Using BIRCH to Compute Approximate Rank Statistics on Massive Datasets

Charest, Lysiane; Plante, Jean-François

The BIRCH algorithm (Balanced Iterative Reducing and Clustering Hierarchies) handles massive dataset by reading the data file only once, clustering the data as it is read, and retaining only a few clustering features to summarize the data read so far. Using BIRCH allows to analyze datasets that are too large to fit in the computer main memory. We propose estimates of Spearman's \(\rho\) and Kendall's \(\tau\) that are calculated from a BIRCH output and assess their performance through Monte Carlo studies. The numerical results show that the BIRCH-based estimates can achieve the same efficiency as the usual estimates of \(\rho\) and \(\tau\) while using only a fraction of the memory otherwise required.

Paru en novembre 2012 , 23 pages

Axe de recherche

Axe 1 : Valorisation des données pour la prise de décision

Document

G-2012-76.pdf (600 Ko)

GERAD

G-2012-76

Using BIRCH to Compute Approximate Rank Statistics on Massive Datasets

Lysiane Charest et Jean-François Plante

Axe de recherche

Document