Groupe d’études et de recherche en analyse des décisions

Visualizing Clusters With a Density-Based Similarity Measure

Rebecca Nugent

The goal of clustering is to identify distinct groups in a dataset and assign a group label to each observation. To cast clustering as a statistical problem, we regard the data as a sample from an unknown density $$p(x)$$. To generate clusters, we estimate the properties of $$p(x)$$ either with parametric (model-based) or nonparametric methods. In contrast, the algorithmic approach to clustering (linkage methods, spectral clustering) applies an algorithm, often based on a distance measure, to data in $$m$$-dimensional space. Many commonly used clustering methods employ functions of Euclidean distance between observations to determine groupings. Spherical groups are easily identified, curvilinear groups less so. We first motivate the use of a density-based similarity measure and briefly introduce generalized single linkage, a graph-based clustering approach. We describe a refinement algorithm used to bound the measure and then explore the performance of this measure in clustering and visualization methods.