Groupe d’études et de recherche en analyse des décisions

Data clustering using large p-median models and primal-dual variable neighborhood search

Nenad Mladenović School of Mathematics, Brunel University, Royaume-Uni

Data clustering methods have been developed extensively in the data mining literature for detecting useful patterns in large datasets in the form of densely populated regions in a multi-dimensional Euclidean space. One of the challenges in dealing with these large databases is to obtain quality solutions within reasonable CPU time and memory requirements. Earlier partitioning methods such as PAM (1990) become inefficient on these larger sets due to repetitive and lengthy neighborhood searches in the solution space. To get around this problem, CLARA (1990) randomly samples the solution space first, and then solves a clustering problem (p-median) on the much smaller representative sample using PAM. Meanwhile CLARANS (2002) randomly selects neighborhood points up to a maximum number instead of searching the entire neighborhood. BIRCH (1996) uses hierarchical approach to cluster the data in a single pass. Improvements are made at the user's option by further passes through the (classified) data.

The purpose of this paper is to present a new approach for solving the p-median problem based on the variable neighborhood search metaheuristic. Using a highly efficient data structure and interchange updating procedure within the embedded local search we are able to analyze very large multi-dimensional datasets directly and quickly. Another feature of the our method is that a guaranteed bound on solution quality is obtained by solving heuristically a dual relaxation of the problem.