Clusterwise regression is a clustering technique where multiple lines or hyperplanes are fit to mutually exclusive subsets of a dataset such that the sum of squared errors (SSE) from each observation to its cluster's line is minimized. A column generation based approach is proposed for solving the clusterwise regression problem. The proposed strategy firstly relies on heuristic optimization of the original problem for insertion of the best known columns and all possible one observation perturbations into the reduced master problem. Secondly, a greedy heuristic based on the duals, another based on the best known solution to the original problem, and a multistart heuristic optimization are all employed to attempt to insert one or more columns early and stop the subproblem search. If these heuristics fail to identify an improving column, an exhaustive search is performed starting with incrementally larger ending subsets, all the while iteratively performing heuristic optimization to ensure a proper balance of exact and heuristic optimization. For the subproblem heuristics and branch and bound search, observations are sequenced by their duals and by their inclusion in joint pair branching rules. A series of experiments demonstrate that the extended column generation approach outperforms the best known alternative (BBHSE) when the number of clusters is greater than three. The experiments also explore the importance of each proposed component and demonstrate that all of them provide performance increases with minimal costs. Additionally, the current work further demonstrates and expands the successful use of the new paradigm of using incrementally larger ending subsets to strengthen the lower bounds of a branch and bound search as pioneered by Brusco's Repetitive Branch and Bound Algorithm (RBBA).
Published December 2011 , 20 pages