Groupe d’études et de recherche en analyse des décisions

G-2010-27

Fast Robust Model Selection in Large Datasets

et

Large datasets upon which classical statistical analysis cannot be performed because of the curse of dimensionality are more and more common in many research fields. In particular, in the linear regression context, it is often the case that a huge number of potential covariates are available to explain a response variable, and the first step of a reasonable statistical analysis is to reduce the number of covariates using appropriate statistical criteria. Alternative fast methods that alleviate the problem of computational time with classical procedures have been recently proposed in the literature. However, these fast methods are based on classical statistical theory and are non robust to extreme observations. And, simply replacing the classical statistical criteria by robust ones is not possible because the complexity of the robust estimators and the testing procedures lead to infeasible computations. In this paper, we propose alternative robust estimators, selection criteria and testing procedures for the linear regression model that are fast to compute and hence can be used in a fast model selection procedure. The robust estimator is a one-step weighted M-estimator that can be biased if the covariates are not orthogonal. We show that the bias is relatively small and can be made smaller by iterating the M-estimator one or more steps further. In the variable selection process, we propose a simplified robust criterion based on a robust t-statistic for significance. We propose a complete algorithm for fast robust model selection, including considerations for huge sample sizes, and show the performance of our method in a simulation study. We also analyze two datasets and show that the results obtained by our method outperform those from robust LARS and random forests. Supplemental materials are also provided.

, 31 pages