In this paper, we empirically investigate the robustness of random forests for regression problems. We also investigate the performance of five variations of the original random forest method, all aimed at improving robustness. All the proposed variations can be easily implemented using the R package randomForest. The first main idea behind these variations is the use of the median, instead of the mean, to combine the predictions from the individual trees. The second idea is to build the trees using the ranks of the response instead of the original values. The competing methods are compared via a simulation study and ten real data sets obtained from the UCI Machine Learning Repository. Our results show that the median--based random forests (using either the ranks or the original responses) offer good and stable performances for the simulated and real data sets considered and, as such, should be considered as serious alternatives to the original random forest method.
Paru en octobre 2010 , 17 pages