Back

G-2022-38

On removing diverse data for training machine learning models

, , and

BibTeX reference

Providing the right data to a machine learning model is an important step to insure the performance of the model. Non-compliant training data instances may lead to wrong predictions yielding models that cannot be used in production. Instance or prototype selection methods are often used to curate training sets thus leading to more reliable and efficient models. In this work, we investigate if diversity is helpful as a criterion for choosing which instances to remove from a given training set. We test our hypothesis against a random selection method and Mahalanobis outlier selection, using benchmark data sets with different data characteristics. Our computational experiments demonstrate that selection by diversity achieves better classification performance than random selection, and can hence be considered as an alternative data selection criterion for effective model training.

, 16 pages

Research Axis

Research application

Document

G2238.pdf (3 MB)