An example of nearest neighbor–based data cleaning. The X- and Y-axes represent two attributes in the feature space. The minority class examples are denoted by black circles and the majority class examples are denoted by white circles. Red rectangles indicate the axis-parallel decision regions of the minority class learned by the decision tree algorithm. (a) We show an imbalanced data set with sparse minority class examples. The decision regions of the minority class contain the majority class examples. (b) One way to exclude the majority class is to shrink the decision regions by making them more specific. However, more specific regions produce more splits in the decision tree, causing the overfitting problem. (c) To identify the “dirty” examples that may mislead learning, the proposed method locates k-nearest (where k is 3 in this example) neighbors for each minority class example. The 3-nearest neighbors of a minority class example are indicated by links. (d) A red cross marks each “dirty” example. (e) After the “dirty” examples are removed, the decision regions are “clean” (i.e., they contain only the minority class examples). Using these clean decision regions, learning algorithms can more easily recognize the correct boundary between classes.
Hu et al. BMC Medical Informatics and Decision Making 2012 12:131 doi:10.1186/1472-6947-12-131