Examples of decision regions of data points projected to a 2D space. The X- and Y-axes represent two attributes in the feature space. The minority class examples are denoted by black circles, and the majority class examples are denoted by white circles. Red rectangles indicate the axis-parallel decision regions of the minority class learned by the decision tree algorithm. (a) In an imbalanced but coherent data set, the boundary between classes is clear. Over-sampling the minority class or under-sampling the majority class to balance the data set can help learning algorithms identify the decision regions. (b) If the data set is imbalanced and the minority class examples are sparsely scattered in the majority class, the decision regions are likely to include the majority class examples, making classification more difficult. (c) Over-sampling the minority class with replications makes the decision regions more specific. The replications of the minority class examples are indicated by larger black circles. As the decision regions become more specific, learning algorithms based on the divide-and-conquer method (e.g., a decision tree algorithm) are more prone to overfitting because they produce more partitions in the data during learning. (d) In contrast, under-sampling the majority class randomly selects examples until its size equals that of the minority class. Because the minority class examples are scattered, the decision regions may still contain the majority class examples, and learning the boundary remains difficult.
Hu et al. BMC Medical Informatics and Decision Making 2012 12:131 doi:10.1186/1472-6947-12-131