|
Resolution: standard / high Figure 4.
Examples of decision regions of data points projected to a 2D space. The X- and Y-axes represent two attributes in the feature space. The minority class
examples are denoted by black circles, and the majority class examples are denoted
by white circles. Red rectangles indicate the axis-parallel decision regions of the
minority class learned by the decision tree algorithm. (a) In an imbalanced but coherent data set, the boundary between classes is clear. Over-sampling
the minority class or under-sampling the majority class to balance the data set can
help learning algorithms identify the decision regions. (b) If the data set is imbalanced and the minority class examples are sparsely scattered
in the majority class, the decision regions are likely to include the majority class
examples, making classification more difficult. (c) Over-sampling the minority class with replications makes the decision regions more
specific. The replications of the minority class examples are indicated by larger
black circles. As the decision regions become more specific, learning algorithms based
on the divide-and-conquer method (e.g., a decision tree algorithm) are more prone
to overfitting because they produce more partitions in the data during learning. (d) In contrast, under-sampling the majority class randomly selects examples until its
size equals that of the minority class. Because the minority class examples are scattered,
the decision regions may still contain the majority class examples, and learning the
boundary remains difficult.
Hu et al. BMC Medical Informatics and Decision Making 2012 12:131 doi:10.1186/1472-6947-12-131 |