Table 1

Misclassification rates by CART and RF modeling

Error rate

Number of Independent variables

Class 1

Class 2


Dataset 1: Down/Up

Down

Up

Sample Size

51

65

CART

164

0.41

0.46

RF

164

0.59

0.31

RF + CART

4

0.37

0.23

Dataset 2: Transient/Sustained

Transient

Sustained

Sample Size

23

41

CART

159

0.22

0.68

RF

159

0.86

0.19

RF + CART

3

0.17

0.27


For each dataset, the synexpression group labeling was the dependent variable and the TFBSs were the independent variables. CART model was derived by using Gini splitting criterion, equal prior setting, unitary cost and a 10-fold cross validation. The best tree was selected by minimum cost. The error rates were the rates on the test sample by cross validation. RF was run with stratified sampling with an equal sample size for both classes, whereas the sample size was set to the one of the class with smaller number of observations. The error rates were the average of out-of-bag error rates of 100 runs of RF, each with 1000 trees. RF + CART was to build a CART model on the top most important variables selected by RF. For both datasets, RF + CART provided the best classification results with lowest misclassification rates.

Qin et al. BMC Systems Biology 2009 3:73   doi:10.1186/1752-0509-3-73

Open Data