Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, California, USA

Biostatistics, School of Public Health, University of California, Los Angeles, California, USA

Abstract

Background

Ensemble predictors such as the random forest are known to have superior accuracy but their black-box predictions are difficult to interpret. In contrast, a generalized linear model (GLM) is very interpretable especially when forward feature selection is used to construct the model. However, forward feature selection tends to overfit the data and leads to low predictive accuracy. Therefore, it remains an important research goal to combine the advantages of ensemble predictors (high accuracy) with the advantages of forward regression modeling (interpretability). To address this goal several articles have explored GLM based ensemble predictors. Since limited evaluations suggested that these ensemble predictors were less accurate than alternative predictors, they have found little attention in the literature.

Results

Comprehensive evaluations involving hundreds of genomic data sets, the UCI machine learning benchmark data, and simulations are used to give GLM based ensemble predictors a new and careful look. A novel bootstrap aggregated (bagged) GLM predictor that incorporates several elements of randomness and instability (random subspace method, optional interaction terms, forward variable selection) often outperforms a host of alternative prediction methods including random forests and penalized regression models (ridge regression, elastic net, lasso). This random generalized linear model (RGLM) predictor provides variable importance measures that can be used to define a “thinned” ensemble predictor (involving few features) that retains excellent predictive accuracy.

Conclusion

RGLM is a state of the art predictor that shares the advantages of a random forest (excellent predictive accuracy, feature importance measures, out-of-bag estimates of accuracy) with those of a forward selected generalized linear model (interpretability). These methods are implemented in the freely available R software package

Background

Prediction methods (also known as classifiers, supervised machine learning methods, regression models, prognosticators, diagnostics) are widely used in biomedical research. For example, reliable prediction methods are essential for accurate disease classification, diagnosis and prognosis. Since prediction methods based on multiple features (also known as covariates or independent variables) can greatly outperform predictors based on a single feature

Ensemble predictors are particularly attractive since they are known to lead to highly accurate predictions. An ensemble predictor generates and integrates multiple versions of a single predictor (often referred to as base learner), and arrives at a final prediction by aggregating the predictions of multiple base learners, e.g. via plurality voting across the ensemble. One particular approach for constructing an ensemble predictor is bootstrap aggregation (bagging)

Breiman (1996) showed that bagging weak predictors (e.g. tree predictors or forward selected linear models) often yields substantial gains in predictive accuracy

This article is organized as follows. First, we present a motivating example that illustrates the high prediction accuracy of the RGLM. Second, we compare the RGLM with other state of the art predictors when it comes to binary outcome prediction. Toward this end, we use the UCI machine learning benchmark data, over 700 empirical gene expression comparisons, and extensive simulations. Third, we compare the RGLM with other predictors for quantitative (continuous) outcome prediction. Fourth, we describe several variable importance measures and show how they can be used to define a thinned version of the RGLM that only uses few important features. Even for data sets comprised of thousands of gene features, the thinned RGLM often involves fewer than 20 features and is thus more interpretable than most ensemble predictors.

Methods

Construction of the RGLM predictor

RGLM is an ensemble predictor based on bootstrap aggregation (bagging) of generalized linear models whose features (covariates) are selected using forward regression according to AIC criterion. GLMs comprise a large class of regression models, e.g. linear regression for a normally distributed outcome, logistic regression for binary outcome, multi-nomial regression for multi-class outcome and Poisson regression for count outcome

The steps of the RGLM construction are presented in Figure

Overview of the RGLM construction

**Overview of the RGLM construction.** The figure outlines the steps used in the construction of the RGLM. The pink rectangles represent data matrices at each step. Width of a rectangle reflects the number of remaining features.

Importantly, RGLM also has a parameter

These methods are implemented in our R software package

Parameter choices for the RGLM predictor

As discussed below, we find that it is usually sufficient to consider only **effective** number of features which equals the number of features ^{∗}=^{∗}<=10, then using all features (i.e setting ^{∗}>300 then setting ^{∗} in the intermediate case (10<^{∗}<=300) results from fitting an interpolation line through the two points (10,1) and (300, 0.2). We find that RGLM is quite robust with respect to the parameter

**N**

**nFeaturesInBag/N**

**N**
^{
∗
}

**nFeaturesInBag/N**

This table shows the default values of ^{∗} is the effective number of features which equals the number of features ^{∗}. 1.0276 and 0.00276 are obtained by interpolating a straight line between (10,1) and (300, 0.2).

**No interaction**

1−10

1

1−10

1

11−300

1.0276−0.00276

11−300

1.0276−0.00276^{∗}

>300

0.2

>300

0.2

**2-way interaction**

1−4

1

1−10

1

5−24

1.0276−0.00276

11−300

1.0276−0.00276^{∗}

>24

0.2

>300

0.2

**3-way interaction**

1−3

1

1−10

1

4−12

1.0276−0.00276(^{3}+5

11−300

1.0276−0.00276^{∗}

>12

0.2

>300

0.2

Relationship with related prediction methods

As discussed below, RGLM can be interpreted as a variant of a bagged predictor

1. RGLM allows for interaction terms between features which greatly improve the performance on some data sets (in particular the UCI benchmark data sets). We refer to RGLM involving two-way or three way interactions as RGLM.inter2 and RGLM.inter3, respectively.

2. RGLM has a parameter

3. RGLM has a parameter

4. RGLM optimizes the AIC criterion during forward selection.

5. RGLM has a “thinning threshold” parameter which allows one to reduce the number of features involved in prediction while maintaining good prediction accuracy. Since a thinned RGLM involves far fewer features, it facilitates the understanding how the ensemble arrives at its predictions.

RGLM is not only related to bagging but also to the random subspace method (RSM) proposed by

Another prediction method, random multinomial logit model (RMNL), also shares a similar idea with RGLM. It was recently proposed for multi-class outcome prediction

Software implementation

The RGLM method is implemented in the freely available R package

Short description of alternative prediction methods

**Forward selected generalized linear model predictor (forwardGLM)** We denote by

**Random forest (RF)** RF is an ensemble predictor that consists of a collection of decision trees which vote for the class of observations

**Recursive partitioning and regression trees (Rpart)** Classification and regression trees were generated using the default settings

**Linear discriminant analysis (LDA)** LDA aims to find a linear combination of features (referred to as discriminant variables) to predict a binary outcome (reviewed in

**Diagonal linear discriminant analysis (DLDA)** DLDA is similar to LDA but it ignores the correlation patterns between features. While this is often an unrealistic assumption, DLDA (also known as gene voting) has been found to work well in in gene expression applications

**K nearest neighbor (KNN)** We used the

**Support vector machines (SVM)** We used the default parameters from the

**Shrunken centroids (SC)** The SC predictor is known to work well in the context of gene expression data

**Penalized regression models** Various convex penalties can be applied to generalized linear models. We considered ridge regression _{2} penalty, the lasso corresponding to an _{1} penalty _{1} and _{2} penalties

20 disease-related gene expression data sets

We use 20 disease related gene expression data sets involving cancer and other human diseases (described in Table

**Data set**

**Samples**

**Features**

**Reference**

**Data set ID**

**Binary outcome**

Sample size, number of features, original reference, data set IDs and outcomes for the 20 disease related gene expression data sets.

**adenocarcinoma**

76

9868

NA

most prevalent class vs others

**brain**

42

5597

NA

most prevalent class vs others

**breast2**

77

4869

NA

most prevalent class vs others

**breast3**

95

4869

NA

most prevalent class vs others

**colon**

62

2000

NA

most prevalent class vs others

**leukemia**

38

3051

NA

most prevalent class vs others

**lymphoma**

62

4026

NA

most prevalent class vs others

**NCI60**

61

5244

NA

most prevalent class vs others

**prostate**

102

6033

NA

most prevalent class vs others

**srbct**

63

2308

NA

most prevalent class vs others

**BrainTumor2**

50

10367

NA

Anaplastic oligodendrogliomas vs Glioblastomas

**DLBCL**

77

5469

NA

follicular lymphoma vs diffuse large B-cell lymphoma

**lung1**

58

10000

GSE10245

Adenocarcinoma vs Squamous cell carcinoma

**lung2**

46

10000

GSE18842

Adenocarcinoma vs Squamous cell carcinoma

**lung3**

71

10000

GSE2109

Adenocarcinoma vs Squamous cell carcinoma

**psoriasis1**

180

10000

GSE13355

lesional vs healthy skin

**psoriasis2**

82

10000

GSE14905

lesional vs healthy skin

**MSstage**

26

10000

E-MTAB-69

relapsing vs remitting RRMS

**MSdiagnosis1**

27

10000

GSE21942

RRMS vs healthy control

**MSdiagnosis2**

44

10000

E-MTAB-69

RRMS vs healthy control

Empirical gene expression data sets

For all data sets below, we considered 100 randomly selected gene traits, i.e. 100 randomly selected probes. They were directly used as continuous outcomes or dichotomized according to the median value (top half =1, bottom half =0) to generate binary outcomes. For all data sets except “Brain cancer”,

**Brain cancer data sets** These two related data sets contain 55 and 65 microarray samples of glioblastoma (brain cancer) patients, respectively. Gene expression profiles were measured using Affymetrix U133 microarrays. A detailed description can be found in

**SAFHS blood lymphocyte data set** This data set

**WB whole blood gene expression data set** This is the whole blood gene expression data from healthy controls. Peripheral blood samples from healthy individuals were analyzed using Illumina Human HT-12 microarrays. After pre-processing, 380 samples remained in the data set.

**Mouse tissue gene expression data sets** The 4 tissue specific gene expression data sets were generated by the lab of Jake Lusis at UCLA. These data sets measure gene expression levels (Agilent array platform) from adipose (239 samples), brain (221 samples), liver (272 samples) and muscle (252 samples) tissue of mice from the B ×H _{2} mouse intercross described in

Machine learning benchmark data sets

The 12 machine learning benchmark data sets used in this article are listed in Table

**Data set**

**Samples**

**Features**

Sample size and number of features for the 12 UCI machine learning benchmark data sets.

**BreastCancer**

699

9

**HouseVotes84**

435

16

**Ionosphere**

351

34

**diabetes**

768

8

**Sonar**

208

60

**ringnorm**

300

20

**threenorm**

300

20

**twonorm**

300

20

**Glass**

214

9

**Satellite**

6435

36

**Vehicle**

846

18

**Vowel**

990

10

Simulated gene expression data sets

We simulated an outcome variable

**Simulation study design.** This file describes the simulation studies and presents R code used for simulating the data set.

Click here for file

We considered 180 different simulation scenarios involving varying sizes of the training data (50, 100, 200, 500, 1000 or 2000 samples) and varying numbers of genes (60, 100, 500, 1000, 5000 or 10000 genes) that served as features. Test sets contained the same number of genes as in the corresponding training set and 1000 samples. For each simulation scenario, we simulate 5 replicates resulting from different choices of the random seed.

Results

Motivating example: disease-related gene expression data sets

We compare the prediction accuracy of RGLM with that of other widely used methods on 20 gene expression data sets involving human disease related outcomes. Many of the 20 data sets (Table

To arrive at an unbiased estimate of prediction accuracy, we used 3-fold cross validation (averaged over 100 random partitions of the data into 3 folds). Note that the accuracy equals 1 minus the median misclassification error rate. Table

**Data set**

**RGLM**

**RF**

**RFbigmtry**

**Rpart**

**LDA**

**DLDA**

**KNN**

**SVM**

**SC**

For each data set, the prediction accuracy was estimated using 3−

**adenocarcinoma**

0.842

0.842

0.842

0.737

0.842

0.744

0.842

0.842

0.803

**brain**

0.881

0.810

0.833

0.762

0.810

0.929

0.881

0.786

0.929

**breast2**

0.623

0.610

0.636

0.584

0.610

0.636

0.584

0.558

0.636

**breast3**

0.705

0.695

0.716

0.611

0.695

0.705

0.669

0.674

0.700

**colon**

0.855

0.823

0.823

0.726

0.855

0.839

0.774

0.774

0.871

**leukemia**

0.921

0.895

0.921

0.816

0.868

0.974

0.974

0.763

0.974

**lymphoma**

0.968

1.000

1.000

0.903

0.960

0.984

0.984

1.000

0.984

**NCI60**

0.902

0.869

0.869

0.738

0.885

0.902

0.852

0.869

0.918

**prostate**

0.931

0.892

0.902

0.853

0.873

0.627

0.804

0.853

0.912

**srbct**

1.000

0.944

0.984

0.921

0.857

0.905

0.952

0.873

1.000

**BrainTumor2**

0.760

0.750

0.740

0.620

0.760

0.700

0.700

0.660

0.720

**DLBCL**

0.909

0.851

0.883

0.831

0.922

0.779

0.870

0.792

0.857

**lung1**

0.931

0.931

0.931

0.828

0.914

0.931

0.931

0.897

0.914

**lung2**

0.935

0.935

0.935

0.826

0.957

0.978

0.935

0.848

0.978

**lung3**

0.901

0.901

0.887

0.803

0.873

0.859

0.831

0.859

0.887

**psoriasis1**

0.989

0.994

0.989

0.978

0.994

0.989

0.989

0.983

0.989

**psoriasis2**

0.963

0.988

0.976

0.963

0.976

0.963

0.963

0.963

0.963

**MSstage1**

0.846

0.846

0.846

0.423

0.769

0.769

0.808

0.769

0.769

**MSdiagnosis1**

0.963

0.926

0.926

0.556

0.889

0.889

0.963

0.926

0.926

**MSdiagnosis2**

0.591

0.614

0.614

0.568

0.545

0.568

0.568

0.568

0.523

**MeanAccuracy**

0.871

0.856

0.863

0.752

0.843

0.833

0.844

0.813

0.863

**Rank**

1

4

2.5

9

6

7

5

8

2.5

**Pvalue**

NA

0.029

0.079

0.00014

0.0075

0.05

0.014

0.00042

0.37

As seen from Table

Our evaluations focused on the accuracy (and misclassification error). However, a host of other accuracy measures could be considered. Additional file

**Sensitivity and specificity of predictors in the 20 disease gene expression data sets.** For each data set and prediction method, the table reports the sensitivity and specificity estimated using 3-fold cross validation. More precisely, the table reports the average 3-fold CV estimate over 100 random partitions of the data into 3 folds. Median sensitivity and specificity across data sets are summarized at the bottom.

Click here for file

A strength of this empirical comparison is that it involves clinically or biologically interesting data sets but a severe limitation is that it only involves 20 comparisons. Therefore, we now turn to more comprehensive empirical comparisons.

Binary outcome prediction

Empirical study involving dichotomized gene traits

Many previous empirical comparisons of gene expression data considered fewer than 20 data sets. To arrive at 700 comparisons, we use the following approach: We started out with 7 human and mouse gene expression data sets. For each data set, we randomly chose 100 genes as gene traits (outcomes) resulting in 7×100 possible outcomes. We removed the gene corresponding to the gene trait from the feature set. Next, each gene trait was dichotomized by its median value to arrive at a binary outcome ^{−51}), RFbigmtry (median difference =0.01,^{−16}), LDA (median difference =0.06,^{−53}), SVM (median difference =0.03,^{−62}) and SC (median difference =0.04,^{−71}). Other predictors perform even worse, and the corresponding p-values are not shown.

Binary outcome prediction in empirical gene expression data sets

**Binary outcome prediction in empirical gene expression data sets.** The boxplots show the test set prediction accuracies across 700 comparisons. The horizontal line inside each box represents the median accuracy. The horizontal dashed red line indicates the median accuracy of the RGLM predictor. P-values result from using the two-sided Wilcoxon signed rank test for evaluating whether the median accuracy of RGLM is the same as that of the mentioned method. For example, p.RF results from testing whether the median accuracy of RGLM is the same as that of the RF. **(A)** summarizes the test set performance for predicting 100 dichotomized gene traits from each of the 7 expression data sets. **(B-H)** show the results for individual data sets. 100 randomly chosen, dichotomized gene traits were used. Note the superior accuracy of the RGLM predictor across the different data sets.

The fact that RFbigmtry is more accurate in this situation than the default version of RF probably indicates that relatively few genes are informative for predicting a dichotomized gene trait. Also note that RGLM is much more accurate than the unbagged forward selected GLM which reflects that forward selection greatly overfits the training data. In conclusion, these comprehensive gene expression studies show that RGLM has outstanding prediction accuracy.

Machine learning benchmark data analysis

Here we evaluate the performance of RGLM on the UCI machine learning benchmark data sets which are often used for evaluating prediction methods

**Data set**

**RGLM**

**RGLM.inter2**

**RF**

**RFbigmtry**

**Rpart**

**LDA**

**DLDA**

**KNN**

**SVM**

**SC**

For each data set, the prediction accuracy was estimated using 3−

**BreastCancer**

0.964

0.959

0.969

0.961

0.941

0.957

0.959

0.966

0.967

0.956

**HouseVotes84**

0.961

0.963

0.958

0.954

0.954

0.951

0.914

0.924

0.958

0.938

**Ionosphere**

0.883

0.946

0.932

0.917

0.875

0.863

0.809

0.849

0.940

0.829

**diabetes**

0.768

0.759

0.759

0.754

0.741

0.768

0.732

0.740

0.757

0.743

**Sonar**

0.769

0.837

0.817

0.788

0.707

0.726

0.697

0.812

0.822

0.726

**ringnorm**

0.577

0.973

0.940

0.910

0.770

0.567

0.570

0.590

0.977

0.535

**threenorm**

0.803

0.827

0.807

0.777

0.653

0.817

0.825

0.815

0.853

0.817

**twonorm**

0.937

0.953

0.947

0.920

0.733

0.957

0.960

0.947

0.953

0.960

**Glass**

0.636

0.743

0.827

0.799

0.729

0.659

0.531

0.808

0.748

0.645

**Satellite**

0.986

0.987

0.988

0.985

0.961

0.985

0.734

0.990

0.988

0.803

**Vehicle**

0.965

0.986

0.986

0.973

0.944

0.967

0.729

0.909

0.974

0.752

**Vowel**

0.936

0.986

0.983

0.976

0.950

0.938

0.853

0.999

0.991

0.909

**MeanAccuracy**

0.849

0.910

0.909

0.893

0.830

0.846

0.776

0.862

0.911

0.801

**Rank**

6

2

2

4

8

7

10

5

2

9

**Pvalue**

0.0093

NA

0.26

0.042

0.00049

0.0093

0.0067

0.11

0.96

0.0015

Overall, we find that RGLM.inter2 ties with SVM (

**Sensitivity and specificity of predictors in the UCI machine learning benchmark data.** For each data set and prediction method, the table reports the sensitivity and specificity estimated using 3-fold cross validation. More precisely, the table reports the average 3-fold CV estimate over 100 random partitions of the data into 3 folds. Median sensitivity and specificity across data sets are summarized at the bottom.

Click here for file

A potential limitation of these comparisons is that we considered pairwise interaction terms for the RGLM predictor but not for the other predictors. To address this issue, we also considered pairwise interactions among features for other predictors. Additional file

**Prediction accuracy when including pairwise interactions between features in the UCI machine learning benchmark data.** This table is an extension to Table

Click here for file

Simulation study involving binary outcomes

As described in Methods, we simulated 180 gene expression data sets with binary outcomes. The number of features (genes) ranged from 60 to 10000. The sample sizes (number of observations) of the training data ranged from 50 to 2000. To robustly estimate the test set accuracy we chose a large size for the corresponding test set data,

Binary outcome prediction in simulation

**Binary outcome prediction in simulation.** This boxplot shows the test set prediction accuracies across the 180 simulation scenarios.The red dashed line indicates the median accuracy of the RGLM. P-values result from using the two-sided Wilcoxon signed rank test for evaluating whether the median accuracy of RGLM is the same as that of the mentioned method.

Continuous outcome prediction

In the following, we show that RGLM also performs exceptionally well when dealing with continuous quantitative outcomes. We not only compare RGLM to a standard forward selected linear model predictor (forwardGLM) but also a random forest predictor (for a continuous outcome). We do not report the findings for the k-nearest neighbor predictor of a continuous outcome since it performed much worse than the above mentioned approaches in our gene expression applications (the accuracy of a KNN predictor was decreased by about 30 percent). We again split the data into training and test sets. We use the correlation between test set predictions and truly observed test set outcomes as measure of predictive accuracy. Note that this correlation coefficient can take on negative values (in case of a poorly performing prediction method).

Empirical study involving continuous gene traits

Here we used the same 700 gene expression comparisons as described above (100 randomly chosen gene traits from each of 7 gene expression data sets) but did not dichotomize the gene traits. Incidentally, prediction methods for gene traits are often used for imputing missing gene expression values. Our results presented in Figure

Continuous outcome prediction in empirical gene expression data sets

**Continuous outcome prediction in empirical gene expression data sets.** The boxplots show the test set prediction correlation in 700 applications. P-values result from using the two-sided Wilcoxon signed rank test for evaluating whether the median accuracy of RGLM is the same as that of the mentioned method. **(A)** summarizes the test set performance for predicting 100 continuous gene traits from each of the 7 expression data set. **(B-H)** show the results for individual data sets. RGLM is superior to other methods overall.

Mouse tissue expression data involving continuous clinical outcomes

Here we used the mouse liver and adipose tissue gene expression data sets to predict 21 clinical outcomes (detailed in Methods). Again, RGLM achieved significantly higher median prediction accuracy compared to the other predictors (Figure

Continuous clinical outcome prediction in mouse adipose and liver data sets

**Continuous clinical outcome prediction in mouse adipose and liver data sets.** The boxplots show the test set prediction correlation for predicting 21 clinical outcomes in **(A)** mouse adipose and **(B)** mouse liver. The red dashed line indicates the median correlation for RGLM. P-values result from using the two-sided Wilcoxon signed rank test for evaluating whether the median accuracy of RGLM is the same as that of the mentioned method.

Simulation study involving continuous outcomes

180 gene expression data sets are simulated in the same way as described previously (for evaluating a binary outcome) but here the outcome

Continuous outcome prediction in simulation studies

**Continuous outcome prediction in simulation studies.** This boxplot shows the test set prediction accuracy across the 180 simulation scenarios. The red dashed line indicates the median accuracy for the RGLM. Wilcoxon signed rank test p-values are presented.

Comparing RGLM with penalized regression models

In our previous comparisons, we found that RGLM greatly outperforms forward selected GLM methods based on the AIC criterion. Many powerful alternatives to forward variable selection have been developed in the literature, in particular penalized regression models. Here, we compare RGLM to 3 major types of penalized regression models: ridge regression ^{−52}) and the lasso (^{−10}) on the 700 dichotomized gene expression trait data. Also, RGLM is significantly better than elastic net (^{−27}) and lasso (^{−28}) in simulations with binary outcomes. Figure ^{−86}) in the 700 continuous gene expression traits data and outperforms elastic net (^{−25}) and lasso (^{−27}) in simulations with continuous outcomes.

Penalized regression models versus RGLM

**Penalized regression models versus RGLM.** The heatmap reports the median difference in accuracy between RGLM and 3 types of penalized regression models in **(A)** binary outcome prediction and **(B)** continuous outcome prediction. Each cell entry reports the paired median difference in accuracy (upper number) and the corresponding Wilcoxon signed rank test p-value (lower number). The cell color indicates the significance of the finding, where red implies that RGLM outperforms penalized regression model and green implies the opposite. The color panel on the right side shows how colors correspond to −

As a caveat, we mention that cross validation methods were not used to inform the parameter choices of the penalized regression models since the RGLM predictor was also not allowed to fine tune its parameters. By only using default parameter choices we ensure a fair comparison. In a secondary analysis, however, we allowed penalized regression models to use cross validation for informing the choice of the parameters. While this slightly improved the performance of the penalized regression models (data not shown), it did not affect our main conclusion. RGLM outperforms penalized regression models in these comparisons.

Feature selection

Here we briefly describe how RGLM naturally gives rise to variable (feature) importance measures. We compare the variable importance measures of RGLM with alternative approaches and show how variable importance measures can be used for defining a thinned RGLM predictor with few features.

Variable importance measure

There is a vast literature on using ensemble predictors and bagging for selecting features. For example, Meinshausen and Bühlmann describe “stability selection” based on variable selection employed in regression models

To reveal relationships between different types of variable importance measures, we present a hierarchical cluster tree of RGLM measures, RF measures and standard marginal analysis based on correlations in Figure

Relationship between variable importance measures based on the Pearson correlation across 70 tests

**Relationship between variable importance measures based on the Pearson correlation across 70 tests.** This figure shows the hierarchical cluster tree (dendrogram) of 7 variable importance measures.

Leo Breiman already pointed out that random forests could be used for feature selection in genomic applications. Díaz-Uriarte et al. proposed a related gene selection method based on the RF which yields small sets of genes

**Comparison of RGLM based feature selection method with the RF based method of Díaz-Uriarte et al.** For each data set in the 20 disease gene expression data, the RF based variable selection method by Díaz-Uriarte

Click here for file

RGLM predictor thinning based on a variable importance measure

Both RGLM and random forest have superior prediction accuracy but they differ with respect to how many features are being used. Recall that the random forest is composed of individual trees. Each tree is constructed by repeated node splits. The number of features considered at each node split is determined by the RF parameter

In RGLM, the number of times a feature is selected by forward regression models among all bags, **RGLM predictor thinning**. Thus, features whose value of

Figure

RGLM predictor thinning

**RGLM predictor thinning.** This figure averages the thinning results of 700 applications (predicting 100 gene traits from each of 7 empirical data set). **(A)** Accuracies decrease as the thinning threshold increases. The black and blue lines represent the median and mean accuracies, respectively. **(B)** The average fraction of genes left in final models (y-axis) drops quickly as the thinning threshold increases as shown in the black line. The function in Equation 1 approximates the relationship between the two variables as shown in the red line. **(C)** Number of genes used in prediction for no thinning versus thinning threshold equal to 20. On average, less than 20% of genes remain.

Interestingly, the accuracy diminishes very slowly for initial, low threshold values. But even low threshold values lead to a markedly sparser ensemble predictor (Figure

We have found that the following empirical function accurately describes the relationship between thinning threshold (

where

**Effect of the number of bags on RGLM predictor thinning.** s This figure reports how prediction accuracy changes as variable thinning is applied to the RGLM. Results are averaged over the 100 dichotomized gene traits in the mouse adipose data set. The five rows correspond to

Click here for file

Our results demonstrate that the number of required features decreases rapidly even for low values of the thinning threshold without compromising the prediction accuracy of the thinned predictor. Figure

RGLM thinning versus RF thinning

The idea behind RGLM thinning is to remove features with low values of the variable importance measure. Of course, a similar idea can be applied to other predictors. Here we briefly evaluate the performance of a thinned random forest predictor which removed variables based on a low value of its importance measure (“mean decreased accuracy”). To arrive at an unbiased comparison, both RGLM and RF are thinned based on results obtained in the training data. Next, accuracies of the thinned predictors are evaluated in the test set data. Figure

RGLM thinning versus RF thinning

**RGLM thinning versus RF thinning.** This figure compares the thinned RGLM with the thinned RF in **(A)** the 20 disease related data sets and **(B)** the 700 gene expression traits. Numbers that connect dashed lines are RGLM thinning thresholds. For a pre-specified threshold, the number of features used for a thinned random forest is matched with that for the thinned RGLM (except for a threshold of 0). The

Discussion

Why was the RGLM not discovered earlier?

After Breiman proposed the idea of bagged linear regression models in 1996

How do modifications of a GLM affect the prediction accuracy

**How do modifications of a GLM affect the prediction accuracy.** The figure illustrates how two bad modifications to a GLM add up to a superior predictor (RGLM). In general, bagging or forward model selection alone lower the prediction accuracy of generalized linear models (such as logistic regression models). However, combining these two bad modifications leads to the superior prediction accuracy of the RGLM predictor. The figure may also explain why the benefits of RGLM type predictors were not previously recognized.

Additional reasons why the merits of RGLM have not been recognized earlier may be the following. First, it may be a historical accident. Bagging was quickly over-shadowed by other seemingly more accurate ways of constructing ensemble predictors, such as boosting

Second, previous comparisons of bagged predictors in the context of genomic data were based on limited empirical evaluations. Many comparisons involved fewer than 20 microarray data sets when comparing predictors

Third, previous studies probably did not consider enough bootstrap samples (bags). While previous studies used 10 to 50 bags, we always used 100 bags when constructing the RGLM. To illustrate how prediction accuracy depends on the number of bags, we evaluate the brain cancer data with 1 to 500 bags using 5 gene traits randomly selected from those used in our binary and continuous outcome prediction, respectively. The results are shown in Additional file

**Prediction accuracy versus number of bags used for RGLM.** This figure presents the results for predicting 5 gene traits in the brain cancer data set when different numbers of bags (bootstrap samples) are used for constructing the RGLM. Each color represents one gene trait. **(A)** Binary outcome prediction. The 5 gene traits were randomly selected from all 100 gene traits used in the binary outcome prediction section. **(B)** Continuous outcome prediction. The 5 gene traits were randomly selected from all 100 gene traits used in the continuous outcome prediction.

Click here for file

Strengths and limitations

RGLM shares many advantages of bagged predictors including a nearly unbiased estimate of the prediction accuracy (the out-of-bag estimate) and several variable importance measures. While our empirical studies focus on binary and continuous outcomes, it is straightforward to define RGLM for count outcomes (resulting in a random Poisson regression model) and for multi-class outcomes (resulting in a random multinomial regression model).

A noteworthy limitation of RGLM is computational complexity since the forward selection process (e.g. by the function

Our empirical studies demonstrate that RGLM compares favorably with the random forest, support vector machines, penalized regression models, and many other widely used prediction methods. As a caveat, we mention that we chose default parameter choices for each of these methods in order to ensure a fair comparison. Future studies could evaluate how these prediction methods compare when resampling schemes (e.g. cross validation) are used to inform parameter choices. Our

Conclusions

Since individual forward selected GLMs are highly interpretable, the resulting ensemble predictor is more interpretable than an RF predictor. Our empirical studies (20 disease related gene expression data sets, 700 gene expression trait data, the UCI benchmark data) clearly highlight the outstanding prediction accuracy afforded by the RGLM. High accuracies are achieved not only in genomic data sets (many features, small sample size) but also in the UCI benchmark data (few features, large sample size).

Abbreviations

RGLM: Random generalized linear model; RGLM: Inter2 - RGLM considering pairwise interactions between features; RGLM: Inter3 - RGLM considering two-way and three-way interactions between features; forwardGLM: Forward selected generalized linear model; RF: Random forest with default mtry; RFbigmtry: Random forest with mtry equal to the total number of features; GLM: Generalized linear model; Rpart: Recursive partitioning; LDA: Linear discriminant analysis; DLDA: Diagonal linear discriminant analysis; KNN: K nearest neighbor; SVM: Support vector machine; SC: Shrunken centroids; RSM: Random subspace method; RMNL: Random multinomial logit model; RKNN: Random K nearest neighbor; E-RFE: Entropy-based recursive feature elimination; AIC: Akaike information criteria; aMV: Adjusted majority vote.

Competing interests

The authors declare that they have no competing interest.

Authors’ contributions

LS carried out all analyses. PL helped with the R implementation and analysis. LS and SH developed the method and wrote the article. SH conceived of the study. All authors read and approved the final manuscript.

Acknowledgements

We acknowledge grant support from 1R01DA030913-01, P50CA092131, P30CA16042, UL1TR000124. We acknowledge the efforts of IGC and expO in providing data set GSE2109.