Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Methodology article

Random generalized linear model: a highly accurate and interpretable ensemble predictor

Lin Song12, Peter Langfelder1 and Steve Horvath12*

Author Affiliations

1 Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, California, USA

2 Biostatistics, School of Public Health, University of California, Los Angeles, California, USA

For all author emails, please log on.

BMC Bioinformatics 2013, 14:5  doi:10.1186/1471-2105-14-5

Published: 16 January 2013

Additional files

Additional file 1:

Simulation study design. This file describes the simulation studies and presents R code used for simulating the data set.

Format: PDF Size: 128KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 2:

Sensitivity and specificity of predictors in the 20 disease gene expression data sets. For each data set and prediction method, the table reports the sensitivity and specificity estimated using 3-fold cross validation. More precisely, the table reports the average 3-fold CV estimate over 100 random partitions of the data into 3 folds. Median sensitivity and specificity across data sets are summarized at the bottom.

Format: PDF Size: 13KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 3:

Sensitivity and specificity of predictors in the UCI machine learning benchmark data. For each data set and prediction method, the table reports the sensitivity and specificity estimated using 3-fold cross validation. More precisely, the table reports the average 3-fold CV estimate over 100 random partitions of the data into 3 folds. Median sensitivity and specificity across data sets are summarized at the bottom.

Format: PDF Size: 7KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 4:

Prediction accuracy when including pairwise interactions between features in the UCI machine learning benchmark data. This table is an extension to Table 5. It shows the prediction accuracy of predictors other than RGLM when considering pairwise interactions between features in the same UCI mlbench data sets. Although several predictors show improvement, none of them beats RGLM.inter2.

Format: PDF Size: 11KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 5:

Comparison of RGLM based feature selection method with the RF based method of Díaz-Uriarte et al. For each data set in the 20 disease gene expression data, the RF based variable selection method by Díaz-Uriarte et al selects a small set of genes. For each of the selected genes, the file reports the ranking in terms of the RGLM variable importance measure timesSelectedByForwardRegression. As expected, only a few of the selected genes have a high rank in terms of timesSelectedByForwardRegression illustrating that these variable selection methods are different.

Format: PDF Size: 6KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 6:

Effect of the number of bags on RGLM predictor thinning. s This figure reports how prediction accuracy changes as variable thinning is applied to the RGLM. Results are averaged over the 100 dichotomized gene traits in the mouse adipose data set. The five rows correspond to nBags values of 20, 50, 100, 200, 500 respectively. Within each row, the two panels have the same meaning as in Figure 9.

Format: PDF Size: 22KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 7:

Prediction accuracy versus number of bags used for RGLM. This figure presents the results for predicting 5 gene traits in the brain cancer data set when different numbers of bags (bootstrap samples) are used for constructing the RGLM. Each color represents one gene trait. (A) Binary outcome prediction. The 5 gene traits were randomly selected from all 100 gene traits used in the binary outcome prediction section. (B) Continuous outcome prediction. The 5 gene traits were randomly selected from all 100 gene traits used in the continuous outcome prediction.

Format: PDF Size: 24KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data