A computational pipeline for the development of multi-marker bio-signature panels and ensemble classifiers
1 NCE CECR Prevention of Organ Failure (PROOF) Centre of Excellence, Vancouver, BC, V6Z 1Y6, Canada
2 Department of Statistics, University of British Columbia, Vancouver, BC, V6T 1Z2, Canada
3 Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BC, V6T 2B5, Canada
4 Immunity and Infection Research Centre, Vancouver, BC, V5Z 3J5, Canada
5 Immunology Laboratory, Vancouver General Hospital, Vancouver, BC, V5Z 1M9, Canada
6 Department of Medicine, University of British Columbia, Vancouver, BC, V5Z 1M9, Canada
7 James Hogg Research Centre, St. Paul’s Hospital, University of British Columbia, Vancouver, BC, V6Z 1Y6, Canada
8 Department of Computer Science, University of British Columbia, Vancouver, BC, V6T 1Z2, Canada
9 Department of Medical Genetics, University of British Columbia, Vancouver, BC, V6T 1Z3, Canada
10 Institute for HEART+LUNG Health, Vancouver, BC, V6Z 1Y6, Canada
11 Department of Medicine, Division of Respiratory Medicine, University of British Columbia, Vancouver, BC, V5Z 1M9, Canada
12 Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada
BMC Bioinformatics 2012, 13:326 doi:10.1186/1471-2105-13-326Published: 8 December 2012
Biomarker panels derived separately from genomic and proteomic data and with a variety of computational methods have demonstrated promising classification performance in various diseases. An open question is how to create effective proteo-genomic panels. The framework of ensemble classifiers has been applied successfully in various analytical domains to combine classifiers so that the performance of the ensemble exceeds the performance of individual classifiers. Using blood-based diagnosis of acute renal allograft rejection as a case study, we address the following question in this paper: Can acute rejection classification performance be improved by combining individual genomic and proteomic classifiers in an ensemble?
The first part of the paper presents a computational biomarker development pipeline for genomic and proteomic data. The pipeline begins with data acquisition (e.g., from bio-samples to microarray data), quality control, statistical analysis and mining of the data, and finally various forms of validation. The pipeline ensures that the various classifiers to be combined later in an ensemble are diverse and adequate for clinical use. Five mRNA genomic and five proteomic classifiers were developed independently using single time-point blood samples from 11 acute-rejection and 22 non-rejection renal transplant patients. The second part of the paper examines five ensembles ranging in size from two to 10 individual classifiers. Performance of ensembles is characterized by area under the curve (AUC), sensitivity, and specificity, as derived from the probability of acute rejection for individual classifiers in the ensemble in combination with one of two aggregation methods: (1) Average Probability or (2) Vote Threshold. One ensemble demonstrated superior performance and was able to improve sensitivity and AUC beyond the best values observed for any of the individual classifiers in the ensemble, while staying within the range of observed specificity. The Vote Threshold aggregation method achieved improved sensitivity for all 5 ensembles, but typically at the cost of decreased specificity.
Proteo-genomic biomarker ensemble classifiers show promise in the diagnosis of acute renal allograft rejection and can improve classification performance beyond that of individual genomic or proteomic classifiers alone. Validation of our results in an international multicenter study is currently underway.