Large-scale integration of cancer microarray data identifies a robust common cancer signature
1 The Institute for Computational Medicine and Center for Cardiovascular Bioinformatics and Modeling, Johns Hopkins University, Baltimore, MD 21218, USA
2 Department of Applied Mathematics and Statistics and Center for Imaging Sciences, Johns Hopkins University, Baltimore, MD 21218, USA
BMC Bioinformatics 2007, 8:275 doi:10.1186/1471-2105-8-275Published: 30 July 2007
There is a continuing need to develop molecular diagnostic tools which complement histopathologic examination to increase the accuracy of cancer diagnosis. DNA microarrays provide a means for measuring gene expression signatures which can then be used as components of genomic-based diagnostic tests to determine the presence of cancer.
In this study, we collect and integrate ~ 1500 microarray gene expression profiles from 26 published cancer data sets across 21 major human cancer types. We then apply a statistical method, referred to as the Top-Scoring Pair of Groups (TSPG) classifier, and a repeated random sampling strategy to the integrated training data sets and identify a common cancer signature consisting of 46 genes. These 46 genes are naturally divided into two distinct groups; those in one group are typically expressed less than those in the other group for cancer tissues. Given a new expression profile, the classifier discriminates cancer from normal tissues by ranking the expression values of the 46 genes in the cancer signature and comparing the average ranks of the two groups. This signature is then validated by applying this decision rule to independent test data.
By combining the TSPG method and repeated random sampling, a robust common cancer signature has been identified from large-scale microarray data integration. Upon further validation, this signature may be useful as a robust and objective diagnostic test for cancer.