Modeling the next generation sequencing sample processing pipeline for the purposes of classification
1 AgriLife Genomics and Bioinformatics Services, Texas AgriLife Research, Texas A&M System, College Station, Texas, TX, 77843, USA
2 Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH, 43210, USA
3 Department of Veterinary Physiology and Pharmacology, Texas A&M University, College Station, Texas, TX, 77843, USA
4 Department of Electrical and Computer Engineering, Texas A&M University, College Station, Texas, TX, 77843, USA
5 Translational Genomics Research Institute (TGen), 400 North Fifth Street, Suite 1600, Phoenix, AZ, 85004 USA
BMC Bioinformatics 2013, 14:307 doi:10.1186/1471-2105-14-307Published: 11 October 2013
A key goal of systems biology and translational genomics is to utilize high-throughput measurements of cellular states to develop expression-based classifiers for discriminating among different phenotypes. Recent developments of Next Generation Sequencing (NGS) technologies can facilitate classifier design by providing expression measurements for tens of thousands of genes simultaneously via the abundance of their mRNA transcripts. Because NGS technologies result in a nonlinear transformation of the actual expression distributions, their application can result in data that are less discriminative than would be the actual expression levels themselves, were they directly observable.
Using state-of-the-art distributional modeling for the NGS processing pipeline, this paper studies how that pipeline, via the resulting nonlinear transformation, affects classification and feature selection. The effects of different factors are considered and NGS-based classification is compared to SAGE-based classification and classification directly on the raw expression data, which is represented by a very high-dimensional model previously developed for gene expression. As expected, the nonlinear transformation resulting from NGS processing diminishes classification accuracy; however, owing to a larger number of reads, NGS-based classification outperforms SAGE-based classification.
Having high numbers of reads can mitigate the degradation in classification performance resulting from the effects of NGS technologies. Hence, when performing a RNA-Seq analysis, using the highest possible coverage of the genome is recommended for the purposes of classification.