This article is part of the supplement: The 2010 International Conference on Bioinformatics and Computational Biology (BIOCOMP 2010): Systems Biology
Statistical methods on detecting differentially expressed genes for RNA-seq data
1 Biostatistics Epidemiology Research Design Core, Center for Clinical and Translational Sciences, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
2 The Chem21 Group Inc, 1780 Wilson Drive, Lake Forest, IL 60045, USA
3 Department of Statistical Science, Southern Methodist University, Dallas, TX 75275, USA
4 School of Mathematics, University of Manchester, Manchester, M13 9PL, UK
5 Rush University Cancer Center, Rush University Medical Center, Chicago, IL 60612, USA
6 Department of General Surgery and Immunology and Microbiology, Rush University Medical Center, Chicago, IL 60612, USA
7 Department of Radiation Oncology Massachusetts General Hospital and Harvard Medical School Boston, MA 02114, USA
8 Department of Internal Medicine and Biochemistry, Rush University Medical Center, Chicago, IL 60612, USA
BMC Systems Biology 2011, 5(Suppl 3):S1 doi:10.1186/1752-0509-5-S3-S1Published: 23 December 2011
For RNA-seq data, the aggregated counts of the short reads from the same gene is used to approximate the gene expression level. The count data can be modelled as samples from Poisson distributions with possible different parameters. To detect differentially expressed genes under two situations, statistical methods for detecting the difference of two Poisson means are used. When the expression level of a gene is low, i.e., the number of count is small, it is usually more difficult to detect the mean differences, and therefore statistical methods which are more powerful for low expression level are particularly desirable. In statistical literature, several methods have been proposed to compare two Poisson means (rates). In this paper, we compare these methods by using simulated and real RNA-seq data.
Through simulation study and real data analysis, we find that the Wald test with the data being log-transformed is more powerful than other methods, including the likelihood ratio test, which has similar power as the variance stabilizing transformation test; both are more powerful than the conditional exact test and Fisher exact test.
When the count data in RNA-seq can be reasonably modelled as Poisson distribution, the Wald-Log test is more powerful and should be used to detect the differentially expressed genes.