Abstract
Background
The cost of DNA sequencing has undergone a dramatical reduction in the past decade. As a result, sequencing technologies have been increasingly applied to genomic research. RNASeq is becoming a common technique for surveying gene expression based on DNA sequencing. As it is not clear how increased sequencing capacity has affected measurement accuracy of mRNA, we sought to investigate that relationship.
Result
We empirically evaluate the accuracy of repeated gene expression measurements using RNASeq. We identify library preparation steps prior to DNA sequencing as the main source of error in this process. Studying three datasets, we show that the accuracy indeed improves with the sequencing depth. However, the rate of improvement as a function of sequence reads is generally slower than predicted by the binomial distribution. We therefore used the betabinomial distribution to model the overdispersion. The overdispersion parameters we introduced depend explicitly on the number of reads so that the resulting statistical uncertainty is consistent with the empirical data that measurement accuracy increases with the sequencing depth. The overdispersion parameters were determined by maximizing the likelihood. We shown that our modified betabinomial model had lower false discovery rate than the binomial or the pure betabinomial models.
Conclusion
We proposed a novel form of overdispersion guaranteeing that the accuracy improves with sequencing depth. We demonstrated that the new form provides a better fit to the data.
Background
To measure gene expression by RNASeq, RNA molecules are converted to DNA, sequenced, mapped to a gene database, and counted [13]. RNASeq then provides a digital readout of the gene expression levels. As the cost of nextgeneration sequencing drops rapidly, RNASeq may replace microarray methods in genomewide surveys of gene expression. Compared to microarray technology, RNASeq has several advantages, including the ability to simultaneously detect mutations, discovering alternative transcript [46] and alternative splicing [710].
It is common to study the changes in gene expression under a perturbation. The perturbation can be, for example, the deletion of a gene, which is important in characterizing the function of a new gene, or it can be the stimulation of cells by a ligand, which is important in deciphering a pathway. Many experimental techniques, such as RNA interference [11], have been developed in recent years to make it easier to delete genes in mammalian cells. For an embryonic lethal gene in the mouse model, the Crelox system can be used to perform conditional gene knockout in a tissuespecific manner [12]. These gene deletion techniques facilitate the study of gene functions for a large fraction of mammalian genes that remain to be characterized. Furthermore, twosample comparisons apply when studying pathways through receptor stimulation. These methods have become increasingly popular for examining signal transduction pathways holistically. In such studies, the emphasis is on the function of genes or pathways and not on the genetic background in which the study is carried out. Therefore, one repeats the experiments in the same cell line or in mice with identical genetic backgrounds, and expects to find no genetic variation. In this situation, the difference in gene expression can be due to different methods of handling the biological samples (library preparation), as well as statistical fluctuations from the finite number of tags mapped to each gene. The uncertainty in the outcome of RNAseq in repeated experiments of identical genetic background is yet to be characterized.
Such uncertainty affects the ability to affirm which genes are differentially expressed between a sample and a control. We focus on estimating the change in gene expression because the absolute amounts of RNA, by themselves, as measured by the RPKM (reads per kilobase of read length per million mapped reads) of the sequenced tag values [2], are not useful in most cases for biological interpretation. We hypothesize that experimental uncertainty is due primarily to the library preparation steps before sequencing, that it is intrinsic to the experimental protocol, and can therefore be characterized from repeated experiments. The expression difference is estimated based on the computation of a pvalue, which can be calculated from repeated experiments using a ttest. However, since in RNASeq, the expression ratio derived from low sequence reads should have a larger error, it would be valuable to statistically estimate the error due to low counts [13]. Binomial and betabinomial [14,15] distributions can be used to characterize small tag count fluctuations.
A fundamental question in RNASeq analysis is how the accuracy of measured gene expression change by RNASeq depend on the sequencing depth [16]. Here the sequence depth means the total number of sequenced reads, which can be increased by using more lanes. A binomial distribution is often used to compare two RNASeq experiments. In this model, uncertainty approaches zero as where N is the tag counts for the gene. Indeed, the sequencing of the same DNA in different sequencing lanes produces errors consistent with the binomial distribution [13,17]. However, comparisons of different samples have shown a dispersion larger than that given by the binomial distribution [18,19]. A betabinomial distribution appropriately describes the overdispersion. This type of distribution has been used for the analysis of differential gene expression levels in SAGE libraries [20], and to model peptide count data with both within and betweensample variation in labelfree tandem mass spectrometrybased proteomics [21]. For the dispersion, the error is a sum of two parts: the first part goes to zero following , and the second part is a constant that is independent of N. The constant is ideal for describing the genetic variant. However, where genetic variations are not expected, it is inconsistent with the intuition that the accuracy of the measurement should improve with increasing depth of sequencing. In this paper, we provide empirical evidence that the error goes to zero as the tag count N increases, but at a slower rate than . We aim to characterize these overdispersions gene by gene, using a pair of replicate experiments. We compared the results in a dataset in which multiple replicates were also available. We used a form of overdispersion based on a betabinomial distribution, but one in which the overdispersion parameter depends explicitly on the number of tags. The form we used was suggested during a study of the standard deviation as a function of the tag count. We demonstrated that the modified betabinomial distribution improve performance.
Results
Normalization by proportion
The use of a proportion is a convenient way to compare two samples. Let n_{i} and m_{i} be the number of tags mapped to gene i. The proportion is defined as . It is convenient to use proportion because differences in proportion give rise to pvalues using established statistics such as binomial and betabinomial distributions. A proportion is also a convenient component of a normalization procedure.
In order to detect differential expression in two samples, we must determine the ratio of the counts in the two samples that corresponds to the same expression. One method, adapted in calculating the RPKM, assumes that the total number of tags sequenced, and equivalently the total amount of RNA, is a constant. The problem with RPKM normalization is that the number is dominated by a few genes that receive the highest sequence reads. These genes may or may not remain constant under the two experimental conditions. One could also use housekeeping genes such as POLR2A (polymeras II) or GAPDH in a normalization procedure. The problem with relying on a housekeeping gene is that the normalization depends on the choice of genes. Since the number of housekeeping genes is small, this normalization procedure is subject to fluctuation due to relatively small tag counts on these genes. Bullard et al. have shown good results with an upperquartile normalization method [13].
The most conservative normalization procedure assumes that the maximum number of genes remains unchanged in the two experimental conditions. This corresponds to the maximum in the histogram ratio of tag counts . The tag counts proportion p_{i} is more convenient to use. The maximum in a histogram of p_{i} corresponds to the neutral ratio p_{n}, where the expression levels are assumed to be equal in the two samples. This maximum can be determined from fitting a Gaussian (or beta function) to the peak of the histogram (Figure 1). In this formulation, the RPKM normalization corresponds to choosing , where N and M are the total number of tags to genes in the experiment and control.
Figure 1. Histogram of proportions and peak of histogram of proportion normalization. The peak in the histogram corresponds to the largest density of genes. To determine the peak maximum, the histogram was fitted to a beta function. The blue curve shows the best fit with the maximum at p_{n} = 0.527. This is to be compared to the proportion corresponding to RPKM normalization 0.525.
This peak of histogram normalization is expected to be the most reasonable procedure for the Chiang dataset [22], which consists of the wild type and knockout versions of the TDP43 gene (see Data and Methods for details). For this dataset, we expect the perturbation to the global gene expressions to be smaller than when comparing two different types of cells. Indeed, our peak of histogram normalization procedure resulted in a median of base2 logarithm of expression difference ratio between the wild type and knockout gene of 0.014, which is to be compared to 0.025 for the median under the RPKM normalization procedure. This showed that peak normalization was comparable to and perhaps slightly better than RPKM normalization.
Normalization is performed according to the assumption that most of the genes do not change expression in the two experimental conditions. Although this convenient assumption is probably true in most cases, it has no ironclad biological justification.
Binomial distribution fit the variance from the same library but not for different libraries
We empirically studied errors in RNASeq experiments by examining the variance from replicated measurements. We first examined the fluctuation in reads mapped to a gene from duplicate experiments based on the same biological sample. The pvalues of the differences were computed according to a binomial distribution by comparing to a neutral ratio p_{n} as determined by peak normalization. For the same sample and the same library preparation sequenced in different lanes of the Illumina sequencer, the histogram of the pvalue is flat (Figure 2a). This indicates that the errors in different lanes containing sample from the same library are consistent with the binomial distribution. In contrast, the histogram of pvalues according to the binomial distribution for two independent library preparations showed clear overabundance of small pvalues (Figure 2b). This demonstrated that the binomial distribution does not adequately describe the data—the dispersion of the random fluctuation is stronger than that given by the binomial distribution. We use the term library preparation to refer to an independent extraction of RNA, conversion to DNA and PCR amplification of DNA. Since the experiment and the control must be in separate library preparations, it is important to capture this overdispersion. The overabundance of small pvalues for different libraries was also true when we used Fisher’s exact test (data not shown). When we used the betabinomial distribution to compute the pvalue for the different libraries, the histogram was flat. This shows the overdispersion is accounted for by the betabinomial distribution. A QQ plot against either a binomial or betabinomial distribution (data not shown) also indicated that the betabinomial distribution better fitted the data.
Figure 2. Histogram of pvalues of gene expression differences from duplicate experiments on the same biological sample. (a) Duplicate experiments were from the same DNA library sequenced in different lanes. pvalues were calculated from binomial distribution. (Two datasets compared: Bullard SRR037457 vs SRR037458.) (b) When binomial distribution is applied to the same biological sample prepared in two different libraries, more genes had small probability than expected, which erroneously predicted the existence of significantly differentially expressed genes when there should not be any. (Two datasets compared: Bullard SRR037467 vs SRR037471.) (c) When the same two libraries are compared using betabinomial distribution, there is no longer high density at small pvalue. Peak of proportion normalization was used in these calculations. These histograms were drawn using R package Bumclass [27].
Errors decreased with sequencing depth
We first addressed the uncertainty in the RNASeq measurement and how uncertainty was related to the sequencing depth empirically from repeated measurements. Specifically, from replicates of the biological sample, we calculated the standard deviation of the proportion. If the proportion satisfied the binomial distribution, we expected , where n_{i} and m_{i} are tags mapped to gene i in two duplicate experiments of the sample (possibly from different libraries), and p_{n} is the normalization proportion. Figure 3 shows a plot of , averaged over pairs of duplicate experiments (Table 1), as a function of the mean n_{i} + m_{i} for the three sets of experimental data. These figures show that the variance of the proportion continued to decrease at large n_{i} + m_{i} and there was no sign of saturation. However, the rates of decrease with the tag counts depended on the dataset and were slower than that given by the binomial distribution.
Figure 3. The variance of proportion versus the mean tag counts in base10 log scale. The variances of proportion were computed from replicates of the same biological samples. (a) Caltech dataset; (b) Chiang dataset;(c) Bullard dataset. Each point represents a gene averaged over replicates (see Table 2 for the number of replicates for each dataset). The red line has a slope of 1. The black line is fit to the data for a mean (xaxis) larger than 2 (count greater than 100).
Table 1. three datasets
Table 2. Two estimations of γ from three datasets
Modified betabinomial distribution
We used a betabinomial distribution to describe the overdispersion in the data, as shown in Figure 2b. However, in the betabinomial distribution, the standard error approaches a constant as the mean tag counts become very large, whereas empirically, the standard error follows a decreasing trend at large tag counts (Figure 3). We therefore made the following assumption about the form of the θ parameter in the betabinomial distribution (see Method for details). Let n_{i} and m_{i} be the number of tags mapped to gene i. We make θ_{i} depend explicitly on the tag counts.
Under this assumption, for 0 <γ < 1, the asymptotic form of the variance of the proportion at large tag count N_{i} = n_{i} + m_{i} according to the betabinomial distribution is . Therefore the variance of the proportion of the modified betabinomial distribution does approach zero at large N, but at a slower rate than in the binomial distribution.
Determining the parameters γ and D_{i}
Although γ can be estimated from the slope and intercept, in the log scale of variance versus the mean tag count (Figure 3)., it required multiple experiments and had low accuracy due to data scattering. For a better estimation of the parameters γ and D_{i} in Eq.(1)), we used maximumlikelihood estimation (MLE). In this approach, the likelihood was derived from the betabinomial distribution of tag counts n_{i} and m_{i} for gene i, and summed over all the genes and over all the pairs of duplicate experiments. The overdispersion parameters θ_{i} were given by Eq.(1) and the parameter γ and parameters D_{i} for each gene were chosen to maximize the likelihood. The plots in (Figure 4). were obtained by performing a full optimization of likelihood Eq.(eq:likelihood) (see Data and Methods) with respect to D_{i} for each γ, and plotting the optimized likelihood values against γ. Table 2 compares the γ from two estimates. The estimated γ depended on the data. We computed γ for three sets of data. The values ranged from 0.2 to 1.0 (Figure 4). These estimates were consistent with those from the standard error (Figure 3).
Figure 4. Betabinomial likelihood as a function of the parameter γ. (a) Caltech dataset; (b) Chiang dataset; (c) Bullard dataset. The vertical lines marked the position of maximum.
Comparison of betabinomial and binomial distributions
Figure 5 shows a comparison of the false discovery rates (FDR) [23] and receiver operating characteristics (ROC) [24] for genes deemed to be differentially expressed by the binomial and betabinomial distributions. For the Bullard dataset, the results were comparable for the two distributions. For the Caltech and Chiang datasets, the betabinomial distribution was superior (for dataset details, see Data and Methods).
Figure 5. False discovery rate (FDR) and receiver operating characteristic (ROC) for three data sets. (a) and (b) Caltech dataset; (c) and (d) Chiang dataset; (e) and (f) Bullard dataset. Three panels on the left indicate the FDR. FDR (on yaxis) is plotted against the number of most significantly differentially expressed genes (on xaxis). Three panels on the right indicate the ROC. Bi denotes binomial distribution; BB denotes betabinomial distribution. The line for BB γ = 0 was obtained by setting γ = 0 and optimizing D_{i}. It corresponds to the normal betabinomial distribution. In (b), the line for BB γ = 0 overlap with the line for BB γ = 0.3.
We took the top 300 genes deemed most significantly differentially expressed by a ttest, and by binomial and betabinomial distributions, and overlaid them in a plot of the fold change versus the average tag counts (see Figure 6 and Figure 7). We note that the genes identified as significantly differentially expressed by the binomial distribution tended to have large tag counts; whereas many genes identified as significantly differentially expressed from the ttest had small tag counts. Some genes identified as significantly differentially expressed by the binomial distribution (marked by a triangle only) were not identified as significantly differentially expressed by the betabinomial distribution, even though they had higher fold changes than other genes at similar tag counts. The large fluctuations of these genes are evident because they were also not called significantly differentially expressed by the ttest.
Figure 6. Gene expression fold change in the TDP43 deletion vs wild type genes (Chiang dataset). Gene expression fold change is plotted against the average tag counts (xaxis in base10 log; yaxis in base2 log). The 300 most significantly differentially expressed genes by pvalue are depicted by squares (ttest), diamonds (betabinomial distribution), and triangles (binomial distribution). Black circles represent genes not among the top 300 in any methods. The green and purple boxes and lines indicate the median for RPKM and peak of proportional normalization. The data were from the average of three deletion and two wild type experiments.
Figure 7. Venn Diagram comparison. The overlap of top 300 genes identified by betabinomial (bb) binomial (bi), and the ttest (t) shows in Venn Diagram. The number in lower right of the rectangle indicates the total number of transcripts detected.
Conclusions
We have investigated the error of RNASeq gene expression from repeated measurements. We have shown that the sequence reads from the same biological sample sequenced in different lanes follows a binomial distribution and that the library preparation steps prior to sequencing introduced larger variations from repeated experiments of the same biological specimen. We showed that the accuracy from repeated measurement improved with the sequencing depth. However the improvement with the tag counts was generally slower than predicted by the binomial distribution. We used a betabinomial distribution to fit the interlibrary overdispersion and introduced a parameterization of the overdispersion parameter that is consistent with the intuition that measurement accuracy should increase with the sequencing depth. We optimized the overdispersion parameters using maximumlikelihood estimation. We demonstrated better performance in lower FDR using our modified betabinomial model.
Using the proportion of counts to estimate the gene expression difference has advantages over the RPKM expression. It has been shown recently that, in contrary to a naive presumption, the number of tags mapped to different positions in the same gene are highly nonuniform [18]. The Poisson rate at different positions can fluctuate by a few hundred fold. However, the pattern of the variable rates along the position of the gene is highly reproducible, even when comparing experiments performed on different tissues. And such rates depend on the nucleotide composition of the local sequences. The main contribution of the variable Poisson rate is that it can be attributed to the hexamer primer in converting RNA to DNA [19]. These data suggest that uneven PCR amplification could be the cause of the overdispersion that was observed. In estimating gene expression differences using a proportion, the highly variable Poisson rates do not need to be estimated. Such rates only enter the process indirectly through the dispersion. Not having to estimate the highly variable Poisson rates is therefore advantageous.
When the value of γ in Eq.(1) is zero, our model reverses back to the betabinomial distribution. Interestingly, the γ values estimated in the different datasets were not the same. This phenomenon is similar to the GC bias in sequencing data, which also depends on the experiments [25]. Therefore γ may be influenced by the experimental protocol that is used. The Caltech and Chiang datasets had similar values of γ. In these two datasets, the D_{i} values are similar for the same gene (data not shown). This is quite consistent with the previous finding [18] that the variability of the Poisson rates is similar in different experiments. It would be interesting to study how D_{i} may depend on the DNA sequences of the gene. Another parametrization of overdispersion is by position within a gene using Eq.(1). These possibilities will be explored in the future.
Methods
Peak of proportion histogram normalization
The normalization procedure using the peak of the histogram of proportion assumes that most genes remain unchanged in the two conditions being compared. In this normalization procedure, we fitted the highest peak in the histogram of proportion to a beta function. The maximum of the beta function determines the normalization proportion p_{n}.
In RPKM normalization, we first count the total number of tags mapped to any gene in the RNASeq experiment. The number of tags mapped to a particular gene is divided by the total number of tags sequenced (the unit is millions of tags), and then divided by the number of nucleotides in the gene (the unit is thousands).
Datasets used
The three datasets we used are listed in Table 1.
The Chiang dataset consisted of five independent libraries of the deleted TDP43 gene in the mouse. The data were derived from three independent clones of TDP43 knockout embryonic stem (ES) cells and two independent clones of control ES cells. Raw reads were mapped to the University of California Santa Cruz mm9 genome library by efficient largescale alignment of nucleotide databases. One gene deletion is an ideal case for testing normalization procedures with the assumption that most genes do not change.
The Caltech dataset consisted of two cells lines: GM12878 (normal blood) and H1hESC (embryonic stem cells), each with four libraries made independently from the same biological sample. The process involved raw Illumina reads on 2x75 datasets (RawData files on the download page, fasta format), which were run through Bowtie, version 0.9.8.1, with up to 2 mismatches. The resulting mappings were stored (RawData2 files, Bowtie format) for up to ten matches per read to the genome, spiked controls and UCSC knownGene splice junctions.
The Bullard dataset consisted of human brain reference RNA and human universal reference RNA as two library preparations. We used Bowtie, version 0.12.7, to align the reads to the genome (H. sapiens, NCBI 37.1 assembly). The Bowtie command we used to implement this mapping strategy was ./bowtie a v 2 t m 1 best strata h_sapiens_37_asm.
Maximumlikelihood estimation (MLE)
Let n_{ip} and m_{ip} be the tags mapped to the ith gene and ppair of experiment and control, respectively. The likelihood function according the betabinomial distribution is
where α_{ip} and β_{ip} are two parameters of the betabinomial distribution. This is equivalent to using instead the following parameters , and . It can be shown analytically that the proportion that maximizes the likelihood function is given by . We will further assume that θ_{ip} is independent of of p; we use Eq.(1) to reparameterize θ_{i} in terms of parameters D_{i} and . The parameters were determined by maximizing the likelihood
Likelihood ratio test
According to the likelihood ratio test, follows a Χ^{2} distribution, where p_{i} is the proportion for gene i and p_{n} is the normalized proportion corresponding to no change in gene expression. This is the most convenient way to compute the pvalue.
FDR and ROC
To determine the false discovery rate (FDR), we assumed that any gene deemed to be significantly differentially expressed at a given pvalue were false when comparing two replicates sequenced from the same biological sample. We computed the FDR by dividing the number of falsely discovered genes at a given pvalue with the number of significantly differentially expressed genes, comparing the sample to the control at the same pvalue.
To determine the receiver operating characteristic (ROC), we first established a gold standard. Approximately one thousand genes in the Bullard dataset were previously assayed by RTPCR in four independent experiments [26]. Differentially expressed genes were determined by ttest by Bullard et al. [13]. We used their results to draw an ROC curve when comparing the binomial and betabinomial distributions for the Bullard dataset. For the Caltech and Chiang datasets, we assumed that the ttest provided a gold standard. In order to reduce errors for small tag counts, we required a gene to have more than 20 mapped tags. For the Caltech data, the Benjamini & Hochberg adjustment was applied to the pvalue calculated by the ttest, using a cutoff of 0.05 [23]. We could not use the FDR pvalue adjustment on the Bullard dataset, as much fewer genes had differential expression levels detected from the wild/knockout samples. Therefore, we applied a cutoff of 0.05 to the pvalue from the ttest and required a fold change larger than two.
Computing the fold change
We related the fold change in the gene expression level FC_{i} to the optimized ratio p_{i} and obtained, by definition, . This ratio has to be calibrated against the normalization of the entire experiment. We defined p_{n} as no change. Therefore , where p_{n} is the normalized ratio as determined over the entire dataset. Infinite values of FC_{i} can be avoided by adding a pseudocount to n_{i} and m_{i} so that 0 <p_{i} < 1.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
SL and GC designed the studies. GC wrote perl and R program and performed data analysis and modeling. HL and YL assisted in data analysis and modeling. SL, GC, XH, JL, PM, YJ developed statistical model. SL and GC wrote the manuscript.
Acknowledgements
This work was partially supported by the NIH/NCI grant 5K25CA123344. HL was supported by a training fellowship from the Keck Center for Quantitative Biomedical Sciences of the Gulf Coast Consortia, on the Computational Cancer Biology Training Program from the Cancer Prevention and Research Institute of Texas (CPRIT No. RP101489).
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 13, 2012: Selected articles from The 8th Annual Biotechnology and Bioinformatics Symposium (BIOT2011). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/13/S13/S1
References

Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR: Accurate whole human genome sequencing using reversible terminator chemistry.
Nature 2008, 456:5359. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNASeq.
Nat Methods 2008, 5:621628. PubMed Abstract  Publisher Full Text

Wilhelm BT, Marguerat S, Watt S, Schubert F, Wood V, Goodhead I, Penkett CJ, Rogers J, Bahler J: Dynamic repertoire of a eukaryotic transcriptome surveyed at singlenucleotide resolution.
Nature 2008, 453:12391243. PubMed Abstract  Publisher Full Text

Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNASeq.
Bioinformatics 2009, 25:11051111. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van BM, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNASeq reveals unannotated transcripts and isoform switching during cell differentiation.
Nat Biotechnol 2010, 28:511515. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li M, Wang IX, Li Y, Bruzel A, Richards AL, Toung JM, Cheung VG: Widespread RNA and DNA sequence differences in the human transcriptome.
Science 2011, 333:5358. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB: Alternative isoform regulation in human tissue transcriptomes.
Nature 2008, 456:470476. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Katz Y, Wang ET, Airoldi EM, Burge CB: Analysis and design of RNA sequencing experiments for identifying isoform regulation.
Nat Methods 2010, 7:10091015. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

‘t Hoen PA, Ariyurek Y, Thygesen HH, Vreugdenhil E, Vossen RH, de Menezes RX, Boer JM, van Ommen GJ, den Dunnen JT: Deep sequencingbased expression analysis shows major advances in robustness, resolution and interlab portability over five microarray platforms.
N Nucleic Acids Res 2008, 36:141. Publisher Full Text

Nature 2002, 418:244251. PubMed Abstract  Publisher Full Text

Sauer B: Inducible gene targeting in mice using the Cre/lox system.
Methods 1998, 14:381392. PubMed Abstract  Publisher Full Text

Bullard JH, Purdom E, Hansen KD, Dudoit S: Evaluation of statistical methods for normalization and differential expression in mRNASeq experiments.
BMC Bioinformatics 2010, 11:94. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Skellam JG: A probability distribution derived from the binomial distribution by regarding the probability of success as variable between the sets of trials.

Lee J, Mueller P, Liang S, Cai G, Ji Y: On Differential Gene Expression Using RNASeq Data.

Toung JM, Morley M, Li M, Cheung VG: RNAsequence analysis of human Bcells.
Genome Res 2011, 21:991998. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wang L, Feng Z, Wang X, Zhang X: DEGseq: an R package for identifying differentially expressed genes from RNAseq data.
Bioinformatics 2010, 26:136138. PubMed Abstract  Publisher Full Text

Li J, Jiang H, Wong WH: Modeling nonuniformity in shortread rates in RNASeq data.
Genome Biol 2010, 11:R50. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Hansen KD, Brenner SE, Dudoit S: Biases in Illumina transcriptome sequencing caused by random hexamer priming.
Nucleic Acids Res 2010, 38:e131. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Baggerly KA, Deng L, Morris JS, Aldaz CM: Differential expression in SAGE: accounting for normal betweenlibrary variation.
Bioinformatics 2003, 19:14771483. PubMed Abstract  Publisher Full Text

Pham T, Piersma SR, Warmoes M, Jimenez CR: On the betabinomial model for analysis of spectral count data in labelfree tandem mass spectrometrybased proteomics.
Bioinformatics 2010, 26:363369. PubMed Abstract  Publisher Full Text

Chiang PM, Ling J, Jeong YH, Price DL, Aja SM, Wong P: Deletion of TDP43 downregulates Tbc1d1, a gene linked to obesity, and alters body fat metabolism.
Proc Natl Acad Sci U S A 2010, 107:1632016324. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing.

Fawcett T: An introduction to ROC analysis.
Pattern Recognition Letters 2006, 27:861874. Publisher Full Text

Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultrashort read data sets from highthroughput DNA sequencing.
Nucleic Acids Res 2008, 36:e105. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Canales R, L Y, Willey J, Austermiller B, Barbacioru C, Boysen C, Hunkapiller K, Jensen R, Knight CR, Lee K, Ma Y, Maqsodi B, Papallo A, Peters E, Poulter K, Ruppel P, Samaha R, Shi L, Yang W, Zhang L, Goodsaid FM: Evaluation of DNA microarray results with quantitative gene expression platforms.
Nat Biotechnol 2006, 24:11151122. PubMed Abstract  Publisher Full Text

OOMPA package [http://genome.ucsc.edu/cgibin/hgTrackUi?db=hg18\&g=wgEncodeCaltechRnaSeq] webcite

Wold/Caltech lab [http://genome.ucsc.edu/cgibin/hgTrackUi?db=hg18\&g=wgEncodeCaltechRnaSeq] webcite