Abstract
Background
RNAseq, a nextgeneration sequencing based method for transcriptome analysis, is rapidly emerging as the method of choice for comprehensive transcript abundance estimation. The accuracy of RNAseq can be highly impacted by the purity of samples. A prominent, outstanding problem in RNAseq is how to estimate transcript abundances in heterogeneous tissues, where a sample is composed of more than one cell type and the inhomogeneity can substantially confound the transcript abundance estimation of each individual cell type. Although experimental methods have been proposed to dissect multiple distinct cell types, computationally "deconvoluting" heterogeneous tissues provides an attractive alternative, since it keeps the tissue sample as well as the subsequent molecular content yield intact.
Results
Here we propose a probabilistic modelbased approach, Transcript Estimation from Mixed Tissue samples (TEMT), to estimate the transcript abundances of each cell type of interest from RNAseq data of heterogeneous tissue samples. TEMT incorporates positional and sequencespecific biases, and its online EM algorithm only requires a runtime proportional to the data size and a small constant memory. We test the proposed method on both simulation data and recently released ENCODE data, and show that TEMT significantly outperforms current stateoftheart methods that do not take tissue heterogeneity into account. Currently, TEMT only resolves the tissue heterogeneity resulting from two cell types, but it can be extended to handle tissue heterogeneity resulting from multi cell types. TEMT is written in python, and is freely available at https://github.com/ucicbcl/TEMT webcite.
Conclusions
The probabilistic modelbased approach proposed here provides a new method for analyzing RNAseq data from heterogeneous tissue samples. By applying the method to both simulation data and ENCODE data, we show that explicitly accounting for tissue heterogeneity can significantly improve the accuracy of transcript abundance estimation.
Background
The rapidly advancing nextgeneration sequencing based transcriptome analysis tool, RNAseq, provides a comprehensive and accurate method for analyzing the entire RNA components of the transcriptome [1]. The efficiency and sensitivity of RNAseq make it a primary method for detecting alternativelyspliced forms and estimating their abundances [2,3]. However, estimating transcript abundances in heterogeneous tissues by RNAseq remains an unsolved, outstanding problem because of the confounding effect from different cell types [4]. Many tissue samples from native environments are heterogeneous. For example, tumor samples are usually composed of tumor cells and surrounding normal cells [5]. Therefore, reads from an RNAseq experiment of tumor samples will consist of contributions from both tumor and normal cells. Additionally, tumor tissues themselves are often heterogeneous, consisting of different subclones (e.g. breast cancer subtypes [6]), leading to even more complicated tissue environments.
Experimental methods have been proposed to address issues arising from contamination of different cell types, such as lasercapture microdissection [7], which allows dissection of morphologically distinguishable cell types. The mRNA content yield by this technology is consequently lowered, and needs to be compensated for, usually by molecular amplification. However, the nonlinearity induced by amplifying mRNA [8] has its own problems, and can make the expression profiles of distinct cell types less distinguishable, weakening the sensitivity of RNAseq technology. Other experimental approaches, including cell purification and enrichment, are comparatively expensive and laborious [9]. Therefore developing alternative in silico approaches to resolving the tissue heterogeneity problem, especially in cancer research, remains a major problem in RNAseq analysis [10].
Research in computational approaches to resolving the tissue heterogeneity problem of different biotechnologies has a fairly long history [1114]. The first attempt to computationally microdissect heterogeneous tissues for microarray expression data was based on a linear model [11], which estimated both celltype proportion and gene expression level. Prior information regarding "marker genes", which are genes uniquely expressed in each celltype, was incorporated into the linear model to identify distinct cell types. The linear model was extended with Bayesian prior densities of celltype proportions [13], and a posterior sampling approach was then constructed for celltypespecific expression profiling. A statistical testing method [14] was proposed for single nucleotide polymorphism (SNP) array based copy number alterations analysis from heterogeneous tissue samples. In this method, Bayesian differentiation between hemizygous deletion and homozygous deletion were used to infer the underlying normal cell proportion and copy number profiles of both normal cells and tumor cells. One common feature shared by these methods is that they all adopted probabilistic models, not only allowing prior information about different cell types to be smoothly incorporated into the models, but also taking advantages of the flexibility of probabilistic model to capture specific aspects of each data type.
To the best of our knowledge, no computational approaches have been proposed to resolve the tissue heterogeneity problem from RNAseq data in a probabilistic fashion. Typically, researchers apply transcriptional profiling tools designed for homogeneous tissue samples directly to RNAseq data from heterogeneous tissue samples. Subsequent estimation results are interpreted as transcriptional profiling of a particular single cell type of interest. Therefore, we ask whether it is possible to estimate trancriptive abundances of individual cell types from RNAseq of heterogeneous tissues, by decoupling the contributions from multiple cell types. We propose a probabilistic modelbased approach, Transcript Estimation from Mixed Tissue samples (TEMT) to address this question. Currently, TEMT requires two sets of singleend RNAseq reads. One read set is from a heterogeneous tissue sample composed of two cell types, while the other is from a pure tissue sample composed of one of the two cell types. TEMT incorporates prior information of cell type proportion and can calculate probabilities of RNAseq reads sampled from each cell type. Because TEMT implements an online EM algorithm [15], it has a time requirement proportional to the data size and a constant memory requirement. To further improve the estimation accuracy, TEMT also implements a bias module, which incorporates both positional bias [1618] and sequencespecific bias [19,20].
To assess the performance of TEMT, we analyzed a series of both simulation and real data from ENCODE [21], and compared the transcript relative abundances estimation from TEMT to those obtained from other methods that do not take the tissue heterogeneity into account. Our results show that explicitly accounting for tissue heterogeneity can significantly improve transcript abundance estimation accuracy.
Methods
In this section, we first introduce the generative mixture model of TEMT. Combined with cell type proportion as prior information, we propose a maximum a posteriori estimation approach for finding model parameters. Next, we explain how to incorporate a positional and sequencespecific bias module into the model. Finally, we introduce an online EM algorithm for parameter estimation, reducing the time complexity to be proportional to the data size and the space complexity to be constant.
Model
Basic definition
We focus on transcript abundance estimation. Denote as a set of reference transcripts, which we assume is known and complete. Let denote the length of transcript t in the set with , where T is the total number of transcripts in the reference set. Suppose we are interested in transcriptome analysis in two cell types: a and b. Let and denote the relative transcript abundance of transcript t in cell type a and b, respectively, with . We assume are properly normalized such that and .
We assume RNAseq reads are available in two samples: one consisting of cells of only type a, which we call the "pure sample", and the other consisting of cells of both type a and b with percentage from cell type a and from cell type b, which we call the "mixed sample." In the cancer transcriptome analysis, cell type a can represent normal cells as it is usually easy to obtain a pure tissue sample, while cell type b can represent tumor cells as most tumor tissue samples are contaminated by normal cells.
Because the pure sample consists of only cell type a, its relative transcript abundance is described by for all t. However, the relative abundance of transcript t within the mixed sample is a weighted sum of the transcript abundance of both cell type a and b
Denote the read set from the pure sample by and the read set from the mixed sample by. Our goal is to estimate the relative abundance of each transcript in the reference set T from the RNAseq read data and in both cell type a and b
Alignment representation
We first map reads to the reference transcript set and convert the raw read data into a corresponding alignment representation. Denote the alignment representation of the read set by =, where if read i from aligns to transcript t and 0 otherwise, and N^{p }is the total number of reads in read set . The alignment representation is similarly defined for read set from the mixed sample. Note that one read might map to multiple transcripts due to alternative splicing, sequence similarity shared by homologous genes, or other reasons. As a result, the summation of over all transcripts may be bigger than 1 for some i. These "ambiguous reads" introduce a major source of uncertainty into transcript abundance estimation.
Generative model
We model the sequencing of reads as a sampling process, randomly chooses a transcript t from the reference transcript set according to its relative abundance and effective length, and then generates a read from a random location of the chosen transcript. Under this model, the probability of a read originating from transcript t is
with s being either p for the pure sample or m for the mixed sample. Here, is the effective length of transcript t, which quantifies the number of positions at which a read can start within transcript t. Different methods have been proposed to model the effective length [19,22]. In TEMT, the effective length is modeled with consideration to the length distribution of RNAseq fragments [19]
We assume the fragment length x has a normal distribution with mean µ and variance , and is the normal probability density function of. By renormalizing , we obtain the discrete distribution of all possible fragment lengths. The effective length is then the expectation of the number of positions a read can start within transcript t, based on the discrete distribution of fragment length.
Suppose a read is generated uniformly from each location covered by the effective length of each transcript. Then the probability of observing read i as represented by its alignment map is
for s = p or m.
Assume each read is generated independently in both the pure and the mixed samples. The likelihood of observing the read set from the pure sample and from the mixed sample is then described by
We are interested in estimating the relative transcript abundances set , but since it can be uniquely defined by the read sampling probability set ,
We can directly estimate the read sampling probability set from the likelihood function Equation (5) instead. Note that, again for all t as it is the parameter of pure sample, but unlike the linear form in Equation (1), in terms of is given as a nonlinear form
Where, the factor induce the nonlinearity. But due to the averaging effect of the large number of transcripts, practically lies within 1 ± 0:05. So we approximate with the linear form
As it brings computational convenience in the following learning step.
Finally, we define
as the parameters of our model. The likelihood in Equation (5) can be then expressed as
Maximum a posteriori estimation
Several analysis have noticed the identifiability problem [12,13] in estimating cell type specific expression in heterogeneous tissue samples. Ideally, if the proportion information for some cell types is missing, we can then pool these cell types as one type, making the expression of each individual cell type inside unidentifiable. Previously, prior constraints have been used to resolve the problem [12,13]. In our model, the prior knowledge of cell type proportions is combined with the model likelihood, and we subsequently use maximum a posteriori (MAP) estimation to find the optimal parameters.
Specifically, we place a Beta distribution as the prior for cell proportions of type a and type b. The parameter quantify the location and sharpness of the prior. Practically, we found setting 10 times as the data size gave a good convergence rate and accuracy. Combining the prior with the likelihood given in Equation (11), the posterior distribution of the model is proportional to
Incorporating sequencing bias
Both positional [1618] and sequencespecific [19,20] sequencing biases have been observed in next generation sequencing data. These biases mainly result from nonuniformly distributed cDNA fragments during the RNAseq library preparation [20]. Under positional bias, reads positioning is not uniformly distributed across the effective length of the target transcript, but preferentially distributed around either the 5' end or the 3' end of the target transcript. Under sequencespecific bias, the sequences near the two ends of the fragments affect their probability to be sequenced. To account for these nonuniformity effects during transcript abundance estimation, we incorporate the bias module of [19] into our model.
In order to further describe the local alignment context, we define another two sets of variables. Specifically, for read i from either read set or , we denote as the starting position of the alignment within transcript t relative to the 5' end of the strand. We also denote , where , as the local sequence of transcript t with length L and centered at .Then we define the bias weight as
for s=p or m.
This bias weight is essentially the ratio of the probability of observing and under the bias model to the probability under the uniform model. If no bias exists, the weight reduces to 1. The bias reweighted Equation (4) is then:
To calculate the bias weight, we use the bin method and Markov chain for positional bias and sequencespecific bias respectively. Complete details can be found in the Supplementary (Additional file 1). The final unnormalized posterior distribution of the model is then described as
Additional file 1. Supplementary. Complete details for calculating positional and sequencespecific bias weights.
Format: PDF Size: 41KB Download file
This file can be viewed with: Adobe Acrobat Reader
Where and are the bias weights computed based on read set and . The directed graphical model of TEMT is shown in Figure 1. The estimated parameters are given by
Figure 1. The representative graphical model of TEMT.
Online EM algorithm for learning
We solve the maximum a posteriori problem in Equation (16) using the ExpectationMaximization (EM) [23] framework. For each read i from read set of pure sample, we denote the latent variable of the transcript alignment representation as , where if read i aligns to transcript t and 0 otherwise. But now , which means only one , indicating read i is actually originating from transcript t. Similarly, for each read i from read set of mixed sample, we denote the latent variable of the transcript alignment representation as , where if read i aligns to transcript t and is originating from cell type a within the mixed sample, and 0 otherwise. or 0 is similar defined for cell type b. Thus means read i is actually originating from only one transcript, and either from cell type a or b within the mixed sample. We also define the auxiliary variable , and as the conditional probability weight of each latent variable , and conditional on model parameters Θ and the observed read alignment representations . Then based on Jensen's inequality [24], the complete posterior distribution, which is also the lower bound of Equation (15) can be written as
In which C is a normalizing constant and the equality holds only if the conditional probabilities , , are the true posterior distributions of latent variables .
The EM framework maximizes Equation (17) by iteratively applying the expectation step and the maximization step to update both the conditional probabilities , , and model parameters Θ until convergence. The expectation step of typical batch EM algorithm has to fetch all the data points into memory, and calculates the conditional probabilities based on the average of all the data points. While this batch method guarantee's the loglikelihood function to monotonically increase, it also induces inefficiency in both time and space complexity. Considering the highthroughput nature of nextgeneration sequencing technology as well as its huge data size, we implemented the EM algorithm in an online fashion [15] to both lower the memory requirement and boost the convergence rate.
The main difference between the batch EM and the online EM is in the Estep. The Estep of the online EM algorithm first calculates the conditional probabilities of only one new data point, and then updates the conditional probabilities of all the current data points by interpolating between the conditional probabilities of all the previous data points and the conditional probabilities of the new data point, with a forgetting factor σ controlling the convergence rate.
It is shown in [15] that with the constraint 0.5 < σ ≤ 1, the online EM algorithm is asymptotically equivalent to stochastic gradient ascent, and is guaranteed to converge to the maximum likelihood estimator, which is extended to the maximum a posteriori estimator in our model.
Specifically, the online EM updates in our model is given by
Estep
In Equation (1820), we compute the conditional probabilities , , of just one new read i + 1 based on previous parameter estimation , , ; In Equation (2123), we compute the new conditional probabilities average , , by interpolating between the previous conditional probabilities average , , and , , . n is the index of iteration step and i is the index of data points. σ is the forgetting factor which controls the convergence rate, with the constraint 0.5 < σ ≤ 1.
Mstep
In the subsequent Mstep, parameters , , , are updated according to new conditional probabilities average , , .
Results
Next we test the performance of the proposed method on both simulation data and the recently released ENCODE data [21]. For both datasets, we used the following threestep protocol and parameters to construct the analysis:
1. We aligned the raw read set from either simulation or the ENCODE data to a given transcript set using bowtie0.12.7 [25]. For each read, we allowed 2 mismatches and reported at most 10 candidate alignments.
2. The abundance of each transcript in terms of estimated counts was estimated via both TEMT and a control model. Estimated counts is defined as the estimated number of reads generated from the target transcript. In TEMT, the prior of each cell type proportion was set to the same as the proportion used in simulation and ENCODE data respectively, and β^{a}, β^{b }was set to 10 times the size of the read set . μ = 200; σ = 80 were used as the mean and standard deviation of the RNAseq fragment length distribution. We chose eXpress0.9.4 [26] as the control model, as it is the stateoftheart method for transcript abundance estimation and also utilizes an onlineEM algorithm. Note that, to run TEMT, we need two read sets, in which one is for the pure sample and the other is for the mixed sample, as previously mentioned. In contrast, to run eXpress, we only need one read set from either the pure sample or the mixed sample. The forgetting factor for the online EM algorithms in both TEMT and eXpress was set to be σ = 0:85, and the errormodel in eXpress was disabled for comparison.
3. To measure the model accuracy, we used the Error Fraction (EF) measure introduced by [17] to quantify the discrepancy between the model estimates and the ground truth estimates. The Error Fraction is defined as the fraction of transcripts for which the estimates are significantly different (percent error >10% in our case) from the ground truth.
Simulation
Data preparation
To show the utility of TEMT, we first carried out a series of simulation studies. To obtain simulated read sets, we used FluxSimulator [27], a software for transcriptome and read generation by simulating the biochemical processes underlying the library preparation. FluxSimulator requires a reference transcript set to start the simulation process, so we manually downloaded 406 transcripts of 208 alternatively spliced genes in human from Alternative Splicing Structural Genomics Project (AS3D) [28], and used these 406 transcripts as the reference transcript set. We first simulated the transcript expression process twice producing two sets of relative transcript abundances, corresponding to cell type a and b respectively. Based on these two transcript abundance sets, we then simulated 6 pairs of 1 million 75bp singleend read sets corresponding to six different cell type b proportions from 40% up to 90%. The relative transcript abundances of cell type a and b were kept the same throughout these simulations. For each paired read set, one read set is for the pure sample composed of only cell type a, whereas the other read set is for the mixed sample composed of both cell type a and b, mixed with the cell type b proportion. Within the mixedsample read set, we also extracted the reads simulated purely from cell type b, which was used for control model eXpress.
Analysis
The simulated data are analyzed with the bias module both enabled and disabled. Surprisingly, the positional and sequencespecific bias module did not improve the accuracy of the transcript abundance estimation as measured by the Error Fraction of estimated counts in both TEMT and eXpress. This result may due to the stochasticity during the simulation of FluxSimulator. So we only present the results with the bias module disabled in both TEMP and eXpress in Figure 2.
Figure 2. Analysis results of simulated data of 6 different cell type b proportions with the bias module disabled. The xaxis is the different cell type b proportion, and the yaxis is the Error Fraction of the corresponding estimates. The green and blue lines are the estimates from TEMT for cell type a and cell type b, based on the two read sets of the cell type a pure sample and the mixed sample. The yellow and magenta lines are the estimates from eXpress for cell type a and cell type b, based on the two read sets of the cell type a pure sample and the cell type b pure sample. The red line is the direct estimates from eXpress for cell type b, based on the read set of the mixed sample.
We note that the estimates of cell type a from TEMT achieve roughly the same accuracy, compared with the estimates from eXpress based on the read set of the pure sample of cell type a. Also, this accuracy does not change significantly under the effect of different cell type b proportions. This is mainly due to the pure sample read set of cell type a within the input data for TEMT.
The accuracy of the estimates of cell type b from TEMT is also shown in Figure 2, which shows that TEMT generally outperforms the direct estimation method. To the best of our knowledge, there are no computational tools similar to our model that can estimate the relative transcript abundances of cell type b via RNAseq data generated from mixed samples. Typically, computational methods are applied directly to the noisy data of mixed samples and results are interpreted as the estimates of cell type b. To compare the estimates of cell type b from TEMT with direct estimates using the current method, we applied the control model eXpress directly to the read set of the mixed sample. The estimated counts from eXpress were then compared with the true counts from another 1 million simulated read set purely of cell type b, while keeping the same relative transcript abundance as the previous simulations. The corresponding Error Fractions are shown as the red line in Figure 2 regarding different cell type b proportions. Although the accuracy of cell type b estimates from TEMT is affected by different cell type b proportions, it is generally better than the direct estimates. This can be further illustrated in Figure 3, which shows that the direct estimated counts of cell type b from eXpress deviate more from the true counts as the cell type b proportion decrease, while the estimates of TEMT have much reduced deviation. We notice that as the cell type b proportion gradually decreases, the accuracy of the estimates of cell type b from TEMT also decreases. This is the result of the contamination effect from the cell type a within the mixed sample. A recent paper [4] also observed this similar phenomenon when studying copy number aberrations from heterogeneous tumor tissue.
Figure 3. Comparisons between indirect estimates from TEMT and direct estimates from eXpress for cell type b in terms of estimated counts. The xaxis is the estimated counts from the two models, and the yaxis is the true counts. Each point in the figure is a comparison between the estimated count and true count. The red points are the direct estimates from eXpress, while the blue points are the indirect estimates from TEMT. Figure (a)(f) are each comparison with cell type b proportions from 40% to 90%.
ENCODE data
Data preparation
Next we analyzed the recently released ENCODE data. Due to the lack of RNAseq data sampled from mixed tissue samples with known cell type proportions, we artificially generated the mixedsample read sets by mixing reads obtained from two different cell types. Specifically, we chose two Tier 1 cell lines, GM12878 and K562, and treated them as cell type a and cell type b respectively. The corresponding singleend RNAseq data of these two cell lines, GM78 1×75D A 1 (UCSC Accession: wgEncodeEH000125) and K562 1×75D A 1 (UCSC Accession: wgEncodeEH000126) from the Wold lab [29] at Caltech, were download from ENCODE (2012). The data downloaded from the same lab under similar protocols is intended to reduce the deviation resulting from experiments. We then randomly selected 10 million reads from GM12878 cells to form the read set of the pure sample, and 10 million reads from both GM12878 and K562 cells using different K562 cells proportions to form the read set of the mixed sample. Similar to the previous simulation study, we extracted the reads purely selected from K562 cells within the mixed sample, and used them for the eXpress control model. We studied 6 different K562 cells proportions from 40% to 90% in order to compare with the previous simulation study. 36908 human RefSeq [30] transcripts from UCSC known genes [31] were used as the transcript set for the ENCODE data.
Analysis
One major issue in studying the ENCODE data is that the ground truth of relative transcript abundance in each cell type is unknown. We used the estimates from eXpress based on the GM12878 and K562 pure samples as the ground truth. Again, the bias module was disabled for both TEMT and eXpress. The general result of ENCODE data is shown in Figure 4. Similar to the simulated data, the indirect estimates for K562 cells from TEMT generally outperforms the direct estimates from eXpress based on the read set of the mixed sample. The contamination effect from cell type a within the mixed sample observed in Figure 3 is also seen in the eXpress analysis of ENCODE data, while TEMT does not have this issue. Note that the measure of relative transcript abundances as shown in the red line of Figure 4 is no longer estimated counts, but reads per kilobase of transcript per million mapped reads (RPKM), as the total number of reads from K562 cells within the mixed sample is less than the total number of reads of the mixed sample, so that normalization is necessary for comparison. We notice TEMT underperforms direct estimates from eXpress when K562 cells proportion equals 90%. Possibly the contamination effect of GM12878 cells within the mixed sample is not severe enough at this point, as we can imagine the red line in Figure 4 will finally reach 0% Error Fraction when K562 cells proportion reaches 100%. On the other hand, since the estimates from eXpress based on the pure sample are considered the ground truth, the lower bound Error Fraction of K562 cells estimates from TEMT should be the same as the Error Fraction of GM12878 cells estimates, which is around 20% to 30% in Figure 4.
Figure 4. Analysis results of the ENCODE data of 6 different K562 cells proportions with the bias module disabled. The xaxis is the different K562 cells proportions, and the yaxis is the Error Fraction of the corresponding estimates. The green and blue lines are the estimates from TEMT for GM12878 and K562 cells, based on the read sets of the GM12878 cells pure sample and the mixed sample. The red line is the direct estimates from eXpress for K562 cells, based on the read set of the mixed sample.
Discussion
We formulated our model under the assumption that the heterogeneous tissue is only composed of two cell types, but in reality, a heterogeneous tissue might be much more complicated, consisting of multiple cell types. To relax this constraint, our model needs to be further extended to analyze more complex cases in which each cell type may have its own subtypes, e.g. breast cancer subtypes, leading to a more sophisticated heterogeneous tissue environment. Further dissecting cell subtype heterogeneity is the next step in refining our model. Moving from two cell types to arbitrarily many cell types is of great interest, since it may substantially facilitate transcriptome study of heterogeneous tissues.
One critical component necessary to make our model work is the prior information of cell type b proportion, which is necessary to resolve the identifiability problem of mixed samples. In real experiments, precise prior information regarding cell type proportions may be unavailable. One solution in the context of our model is to down weight the effect of the prior by decreasing the parameter β^{a}, β^{b}, which adds more uncertainty to the cell mixture proportion. However, this approach may decrease the performance of the model as the uncertainty in cell mixture proportion cannot be distinguished from the uncertainty in transcript abundance estimation. This observation suggests another direction to further improving our model which is to solely estimate cell type b proportion without the prior information. To fulfill this requirement, the identifiability problem needs to be resolved as mentioned in section 2.3, which turns out to be comparatively hard for RNAseq data. Unlike the heterozygous and homozygous deletions in [14], which can be utilized to differentiate between the SNP array data generated by normal cells and tumor cells, there are no such explicit differences between the reads generated by distinct cell types in RNAseq data, thus making the generative mixture model unconstrained. The "marker genes" method proposed by [11], which tries to distinguish distinct cell types by utilizing genes uniquely expressed in each cell type, provides a future potential direction to extend the current model.
Conclusion
In this article, we propose a probabilistic modelbased method TEMT to estimate transcript abundance of individual cell types based on RNAseq data from heterogeneous tissue samples. TEMT utilizes prior information to distinguish reads generated by each cell type within the heterogeneous tissue sample. Positional and sequencespecific biases are also incorporated to improve estimation accuracy. TEMT is able to process large datasets as the online EM algorithm is adopted to guarantee a time complexity proportional to the data size and a constant space complexity. Our experiments on both simulated datasets and ENCODE data shows that explicitly accounting for tissue heterogeneity can significantly improve the accuracy of transcript abundance estimation.
Competing interests
The authors declare that they have no conflict of interests.
Authors' contributions
Designed the experiments: LY and XX; Performed the experiments: LY; Wrote the paper: LY and XX; All authors contributed to the analysis, and approved the paper.
Acknowledgements
We gratefully acknowledge helpful discussions with Jake Biesinger, Daniel Newkirk and Ali Mortazavi. The work is partly supported by National Institute of Health grant R01HG006870.
Declarations
Publication of this article was supported by National Institute of Health grant R01HG006870.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 5, 2013: Proceedings of the Third Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMBseq 2013). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S5.
References

Marguerat S, Bähler J: RNAseq: from technology to biology.
Cellular and Molecular Life Sciences 2010, 67(4):569579. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Trapnell C, Williams B, Pertea G, Mortazavi A, Kwan G, Van Baren M, Salzberg S, Wold B, Pachter L: Transcript assembly and quantification by RNASeq reveals unannotated transcripts and isoform switching during cell differentiation.
Nature biotechnology 2010, 28(5):511515. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ren S, Peng Z, Mao J, Yu Y, Yin C, Gao X, Cui Z, Zhang J, Yi K, Xu W, et al.: RNAseq analysis of prostate cancer in the Chinese population identifies recurrent gene fusions, cancerassociated long noncoding RNAs and aberrant alternative splicings.
Cell Research 2012, 22(5):806821. PubMed Abstract  Publisher Full Text

Chan K, Jiang P, Zheng Y, Liao G, Sun H, Wong J, Siu S, Chan W, Chan S, Chan A, et al.: Cancer Genome Scanning in Plasma: Detection of TumorAssociated Copy Number Aberrations, SingleNucleotide Variants, and Tumoral Heterogeneity by Massively Parallel Sequencing.
Clinical Chemistry 2012. PubMed Abstract  Publisher Full Text

Navin N, Kendall J, Troge J, Andrews P, Rodgers L, McIndoo J, Cook K, Stepansky A, Levy D, Esposito D, et al.: Tumour evolution inferred by singlecell sequencing.
Nature 2011, 472(7341):9094. PubMed Abstract  Publisher Full Text

Comprehensive molecular portraits of human breast tumours. 2012.

EmmertBuck M, Bonner R, Smith P, Chuaqui R, Zhuang Z, Goldstein S, Weiss R, Liotta L, et al.: Laser capture microdissection.
Science 1996, 274(5289):9981001. PubMed Abstract  Publisher Full Text

Otsuka Y, Ichikawa Y, Kunisaki C, Matsuda G, Akiyama H, Nomura M, Togo S, Hayashizaki Y, Shimada H: Correlating purity by microdissection with gene expression in gastric cancer tissue.
Scandinavian Journal of Clinical & Laboratory Investigation 2007, 67(4):367379. PubMed Abstract  Publisher Full Text

Clarke R, Ressom H, Wang A, Xuan J, Liu M, Gehan E, Wang Y: The properties of highdimensional data spaces: implications for exploring gene and protein expression data.
Nature Reviews Cancer 2008, 8:3749. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Meyerson M, Gabriel S, Getz G: Advances in understanding cancer genomes through secondgeneration sequencing.
Nature Reviews Genetics 2010, 11(10):685696. PubMed Abstract  Publisher Full Text

Venet D, Pecasse F, Maenhaut C, Bersini H: Separation of samples into their constituents using gene expression data.
Bioinformatics 2001, 17(suppl 1):S279S287. PubMed Abstract  Publisher Full Text

Gusnanto A, Wood H, Pawitan Y, Rabbitts P, Berri S: Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from nextgeneration sequence data.
Bioinformatics 2012, 28:4047. PubMed Abstract  Publisher Full Text

Erkkilä T, Lehmusvaara S, Ruusuvuori P, Visakorpi T, Shmulevich I, Lähdesmäki H: Probabilistic analysis of gene expression measurements from heterogeneous tissues.
Bioinformatics 2010, 26(20):25712577. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Yu G, Zhang B, Bova G, Xu J, Wang Y, et al.: BACOM: in silico detection of genomic deletion types and correction of normal cell contamination in copy number data.
Bioinformatics 2011, 27(11):14731480. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cappé O, Moulines E: Online EM algorithm for latent data models.

Bohnert R, Rätsch G: rQuant. web: a tool for RNASeqbased transcript quantitation.
Nucleic acids research 2010, 38(suppl 2):W348W351. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li B, Ruotti V, Stewart R, Thomson J, Dewey C: RNASeq gene expression estimation with read mapping uncertainty.
Bioinformatics 2010, 26(4):493500. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li J, Jiang H, Wong W: Method Modeling nonuniformity in shortread rates in RNASeq data.
Genome Biol 2010, 11(5):R25. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Roberts A, Trapnell C, Donaghey J, Rinn J, Pachter L, et al.: Improving RNASeq expression estimates by correcting for fragment bias.
Genome Biol 2011, 12(3):R22. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Hansen K, Brenner S, Dudoit S: Biases in Illumina transcriptome sequencing caused by random hexamer priming.
Nucleic acids research 2010, 38(12):e131e131. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dunham L, Kunaje A, Aldred S, Collins P, Davies C, Doyle F, Epstein C, Frietze S, Harrow J, Khatun J, Kaul R, Lajoie B, Landt S, Lee B, Pauli F, Rosenbloom K, Sabo P, Safi A, Sanyal A, Shoresh N, Simon J, Song L, Trinklein N, Altshuler R, Birney E, Brown J, Cheng C, Djebali S, Dong X, Ernst J, Furey T, et al.: An Integrated Encyclopedia of DNA Elements in the Human Genome.

Pachter L: Models for transcript quantification from RNASeq.

Dempster A, Laird N, Rubin D: Maximum likelihood from incomplete data via the EM algorithm.
Journal of the Royal Statistical Society. Series B (Methodological) 1977, 138.

Jensen J: Sur les fonctions convexes et les inégalités entre les valeurs moyennes.
Acta Mathematica 1906, 30:175193. Publisher Full Text

Langmead B, Trapnell C, Pop M, Salzberg S, et al.: Ultrafast and memoryefficient alignment of short DNA sequences to the human genome.
Genome Biol 2009, 10(3):R25. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Roberts A, Pachter L: Streaming fragment assignment for realtime analysis of sequencing experiments.
Nature Methods 2012. PubMed Abstract  Publisher Full Text

Sammeth M: [http://sammeth.net/confluence/display/SIM/Home] webcite

AS3D: [http://www.as3d.org/] webcite

Mortazavi A, Williams B, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNASeq.
Nature methods 2008, 5(7):621628. PubMed Abstract  Publisher Full Text

Pruitt K, Tatusova T, Maglott D: NCBI reference sequences (RefSeq): a curated nonredundant sequence database of genomes, transcripts and proteins.
Nucleic acids research 2007, 35(suppl 1):D61D65. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hsu F, Kent W, Clawson H, Kuhn R, Diekhans M, Haussler D: The UCSC known genes.
Bioinformatics 2006, 22(9):10361046. PubMed Abstract  Publisher Full Text