Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut, USA
Department of Statistics, George Washington University, Washington, DC, USA
Novartis Institutes for BioMedical Research, Cambridge, Massachusetts, USA
Section of Rheumatology, Yale School of Medicine, New Haven, Connecticut, USA
Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, Maryland, USA
Abstract
Background
RNA-Seq technology measures the transcript abundance by generating sequence reads and counting their frequencies across different biological conditions. To identify differentially expressed genes between two conditions, it is important to consider the experimental design as well as the distributional property of the data. In many RNA-Seq studies, the expression data are obtained as multiple pairs,
Results
We present a Bayesian hierarchical mixture model for RNA-Seq data to separately account for the variability within and between individuals from a paired data structure. The method assumes a Poisson distribution for the data mixed with a gamma distribution to account variability between pairs. The effect of differential expression is modeled by two-component mixture model. The performance of this approach is examined by simulated and real data.
Conclusions
In this setting, our proposed model provides higher sensitivity than existing methods to detect differential expression. Application to real RNA-Seq data demonstrates the usefulness of this method for detecting expression alteration for genes with low average expression levels or shorter transcript length.
Background
Gene expression profiles are routinely collected to identify differentially expressed genes and pathways across various individuals and cellular states. Sequencing-based technologies offer more accurate quantification of expression levels compared to other technologies. Early sequence-based expression measured transcript abundance by counting short segments, known as tags, generated from the 3’ end of a transcript. Tag-based methods include the Serial Analysis of Gene Expression (SAGE,
Sequence-based approaches quantify gene expression as a ‘digital’ count and require modeling suitable for a count random variable. The Poisson distribution has been central in modelling expression data
Many practical RNA sequencing studies collect data with a paired structure, where the global expression profiles are measured before and after a treatment is applied to the same individual. Appropriate modeling of such data requires taking this design structure as well as the distributional property of the data into account. The Poisson model has been used to test the effect of drugs when the observation occurs as paired data, such as predrug and postdrug counts
In this paper, we propose a Bayesian hierarchical approach to modeling paired count data that separately accounts for the within and between individual variability from a paired data structure. Our work adopts the Poisson-Gamma mixture model
The rest of this manuscript is organized as follows. Data Section introduces the biological problem and data that motivated this study. Methods Section presents our parametric model and the Bayesian method to identify genes with differential expression levels. The performance of the proposed model is examined by Simulations. Two sets of simulation studies are conducted: (1) those based on the model assumption to investigate the accuracy of the proposed method on parameter estimation, and (2) those based on mimicking the motivating data set to examine the robustness of the proposed method. Finally, the proposed method is applied to real data with detailed discussion of the results and comparisons with other methods.
Data
Qian et al. (Qian F. et al.: Identification of genes critical for resistance to infection by West Nile virus using RNA-Seq analysis, submitted) designed an RNA-Seq experiment to study human West Nile virus (WNV) infection. One objective of this study was to identify altered genes/transcripts from viral infection of primary human macrophages in comparison to uninfected samples. This study naturally has a paired design structure. A total of 10 healthy donors were recruited according to the guidelines of the human research protection program of Yale University and cells were isolated from fresh heparinized blood samples for infection with WNV (strain CT 2741, MOI=1, for 24 hours) as described previously
Methods
Bayesian mixture model for paired counts
We now describe our Bayesian hierarchical mixture model to identify differentially expressed genes/transcripts from paired RNA-seq data. As noted above, such data arise naturally from experiments measuring the biological change from treatments. We start with an overdispersed count model
It has been shown that the variability among technical replicates for RNA-Seq data can be captured by the Poisson distribution
This model allows us to obtain a simpler form of the predictive density,
Assuming independence between the baseline expression and treatment effect, we use a two-component mixture model to characterize the fold change distribution, where the expression change state of each gene is defined by a latent variable
Collecting all the components discussed so far, the model can be summarized in Figure
Diagram illustrating the hierarchical model for paired RNA-Seq data
Diagram illustrating the hierarchical model for paired RNA-Seq data.
To complete our model description, we need to specify prior assumptions for the unknown model parameters,
1. (
2. Each
3.
4.
5. Joint independency among all the parameters.
Parameter estimation via Markov chain Monte-Carlo (MCMC)
In this section, we describe the Gibbs sampling algorithm
DE classification and false discovery rate estimation
The MCMC algorithm generates random samples from the joint posterior distribution of all model parameters. These samples are then used to infer the status of differential expression. One way to select a set of interesting genes is to rank genes using estimated posterior-mean fold change
where
The Bayes’ rule assigns a gene’s expression status according to the largest posterior probability. An alternative is to classify a gene if the posterior probability of being non-null is greater than a threshold (
The method was implemented in R and is available at http://bioinformatics.med.yale.edu.
Results and discussion
Simulations
Simulations based on the model assumptions
The first part of the simulation was conducted to examine the performance of the proposed approach when the data are generated under the model assumptions. For 10,000 genes and 10 individuals, we simulate expression counts both before and after treatment according to Equation 1. Library sizes are sampled uniformly from 7 to 18 millions and relative expected baseline expression
Results in Table
Parameters
True parameter
Posterior mean
0.01
0.013 (0.002)
1.5
1.501 (0.015)
0.25
0.238 (0.015)
0.1
0.099 (0.001)
Estimated fold change
Estimated fold change. The left panel shows the distribution of the estimated fold changes under EE and DE by the Bayes’ rule. The red lines are the true fold change distributions. The right panel displays the relationship between the estimated and true fold change.
Simulations based on the empirical data
In the second part of the simulation, we assume that the expression abundance is measured for 5,000 genes simultaneously before and after a given treatment. The number of individuals is set to be 10 for the relatively larger sample case (cases 1 and 4), 5 for the medium (cases 2 and 5), and 3 for the relatively smaller sample case (cases 3 and 6). The size of each library is randomly sampled from 1.8 to 3 million to have simulated count distribution compatible with the real data distribution. The infected set of the RNA-seq data (Data Section, Qian F. et al. for details) was used as the expected baseline count data to mimic the observed mean-specific dispersion. First, we sample 5,000 gene indices with replacement to get the expected baseline expression. Expression counts from the selected indices are summarized by a matrix where rows from this data matrix correspond to the selected genes in the original data matrix and columns correspond to individuals. Then, the relative expression (
Among 5,000 genes, the first 4,000 are assumed to have no change (
Each case was repeated 100 times. We compare the performance of our approach with DESeq (version 1.8.3)
Table
Case 1
Case 2
Case 3
10
5
3
Operating characteristics are based on the Bayes rule.
-0.170 (0.037)
-0.169 (0.041)
-0.157 (0.041)
3.653×10^{-4}
3.604×10^{-4}
3.83×10^{-4}
(3×10^{-5})
(4.421×10^{-5})
(6.090×10^{-5})
0.984 (0.104)
0.968 (0.115)
0.955 (0.110)
0.151 (0.004)
0.153 (0.005)
0.156 (0.006)
0.972 (0.006)
0.993 (0.003)
0.953 (0.011)
0.030 (0.008)
0.046 (0.011)
0.068 (0.013)
0.024 (0.004)
0.037 (0.005)
0.049 (0.006)
Sensitivity
0.928 (0.014)
0.866 (0.020)
0.802 (0.025)
Specificity
0.995 (0.001)
0.994 (0.002)
0.991 (0.002)
Case 4
Case 5
Case 6
10
5
3
0.007 (0.035)
0.006 (0.038)
-0.002 (0.037)
3.634×10^{-4}
3.532×10^{-4}
3.450×10^{-4}
(2.931×10^{-5})
(4.155×10^{-5})
(5.283×10^{-5})
1.172 (0.048)
1.151 (0.059)
1.140 (0.050)
0.179 (0.003)
0.183 (0.004)
0.188 (0.005)
0.990 (0.002)
0.979 (0.004)
0.965 (0.007)
0.030 (0.008)
0.044 (0.009)
0.064 (0.012)
0.021 (0.004)
0.031 (0.005)
0.042 (0.006)
Sensitivity
0.953 (0.011)
0.906 (0.015)
0.862 (0.020)
Specificity
0.995 (0.001)
0.992 (0.002)
0.989 (0.002)
False discovery rate from the simulation
False discovery rate from the simulation. True and estimated false discovery rates are compared across different threshold for posterior probability. Solid lines are true values and dashed lines are estimated values averaged over all simulations. Left panel shows the result from simulation cases 1, 2, and 3, where non-null fold change is empirically generated. Results for cases 4, 5, 6 and 7,8 are illustrated on the middle panel and right panel, respectively.
Simulation results
Simulation results. Operating characteristics for 8 simulation settings are plotted with red, green, and blue lines for the Bayes, DESeq, and edgeR methods, respectively.
We further considered a simulation scenario similar with the real data. As shown in the data application, the log-scaled fold change estimated from the data has larger variance under null component. We set the null component variance to be 0.35 and repeated the simulation 50 times. For features in the non-null group, log-fold change was sampled from a normal distribution with a mean of -0.45 and a variance of 4. Simulation was performed with the sample size of 10 (case 7) and the size of 5 (case 8). Averages of the parameter estimates
Applications
Differential expression analysis with the Bayesian modeling
In this section, we apply our method to the motivating data set described in the Data Section. Initial values of the model parameters are calculated directly from the data. The MCMC sampling is run 4,000 iterations after discarding the first 8,000 iterations. On average, computational time was around 5 minutes per every 100 iterations. The number of total iterations and burn-in period are determined by monitoring trace plots of MCMC samples (Figure
Trace of parameters regarding the mixture distrubution
Trace of parameters regarding the mixture distrubution. Trace of parameters regarding the mixture distrubution (a) and distributions of fold change estimates for genes classified into EE and DE groups, respectively, by the Bayes’ rule (b).
Result of the Bayesian approach and comparison with other existing methods
Result of the Bayesian approach and comparison with other existing methods. Posterior probabilities against estimated fold change (a) and consistency between the Bayesian approach and existing approaches when the same number of top-ranked transcripts are chosen (b).
Comparisons with existing methods
In this section, we compare DE analysis results between our approach and existing methods. The DESeq or edgeR is applied to the same data set and top 2,352 DE transcripts are selected by their p-values. The edgeR shows higher consistency with our Bayesian model with 63.5% of overlap than the DESeq having 34.3% of overlapping transcripts. Specifically, 832, 632, and 1,364 transcripts are detected uniquely by the Bayes, edgeR, and DESeq, respectively (Figure
Comparison of DE transcripts
Comparison of DE transcripts. Commonly detected transcripts by all three methods are labeled in purple: log-scaled Bayesian estimated fold change against log-scaled average expression. Other three panels show DE transcripts detected by each of three methods. They are labeled in red, green, and blue for the Bayes, DESeq, and edgeR methods, respectively.
Example of uniquely selected by the proposed Bayesian model
Example of uniquely selected by the proposed Bayesian model. Illustration of expresseion values from a transcript detected by the proposed method only.
DE proportion and transcript length
DE proportion and transcript length. Proportion of DE transcripts over their average expression level. Transcripts are partitioned into 10 equal-sized bins by their expression levels. The proportion of transcripts inferred to be DE is plotted on the y-axis. Red, green, and blue lines are from Bayes, DESeq, and edgeR methods, respectively.
Bioinformatics annotations of the results
Pathway-level analysis is one effective way to summarize biological relevance of differentially expressed genes. We perform gene enrichment analysis using DAVID (
Term
Count
p-value
Count column indicates the number of DAVID IDs associated with each pathway.
Cluster 1
score: 11.39
Defense response
GO:0006952
106
5.3E-14
Response to wonding
GO:0006954
90
1.3E-11
Inflammatory response
GO:0009611
63
1.0E-10
Cluster 2
score: 5.43
Response to molecule of lipopolysaccharide
GO:0002237
23
9.4E-8
Response to cytokine stimulus
GO:0034097
18
8.0E-5
Response to bacterium
GO:0009617
31
3.5E-4
Cluster 3
score: 5.19
Regulation of cytokine production
GO:0001817
41
2.9E-9
Positive regulation of cytokine production
GO:0001819
20
8.0E-5
positive regulation of multicellular organismal process
GO:0051240
35
1.1E-3
Cluster 8
score: 2.89
Regulation of apoptosis
GO:0042981
100
1.0E-5
Regulation of programmed cell death
GO:0043067
100
1.5E-5
Regulation of cell death
GO:0010941
100
1.8E-5
Cluster 10
score: 2.72
Leukocyte activation
GO:0045321
41
9.2E-6
Cell activation
GO:0001775
46
1.1E-5
T cell activation
GO:0046649
26
2.0E-5
Conclusions
In this paper, we have presented a hierarchical mixture model for the identification of differential gene expression from RNA-Seq data motivated by a West Nile Virus study, which collected samples as multiple pairs,
Simulation study suggests that our Bayesian setting can have better power to detect differential gene expression. In the real data application, our proposed is able to identify transcripts with large treatment effects but low expression levels, whereas these transcripts were not inferred to be differentially expressed by other approaches. This is likely due to the more flexible and adaptable modeling of variance across individuals in our approach. Further examination of the characteristics of these top-ranked transcripts shows that the proportion of top-ranked transcripts in the short transcript group is consistent with the proportion in the long transcript group. On the other hand, the gene sets detected by the existing approaches show a bias towards longer transcripts, as has been noted in the literature before
We have assumed that the log-fold change arises from a mixture of two normal distributions. Under DE, the model allows the mean of log-fold change distribution not to be restricted at zero. By doing so, our proposed model can be applied to the data showing asymmetry between over and under expression. A normal distributional assumption is shown to be robust from simulation study under empirical fold change scenarios. Other possible choices for the null genes include a point mass at 0
Appendix
Variability across individuals
The Poisson-Gamma setting (Equation 1 and 2) allows extra variance among count expression values
Modeling details
The joint density of
Let
Here, some details on the integral over
Therefore,
After integrating over the expected gene- and individual-specific relative baseline expression (
We use the non-informative prior distributions for the unknown model parameters specified in the Methods Section.
Parameter estimates by the Metropolis-Hastings algorithm (MCMC)
We infer the posterior distributions using the Gibbs sampling
Step1
Update
where
If the proposal is accepted, we replace the old
Step2
Update
Similarly,
Step3
Update (
(Step 3-1) Generate
(Step 3-2) Then,
Define
The acceptance probabilty is
where
Step4
Update
where
Step5
Update the mixing proportions (
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
LMC developed and implemented the proposed model, performed statistical analysis, and drafted the manuscript. JPF participated in model development and helped manuscript preparation. WZ processed the WNV data and participated in data analysis. FQ, VB, and RRM performed WNV experiment. HZ designed and coordinated the study and helped draft the manuscript. All authors read and approved the final manuscript.
Acknowledgements
This work was supported in part by Grant GM59507 from NIH, 5T15LM007056-25 from PHS/DHHS, UL1 RR024139 from Yale CTSA grant, and awards from the NIH (HHS N272201100019C, AI 070343, AI 089992).