Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, Texas 77030, USA

Department of Stem Cell Transplantation and Cellular Therapy, The University of Texas MD Anderson Cancer Center, Houston, Texas 77030, USA

Department of Leukemia, The University of Texas MD Anderson Cancer Center, Houston, Texas 77030, USA

Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas 77030, USA

Department of Mathematics, The University of Texas at Austin, Austin, Texas 78712, USA

Abstract

Background

The cost of DNA sequencing has undergone a dramatical reduction in the past decade. As a result, sequencing technologies have been increasingly applied to genomic research. RNA-Seq is becoming a common technique for surveying gene expression based on DNA sequencing. As it is not clear how increased sequencing capacity has affected measurement accuracy of mRNA, we sought to investigate that relationship.

Result

We empirically evaluate the accuracy of repeated gene expression measurements using RNA-Seq. We identify library preparation steps prior to DNA sequencing as the main source of error in this process. Studying three datasets, we show that the accuracy indeed improves with the sequencing depth. However, the rate of improvement as a function of sequence reads is generally slower than predicted by the binomial distribution. We therefore used the beta-binomial distribution to model the overdispersion. The overdispersion parameters we introduced depend explicitly on the number of reads so that the resulting statistical uncertainty is consistent with the empirical data that measurement accuracy increases with the sequencing depth. The overdispersion parameters were determined by maximizing the likelihood. We shown that our modified beta-binomial model had lower false discovery rate than the binomial or the pure beta-binomial models.

Conclusion

We proposed a novel form of overdispersion guaranteeing that the accuracy improves with sequencing depth. We demonstrated that the new form provides a better fit to the data.

Background

To measure gene expression by RNA-Seq, RNA molecules are converted to DNA, sequenced, mapped to a gene database, and counted

It is common to study the changes in gene expression under a perturbation. The perturbation can be, for example, the deletion of a gene, which is important in characterizing the function of a new gene, or it can be the stimulation of cells by a ligand, which is important in deciphering a pathway. Many experimental techniques, such as RNA interference

Such uncertainty affects the ability to affirm which genes are differentially expressed between a sample and a control. We focus on estimating the change in gene expression because the absolute amounts of RNA, by themselves, as measured by the RPKM (reads per kilobase of read length per million mapped reads) of the sequenced tag values

A fundamental question in RNA-Seq analysis is how the accuracy of measured gene expression change by RNA-Seq depend on the sequencing depth

Results

Normalization by proportion

The use of a proportion is a convenient way to compare two samples. Let _{i}_{i}

In order to detect differential expression in two samples, we must determine the ratio of the counts in the two samples that corresponds to the same expression. One method, adapted in calculating the RPKM, assumes that the total number of tags sequenced, and equivalently the total amount of RNA, is a constant. The problem with RPKM normalization is that the number is dominated by a few genes that receive the highest sequence reads. These genes may or may not remain constant under the two experimental conditions. One could also use housekeeping genes such as POLR2A (polymeras II) or GAPDH in a normalization procedure. The problem with relying on a housekeeping gene is that the normalization depends on the choice of genes. Since the number of housekeeping genes is small, this normalization procedure is subject to fluctuation due to relatively small tag counts on these genes. Bullard et al. have shown good results with an upper-quartile normalization method

The most conservative normalization procedure assumes that the maximum number of genes remains unchanged in the two experimental conditions. This corresponds to the maximum in the histogram ratio of tag counts _{i}_{i}_{n}

Histogram of proportions and peak of histogram of proportion normalization

**Histogram of proportions and peak of histogram of proportion normalization**. The peak in the histogram corresponds to the largest density of genes. To determine the peak maximum, the histogram was fitted to a beta function. The blue curve shows the best fit with the maximum at _{n}

This peak of histogram normalization is expected to be the most reasonable procedure for the Chiang dataset

Normalization is performed according to the assumption that most of the genes do not change expression in the two experimental conditions. Although this convenient assumption is probably true in most cases, it has no ironclad biological justification.

Binomial distribution fit the variance from the same library but not for different libraries

We empirically studied errors in RNA-Seq experiments by examining the variance from replicated measurements. We first examined the fluctuation in reads mapped to a gene from duplicate experiments based on the same biological sample. The _{n}

Histogram of

**Histogram of p-values of gene expression differences from duplicate experiments on the same biological sample.** (a) Duplicate experiments were from the same DNA library sequenced in different lanes.

Errors decreased with sequencing depth

We first addressed the uncertainty in the RNA-Seq measurement and how uncertainty was related to the sequencing depth empirically from repeated measurements. Specifically, from replicates of the biological sample, we calculated the standard deviation of the proportion. If the proportion satisfied the binomial distribution, we expected _{i}_{i}_{n}_{i}_{i}_{i}_{i}

The variance of proportion versus the mean tag counts in base-10 log scale

**The variance of proportion versus the mean tag counts in base-10 log scale**. The variances of proportion were computed from replicates of the same biological samples. (a) Caltech dataset; (b) Chiang dataset;(c) Bullard dataset. Each point represents a gene averaged over replicates (see Table

three datasets

Data Set

A

B

Caltech^{a}

Normal Blood

Embryonic Stem Cells

Rep1Gm12878CellLongpolyaBow0981x32

PairedRep1H1hescCellPapErng32aR2x75

Rep2Gm12878CellLongpolyaBow0981x32

PairedRep2H1hescCellPapErng32aR2x75

PairedRep1Gm12878CellLongpolyaBb12x75

PairedRep3H1hescCellPapErng32aR2x75

PairedRep2Gm12878CellLongpolyaBb12x75

PairedRep4H1hescCellPapErng32aR2x75

Chiang^{b}

Knock-out of TDP-43

Wild Type

GSM546932_A_sorted

GSM546935_B_sorted

GSM546933_D_sorted

GSM546936_C_sorted

GSM546934_E_sorted

Bullard^{c}

Brain

UHR library A

UHR library B

SRR037457

SRR037466

SRR037470

SRR037458

SRR037467

SRR037471

SRR037468

SRR037472

SRR037469

^{a} from reference

^{b} from reference

^{c} from reference

Two estimations of

Data Set

Pairs of Experiments used in calculation

Standard Error^{1}

MLE^{2}

Caltech

6^{a}

0.26

0.2

Chiang

3^{b}

0.40

0.2

Bullard

12^{c}

0.76

1.0

^{1} Obtained from slope in Figure

^{2} from maximizing likelihood Eq.(2)

^{a} from four libraries of same biological sample

^{b} from three knockout replicates and two wild type replicates

^{c} by comparing two different libraries having four and three replicates

Modified beta-binomial distribution

We used a beta-binomial distribution to describe the overdispersion in the data, as shown in Figure _{i}_{i}_{i}

Under this assumption, for 0 <_{i}_{i}_{i}

Determining the parameters _{i}

Although _{i}_{i}_{i}_{i}_{i}_{i}

Beta-binomial likelihood as a function of the parameter

**Beta-binomial likelihood as a function of the parameter γ**. (a) Caltech dataset; (b) Chiang dataset; (c) Bullard dataset. The vertical lines marked the position of maximum.

Comparison of beta-binomial and binomial distributions

Figure

False discovery rate (FDR) and receiver operating characteristic (ROC) for three data sets

**False discovery rate (FDR) and receiver operating characteristic (ROC) for three data sets.** (a) and (b) Caltech dataset; (c) and (d) Chiang dataset; (e) and (f) Bullard dataset. Three panels on the left indicate the FDR. FDR (on y-axis) is plotted against the number of most significantly differentially expressed genes (on x-axis). Three panels on the right indicate the ROC. Bi denotes binomial distribution; BB denotes beta-binomial distribution. The line for BB _{i}

We took the top 300 genes deemed most significantly differentially expressed by a t-test, and by binomial and beta-binomial distributions, and overlaid them in a plot of the fold change versus the average tag counts (see Figure

Gene expression fold change in the TDP-43 deletion vs wild type genes (Chiang dataset)

**Gene expression fold change in the TDP-43 deletion vs wild type genes (Chiang dataset).** Gene expression fold change is plotted against the average tag counts (x-axis in base-10 log; y-axis in base-2 log). The 300 most significantly differentially expressed genes by

Venn Diagram comparison

**Venn Diagram comparison.** The overlap of top 300 genes identified by beta-binomial (bb) binomial (bi), and the t-test (t) shows in Venn Diagram. The number in lower right of the rectangle indicates the total number of transcripts detected.

Conclusions

We have investigated the error of RNA-Seq gene expression from repeated measurements. We have shown that the sequence reads from the same biological sample sequenced in different lanes follows a binomial distribution and that the library preparation steps prior to sequencing introduced larger variations from repeated experiments of the same biological specimen. We showed that the accuracy from repeated measurement improved with the sequencing depth. However the improvement with the tag counts was generally slower than predicted by the binomial distribution. We used a beta-binomial distribution to fit the inter-library overdispersion and introduced a parameterization of the overdispersion parameter that is consistent with the intuition that measurement accuracy should increase with the sequencing depth. We optimized the overdispersion parameters using maximum-likelihood estimation. We demonstrated better performance in lower FDR using our modified beta-binomial model.

Using the proportion of counts to estimate the gene expression difference has advantages over the RPKM expression. It has been shown recently that, in contrary to a naive presumption, the number of tags mapped to different positions in the same gene are highly non-uniform

When the value of _{i}_{i}

Methods

Peak of proportion histogram normalization

The normalization procedure using the peak of the histogram of proportion assumes that most genes remain unchanged in the two conditions being compared. In this normalization procedure, we fitted the highest peak in the histogram of proportion to a beta function. The maximum of the beta function determines the normalization proportion _{n}.

In RPKM normalization, we first count the total number of tags mapped to any gene in the RNA-Seq experiment. The number of tags mapped to a particular gene is divided by the total number of tags sequenced (the unit is millions of tags), and then divided by the number of nucleotides in the gene (the unit is thousands).

Datasets used

The three datasets we used are listed in Table

The

The

The

Maximum-likelihood estimation (MLE)

Let _{ip}_{ip}

where α_{ip}_{ip}_{ip}_{i}_{i}

Likelihood ratio test

According to the likelihood ratio test, ^{2} distribution, where _{i}_{n}

FDR and ROC

To determine the false discovery rate (FDR), we assumed that any gene deemed to be significantly differentially expressed at a given

To determine the receiver operating characteristic (ROC), we first established a gold standard. Approximately one thousand genes in the Bullard dataset were previously assayed by RT-PCR in four independent experiments

Computing the fold change

We related the fold change in the gene expression level FC_{i}_{i}_{n}_{n}_{i}_{i}_{i}_{i}

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

SL and GC designed the studies. GC wrote perl and R program and performed data analysis and modeling. HL and YL assisted in data analysis and modeling. SL, GC, XH, JL, PM, YJ developed statistical model. SL and GC wrote the manuscript.

Acknowledgements

This work was partially supported by the NIH/NCI grant 5K25CA123344. HL was supported by a training fellowship from the Keck Center for Quantitative Biomedical Sciences of the Gulf Coast Consortia, on the Computational Cancer Biology Training Program from the Cancer Prevention and Research Institute of Texas (CPRIT No. RP101489).

This article has been published as part of