Departments of Oncology and Biostatistics, Mayo Clinic, Rochester, MN 55905, USA

Abstract

Background

Assessing the reliability of experimental replicates (or global alterations corresponding to different experimental conditions) is a critical step in analyzing RNA-Seq data. Pearson’s correlation coefficient

Results

Here we present a single-parameter test procedure for count data, the Simple Error Ratio Estimate (SERE), that can determine whether two RNA-Seq libraries are faithful replicates or globally different. Benchmarking shows that the interpretation of SERE is unambiguous regardless of the total read count or the range of expression differences among bins (exons or genes), a score of 1 indicating faithful replication (i.e., samples are affected only by Poisson variation of individual counts), a score of 0 indicating data duplication, and scores >1 corresponding to true global differences between RNA-Seq libraries. On the contrary the interpretation of Pearson’s

Conclusions

SERE can therefore serve as a straightforward and reliable statistical procedure for the global assessment of pairs or large groups of RNA-Seq datasets by a single statistical parameter.

Background

Massively parallel shotgun RNA-Sequencing (RNA-Seq) has become the technology of choice for transcriptome analysis because of its potential to yield extensive biological information with digital precision. The development of effective statistical data analysis methods has been essential to the utility of RNA-Seq and has been a focus since the original reports on the technology

Pearson’s correlation coefficient

As an alternative, McIntyre et al. recently suggested a measure of concordance based on the Kappa statistic to compare RNA-Seq samples

Here we propose a new candidate statistic for RNA-Seq sample comparison based on the ratio of observed variation to what would be expected from an ideal Poisson experiment. We show that the Simple Error Ratio Estimate (SERE), unlike

Results and discussion

Candidate statistical measures

Pearson’s correlation coefficient

Sensitivity experiment

Figure

Sensitivity and Calibration analysis of candidate statistics on simulated contamination and duplicated replicates RNA-Seq datasets

**Sensitivity and Calibration analysis of candidate statistics on simulated contamination and duplicated replicates RNA-Seq datasets.** One in silico replicate out of a pair was successively contaminated by reads from a biological replicate. Pearson’s

For the correlation and concordance measures the value 1 is usually viewed as the “ideal”. This is only achieved for the duplication, a situation where the randomness inherent to the process of read sampling is not allowed and instead a greater than expected congruence between two sample pairs is forced, resulting in an extreme case of “underdispersion”. Data from an actual ideal experiment (0% contamination) had on average correlation values of 0.89 and concordance values of 0.41. SERE on the contrary yielded the expected baseline value of 1 for perfect in silico replicates (0% contamination) and detected contamination as early as 25%. Marked differences appeared when contamination reached 50%. The SERE measure also clearly marks the duplication comparison as unusual (SERE = 0). The sensitivity of both the correlation and concordance measures is much lower, making it difficult to distinguish contaminated samples from the ideal experiment.

Stability experiment

Another characteristic, stability, interrogates whether the behavior of the underlying statistic is independent of ancillary aspects of the experiment; the obvious such factor in RNA-Seq is the sequencing depth. Therefore, RNA-Seq perfect replicate datasets of different sizes were generated by drawing random reads from the universal read pool. We simulated two types of scenarios: In our first experiment (Figure
^{7} reads per sample down to 0.09 for pairs with only 0.5x10^{6} reads. All datasets represented perfect replicates by definition as they were generated in silico by sampling from a common pool. Therefore, low values of Pearson’s

Total read count (sample size) dependence of candidate statistics comparing perfect replicate RNA-Seq datasets

**Total read count (sample size) dependence of candidate statistics comparing perfect replicate RNA-Seq datasets.** The Simple Error Ratio Estimate (SERE) was 1 when two replicate RNA-Seq datasets of different sizes were compared. Variation of SERE for repeat computations from independent replicate dataset pairs for each total read count demonstrated a stable 99% confidence interval (CI) of approximately +/- 0.01. The Pearson correlation coefficient fell as read counts decreased. Kappa also strongly depended on the total read count. All computations were performed on 200 model RNA-Seq datasets obtained by drawing reads randomly from a universal read set (described in Methods).

In our 2nd experiment (Figure

Impact of unequal sample sizes

**Impact of unequal sample sizes.** Pairwise comparisons of perfect in silico replicate RNA-Seq datasets were made similarly to Figure
^{6} plus 9x10^{6}, 3 x10^{6} plus 7 x10^{6}, 5 x10^{6} plus 5 x10^{6}. Pearson’s

Performance of the statistics on empirical data

To put the above findings into perspective, we studied the candidate statistics on an empirical dataset which included technical and biological replicates, as well as samples from different experimental conditions (“control” vs. “SNL”, see Methods). Figure

Benchmarking of the test statistics on empirical RNA-Seq data

**Benchmarking of the test statistics on empirical RNA-Seq data.** Three scenarios were investigated: technical replicates (different lanes from the same RNA-Seq library); biological replicates (different RNA-Seq libraries but same experimental condition); experimental differences (RNA-Seq libraries from different experimental conditions). Pearson’s

The SERE statistic can also be computed pairwise. For the 3 technical replicates of “control 1” for instance, the overall ratio for the three lanes is 1.005, with pairwise values of 1.003, 1.002, and 1.008. When the overall SERE statistic for a set of lanes is large we can use these individual comparisons to further sort out which lane(s) is the source of concern. A simple way to display this is to use SERE to create a cluster map. Figure

**Contains the supplementary table and figures.**

Click here for file

SERE as a measure for clustering

**SERE as a measure for clustering.** RNA-Seq datasets could be meaningfully clustered using SERE, indicating that it is a practical and useful test statistic if the similarity or global differences between many samples of RNA-Seq datasets need to be characterized by a single global paramater.

The drawbacks of

This study was focused on a global approach that is useful in both quality control and early analysis of RNA-Seq experiments. Therefore, an ideal measure for this task was defined to be easy to compute and have three features of sensitivity, calibration and stability. The SERE measure does well, but the correlation and concordance have serious flaws. Why?

Deficiencies in the correlation coefficient have long been known. Chambers et al.

By categorizing the data into bins, as performed by the Kappa statistic, one avoids the susceptibility to values on the extreme of the scale. However the choice of the bin sizes becomes the driving factor for this statistic. Additional file

For the simulation study, we chose the unweighted Kappa. We took the same bin sizes as proposed by McIntyre et al., which used 0 counts as the smallest bin. Therefore, whenever the expression of an exon is so sparse that only a single read is detected among two or more samples (the exon is a singleton) the exon will be scored as “off the diagonal” since it will fall into the bin “0” for one sample and in “1-10” for the other sample. The fraction of singletons in our in silico samples with 5 million UMRs is 11.56-11.96%, which alone limits the Kappa to a maximum of about 0.89. The total fraction of singletons tended to decrease by increasing the total read count and the calculated Kappa value rises as seen in Figure

Computational simulation can be helpful in estimating the expected values for Pearson’s

The Simple Error Ratio Estimate (SERE)

The third candidate statistic appears to be a useful measurement to identify global differences between RNA-Seq data by fulfilling the set criteria of a good measure. A primary reason is that it compares the observed variation to an expected value, and the latter accounts for the impact of varying read depth. It is easy to compute and satisfies our three primary criteria.

Calibration

A “perfect” SERE of 1 indicates that samples differ exactly as would be expected due to Poisson variation. If RNA-Seq samples are truly different, this is identified by values > 1 (overdispersion). Values below 1 are well interpretable and indicate “underdispersion,” e.g. through artefactual duplication of data. A value of 0 would constitute perfect identity, such as might occur from accidentally duplicating a file name. Interestingly, detection of underdispersion has been important in detecting data falsification

Sensitivity

A constructed replicate with 25% contamination was successfully indicated as overdispersed by SERE. As soon as one dataset contains 50% of its reads from another biological replicate, the indication of overdispersion becomes even more obvious. Thus, SERE is a qualified measure to detect processing errors and other sources of variation.

Stability

In RNA-Seq experiments the read counts per exon in a sample vary, either due to rareness of the exon within the sample or due to total number of reads. The expected variation between lanes for that exon also changes. Because SERE explicitly accounts for this, comparing observed to expected counts, it is largely unaffected by these changes, regardless of the sequencing depth. This was confirmed by 200 in silico simulations performed for various numbers of reads, where SERE was 1 on average. However, each simulation is subject to variation and therefore will slightly deviate from 1 either in the direction of under- (<1) or overdispersion (>1). To characterize the range of this variation we calculated the confidence interval (CI) for all the simulations. As seen in Figure

As shown in Figure

Li et al. recently introduced the “Irreproducible Discovery Rate” (IDR) as a measure of reproducibility

Conclusions

SERE provides an efficient single-parameter statistical measure of reproducibility for RNA-Seq datasets. Unlike two other measure currently in use, Pearson’s correlation coefficient

Methods

Empirical RNA-Seq data

RNA-Seq read data used in the present analysis was taken from a previous study

Additional file

Mapping and annotation

RNA-Seq reads (50bp) were aligned to the rat reference genome (RGSC 3.4) by Bowtie

**Is a table listing the exon boundaries for the ****
rat
**

Click here for file

**Is a master read count table listing the number of reads for each exon in each of the 14 lanes.**

Click here for file

A universal pool of RNA-Seq reads for the simulation experiments

All uniquely mapped reads (from lane 1 to 3) from the first “control” RNA-Seq sample were combined resulting in 22.9 × 10^{6} reads in order to create a universal pool. The datasets for the in silico duplicates and replicates described below were generated from this pool. The in silico replicates created from the universal pool of RNA-Seq reads by random drawing are by definition only different due to stochastic (Poisson) variation of the sampling process (see Results). Similarly all 3 lanes from “control 2” were combined to create a second pool used as “contaminant” in the contamination experiments.

In silico replicates: “Perfect” replicates

A set of RNA-Seq datasets faithfully representing Poisson variation only (perfect non-identical replicates) was generated by randomly choosing sets of 5 x10^{6} reads from the universal pool by using the “sample” function in R (Additional file

**Is an R script to create a hash index file by the ‘sample’ function in R that serves as input for** Additional file

Click here for file

**Is the JAVA script to create the in silico replicates.**

Click here for file

In silico contamination

To test whether the statistical measures were sensitive to actual differences, we contaminated one in silico replicate out of a pair with 0;5;10;25;50;75;100% of a biological replicate (“control 2”) via computer simulation. In detail, the first sample was created by randomly drawing 5 million reads of the universal pool of “control 1” and the second by drawing x% of reads from “control 1” and y% of reads from file “control 2”, whereby x+y=100, corresponding to 5 million reads. The procedure was repeated 200x.

Processing of the empirical data

For Pearson’s correlation coefficient and Kappa, the 3 lanes of each of the two “control” and the 4 lanes for each of the two “SNL” condition were compared in a pairwise fashion, resulting in a total of 18 technical replicate comparisons (see Figure

Simple Error Ratio Estimate (SERE)

Given a set of N exons and M lanes, let _{
ij
} denote the number of reads covering the ^{
th
} exon in the ^{
th
} lane. Let _{
j
} be the total read count for lane _{
i
} the total for exon

The expected variation for each observation _{
ij
} is

The denominator is (

Averaging over all N exons we have:

The SERE estimate is

Simple algebra shows that for a singleton count, i.e., an exon that appears only once in only one of the lanes, SERE equals exactly 1. That is, singletons shrink the overall SERE estimate towards 1, whether or not the samples are actually replicates. Therefore, we modify the average in equation 2 to sum over only the non-singleton counts. The R code to calculate SERE is provided in Additional file

**Is the R code to calculate SERE.**

Click here for file

The unmodified measure of equation 2 is the measure of Poisson over-dispersion most often used in generalized linear models, see for instance the classic textbook of McCullagh and Nelder
_{
i
}
^{2} (equation 2) whenever there are small values for
_{
j
}; extending their method shows that (_{
i
}
^{2} will be distributed as a chi-square random variable with (_{
ij
} is the true fraction of exon ^{2} will follow a chi-squared distribution with

**Is an R script to calculate the confidence intervals for SERE.**

Click here for file

Pearson’s correlation coefficient

For a pair of lanes, we calculated the RPKM

Cohen’s simple Kappa statistic

The read counts were normalized to RPKM and divided into 9 bins of size: 0, 1-10, 11-20, 21-40, 41-80, 80-160, 161-320, 321-1000 and greater than 1000 RPKM as it was suggested by McIntyre et al. In order to compare a pair of replicates, a 9 × 9 table of counts was constructed, whereby each exon pair added to a cell of the table (see Additional file

**Is the R code for the Kappa statistic on RPKM scale.**

Click here for file

Abbreviations

RNA-Seq: RNA-Sequencing; SERE: Simple error ratio estimate;

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

SKS, RK and MG performed the analyses and prepared the figures. TMT and ASB conceived the research and wrote the manuscript. All authors read and approved the final manuscript.

Acknowledgements

This research was supported by NINDS.