Skip to main content

RNA-seq: technical variability and sampling

Abstract

Background

RNA-seq is revolutionizing the way we study transcriptomes. mRNA can be surveyed without prior knowledge of gene transcripts. Alternative splicing of transcript isoforms and the identification of previously unknown exons are being reported. Initial reports of differences in exon usage, and splicing between samples as well as quantitative differences among samples are beginning to surface. Biological variation has been reported to be larger than technical variation. In addition, technical variation has been reported to be in line with expectations due to random sampling. However, strategies for dealing with technical variation will differ depending on the magnitude. The size of technical variance, and the role of sampling are examined in this manuscript.

Results

In this study three independent Solexa/Illumina experiments containing technical replicates are analyzed. When coverage is low, large disagreements between technical replicates are apparent. Exon detection between technical replicates is highly variable when the coverage is less than 5 reads per nucleotide and estimates of gene expression are more likely to disagree when coverage is low. Although large disagreements in the estimates of expression are observed at all levels of coverage.

Conclusions

Technical variability is too high to ignore. Technical variability results in inconsistent detection of exons at low levels of coverage. Further, the estimate of the relative abundance of a transcript can substantially disagree, even when coverage levels are high. This may be due to the low sampling fraction and if so, it will persist as an issue needing to be addressed in experimental design even as the next wave of technology produces larger numbers of reads. We provide practical recommendations for dealing with the technical variability, without dramatic cost increases.

Background

RNA-seq (high throughput sequencing of the transcriptome) has the potential to transform the way we study gene structure and expression [1]. The ability to identify novel exons and splice sites [2–6] is just the beginning. Although there have been claims that RNA-seq is more sensitive [7] and has a larger dynamic range [3] than a microarray, these claims are now being challenged [8]. For all the optimism surrounding RNA-seq, there is growing evidence that estimating gene expression in an RNA-seq environment is not straightforward. In addition, technical variance, sequencing bias and mapping bias have been reported [9–11].

The power and promise of RNA-seq technology demand our attention. New papers on experimental design [12] will lead the way to more thoughtful experimentation. One important component in careful study planning is an understanding of the different sources of variability so they can be accounted for in the experimental design [13]. Variability can occur at many levels, both biological and technical. Biological variability is unaffected by the technology, as genotypes, individuals and even cell types vary regardless of how they are measured. Bullard et al. (2010) [14] examined the effects of library construction, flow cell and lane on detection of differential expression. The data used there was commercial grade RNA and the samples compared expected to be very divergent. They concluded that the biological variation was larger than technical variation, underscoring the importance of including biological replicates in the study design.

The observation that biological variability is large does not imply that technical variability should be ignored in the experimental design of a study. High levels of noise in a technology must be considered during experimental design [13]. Marioni et al. (2008) [15] demonstrate clearly that technical variability is not different from what is expected due to random noise (sampling error). Yet, the magnitude of the sampling error is not explicitly discussed. One particular feature of deep sequencing is that even though there are millions of reads, the proportion of the mRNA that is actually sequenced is low. Given a Solexa/Illumina RNA-seq library of 10 nM concentration with a mean insert size of 250 bp and a volume of 400 uL, the number of molecules in a typical RNA library is estimated to be 2.408 × 1012. On the Solexa/Illumina GAIIx, approximately 30 million of the total possible molecules are sampled in a given lane. This represents approximately 0.0013% of the total number of available molecules. Even with the anticipated increase in the next generation of technology, the sampling fraction is will be less than 0.004%. The impact of such a low sampling fraction on the statistical properties of RNA-seq is the focus of this manuscript. We address the following questions: 1) Does a substantial amount of technical variability exist? 2) Is the impact of technical variability the same for all levels of coverage?

The study presented here considers technical variation in three experiments using Solexa/Illumina technology. Although the number of exons consistently detected is improved by increasing the number of mappable reads, the detection of exons with low levels of coverage is inconsistent among technical replicates. Inconsistent detection of exons is most pronounced for exons with average coverage of less than 5 reads per nucleotide. In addition, there can be substantial disagreement in the estimated level of expression among technical replicates, even when the coverage is high. Although the technical variability is not unexpected, technical variability cannot be ignored in RNA-seq experiments.

Methods

Experiments

Experiment 1: Three independent samples of D. melanogaster female heads were collected with each sample representing a unique pool of biological material. Each sample was prepared according to manufacturer's instructions and then the same library was run on two lanes of a Solexa/Illumina flow cell, resulting in two technical replicates for each biological replicate, runs were 36 base-pair paired end. Experiment 2: Three independent samples of D. simulans male heads were collected with each sample representing a unique pool of biological material. Each sample was prepared according to manufacturer's instructions and then the same library was run on two lanes of a Solexa/Illumina flow cell, resulting in two technical replicates for each biological replicate, runs were 36 base-pair paired end. (Data will be deposited to the SRA) Experiment 3: One sample of D. melanogaster cell line c167 was run on 5 lanes yielding 5 technical replicates for 1 biological replicate (from modENCODE experiment GSE17107) [16]. Data from these experiments were 36 bp reads. The experimental design for experiments 1 - 3 are outlined in Figure 1. All samples were run using Solexa/Illumina paired end procedures. The relationship of lanes (same or independent flow cells) is unknown for all three experiments. The first two experiments were conducted by Joe Dunham and Michele Arbeitman at USC and sequenced at Oregon State core facility while the third experiment was conducted by Susan Celniker at Lawrence Livermore national laboratory (SRA009944).

Figure 1
figure 1

Experimental Design. A figure showing the design of the three experiments evaluated here. Biological replicates are separate individuals used for library construction. Technical replicates for the D. melanogaster female heads and D. simulans male heads data are a single library run on multiple lanes. For D. melanogaster c167 cell lines the exact nature of the technical replication is uncertain.

Simulation

The number of molecules present in the library is estimated to be 2.408 × 1012. This estimate was derived using the Illumina protocol. Specifically, the starting material was assumed to be 100 ng of mRNA that then resulted in 500 ng of library with an insert size of 250bp in a volume of 400 uL for a final library concentration of 10 nM. The number of pmoles is therefore 4 [(10 nmol/1L) (1L/1000 mL)(1 mL/1000 uL)(400 uL)(1000 pmol/1 nmol)]. The number of molecules is found by multiplying Avogadro's number which is molecules/moles and adjusting for units [4 pmole)/1000/1000/1000/1000*6.02E + 023] to give 2.408 × 1012 molecules. On the current Solexa/Illumina technology, the GAIIx, approximately 30 million of the total possible molecules are sampled in a given lane. This represents approximately 0.0013% (30,000,000/2.408 × 1012) of the total number of available molecules for analysis (Figure 2). To understand how random sampling may produce variation in technical replicates in an RNA-seq experiment, a small simulation of two technical replicates was conducted. We assume that the approximately 2.408 × 1012 molecules of mRNA were transcribed from 20,000 genes. Gene expression for the 20,000 genes was modeled using a gamma density function with shape parameter 2 and scale parameter 1. This ensures that many genes will have a small number of molecules assigned to them and some genes will have a large number of molecules assigned to them. The range of the gamma density function was examined over the interval from 0 to 17. The integers from 0 to 17 represent bins of genes with similar relative levels of expression. To simulate relative expression for each of the i = 1, 2, ..., 20,000 individual genes, a random integer (Si) from a discrete uniform distribution between 0 and 17 were drawn and the value of the gamma density tiwas calculated at Si. The number of total molecules of mRNA assigned to gene i, gi, was, Ni= ti/Σ(ti) * 2.408 × 1012 rounded to the nearest integer. Σ(gi) = Z. Due to the rounding, Z was approximately, but not exactly equal to2.408 × 1012. For each gi with Ni molecules a unique integer between 1 and Z was assigned, resulting in a unique association between an integer and a gene. To simulate the sampling of molecules of mRNA for each technical replicate, 30 million random integers between 1 and Z were sampled.

Figure 2
figure 2

Library construction and sequencing. Beginning with 100 ng of mRNA the manufacturer's protocol is used to estimate a sampling fraction.

Exon definitions

For this paper exons, rather than transcripts, were used. Estimating levels for individual transcript isoforms in RNA-seq data can be quite complex as the observed data can be a mixture of isoforms [17, 18]. Transcript assembly and abundance estimation from RNA-Seq reveals thousands of new transcripts and switching among isoforms. To focus on the issue of technical replication, isoform estimation was not undertaken. Exons as defined by Flybase (http://www.flybase.org) version 5.4 [19] were mapped onto the genome. Overlapping exons in the same genome region were combined. A single result per genomic region was recorded for each exon in that region (Additional File 1, Additional File 2). For example, for exons with alternative start/end sites the longest genomic region was used to represent an exon. The mapping between exons and the genome region was done via a perl script (Additional File 3). 54,607 of the 60,277 genomic position exons represent uncomplicated annotation regions. However, to ensure that results are not dependent on the exon definition, the observed data were used to derive exons. Consensus contigs were defined as contiguous regions with a per nucleotide read depth greater than or equal to 3 for all lanes within a particular experiment (Additional File 4, 5). This definition results in the evaluation of genomic regions for which RNA-seq data are available for all replicates of a particular experiment. Consensus contigs and their positions for these data are listed in Additional Files 6, 7, 8.

Mapping reads

Reads spanning junctions pose a problem. Each lane of data was first evaluated with TopHat [20] to create a set of read-supported junctions. Bowtie [21] was used as the main tool for read alignment. Alignments were run with the "best" option to ensure that reported singleton alignments have the optimal score. The "tryhard" flag was also used to induce Bowtie to spend more time backtracking for hard-to-align reads. Reads aligning more than 10 times were removed from consideration. First, a paired-end alignment was run. Unaligned reads from this step were aligned unpaired. Remaining unaligned reads were mapped to the junctions from TopHat. For D. simulans samples, reads were also aligned to D. simulans genomic reference sequence [22]. Bowtie is fast but cannot handle in/dels or reads that align partially to the reference. Reads remaining unaligned at this point were aligned with LAST [9]. Results were combined into a single alignment file in SAM format and reported versus single reference (the Flybase v5 D.mel reference [19]). Analysis of read coverage was done with SAMtools [23]. For each lane the resulting pileup file was the basis for all further calculations and the remaining unmapped reads were discarded.

Detection and Quantitation

Each exon was projected onto each pileup file. If at least one read maps to a particular exon then that exon was said to be detected. RPKM which is the (number of reads in a window/length of the window)*C/mappable reads), where C is an estimate of the maximum number of reads [3] was used to estimate the quantity of each exon.

Analysis

To understand variability among replicates it is necessary to understand how well replicates agree, not how well they correlate [24–26]. Agreement between categorical variables can be calculated as the proportion of observations that share the same category divided by the total number of observations [25]. If the number of observations in each category is unequal then two measurements may agree by chance more frequently than estimated using a simple agreement. Kappa calculates the agreement between two replicates and adjusts for chance agreement [25]. The Kappa statistic gives a measure of agreement and the value 1 represents perfect agreement [25]. The most basic question is whether exons can be consistently detected between replicates. The agreement of technical and biological replicates on detection of exons was calculated using both simple agreement and Kappa statistics.

For ordinal scales, a weighted Kappa allows for distance in categories to be used in the assessment of agreement and is interpreted in the same way as the Kappa [25]. The RPKM was grouped into 3 levels undetected/low-medium/high where high levels were defined as RPKM > 20. Alternatively RPKM was grouped into 9 levels using a log scale as follows: Zero reads, RPKM less than 10, 20, 40, 80, 160, 320, 1000 and greater than 1000 were the ordinal categories used. Simple agreement (two replicates having the same level of coverage) and weighted kappa statistics were calculated. Examination of the frequency table between the technical replicates can give insight into whether the disagreements are solely a function of the level of digital gene expression.

Correlation coefficients have often been used to describe agreement. However, correlation does not reflect agreement [24–26]. Spearman's correlation coefficient quantifies how well the relationship between two variables, (Y,X), can be described using a monotonic function. If one estimate is exactly twice the size of the other estimate then the correlation will be 1. However, these two measures do not agree. In addition to the scale factor, correlations depend upon the range of values observed. A large range of values will produce a higher correlation than a narrow range of values. In comparisons of whole genome expression the range is extremely large. A further problem is the large number of data points makes the linear trend unlikely to be influenced by strong disagreement for some exons.

Bland-Altman [24, 26] proposed a method for visualizing data and providing insights into agreement. These plots are known as MVA or MA (minus versus average) plots in the gene expression literature and have been utilized intensively to assess presence of abundance-dependent bias [27]. The plot is defined as follows: on the X axis is the average of the two measures (in this case the technical replicates) and on the Y axis is the difference between the two measures. These plots were constructed for exons for all combinations of technical replicates where there are non-zero measurements in both replicates. An absolute difference of 10 has a different implication for agreement if the RPKM is 20 than if the RPKM is 300. For that reason, Bland-Altman plots were also made on the natural log scale. That is, the natural log of the RPKM was calculated for each technical replicate where the scale is considered to be additive rather than multiplicative and then used for computing the average and the difference.

Coverage plots [28] were constructed to examine whether exon length was associated with missing values. To formally test whether exon abundance and/or length is related to whether an exon is present in all or only a portion of the technical replicates, an analysis of the dataset for the cell line c 167 was conducted. Only exons that were observed in at least one technical replicate were considered. The dichotomous response of "in all technical replicates" versus "not in all technical replicates" was modeled using logistic regression. Formally, Let Yi = 1 if the ith exon is not present in all technical replicates, 0 otherwise. The model

was fit, where ai is the abundance of the ith exon and li is the length of the ith exon and i = 1,..,n where n is the number of exons observed in at least one technical replicate. The average RPKM, taken across all technical replicates was used as the measure of abundance for that exon and the length of the exon in base pairs was used as the measure of exon length.

Results

For the three experiments, the number of mappable reads per lane varies among the technical replicates and the number of exons detected increases with the number of mappable reads (Table 1). The exons that are detected vary between technical and biological replicates (Table 1). Many exons are observed/present in one technical replicate but unobserved/absent in the other technical replicate. There was no exception to this observation in any comparison made (Table 2). The agreement (Kappa) in detection between technical replicates ranges from 0.63-0.81. Lanes with higher coverage have better agreement in detection, but even with the higher coverage rates, the number of exons detected varies. The comparison with the least discrepancy had ~3,600 exons missing in one of the two technical replicates and several had more than 5,000 missing (Table 2). Among biological replicates discrepancies were larger than for technical replicates in that same experiment (Table 2). The simulation study shows that this variability in detection can be explained due to random noise from a very small sampling fraction (0.0013%). A second simulation (not shown here) based upon the multinomial model came to the same conclusion. In the simulation, and in the real data examined here, discrepancies in detection are largely due to exons with coverage per nucleotide of less than 5 in the technical replicate in which they are detected (Table 2, Additional File 9). This indicates that higher coverage increases the consistency of detection across technical replicates. Unfortunately, the vast majority of exons are observed at low coverage (Table 1).

Table 1 Mappable reads per lane in each of the three experiments.
Table 2 Agreement between technical replicates and biological replicates for RPKM measured on FB 5.4 exons (n = 60,277).

Additionally, the length of the exon may contribute to the discrepancy. To examine this, coverage plots were constructed (Additional File 4). Coverage plots have a solid line if the exon was detected in the corresponding biological and technical replicate and a blank space if the exon was not present. The plots were conducted by sorting the exons by length. Subsequent plots were made for exons surrounding a given percentile of exon length. For exons that are detected inconsistently among replicates there seems to be no trend associated with the length. Plots of the 10th percentile appear to be similar to the 20th, 50th and 90th percentile (Additional Files 10). The coverage plots do not take mRNA abundance into account. To test whether exon length and abundance, and the interaction between them contribute to differences in detection among technical replicates a logistic regression model was used. The interaction between length and abundance was significant (estimate, -0.0136; p < 0.001). This means that the relationship between length and abundance differs between highly abundant and infrequently observed exons. For infrequently observed exons, longer exons are more likely to be consistently detected than shorter exons. This pattern is not observed in highly abundant exons. It must be noted that this is a trend, and that some long exons are missing among some technical replicates and some short exons are present in all. Although main effects are often difficult to describe in the presence of interactions, in this case, it is clear that regardless of the exon length lower levels of abundance are less likely to be consistently detected (estimate -0.25; p < 0.001).

Within each lane, for each exon, the average number of reads per nucleotide (APN) and the standard deviation were computed. The coefficient of variation (CV, standard deviation/mean) versus the APN (Figure 3) was plotted for each exon. Coverage below APN 5 is highly variable. The CV for an exon with a coverage of less than 5 reads per nucleotide is often greater than 2 and can be greater than 20. The same pattern holds for all samples examined (Figure 3), indicating that this is not an artifact of the number of mappable reads. In fact, when reads from the c167 experiment are combined and the same plot is constructed, the same pattern is seen (data not shown). Although the maximum CV is smaller for the data combined across 5 lanes, than for a single lane of data, the trend is still apparent. This, together with the results on consistency of detection, indicates that an average coverage of 5 reads is a useful target when planning an RNA-seq study. For a lane with almost 29 million mappable reads, not quite 60% of the exons in Flybase were observed at this level of coverage.

Figure 3
figure 3

Coefficient of variation (CV) plotted on Y axis and average depth per nucleotide (APN) on X axis. Points with average depth of greater than 1000 are not displayed. Panel A is D. simulans BR2 TR2. Panel B is D. melanogaster female heads BR2 TR1. Panel C is TR1 for cell line c 167. Note that despite the difference in the number of mappable reads, the pattern of CV against the mean remains the same. CVs are very large when the average expression is low. Individual points represent exonic regions (Flybase 5.4) cubic smoothing line fit using R's smooth.spline function.

There is great interest in understanding the performance of RNA-seq and to that end many studies have used the correlation coefficient as a means of assessing concordance. We plot the RPKMs of technical replicates here against each other (Figure 4A,B) to demonstrate that these plots are similar to those previously reported [15]. As expected, Pearson and Spearman correlation coefficients are very high. Spearman correlation coefficients are in excess of 0.95. However, correlation is not a measure of agreement [24–26]. By categorizing expression into several ordinal groups and then comparing two technical replicates (Table 2,3, Additional File 9), it becomes immediately obvious that there are areas of disagreement between the technical replicates, and these are not limited to exons with low coverage (Table 2,3). For one particular example given (Table 3), the Spearman correlation is 0.95 and the weighted Kappa statistic is 0.80. Although the actual agreement will depend explicitly upon the ordinal values chosen and the scale used for expression, weighted Kappa statistics for a range of possible categorizations were computed and found to be always lower than 0.9, often close to 0.6 and occasionally close to 0.5. This means that the use of RPKM as a measure of expression can be a log (or even two) different at all levels of expression. This is not sensitive to the RPKM normalization, as normalization factors [29] among technical replicates are close to 1. Indeed, this discrepancy is apparent in the simulated data, where normalization is not necessary. Sampling variance can result in differences in estimated expression of 1 log or greater. This finding may explain some of the studies which report discordance between estimates of expression from RNA-seq and known concentrations [8].

Figure 4
figure 4

Scatterplot of technical replicates. Points where RPKM is 1000 or less are displayed (A). The red line is the 45 degree line. Left panel is D. simulans male heads BR2, middle panel is D. melanogaster female heads BR2 and right panel is D. melanogaster cell line c167 Tr3 vs TR4. Spearman correlation values are (0.95, 0.99, 0.96), respectively. Scatterplot of technical replicates on the log scale (log(RPKM+1)) for RPKM values of less than 1000) (B). The red line is the 45 degree line. Left panel is D. simulans male heads BR2, middle panel is D. melanogaster female heads BR2 and right panel is D. melanogaster cell line c167 Tr3 vs TR4. Spearman correlation values are (0.95, 0.99, 0.96), respectively.

Table 3 Agreement between technical replicates for biological replicate 2 D. melanogaster female heads.

The MVA plot [27], initially proposed by Bland and Altman [24, 26] to assess agreement between two methods of measuring an endpoint, is an intuitive examination of agreement that helps diagnose the magnitude and functional form of disagreements. The Bland-Altman plot of the natural log-transformed data for the technical replicates of D. simulans biological replicate 3 (Figure 5, and Additional file 11) clearly shows that at lower levels of expression there is larger disagreement between technical replicates. However, although the absolute disagreement is a function of abundance, the moving average smoother indicates that average disagreement or bias between technical replicates is consistently linear (rather than nonlinear) over the abundance range. This is in contrast to what is observed for microarray data where a nonlinear bias (disagreement) is typically observed as a function of mean abundance making nonlinear normalization necessary [30, 31]. This is also consistent with the findings of Bullard et al. (2010) [14] who find that no nonlinear normalizations are needed.

Figure 5
figure 5

Bland-Altman plot showing level of agreement between technical replicates for natural log transformed RPKM D. simulans biological replicate 3. On the Y axis is the difference between technical replicates and on the X axis is the average between technical replicates. Green lines are the average of all differences +/- 1.96 (standard deviation of the differences). The red line is drawn at zero. The blue line is a loess fit. The discrepancy between technical replicates is a function of the estimated expression level. The horizontal line is drawn at an average coverage per nucleotide of 5. Bland-Altman plots for all the remaining comparisons among technical replicates are in Additional file 11.

Does the choice of the Flybase definition of an exon impact the results? If the empirically defined common contig is used instead of the Flybase exon, then by definition we are only examining regions seen in all technical replicates. Common contigs show greater levels of disagreement in the estimation of amount of expression between technical replicates than the Flybase exons. Like the Flybase exons, areas of lower coverage are more likely to disagree (Additional File 9). The disagreement across technical replicates decreases as the coverage increases for all the data examined here (Table 2, Additional file 9).

Consistent with other reports [14, 30] biological variation is larger than technical variation. The number of exons detected in both replicates is lower between biological replicates (Table 2). Some of these differences may be due to the flow cell/lane assignments, which are unknown for all of these experiments. Consistent with previous reports, the level of coverage makes a large difference and some of the biggest disagreements are in the D. simulans experiments where 1 biological replicate has ~5 million reads where the other biological replicates had ~28 million reads. These samples must have been run not only on different flow cells but also on different iterations of the technology. Based on previous findings, the finding that these samples have the largest differences and lowest agreement is not surprising.

Discussion

The number of exons detected, was approximately 64% (out of 60,277 exons covering genome positions total) in lanes with 5-7 million reads and 84% in lanes with approximately 27 million reads. The results of this study indicate exons are not consistently detected among technical replicates when the average coverage for that exon is less than 5 reads per nucleotide. Additionally, the coefficient of variation can be very high for exons with coverage of less than 5 reads per nucleotide. Between lanes, exons that are detected in both lanes with one of those being at low coverage, disagree in the estimates of abundance (RPKM) between technical replicates, leading us to conclude that when analyzing data for differential expression, one could consider identifying exons with coverage less than 5 for closer examination. These findings are consistent across different versions of the technology, different species, and different laboratories (E1 and E2 vs E3). The number of exons covered with an average of 5 reads per nucleotide--only 21% of all exons in lanes with 5-7 million reads and 58-60% in lanes with 27 million reads--is substantially smaller than the number of exons detected.

Although agreement improves once coverage is above 5, it is not perfect and disagreement can be large even when coverage is high. Why are there such large discrepancies in detection and in estimates of expression across technical replicates? The underlying assumption for all gene expression technology, whether microarray, SAGE, Q-PCR or RNA-seq, is that the final measurement of gene expression for a particular sample is proportional to the underlying population of mRNA extracted from that sample. The variability in exon detection, and the disagreements in estimates of expression are perplexing. A possible explanation would be the choice of the particular normalization. However, the results are virtually identical if we use an average per nucleotide or RPKM. A normalization constant [29] was also estimated and found to be close to 1 among technical replicates, indicating that these results cannot be explained by improper normalization. There are at least two obvious explanations for observing this disagreement: insufficient mixing and a very low sampling fraction.

If any of the solutions generated during preparation for cluster generation are insufficiently mixed, then the distribution of reads across samples is expected to be uneven. This would be similar to what Student saw in his experiment with yeast cells [32]. In this experiment, insufficient mixing was partially attributed to the tendency of yeast cells to "stick together in groups which was not altogether abolished even by vigorous shaking." Student developed the negative binomial in response to this observation. In the negative binomial distribution the variance is larger than in a Poisson distribution. The standard Poisson has a variance equal to the mean. Poisson distributions that allow for over dispersion are general, and the negative binomial is one particular type of over dispersion. The overdispersed Poisson has been proposed for modeling RNA-seq data [12, 14, 15, 33, 34].

For both the D. melanogaster and D. simulans experiments, technical replicates were derived from an individual library analyzed on multiple lanes. Thus, discordance between the technical replicates must have occurred after library synthesis, presumably during generation of the templates on the flow cell. We can envision other experimental plans in which technical replicates are derived from an individual sample, prior to library synthesis. In this latter case, discordance between technical replicates could arise during or after library preparation. Modifications of the protocol have been developed that omit potential bias due to amplification [35]. An inspection of the cluster generation protocol (Figure 2), carried out after library synthesis, reveals additional places where variation might be introduced. The initial step of the protocol entails at least one large dilution step. Any uneven distribution of molecules in the library at this point, whether due to incomplete mixing or DNA-DNA interactions assisted by cations [36], could lead to differences in the molecules loaded onto a lane. Single stranded libraries may have better mixing. If there is aggregation of molecules in solution, then the current protocol could be modified to a dilution series with particular care taken to ensure molecules are mixed well and neither aggregate nor degrade in the process. Non-uniformity in solution is not limited to cDNA given that RNA molecules, used as input into library synthesis, can form intra- and intermolecular interactions. RNA solutions containing high concentrations of divalent metal ions are particularly prone to forming RNA aggregates [37]. As an extreme example, more than half of the RNA in a pool of random RNA sequences was found to aggregate together through self-complementarity and precipitate out of solution under conditions favoring intramolecular interactions [38]. Although it is not possible to determine from these experiments if unequal mixing is a problem, removing or reducing the potential for this problem should be a relatively straightforward modification of the current protocols.

The other possibility is that discrepancies among technical replicates are due to the sampling fraction, 0.0013% for 30 million reads (Figure 2). This fraction is so low that the behavior of random sampling in this particular scenario should be examined. The simulation study performed here indicates that data are consistent with this hypothesis. Marioni (2008) [15] also reported data consistent with randomness. Agreement between technical replicates in the simulated data (Figure 6A,B,C) shows the same patterns as the three sets of real data presented here. In the simulated data, there are also examples of genes with substantial disagreement among technical replicates. This indicates that the observed disagreement can be explained by the low sampling fraction, and that the results are consistent with the expectations of random variation. What is initially surprising is the size of the random fluctuation. However, when the sampling fraction is considered, the results are less surprising.

Figure 6
figure 6

Bland-Altman plot for simulated data. The data were log transformed and the average of the two technical replicates is on the X axis and the difference between technical replicates is on the Y axis. (A) Simulated replicates 1 versus 2. (B) Simulated replicates 1 versus 3. (C) Simulated replicates 2 versus 3. Green lines are the average of all differences +/- 1.96 (standard deviation of the differences). The red line is drawn at zero. The blue line is a lowess fit.

This study demonstrates that the current nature of the technology is to expect variation between single lanes both in exon detection and estimation of gene expression using RPKM. Technical variation, while smaller than the biological variation, cannot be ignored and should be accounted for in the study design.

The fundamental principles of statistical experimental design are thus just as important as ever, even with modern technology. R.A. Fisher defined these fundamentals to be randomization, replication and blocking [13, 39]. The optimal experimental design strategy will depend on the objectives of the study at hand. Randomization and blocking are important to incorporate in order to avoid confounding of biological with systematic experimental effects. Likely experimental effects in mRNA Seq data include lane, flow cell or library preparation batch. Randomized block designs have been shown to be an efficient use of resources while ensuring legitimate comparisons to be made. Multiplex designs [12] represent another good solution which would ensure more complete coverage of exons across biological replicates and eliminate lane as a potential confounding variable in small experiments. Consider the following simple example of three biological replicates for treatments and three biological replicates for controls for a Drosophila experiment. If each sample is run on a single lane and more coverage is needed, then another full set of 6 lanes is necessary. In contrast, a single multiplexed run where all 6 samples are run on all 6 lanes has been proposed [12]. Exons seen in some technical replicates but not others are clearly a result of technical variation. In this design the total coverage remains almost the same as in the simple design. The added benefit is that small numbers of extra lanes can be run if more coverage is desired.

A replication strategy must be chosen as well with an appropriate balance between biological replicates and technical replicates [40–43]. While, increasing the number of biological replicates increases the precision and generalizability of a study more than increasing the number of technical replicates; for studies where low abundant mRNAs are the focus, increasing technical replication may also be important.

Conclusions

RNA-seq experiments need to replicate the results both technologically and biologically as the technical variation in exon presence and absence as well as amount of coverage is not negligible. Consistent with random variation due to low sampling fractions, technical variation is most pronounced in the variability of detection of exons when coverage is low. However, disagreements between estimates of expression can occur at all levels of coverage.

References

  1. Shendure J: The beginning of the end for microarrays?. Nat Methods. 2008, 5: 585-587. 10.1038/nmeth0708-585.

    CAS  PubMed  Google Scholar 

  2. Asmann Y, Klee E, Thompson EA, Perez E, Middha S, Oberg A, Therneau T, Smith D, Poland G, Wieben E, Kocher JP: 3' tag digital gene expression profiling of human brain and universal reference RNA using Illumina Genome Analyzer. BMC Genomics. 2009, 10: 531-10.1186/1471-2164-10-531.

    PubMed  PubMed Central  Google Scholar 

  3. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5: 621-628. 10.1038/nmeth.1226.

    CAS  PubMed  Google Scholar 

  4. Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras J, Stephens M, Gilad Y, Pritchard JK: Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010, 464: 768-772. 10.1038/nature08872.

    CAS  PubMed  PubMed Central  Google Scholar 

  5. Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, Schmidt D, O'Keefe S, Haas S, Vingron M, Lehrach H, Yaspo ML: A Global View of Gene Activity and Alternative Splicing by Deep Sequencing of the Human Transcriptome. Science. 2008, 321: 956-960. 10.1126/science.1160342.

    CAS  PubMed  Google Scholar 

  6. Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10: 57-63. 10.1038/nrg2484.

    CAS  PubMed  PubMed Central  Google Scholar 

  7. Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, Pritchard JK: Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics. 2009, 25: 3207-3212. 10.1093/bioinformatics/btp579.

    CAS  PubMed  PubMed Central  Google Scholar 

  8. Willenbrock H, Salomon J, Søkilde R, Barken KB, Hansen TN, Nielsen FC, Møller S, Litman T: Quantitative miRNA expression analysis: Comparing microarrays with next-generation sequencing. RNA. 2009, 15: 2028-2034. 10.1261/rna.1699809.

    CAS  PubMed  PubMed Central  Google Scholar 

  9. Frith MC, Wan R, Horton P: Incorporating sequence quality data into alignment improves DNA read mapping. Nucleic Acids Res. 2010, 38: e100-10.1093/nar/gkq010.

    PubMed  PubMed Central  Google Scholar 

  10. Oshlack A, Wakefield M: Transcript length bias in RNA-seq data confounds systems biology. Biol Direct. 2009, 4: 14-10.1186/1745-6150-4-14.

    PubMed  PubMed Central  Google Scholar 

  11. Philippe N, Boureux A, Bréhélin L, Tarhio J, Commes T, Rivals E: Using reads to annotate the genome: influence of length, background distribution, and sequence errors on prediction capacity. Nucleic Acids Res. 2009, 37: e104-10.1093/nar/gkp492.

    PubMed  PubMed Central  Google Scholar 

  12. Auer P, Doerge RW: Statistical Design and Analysis of RNA Sequencing Data Genetics. 2010, 185: 405-416.

    Google Scholar 

  13. Fisher RA: The Design of Experiments. 1960, Hafner, New York

    Google Scholar 

  14. Bullard J, Purdom E, Hansen K, Dudoit S: Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinf. 2010, 11: 94-10.1186/1471-2105-11-94.

    Google Scholar 

  15. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008, 18: 1509-1517. 10.1101/gr.079558.108.

    CAS  PubMed  PubMed Central  Google Scholar 

  16. Celniker SE, Dillon LAL, Gerstein MB, Gunsalus KC, Henikoff S, Karpen GH, Stein L, White KP, Waqterston RH, modENCODE Consortium: Unlocking the secrets of the genome. Nature. 2009, 459: 927-930. 10.1038/459927a.

    CAS  PubMed  PubMed Central  Google Scholar 

  17. Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology. 2010, 28: 511-5. 10.1038/nbt.1621.

    CAS  PubMed  PubMed Central  Google Scholar 

  18. Hiller D, Jiang H, Xu W, Wong WH: Identifiability of isoform deconvolution from junction arrays and RNA-Seq. Bioinformatics. 2009, 25: 3056-3059. 10.1093/bioinformatics/btp544.

    CAS  PubMed  PubMed Central  Google Scholar 

  19. Drysdale R, FlyBase Consortium: FlyBase: a database for the Drosophila research community. Methods Mol Biol. 2008, 420: 45-59. 10.1007/978-1-59745-583-1_3.

    CAS  PubMed  Google Scholar 

  20. Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009, 25: 1105-1111. 10.1093/bioinformatics/btp120.

    CAS  PubMed  PubMed Central  Google Scholar 

  21. Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10: R25-10.1186/gb-2009-10-3-r25.

    Article  PubMed  PubMed Central  Google Scholar 

  22. Begun DJ, Holloway AK, Stevens K, Hillier LW, Poh YP, Hahn MW, Nista PM, Jones CD, Kern AD, Dewey CN, Pachter L, Myers E, Langley CH: Population Genomics: Whole-Genome Analysis of Polymorphism and Divergence in Drosophila simulans. PLoS Biol. 2007, 5: e310-10.1371/journal.pbio.0050310.

    PubMed  PubMed Central  Google Scholar 

  23. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25: 2078-2079. 10.1093/bioinformatics/btp352.

    PubMed  PubMed Central  Google Scholar 

  24. Bland JM, Altman DG: Misleading Statistics: errors in textbooks, software and manuals. International Journal of Epidemiology. 1988, 17: 201-203.

    Google Scholar 

  25. Fleiss JL: Statistical methods for rates and proportions. 1981, John Wiley, New York, 38-46. 2

    Google Scholar 

  26. Bland JM, Altman DG: Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986, 307-310.

    Google Scholar 

  27. Dudoit S, Yang YH, Callow MJ, Speed TP: Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica. 2002, 12: 111-139.

    Google Scholar 

  28. Oberg AL, Mahoney DW, Eckel-Passow JE, Malone CJ, Wolfinger RD, Hill EG, Cooper LT, Onuma OK, Spiro C, Therneau TM, Bergen HR: Statistical Analysis of Relative Labeled Mass Spectrometry Data from Complex Samples Using ANOVA. J Proteome Res. 2008, 7: 225-233. 10.1021/pr700734f.

    CAS  PubMed  PubMed Central  Google Scholar 

  29. Robinson MD, Oshlack A: A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology. 2010, 11: R25-10.1186/gb-2010-11-3-r25.

    PubMed  PubMed Central  Google Scholar 

  30. Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003, 19: 185-193. 10.1093/bioinformatics/19.2.185.

    CAS  PubMed  Google Scholar 

  31. Cunningham JM, Oberg AL, Borralho PM, Kren BT, French AJ, Wang L, Bot BM, Morlan BW, Silverstein KAT, Staggs R, Zeng Y, Lamblin AF, Hilker CA, Fan J-B, Steer CJ, Thibodeau SN: Evaluation of a new high-dimensional miRNA profiling platform. BMC Genomics. 2009, 2: 57-10.1186/1755-8794-2-57.

    Google Scholar 

  32. Student: On the error of counting with a haemacytometer. Biometrika. 1907, 5: 351-360.

    Google Scholar 

  33. Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biology. 2010, 11: R106-10.1186/gb-2010-11-10-r106.

    CAS  PubMed  PubMed Central  Google Scholar 

  34. Robinson and Smyth, Biostatistics: 2008, 9,2, pp. 321-332, and Bioinformatics (2007), vol. 23 n° 21, pages 2881-2887)

  35. Kozarewa I, Ning Z, Quail MA, Sanders MJ, Berriman M, Turner DJ: Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes. Nat Methods. 2009, 6: 291-295. 10.1038/nmeth.1311.

    CAS  PubMed  PubMed Central  Google Scholar 

  36. Bloomfield VA: DNA condensation. Curr Opin Struct Biol. 1996, 6: 334-341. 10.1016/S0959-440X(96)80052-2.

    CAS  PubMed  Google Scholar 

  37. Vlassov A: Assay of Random RNA Oligomerization in Buffers with High Concentrations of Divalent Metal Ions. ChemInform. 2005, 36: [Accessed April 28, 2010]

    Google Scholar 

  38. Bartel DP, Szostak JW: Isolation of new ribozymes from a large pool of random sequences. Science. 1993, 261: 1411-1418. 10.1126/science.7690155.

    CAS  PubMed  Google Scholar 

  39. Fisher RA: Statistical Methods for Research Workers. 1932, London, Oliver and Boyd

    Google Scholar 

  40. Oberg AL, Vitek O: Statistical design of quantitative mass spectrometry-based proteomic experiments. J Proteome Res. 2009, 8 (5): 2144-56. 10.1021/pr8010099.

    CAS  PubMed  Google Scholar 

  41. Kerr MK, Churchill GA: Experimental design for gene expression microarrays. Biostatistics. 2001, 2 (2): 183-201. 10.1093/biostatistics/2.2.183.

    CAS  PubMed  Google Scholar 

  42. Dobbin K, Simon R: Comparison of microarray designs for class comparison and class discovery. Bioinformatics. 2002, 18: 1438-1445. 10.1093/bioinformatics/18.11.1438.

    CAS  PubMed  Google Scholar 

  43. Simon RM, Korn EL, McShane LM, Radmacher MD, Wright GW, Zhao Y: Design and Analysis of DNA Microarray Investigations. 2003, Springer-Verlag

    Google Scholar 

Download references

Acknowledgements

This work was supported by National Institutes of Health [R01-GM077618, R01-GM077618-S1, R01GM076643, N01-AI40065]; National Science Foundation [PGRP 0821954, MRI CNS-0821622]. Michelle Arbeitman and Joe Dunham are kindly acknowledged for the sharing of the D. melanogaster female head and D. simulans male head data.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lauren M McIntyre.

Additional information

Authors' contributions

LMM designed the study, did analyses and wrote the paper, KKL did analyses, designed and executed simulation study, contributed to writing of the paper, AMM designed experiments and writing of the paper, VA carried out all bioinformatics pipeline design and execution, ALO designed experiments and contributed to the paper, LJY designed experiments and contributed the paper, and SVN designed experiments and contributed to writing of the paper. All authors read and approved the final manuscript.

Electronic supplementary material

12864_2010_10200_MOESM1_ESM.PDF

Additional file 1:Overlapping exons combined into single genomic region. D. melanogaster ovo gene (Flybase ID BFgn0003028) used as an example of combining overlapping exons into a single genome region for mapping purposes. Format PDF. View with Adobe. (PDF 93 KB)

12864_2010_10200_MOESM2_ESM.TSV

Additional file 2:Chromosome postions for overlapping exons combined into single genomic regions. Format TSV. View with Wordpad. (TSV 3 MB)

Additional file 3:Perl script for mapping between exons and a genome region. Format PL. View with Wordpad. (PL 2 KB)

12864_2010_10200_MOESM4_ESM.PL

Additonal File 4:Perl script that converts contiguous sequences in a SAMtools consensus pileup to FASTA, BED, or coordinate tables. Format PL. View with Wordpad. (PL 5 KB)

Additional file 5:Perl script to find common sequences from contig BEDs. Format PL. View with Wordpad. (PL 3 KB)

12864_2010_10200_MOESM6_ESM.BED

Additional file 6:St_2a_D.melanogaster_common_contigs.bed. BED file containing chromosome postion, start, end and common_contig_ID for D. melanogaster experiment (Experiment 1). Format BED. View with Wordpad. (BED 704 KB)

12864_2010_10200_MOESM7_ESM.BED

Additional file 7:St_2b_Dsimulans_common_contigs.bed. BED file containing chromosome postion, start, end and common_contig_ID for D. simulans (Experiment 2). Format BED. View with Wordpad. (BED 800 KB)

12864_2010_10200_MOESM8_ESM.BED

Additional file 8:St_2c_167_common_contigs.bed. BED file containing chromosome postion, start, end and common_contig_ID for c167 experiment (Experiment 3). Format BED. View with Wordpad. (BED 759 KB)

12864_2010_10200_MOESM9_ESM.XLSX

Additional file 9:Agreement between technical replicates and biological replicates for RPKM measured on common contigs. A comparison in the agreement of the estimation of the amount of expression. Agreement in whether common contigs are expressed, expressed at a low level (0 < RPKM < 20), and expressed at a high level (RPKM > 20) and agreement on a 9 level ordinal scale as follows: RPKM less than 10, 20, 40, 80, 160, 320, 1000 and greater than 1000. Agreement was measured using a kappa coefficient. The number of common contigs where disagreement is greater than 2 logs and greater than one log are also given. Format XLS. View with Excel. (XLSX 10 KB)

12864_2010_10200_MOESM10_ESM.PDF

Additional file 10:Length of the Exon does not explain disagreement in technical replicates. Coverage plots of the c167 cell lines data. The Y axis is the average coverage across all technical replicates. A bar is drawn if the exon is present (at any coverage level) in that technical replicate. The 10th percentile represents that bottom 10% of the exons in length while the 90th percentile represents the top 10% of exons by length. Format PDF. View with Adobe (PDF 745 KB)

Additional file 11:Bland Altman plots for each biological replicate. (PDF 4 MB)

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

McIntyre, L.M., Lopiano, K.K., Morse, A.M. et al. RNA-seq: technical variability and sampling. BMC Genomics 12, 293 (2011). https://doi.org/10.1186/1471-2164-12-293

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2164-12-293

Keywords