Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st St SW, Rochester, MN, 55905, USA

Mayo Vaccine Research Group, Mayo Clinic, 200 1st St SW, Rochester, MN, 55905, USA

Statistical Genetics, Sage Bionetworks, 1100 Fairview Ave N, M1-C108, Seattle, WA, 98109, USA

Program in Translational Immunovirology and Biodefense, Mayo Clinic, 200 1st St SW, Rochester, MN, 55905, USA

Department of Medicine, Mayo Clinic, 200 1st St SW, Rochester, MN, 55905, USA

Abstract

Background

mRNA expression data from next generation sequencing platforms is obtained in the form of counts per gene or exon. Counts have classically been assumed to follow a Poisson distribution in which the variance is equal to the mean. The Negative Binomial distribution which allows for over-dispersion, i.e., for the variance to be greater than the mean, is commonly used to model count data as well.

Results

In mRNA-Seq data from 25 subjects, we found technical variation to generally follow a Poisson distribution as has been reported previously and biological variability was over-dispersed relative to the Poisson model. The mean-variance relationship across all genes was quadratic, in keeping with a Negative Binomial (NB) distribution. Over-dispersed Poisson and NB distributional assumptions demonstrated marked improvements in goodness-of-fit (GOF) over the standard Poisson model assumptions, but with evidence of over-fitting in some genes. Modeling of experimental effects improved GOF for high variance genes but increased the over-fitting problem.

Conclusions

These conclusions will guide development of analytical strategies for accurate modeling of variance structure in these data and sample size determination which in turn will aid in the identification of true biological signals that inform our understanding of biological systems.

Background

Next generation sequencing is a tool that is revolutionizing scientific research with its unprecedented depth of coverage, accuracy, precision, and the ability to link gene expressions with phenotype. The Illumina Genome Analyzer (GA), originally by Solexa, enables interrogation of mRNA expression via the mRNA Sequencing protocol. There are several reports on quality assessments of next generation sequencing and comparisons with microarray gene expression

First we describe some distributional background. The Poisson distribution is commonly assumed when modeling count data. This distribution considers each individual piece of mRNA to be a random draw from a large collection of pieces of mRNA with some probability vector describing the relative proportion across all possible mRNA pieces. A piece could refer to a particular exon or gene according to the researcher’s interest. The Poisson distribution appears to describe well the variation observed between two technical replicates of the same specimen, i.e., two aliquots of the same library allocated to two lanes on a flow cell

Biological replication adds another level of variability to the observed data. Biological variability is that due to inter-individual differences between human or animal subjects, for example, which cause the probability vector describing the distribution of mRNA strands to differ between subjects. Thus, when count data are observed in multiple biological replicates, the observed variance is a sum of both the technical and biological variability pieces. This results in the observed variance being larger than expected under the Poisson distribution. That is, the variance is larger than the mean. This scenario is termed “over-dispersion”

In the simplest case, variance increases as a linear function of the mean, i.e., the variance is a constant multiplied by the mean, Var(y) = kμ. We denote this as the over-dispersed (OD) Poisson throughout and model parameters can be estimated via quasi-likelihood methods. A more sophisticated model assumes the within-specimen (technical) variation follows a Poisson distribution and the between-specimen mean values follow a gamma distribution. This gives rise to the negative binomial (NB) distribution in which the variance increases as a quadratic function of the mean, i.e., Var(y) = μ + φμ^{2}

Our goal in the present work was to characterize the mean-variance relationship in mRNA Seq data in order to guide the choice of distributional assumptions. We first evaluated technical variability in gene-level counts to ensure consistency with what others have reported. Next, we evaluated the variance structure between biological replicates within a treatment group, considering the functions Var(y) = μ, kμ, and μ + φμ^{2} with and without normalization and blocking factors. We believe this work will be useful to others in analyzing and interpreting similar data.

Methods

Subjects

Twenty five study subjects representing the extremes of the humoral immune responses to rubella vaccine (12 high antibody responders with a median titer of 145 IU/mL and 13 low responders with a median titer of 10 IU/mL) were selected from a large population-based, age-stratified random sample of 738 healthy children and young adults (age 11 to 19 years), from Olmsted County, Minnesota. Clinical and demographic characteristics of the population based sample have been previously reported

PBMC culture, stimulation and RNA isolation

Subjects’ PBMC (peripheral blood mononuclear cells) were thawed and stimulated (or left unstimulated) with live rubella virus (multiplicity of infection, MOI = 5, 48 hrs). Total RNA was extracted from stabilized cells using RNeasy Plus mini kit (Qiagen, Valencia, CA). RNA concentration and quality were assessed by Nano Chip kit analysis on an Agilent 2100 Bioanalyzer (Agilent, Palo Alto, CA). Fifty samples from 25 subjects were completed for culture, RNA extraction and RNA quality control with adequate concentration and purity (lack of DNA contamination), as well as good RNA integrity and lack of RNA degradation.

Library preparation and sequencing

Libraries were prepared using the mRNA-Seq Sample Prep Kit (Illumina, San Diego, CA) following the manufacturer’s instructions. Briefly, polyadenylated RNA was isolated from 1 μg of total RNA using two rounds of hybridization to oligo-dT magnetic beads. The mRNA samples were chemically fragmented, reverse transcribed and converted into double stranded cDNA. Unique Illumina adaptors were ligated to the DNA fragments after end repair (to produce blunt ends) and A-base tailing. Fragments of approximately 200 bp were gel purified and amplified by PCR. The libraries were validated and quantified on an Agilent 2100 Bioanalyzer (Agilent, Palo Alto, CA) using DNA 1000 Chip kits. Sequencing was carried out on the Genome Analyzer GAIIx (Illumina, San Diego, CA). Samples were sequenced as single end reads using Illumina’s Single Read Cluster Generation kit (v2) and 51 Cycle SBS Sequencing Kit (v3) following the manufacturer’s instructions.

The first five flow cells were processed using Sequencing Control Studio (SCS) v 2.01 and the last eight flow cells were processed with SCS v 2.4 which allowed for higher cluster densities and higher pass filter rates. The images from the sequencing cycles were processed using the Illumina Pipeline Software v1.5. Specifically, images were converted to signal intensities using Illumina Pipeline’s FireCrest program. Base calling from intensity values was performed, and the quality scores for every base were calculated using the Bustard program. Illumina’s alignment tool, ELAND, was used to align the sequence reads to genome build 36 and exon junction databases. Illumina’s CASAVA tool version 1.0 was used to summarize the alignment results using only reads that mapped to a unique genome location and CASAVA results were imported into Genome Studio to generate the count tables for genes, exons, and exon-junctions. Sequencing pass/fail quality was determined by cluster densities, percent of clusters passing filters, percent of reads aligning to the reference, and percent error rate of the alignment.

Statistical experimental design

Specimens were randomly allocated to flow cells and lanes as follows (Figure

Study design

**Study design. A**) Cartoon depicting the allocation of subject samples to flow cells. One high (H) and one low (L) responder was allocated to each flow cell. Within a flow cell, each patient was randomly allocated to lanes 1-4 or 5-8, ensuring that H/L response was balanced over lanes across all flow cells. Finally, for each subject, two technical replicates of their stimulated and unstimulated specimens were randomly allocated to the first two or second two lanes such that stimulation status was balanced over lanes. **B**) Flow diagram demonstrating the full initial sample set and reasons for excluded lanes of data for the final analysis data set.

Statistical methods

The endpoint used for analysis was total reads or counts per gene; for the analysis of technical variation this is counts per lane while for biological variation this is counts summed over two technical replicate lanes. We evaluated the suitability of the Poisson, OD Poisson, and NB assumptions for modeling biological variability. The Poisson distribution assumes variance is equal to the mean ^{
2
}, where

Model Goodness-of-fit (GOF) was assessed via quantile-quantile (QQ) plots of per-gene Pearson statistics

**Figure S1.** Evaluation of asymptotic GOF distributional assumptions. QQ plot of GOF statistics from simulated null (i.e., no differential expression) NB data. Data for genes were simulated with mean equal to the mean vector in the unstimulated data presented herein, dispersion parameter equal to the edgeR estimated moderated dispersion parameter values. GOF statistics were calculated for each gene as described in the methods, here using the sample mean and true dispersion parameter. Sample sizes of A) n = 1000 and B) n = 23 were simulated in order to understand whether the asymptotic chi square distribution was appropriate. The theoretical distributions are chi square with A) 999 degrees of freedom and B) 22 degrees of freedom. From the right hand tails we see that the observed distribution does not have values quite as extreme as those in the theoretical distribution. However, the observed distributions are very close to the theoretical distributions as demonstrated by most points lying on the identity line. We conclude that the chi-square distribution is approximately correct for the data presented herein. **Additional file**
**1**
**Figure S2** – Technical reproducibility and functional form of bias. Counts were scaled by total lane counts. A) Representative scatter plot of technical replicate 1 versus technical replicate 2 for one subject. Spearman correlation was 0.9941 for this pair. Axes are on the log base 2 scale. B) MVA plot for the same pair of technical replicates. The vertical axis is difference between the counts in the two replicates on the log2 scale and the horizontal axis is the average of the two counts on the log2 scale; there is one point for each gene observed in at least one replicate. The shading indicates density of points in that area with darker shading representing higher density. If two replicates yielded identical results, all points would lie on the y = 0 horizontal line (indicated on the plots for reference). A locally weighted moving average smoother is indicated to demonstrate the average bias as a function of average count. **Additional file**
**1**
**Figure S3** – Individual QQ plots assessing distribution of technical replicates. QQ plots for all 24 subjects for whom data was received assuming Poisson variation in pairs of technical replicates. Vertical axes indicate observed quantiles and horizontal axes indicate theoretical quantiles.

Click here for file

The generalized linear model (GLM) framework was used to fit per-gene models to test for differential expression between high and low response groups using the log link function

Model fits were evaluated with no normalization, with total count per lane-pair, or with 75^{th} percentile count per lane-pair as a normalization constant

All statistical computing was performed in R

Convergence rates were excellent for all models, with the worst case being non-convergence for four genes when a blocking factor was included in the model. The data have been deposited in the Gene Expression Omnibus and are available (with anonymous gene names since the biological findings have not yet been published) via the following link:

Results

Subjects and assays

All 25 study subjects were female Caucasians to minimize variation. Median rubella-specific antibody response in the 12 high response subjects was 145 IU/ml (min 115, max 325) and in the 13 low response subjects was 10 IU/ml (min 3, max 14)

Figure

At least one count was observed for 17,337 genes. Total counts per lane ranged from 3.7 million to 10.7 million (Figure

Distributions of counts

**Distributions of counts. A**) Histogram of total reads per lane for 46 lanes (unstimulated specimens) on the scale of millions of reads. **B**) Frequency histogram of average counts per gene per lane on the log10 scale. **C**) Cumulative percent of average counts per lane as a function of the percent of genes contributing. Lines for both high (red) and low (blue) responders were drawn, but not distinguishable.

Technical variation

We assessed technical variation as the variation between the same specimen pipetted onto two lanes of a flow cell. Technical reproducibility in these data was good as demonstrated by scatter and minus versus average (MVA) plots (Additional file

QQ plots assuming Poisson-type variation were made for all pairs of technical replicates using Pearson GOF statistics; all 24 are available in Additional file

Biological variation

We define biological variation as the variation between multiple subjects in the same group, e.g., high versus low responders, cancer versus normal, treatment versus not, etc. This variation may be small for cell lines and moderate for genetically identical animals, but can be very large for human samples. Analysis methods need to be able to cope with and appropriately model this variation.

In order to understand the functional form of the mean-variance relationship, we created a scatter plot of the sample biological variation

**R function to plot variance as a function of the mean.**

Click here for file

**R function to create QQ plots of Pearson GOF statistics assuming the NB distribution.**

Click here for file

Assessing presence and magnitude of over-dispersion

**Assessing presence and magnitude of over-dispersion. A**) The horizontal axis indicates the mean scaled count within each of the high/low response groups on the square root scale (labeled on the raw scale) and the vertical axis indicates the variation on the standard deviation (i.e. square root of the variance) within each group. Each gene is thus represented by two points, one for each response group. The green line corresponds to the Poisson assumptions, the blue line corresponds to OD Poisson assumptions, and the red line corresponds to NB assumptions, with lines constructed as described in the text. **B**) Local estimates of

Distribution of GOF statistics

**Distribution of GOF statistics.** Residual QQ plots of model fits normalized with the 75% count and no blocking factor. Tick-marks along the top indicate deciles. The top 5% of GOF statistics are indicated in alternate colors with the top 1% being red and the next 4% being blue. **A**) Standard Poisson, **B**) NB with a global estimate of **C**) NB with per-gene estimates of **D**) NB with local estimates of **E**-**H** are as in A-D but zoomed in on the bottom left corner of the plots.

We questioned whether a one-size-fits-all variance structure was appropriate. It is biologically plausible (and many would argue likely) that the relationship, i.e. the precise multiplier _{10}(average count per lane-pair) of approximately 2 and extending to approximately 4.5 on the horizontal axis (Figure

Experimental variation

As in Bullard et al.

Potential sources of experimental variation examined here were flow cell, lane-pair and library preparation batch and all of these resulted in lower maximum observed Pearson chi-square values which can be seen on the QQ plots (Figure ^{th} percentile offset was included, there was no clear relationship between flow cell effect estimates and run order (Figure

Distribution of GOF statistics when experimental factors are included in the model

**Distribution of GOF statistics when experimental factors are included in the model.** QQ plots of model fits with the NB distribution, local estimates of ^{th} percentile count offset including blocking factors as indicated. Tick-marks along the top indicate deciles. The top 5% of GOF statistics are indicated in alternate colors with the top 1% being red and the next 4% being blue. **A**) lane-pair, **B**) library preparation batch, **C**) flow cell. Panel **D** is the same as A, but zoomed in on the bottom left corner of the plot; no zoom is needed for panels B and C.

Distribution of flow cell effects

**Distribution of flow cell effects.** Box plots of contrast coefficient estimates indicating the difference of flow cells 2 – 13 from flow cell 1 sorted by run order. The flow cells represented by the left four (blue) boxes were analyzed with SCS v 2.01 while the right-most eight (red) were analyzed with SCS v 2.4. **A**) Results from models without an offset to account for differences in total counts per lane. **B**) Results from models including the 75^{th} percentile offset.

While it is important for a model to explain all sources of variation, a balance must be made between this and over-fitting the data. This will especially be the case for studies with extremely small samples which are typically employed for next generation sequencing studies given the intense monetary and time resource utilization.

Characterizing genes with poor model fit

We investigated whether the genes with small counts were those with the smallest (indicative of over-fitting) or largest (indicative of under-fitting, or not explaining enough variation) GOF statistics when using NB distributional assumptions with local variance estimate and no blocking factors**.** Interestingly, filtering out low count genes, even up to a total count of 10,000 (average count 435), had only a minor impact on the distributions (data not shown). The GOF statistics for genes averaging less than 5 counts per subject were distributed throughout the range of GOF statistics (red points on Figure

Understanding causes of poor model fit

**Understanding causes of poor model fit. A**) GOF statistics for genes with an average count per subject <5 are shown in red on the QQ plot from NB, locally estimated variance with no blocking factor models. **B**) Dot plot demonstrating the large variance in a gene with an extremely large GOF statistic.

Discussion

We set out to evaluate technical, biological, and experimental variation in gene expression measured by mRNA-Seq counts from n = 25 subjects. Technical variation in these data were in keeping with Poisson distributional assumptions in general as has been reported by some

Over-dispersion, i.e., variance larger than the mean, is a common phenomenon in observed count data ^{2}, the square root of φ corresponds to the subject-to-subject CV of the Gamma distributed Poisson mean parameters. The CV corresponding to the global estimate of φ for the data presented herein is 36%. We have observed a CV of approximately 9% for technical variation in genetically similar rats and 22% for biological replicates in genetically similar rats (data not shown). This is in keeping with the expectation that biological variation is larger than technical variation, and human variation is larger than genetically inbred animal model variation.

There are several potential sources of experimental variation in these studies including PCR library preparation batch, flow cell, lane, and for large studies, machine and software upgrades. Some effects can be addressed through normalization while others must be modeled directly. We found that normalization strategy had little impact on GOF statistics. Bullard et al.

Bullard et al. found experimental variation due to library PCR preparation batch to be greater than that due to flow cell

Estimation of the over-dispersion parameter has been the subject of much research; see for example the summary in

Our study had several strengths and limitations. We reported on technical and biological variation in mRNA-Seq data from a relatively large sample set and provide valuable insight into biological variation that will be useful to many researchers. The data herein were from true human biological replicates rather than contrived cell line replicates. We expect that the conclusion of a quadratic mean-variance relationship for biological replication will extend to other experimental settings. However, the precise estimates of over-dispersion are expected to be study-specific; it is plausible they would be smaller in inbred mouse models, smaller yet in cell lines. The present study was not designed to assess bias in estimating gene expression, fold-changes or sensitivity and specificity of fold-change detection. While the modeling strategy utilized does not account for the degrees of freedom used in estimating the dispersion parameter, thousands of data points are used in this estimation, so the estimate should be very stable and precise

Conclusions

We found that the within-gene variance structure is over-dispersed relative to the Poisson distribution. As a result, hypothesis tests based on Poisson distributional assumptions will be too liberal (reject more often than they should) and power estimates based on these assumptions will be over-estimated for most genes, i.e., sample sizes will be too small for the estimated power. Local estimates

Competing interests

The authors wish to declare that GAP is the chair of a data safety monitoring board for Merck for non-rubella novel vaccines in clinical trials.

Authors’ contributions

ALO helped design the sequencing study, conceived the statistical questions, directed the statistical work and wrote the manuscript. BMB helped refine the study questions, performed all computing and participated in critical interpretation of the data. DEG helped design the sequencing study, helped direct the statistical computing, helped refine the study questions and participated in critical interpretation of the data. GAP helped design the sequencing study. TMT participated in critical interpretation of the data. All authors read and approved the final manuscript.

Acknowledgments

We are grateful for the contributions of Dr. Iana Haralambieva, Mr. Sumit Middha and Mr. Bruce Eckloff, and for the assistance of Ms. Rhonda Larsen with manuscript preparation. This work was supported by the National Institutes of Health (NIH) [N01-AI40065, AI48793, AI-89859] and by the National Center for Research Resources (NCRR) [1 UL1 RR024150], a component of the National Institutes of Health, and the NIH Roadmap for Medical Research. Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH.