Abstract
Background
Illumina BeadArray technology includes non specific negative control features that allow a precise estimation of the background noise. As an alternative to the background subtraction proposed in BeadStudio which leads to an important loss of information by generating negative values, a background correction method modeling the observed intensities as the sum of the exponentially distributed signal and normally distributed noise has been developed. Nevertheless, Wang and Ye (2012) display a kernelbased estimator of the signal distribution on Illumina BeadArrays and suggest that a gamma distribution would represent a better modeling of the signal density. Hence, the normalexponential modeling may not be appropriate for Illumina data and background corrections derived from this model may lead to wrong estimation.
Results
We propose a more flexible modeling based on a gamma distributed signal and a normal distributed background noise and develop the associated background correction, implemented in the Rpackage NormalGamma. Our model proves to be markedly more accurate to model Illumina BeadArrays: on the one hand, it is shown on two types of Illumina BeadChips that this model offers a more correct fit of the observed intensities. On the other hand, the comparison of the operating characteristics of several background correction procedures on spikein and on normalgamma simulated data shows high similarities, reinforcing the validation of the normalgamma modeling. The performance of the background corrections based on the normalgamma and normalexponential models are compared on two dilution data sets, through testing procedures which represent various experimental designs. Surprisingly, we observe that the implementation of a more accurate parametrisation in the modelbased background correction does not increase the sensitivity. These results may be explained by the operating characteristics of the estimators: the normalgamma background correction offers an improvement in terms of bias, but at the cost of a loss in precision.
Conclusions
This paper addresses the lack of fit of the usual normalexponential model by proposing a more flexible parametrisation of the signal distribution as well as the associated background correction. This new model proves to be considerably more accurate for Illumina microarrays, but the improvement in terms of modeling does not lead to a higher sensitivity in differential analysis. Nevertheless, this realistic modeling makes way for future investigations, in particular to examine the characteristics of preprocessing strategies.
Background
Illumina BeadArray platform is a microarray technology offering highly replicable measurements of gene expression in a biological sample. Each probe is measured on average of thirty to sixty beads randomly distributed on the surface of the array, avoiding spatial artifacts and the reported probe intensity is the robust mean of the bead measurements. Fluorescence intensity measured on each bead is subject to several sources of noise (nonspecific binding, optical noise, …). Thus the intensities produced by the microarray require a background correction in order to account measurement error. For that purpose, Illumina microarray design includes a set of non specific negative control probes which provides an estimate of the background noise distribution.
In genomewide microarrays, the observed intensity of a probe is usually modeled as the sum of a signal and a background noise. Namely, let X be the observed intensity of a given probe, we assume that
where S is the true signal which counts for the abundance of the probe complementary sequence in the target sample and is independent of the background noise B. Only X is observed but the quantity of interest is the signal S. Therefore, a background correction adjusting the effect of noise on the true signal is necessary to enhance the biological validity of the results. In this context the knowledge of both signal and noise distributions provides a background correction procedure: the signal S is estimated by the conditional expectation of S given the observation X=x and given the distributions of B and S. Under parametric assumptions on B and S, the problem is limited to the estimation of the parameters. Besides, in many experimental contexts involving measurement error the normal distribution of the noise is assumed. Specific arguments for microarray data find their origins in analytical chemistry (see e.g. [1]).
Background correction of Affymetrix and twocolor microarray data has been widely developed in literature (see [2] for a review and a comparison). Irizarry et al[3] proposed a parametric model for Affymetrix based on a exponential distribution of the signal, called normexp model. Several estimation procedures have been developed for this model. The first parameter estimation, still popular today, is the Robust Multiarray Average (RMA) procedure. Maximum Likelihood Estimation (MLE), incorporating the negative controls, has been later proposed and is considered to be more sensitive to the true parameter values (see [4]). These procedures can be found in Bioconductor^{a} packages including limma[5].
Illumina design differs from those of Affymetrix and twocolor microarrays by including a set of negative probes which do not specifically target any regular probe. Aside from non specific hybridization, these negative probes do not hybridize and then have signals close to zero. Thus their observed intensity is X=B. As all probes from a given array correspond to the same biological sample and are subject to the same technical steps during the analysis process, the noise is generally assumed identically distributed on an array and the negative probes provide a sample from its distribution.
The background correction implemented in Illumina BeadStudio software is the subtraction of the estimated mean of the negative probe distribution. However, it creates a large amount of probes with negative intensities unusable in further analysis. The deletion of these probes is considered in some studies as an opportunity to gain statistical power when the number of strongly differentially expressed genes is large, but it can lead to an important loss of information. Ding et al[6] illustrate this phenomenon in their mice leukemia study: a large amount of corrected values are negative only in one group suggesting that the corresponding probes have discriminating ability. This issue is confirmed by Dunning et al[7] on spikein data.
To avoid this problem, parametric models have been used on Illumina data with parameter estimations taking into account the specific design of Illumina microarrays. In this context, the normexp model has been first adapted. Ding et al[6] use the Maximum Likelihood Estimation (MLE) based on a MonteCarlo Markov chain approximation and compare their method to an Illuminaadapted RMA procedure using an ad hoc rule of thumbs to estimate the parameters. Xie et al[8] go into details in normexp method comparison on experimental and simulated data. Lin et al[9] present a variance stabilizing transformation (VST) on a model involving both additive and multiplicative noises, which simultaneously denoise and transform the data. Replacing the classical logtransformation, VST produces less directly interpretable results and tends to produces very small fold changes, as underlined by Shi et al[10] who propose an original approach to compare methods offering different biasprecision tradeoff by aligning the innate offset generated by each preprocessing strategy. They conclude in favor of the normexp model with robust ’nonparametric’ parameters associated to a quantilenormalization using control and regular probes. Besides, Chen et al[11] propose a gamma parametrisation of the background noise distribution to handle with the departure from normality observed on negative probes, associated with an exponential distribution of the signal. They emphasize that gammaexponential model can provide an improvement in terms of differential analysis, but might not be an adequate parametrisation in some cases.
The spread of each background correction among the Illumina users is hard to evaluate since many authors do not mention precisely the preprocessing steps performed in their study. Nevertheless, the normexp model that will be especially examined in this paper is included in several widely used packages such as lumi and limma of Bioconductor^{a}.
Despite its popularity the normexp model does not properly fit Illumina microarray data. This issue was raised by Wang and Ye [12], who estimate the density of the signal on an Illumina microarray with a kernelbased deconvolution procedure. The shape of the estimated signal density does not present the characteristics of an exponential distribution and a gamma modeling seems more appropriate. We confirm these findings by implementing the kernelbased estimator by Wang and Wang [13] available in the R package decon. (The results are displayed in Additional file 1, Section 1). The signal density estimate exhibits a heavy tail which can not be fitted by an exponential distribution density. Nevertheless, kernelbased density estimators does not appear efficient to recover breakpoints in the density, and presents instabilities, which limits the interpretation of the signal density estimate in the microarray context. In this paper, we emphasize that the normalexponential model is not flexible enough to model the signalnoise decomposition on Illumina microarrays by showing that the distance between the reconstructed density from the estimated parameters and the distribution of the observed intensities is large.
Additional file 1. Supplementary Material 1 provides a description of the simulations, computing details and additional figures.
Format: PDF Size: 543KB Download file
This file can be viewed with: Adobe Acrobat Reader
We propose an alternative model thereafter called “normalgamma model” which addresses this lack of fit. In our model, the normal noise distribution is assumed and the signal on one array is assumed to be gamma distributed. As the exponential distribution is a special case of the gamma distribution, this model extends the normexp model. The potential of such generalization was already suggested by Xi et al[8] in their discussion. We derive the necessary estimation procedure by likelihood maximization. The good quality of fit is attested on two types of Illumina microarrays. The associated background correction is compared to methods based on the normexp model in terms of quality of estimation of the signal and checked for robustness on simulated data. The characteristics of the background correction procedures are compared on a set of spikein data, and a parallel is drawn with the same characteristics studied on normalgamma simulated data. Finally, the normexp and normalgamma background corrections are compared on two dilution data sets.
The paper is organized as follows. The experimental and simulated data sets as well as the estimation procedures are presented in Section “Methods”: the notations and the general modelbased background correction formula are gathered in Section “General modelbased background correction formula”; the previous models developed for Illumina microarray background correction, including the normexp model, are summarized in Section “Previous modelings”; Section “A new modeling: the normalgamma model” presents the proposed alternative parametric model built with normal noise and gamma distributed signal, as well as a parametric estimation procedure and its associated background correction. The performances of this new model are evaluated on simulated, spikein and dilution data sets in Section “Results and discussion”. The impact of this more flexible parametrisation on background correction as well as the perspectives for further preprocessing analyses are discussed in Section “Conclusions”. The normalgamma parameter estimation and the associated background correction are implemented in the Rpackage NormalGamma. The scripts used to produce the tables and figures are available in Additional file 2.
Additional file 2. Supplementary Material 2 gathers the scripts used to produce the tables and figures.
Format: TXT Size: 106KB Download file
Methods
Materials
Experimental data sets
● (E_{1})Nowac data[14]. The gene expression profile in peripheral blood from ten controls in the Norwegian Woman And Cancer study has been analysed on Human HT6 v4 Expression BeadChips. The whole probe set including 48,000 bead types has been considered, as well as a restricted set of 25,519 bead types according to Illumina annotation files. Details on laboratory experiments are given in Additional file 1: Section 2.1. The data are provided in Additional file 3.
Additional file 3. Supplementary Material 3 is a zip file which contains three text files with the observed intensities of the ten microarrays from data set (E_{1}).
Format: ZIP Size: 1.2MB Download file
● (E_{2})Leukemia mice data[6]. Total RNA from samples of spleen cells from four mice have been analysed on Mouse6 v1 BeadChips. Experiment description and data are available in [6]
● (E_{3})Spikein data[15]. HumanWG6 v2 BeadChips have been customized to include 34 bead types, refered as ’spikes’, whose corresponding target sequence is absent from the human genome in addition to the 48,000 regular probes. The 34 spikes were introduced at 12 different concentrations (0pm, 0.01pm, 0.03pm, 0.1pm, 0.3pm, 1pm, 3pm, 10pm, 30pm, 100pm, 300pm, 1000pm) in a human biological sample. Each sample corresponding to a spike concentration has been analysed on four arrays. The data are available at http://rafalab.jhsph.edu/.
● (E_{4})MAQC data[16]. Two pure samples, Universal Reference RNA (HBRR) and Human Brain Reference RNA (UHRR) were mixed in four different proportions (100%/0%, 75%/25%, 25%/75%, 0%/100%). Five replications of each sample have been analysed on HumanWG6 v1 BeadChips. The data are available on GEO (access number GSE5350).
● (E_{5})Dilution data[17]. The pure samples UHRR and HBRR were mixed at different proportions (100%/0%, 99%/1%, 95%/5%, 90%/10%, 75%/25%, 50%/50%, 25%/75%, 10%/90%, 0%/100%). Each mixed sample has been analysed with four different starting RNA quantities (250ng, 100ng, 50ng, 10ng). Six HumanWG6 v3 BeadChips were used.
Simulated data sets
For each data set, N=100 random arrays including a vector X^{ℓ}of length n_{reg}=25000 corresponding to the regular probe intensities and a vector X^{0,ℓ}of length n_{neg}=1000 corresponding to the negative probe intensities are generated. The values of the nine parameter sets as well as the details of the simulations are given in Additional file 1: Section 2.2
● (S_{1})Normalgamma and normexp models. For each repetition ℓ=1,…,N, X^{ℓ}is generated as the sum of a gamma and a normaldistributed sample, and X^{0,ℓ}is drawn from a normal distribution. Six sets of parameters are computed from two microarrays in data sets (E_{1}) and (E_{2}), based on normexp and normalgamma models in order to get realistic values (sets 16). The normexp parameters are actually degenerated normalgamma parameters where the shape is equal to 1.
● (S_{2})Mixture noise distribution. A mixture of normal and χ^{2}distributions with different proportions (0, 0.1, 0.25, 0.5, 0.75,1) is considered for the background noise. These distributions model a departure from normality with a heavier right tail for larger values of p. The mixture densities are presented in Additional file 1: Section 5. The signal is generated from a gamma distribution with parameter values from set 1.
● (S_{3})Replicates. We mimic replicate measurements from a biological sample by simulating N arrays with the same signal sample generated from a gamma distribution. The background noise and negative probe intensities are independently drawn from a normal distribution for each array. The values of the parameters are computed from the first array in (E_{3}) (set 7). Replicates from the normalexponential model are drawn in the same way with parameter values estimated on the same array with two normexp estimates (sets 8 and 9).
● (S_{4})Replicates with empirical background noise. Similarly to (S_{3}), the signal drawn from a gamma distribution with parameter values from set 7 is identical on each array. In order to get a realistic noise distribution, the negative probe and background noise intensities are sampled from the global set of quantilenormalised negative probe intensities measured in the experimental data set (E_{3}).
General modelbased background correction formula
Notations
Throughout this article, the background correction is processed on one single array corresponding to one biological sample. For a given probe j, we denote by X_{j} the observed intensity, S_{j} the nonobservable underlying signal and B_{j} its background noise. For a negative control probe, S_{j}is assumed to be 0. Let J and J_{0}be respectively the index of regular and negative probes on the array. We denote by f_{X}, f_{S} and f_{B}the densities of respectively the observed intensity, the unknown signal of interest and the background noise.
We denote by the density of the normal distribution with mean μand variance σ^{2}and by ϕand Φ the density and cumulative distribution function of the normal distribution with mean 0 and variance 1. We denote by the density of the exponential distribution with mean α and the density of the gamma distribution with scale parameter θand shape parameter k. The exponential distribution is a special case with k=1 and θ=α.
Given a parametric density and a procedure of estimation of its parameters, we call plugin density, this density for the estimated parameters.
Modelbased background correction
The model based background correction (BgC) incorporates information from both signal and noise distributions. Under the additive model (1) assuming independence of S and B, f_{X} is the convolution product of f_{S}and f_{B}. For an observed probe intensity x, the signal S is estimated by the conditional expectation of S given the observation X=x and the densities f_{B}and f_{S}(more details can be found in [8]):
Previous modelings
The normal model for negative probes
The design of Illumina BeadArrays provides a sample of the background distribution through the negative probes. We have compared the density histogram of negative probes to the plugin normal density obtained by using robust estimators of the parameters on data sets (E_{1}) and (E_{2}).
The results of this comparison are presented in Additional file 1: Section 3.1. As expected, the empirical distribution is essentially normal but we notice a slightly heavier right tail. It may be interpreted as intensities of wrongly designed negative probes which partially hybridize with some material present in the biological sample.
The normalexponential model
The normexp model is a parametric model for the noisesignal decomposition on one array. We recall it briefly and refer for example to [8] for more details. For every probe j,
where ⊥ denotes the independence between variables. The parameters (μσα) depend on the given array.
For computational reason, the X_{j}’s are usually and often implicitly assumed to be independent. The existence of pathways between genes violates this assumption. Nevertheless, as a small proportion of genes are involved, results are reliable. According to the convolution structure (see Section “Modelbased background correction”), the density of the X_{j}’s is:
where . Denoting Θ = (μ,σ,α), from (2), the background corrected intensity for an observed intensity x is:
Normalexponential model fit
We consider the data sets (E_{1}) and (E_{2}). For each array, the following procedure is implemented:
● Computation of the estimators of the normexp model with the methods described in [8]:
1. Maximum Likelihood Estimation (MLE) using both regular and negative probes,
2. Robust Multiarray Analysis (RMA) adapted from Affymetrix method,
3. NP estimation obtained by the method of moments applied to negative and regular probes,
4. Bayesian estimation. Note that the bayesian estimation results are not presented as they are nearly identical to MLE, as pointed out by Xie et al[8].
● For each parameter estimation method, plot of the plugin density .
● Plot of an irregular density histogram of all regular probe intensities of the array using the Rpackage histogram available on the CRAN with default irregular setting (see [18]). Even though adaptive irregular histograms are not commonly used to describe microarray data, they have been proved to offer a better approximation in a general framework. Moreover, they appear especially relevant to estimate microarray distributions which present high irregularities.
Figure 1 shows the results for this procedure on one array from (E_{1}) after removal of imperfectly designed probes (more arrays are presented in Additional file 1: Section 3.2). Apart from the RMA method, the estimated density does not fit the density histogram and even the RMA estimator is not satisfying from a statistical point of view. One can remark that RMA underestimates the high expressions while the other methods tend to overestimate their contributions.
Figure 1. Normalexponential fit. NormalExponential estimation for one array from (E_{1}) after removal of imperfectly designed probes: irregular density histogram of all regular probe intensities and the plugin normexp density of the regular probes with MLE, RMA and NP parameter estimates.
Besides Xie et al[8] show that the MLE and the NP estimation provide satisfying estimators of the parameters on normexp simulated data. Thus the difference between the histogram of the observed intensities and the plugin density does not come from a poor estimation of the parameters but results from an unsuitable parametric model.
A new modeling: the normalgamma model
The poor fitting of the normexp model shown above, as well as the preliminary observations based on nonparametric estimation procedures, call for a more suitable parametric model for Illumina BeadArrays. According to Section “Previous modelings” the normal assumption for the negative probes appears relevant. We consider the gamma distribution as an extension of the exponential distribution to model the signal intensities. Besides, as a scale mixture of exponential distributions (see [19]), the gamma distribution is a natural generalization which helps to take into account different probe hybridization behaviors which could count for different exponential life times. This defines a more flexible parametric model called the normalgamma model that we propose to apply to Illumina BeadArrays.
The normalgamma model
The normalgamma model is defined as follows. For every probe j:
The parameters (μ,σ,k,θ) depend on the given array. This model offers more flexibility than the normexp model but requires the estimation of one more parameter.
According to the convolution structure (see Section “Modelbased background correction”), the density of X_{j} is the convolution product of the densities of S_{j}and B_{j}, namely:
This density does not have any analytic expression as the normexp density (3). Nevertheless, good and fast numerical approximations can be computed using the Fast Fourier Transform (fft) and tail approximations to ensure stability. Our implementation based on fft is detailed in Additional file 1: Section 7.
Parameter estimation in the normalgamma model
The parameters (μ,σ,k,θ) of the normalgamma distribution are estimated by the Maximum Likelihood Estimator (MLE):
where
is the likelihood from the two sets of observations X={X_{j},j∈J} and X^{0}={X_{j},j∈J_{0}} measured on regular and negative probes, respectively. Thanks to the fftbased approximation of the maximum likelihood estimation can be numerically computed using classical minimization algorithms (see Additional file 1: Section 7).
Background corrected intensity for the normalgamma model
Denoting now Θ = (μ,σ,k,θ), we derive from (2) the back ground corrected intensity for an observed intensity x:
using the equality valid for every s>0. This formula allows fftbased computations for the background correction.
Inference of negative probes from Illumina detection pvalues
Most publicly available data sets do not present the negative probe intensities. Nevertheless, for each regular probe, Illumina provides a detection pvalue equal to the proportion of negative probes which have intensities greater than that probe on a given array. Following the idea from Shi et al[20] we propose to infer the negative probe intensities from the detection pvalues (see details in Additional file 1: Section 8.1). For the normexp and the normalgamma models, the estimates of the parameters and reconstructed signals obtained with the true and inferred negative probe intensities are compared on the ten arrays from (E_{1}). We observe that the error resulting from inference of the negative probe is negligible, with a relative error of order 10^{−3} to 10^{−4} on parameter estimation and 10^{−4}to 10^{−5} on signal estimation (see Additional file 1: Section 8.2).
Results and discussion
Fit on Illumina BeadArray data
Similarly to Section “Previous modelings”, we compare the irregular density histogram of the regular probe intensities with the plugin normexp densities using RMA, MLE, NP methods and the plugin normalgamma density using a Maximum Likelihood Estimate of (μ,σ,k,θ) on the data sets (E_{1}) and (E_{2}). The results, similar along the arrays, are illustrated in Figure 2 on one array from (E_{1}) (more plots are presented in Additional file 1: Section 3.2). We do not add the MLE and NP plugin density estimates for the normexp model which have already been shown not to fit the data.
Figure 2. Normalexponential and normalgamma fit. NormalGamma estimation for one array from (E_{1}) after removal of imperfectly designed probes: irregular density histogram of all regular probe intensities, plugin normexp density with RMA estimate and plugin normalgamma density with MLE estimate.
Thanks to the larger flexibility of the normalgamma model, we observe that the distance between the MLE plugin normalgamma density and the histogram of the intensities is smaller than the corresponding distance using the normexp model with any estimation procedure. This graphical result is confirmed numerically using the L_{1}distance between the histogram and the reconstructed density defined by
where represents one plugin density estimate using either the normexp model or the normalgamma model and represents the irregular density histogram obtained with the Rpackage histogram. Table 1 presents the mean of the relative deviation for the normexp estimators with respect to the deviation for the normalgamma estimator:
where is a normexp estimator of the regular probes density, and is the normalgamma estimator for individual i. The mean is computed over the ten arrays from (E_{1}) (with and without the non specific binding probes) and over the four arrays from (E_{2}).
Table 1. Deviation between reconstructed intensities and observation histogram
The mean absolute deviation is 3 times smaller in favor of the normalgamma density with respect to the normexp density using the RMA estimate, and 4 to 8 with respect to the normexp model using the MLE or NP estimates.
Quality of estimation on simulated data
The quality of estimation of the normalgamma model is assessed on the simulation data set (S_{1}). The first two sets of parameters are non degenerate normalgamma parameters, more realistic for modeling Illumina microarrays as shown in Section “Fit on Illumina BeadArray data”. They are used to evaluate the MLE normalgamma parameter estimation and validate the associated background correction, and to quantify the improvement brought by the new normalgamma background correction. The last four sets are actually degenerate normalgamma parameters where the shape parameter k is set to 1, corresponding to normexp data, which enables to assess the potential loss of precision in parameter estimation brought by a more flexible modeling.
Parameter estimation
For each repetition ℓ=1,…,N we compute the normalgamma MLE of the parameters. Table 2 presents the relative L_{1}error for each parameter β∈{μ,σ,k,θ}:
Table 2. RelativeL_{1}error for each parameter in the normalgamma model using MLE estimates
The parameter estimation is of excellent quality for the gaussian distribution and of good quality for the gamma distribution.
To check wether the introduction of a fourth parameter in our model leads to a loss of precision in the parameter estimation, we compare the relative L_{1}errors of the MLE parameter estimation in the normalgamma and normexp models using the parameter sets 3 to 6, corresponding to normexp data. The results summarized in Table 3 indicate that the quality is unchanged for the variance parameter and that we pay a price of order 2 for μ and θ. Nevertheless, since the relative errors in these cases have order 10^{−2}, this loss is negligible.
Table 3. Error in parameter estimation
Background corrected intensity
We now study the performance of the normalgamma background correction (BgC) obtained in (8) with respect to the existing BgC methods in terms of quality of estimation of the signal on the simulated data set (S_{1}). We compare the following BgC methods, detailed in [8]:
0. Normalgamma BgC in (8) with true parameters,
1. Normalgamma BgC in (8) with MLE parameters,
2. Normalexponential BgC in (4) with MLE parameters (referred to as normexpMLE),
3. Normalexponential BgC in (4) with RMA parameters (referred to as normexpRMA),
4. Normalexponential BgC in (4) with NP parameters (referred to as normexpNP),
5. Background subtraction:
These methods are further denoted by for i=0,…,5. For methods 1 to 4, the BgC is a twostep procedure: the parameters are estimated and then plugged respectively into (8) for method 1, and into (4) for methods 2 to 4. From a practical point of view, as the parameters are unknown, is unavailable. Nevertheless, as the result of a procedure with a perfect first estimation step, it allows a comparison to quantify the performance of the second step.
For each parameter set and for each BgC method , i=0,…,5 we compute the Mean Absolute Deviation (MAD):
where denotes for each simulated array ℓthe estimated parameters corresponding to the used methods with the following conventions: 1/ is the true parameters for i=0; 2/ for i=2,…,4 corresponding to the exponential distribution; 3/ represents the median over X^{0,ℓ}for i=5. Simulation results are summarized in Table 4 for the parameter sets 16. The five first columns correspond to the excess risk ratio:
and the last column indicates the reference risk .
Table 4. Excess risk ratio of background corrected rawscale intensities
The normalgamma BgC provides the same quality when the parameters are known or estimated. This holds when the data are generated either from a normalgamma or a normexp model. NormexpNP shows good behaviors when the data come from a normexp model but has a risk increase of order 60% if the data come from a normalgamma model. NormexpMLE provides good results for normalexponential data but fails when the data come from a normalgamma model. Not surprisingly, as already pointed by Xie et al[8], normexpRMA has a poor behavior. The background subtraction method with a maximal quality loss of 32% offers an acceptable alternative in terms of risk. Indeed, most of the intensities being small, putting them to 0 does not affect significantly the MAD value. Let us recall, however, that from a practical point of view, the major disadvantage of background subtraction is the elimination of a considerable number of probes.
In practical experiments, the data are usually transformed before the analysis. To address this issue, the MAD is computed on logtransformed intensities (see Additional file 1: Section 4 for details), and the excess risk ratio is displayed in Table 5. The normalgamma BgC presents a smaller error of estimation than the methods based on the normexp model. The MAD from normexpMLE generates the highest value, and normexpRMA and normexpNP show a a similar moderate excess risk. Nevertheless the differences between the BgC methods are less pronounced than the one observed at the raw scale. However, with an excess risk ratio between 16% and 33%, the normexp methods notably under perform the BgC based on the normalgamma model in terms of signal estimation.
Table 5. Excess risk ratio of background corrected logtransformed intensities
The MAD computation offers a global comparison of the various BgC methods in terms of signal estimation. We refine this analysis by examining the absolute deviation (AD) of the estimated signal for each signal intensity at the raw and log scales, respectively defined as:
The first row of Figure 3 displays the logarithm of the AD at the raw scale, as a function of the logsignal intensity. We observe that normexpMLE presents a larger AD for all values of the signal. On small intensities, the normalgamma BgC outperforms the other methods, whereas normexpRMA and normexpNP present a smaller deviation on moderate intensities. For high values of the signal, normalgamma shows a smaller error of estimation together with normexpNP.
Figure 3. Absolute deviation of the signal estimation on simulated data. Logarithm of the Absolute Deviation of estimated signal on raw scale (first row), Absolute Deviation of logtransformed estimated signal (second row) and signal logdensity (third row). Normalgamma BgC (purple) and normexp BgC with MLE (pink), RMA (blue) and NP (green) parameters.
The absolute deviation on logtransformed intensities is presented on the second row of Figure 3. The normalgamma BgC still presents the smallest error of estimation on weak intensities, but is outperformed by the other methods on moderate intensities. The four methods present similar AD values on high intensities. Besides, we observe than the error of estimation for all methods increases as the signal become weaker.
Robustness
In Section “Previous modelings”, we have underlined the slightly heavier right tail of the negative probe distribution. To ensure that the estimation remains acceptable under the assumption of an imperfect noise parametrisation, we compare the robustness of the normalgamma method with normexpNP, stated as the most robust by [8] and which we found competitive (see Section “Modelbased background correction”). The errors of estimation computed from the simulation data set (S_{2}) are presented in Additional file 1: Section 5. Both methods are robust with respect to nonnormal noise distribution, and the normalgamma BgC still offers a better quality of estimation than normexpNP when the noise distribution departs from normality.
In conclusion, the normalgamma background correction globally offers a better quality in signal estimation with respect to the normexp methods. Nevertheless, this improvement depends on the scale considered and does not steadily hold over the range of intensities.
Operating characteristics
Beyond the quality of estimation of the signal, the performance of a BgC procedure in practical experiments depends on its characteristics in terms of bias and variance. In this section, we compare the operating characteristics of the normalgamma and normexp BgC both on simulated and spikein data. The results are gathered in Figure 4. The background subtraction leading to probe deletion is not further considered. The data from (E_{3}) are background corrected with the methods 1 to 4 described in Section “Background corrected intensity”. Quantile normalization based on both regular and negative probe intensities is applied, followed by logtransformation. The same procedures are implemented on the simulation data sets (S_{3}) and (S_{4}).
Figure 4. Operating characteristics of the BgC methods on spikein and simulated data. Row 1: average spike intensities (left) and standard deviation of spike replicates (right) for all nonzero spike concentrations. Row 2 to 4: average intensity (left) and standard deviation of replicates (right) as a function of signal intensity. Row 2: normalgamma simulation in data set (S_{3}) (parameter set 7); Row 3: gamma signal and empirical background noise distribution (data set (S_{4})); Row 4 normalexponential simulation in data set (S_{3}) (parameter set 9).
Biasprecision tradeoff
The quality of a preprocessing method in microarray experiments can be characterised by its ability to distinguish between distinct values of the signal. Most of the procedures underestimate the signal foldchanges. This bias in foldchange estimation, called compression, has a negative impact on differential analysis. But the efficiency of a preprocessing method also depends on its precision, characterised by the variations of the corrected intensity for a given value of the signal. The tradeoff between bias and precision is an indicator of the performance of a procedure. This issue can be understood by the example of a ttest statistic for a given probe differentially expressed between two groups: an important compression attenuates the difference of average intensities between the two groups, whereas a poor precision generates a high variance term in the denominator, which reduces the value of the test statistic.
The compression and precision obtained with the four BgC methods on the data set (E_{3}) are presented on the first row of Figure 4. The first column displays the average intensity over the 34 spike bead types for each spike concentration. A similar saturation effect is observed for large concentrations with the four methods: for concentrations larger than 100pm, the relationship between logintensity and logconcentration is not linear. Moreover, as the concentration decreases, a compression of the signal is observed with all the methods, but is significantly less pronounced with the normalgamma BgC.
The second column presents the average standard deviation between replicates over all spike bead types. We observe that the improvement in bias brought by the normalgamma model is at the cost of a poorer precision. More generally, the precision increases with the compression for the four methods.
Innate offset
Shi et al[10] highlight the role played by the ”innate offset”, defined as the typical intensity assigned to the nonexpressed genes by a preprocessing procedure, in the unbalanced biasvariance tradeoff of the various BgC methods: the strategies which show the smallest innate offset usually generate less bias but present a poorer precision. On spikein data sets, the innate offset is defined as the mean of the intensities measured on spike probes with concentration zero. The results displayed in Table 6 confirm the observations from Shi et al[10]: as underlined above, the normalgamma BgC exhibits the smallest precision associated with the smallest bias with a slope from the linear regression of intensities on logconcentrations close to 1. The largest offset combined with the highest precision and the largest bias are observed for normexpMLE.
Table 6. Innate offset and operating characteristics
Shi et al[10] propose to compare the preprocessing methods in a more equal way by adding an offset to the backgroundcorrected quantilenormalised intensities before logtransformation, in order to align the innate offsets of the various preprocessing strategies. Our results presented in Additional file 1: Section 6.1, indicate that the characteristics between the four BgC present more similarity after equalizing the innate offsets, but a slight difference remains between normexpMLE and the other BgC methods on small intensities. Nevertheless, prior to the offset equalisation, the methods studied in this paper do not present the large range of biasprecision tradeoffs observed in the preprocessing strategies considered in [10]. In this context, the equalisation of the innate offsets does not appear sufficient to completely erase the differences between BgC methods.
Operating characteristics on simulated data
In order to reinforce the validation of the normalgamma parametrisation for the noisesignal distribution, we compare the operating characteristics obtained on spikein data to the ones provided by the normalgamma simulated data from set (S_{3}). The spike concentration, used as references to assess the bias and precision of the procedures on spikein data, is replaced by the true value of the signal. The results are displayed on the second row of Figure 4. The first column presents the average intensity as a function of the signal logintensity, and the standard deviation of the replicates is shown in the second column. The trends are very similar to the ones observed on spikein data, with a small difference for the variance with normexpRMA. Besides, we observe that the compression in small intensities generated by the four BgC methods is purely a statistical effect. However, the signal attenuation in high intensities observed on spikein data is not present on simulated data. Indeed, it has already been suggested that this phenomenon could come from saturation in light intensity on microarrays.
Furthermore, we address the departure from normality observed on the negative probe distribution by simulating microarrays with a gamma distributed signal and a nonnormal background noise (data set (S_{4})). In order to get a realistic noise distribution, the background noise and the negative probe intensities are sampled from the quantilenormalised negative probe intensities from all arrays in (E_{3}) (see details in Additional file 1: Section 2.2). The operating characteristics of the four BgC methods are presented on the third row of Figure 4. We observe that the slight difference between normalgamma simulations and spikein data with normexpRMA, observed on the second row of Figure 4, is partially corrected by generating a nonnormal background noise.
The same quantities are computed based on normalexponential simulated data with parameter sets 8 and 9. The results are displayed on the fourth row of Figure 4 for set 9, and on Figure G in Additional file 1 for set 8. The comparison between the operating characteristics of the four BgC methods are absolutely not consistent with the observations from the spikein data. In particular, the normalexponential simulated data provide almost identical bias and precision curves for normalgamma, normexpNP and normexpMLE, whereas these methods exhibit notable differences on spikein data.
The parallel drawn between the operating characteristics of the four BgC methods on spikein and simulated data confirms that the gamma model represents a much more accurate parametrisation for the signal distribution than the usual exponential model.
Differential expression analysis
The BgC methods are compared from a practical point of view through a differential expression analysis performed on the dilution data set (E_{4}), based on the hierarchical linear model approach from Smyth [21] implemented in the limma package. This procedure provides pvalues from a moderated tstatistic. A first analysis is run on the two pure samples (proportions 100%/0% and 0%/100%) to define the ”true” differentially expressed (DE) and nondifferentially expressed probes. A second differential expression analysis performed on the two mixed samples (proportion 75%/25% and 25%/75%) is used to assess the performance of the BgC methods. The moderated ttest statistic implemented in the limma package includes a variance term representing the variation of the gene intensity across all arrays, as well as hyperparameters computed from the whole data set intensities. Therefore, in order to get independent results, the two analyses are performed on separate linear models.
The estimate proportion of DE probes in pure samples computed with a convex decreasing density procedure [22] is 28% for all methods. Thus, in order to be conservative, we define the probes with the 20% smallest pvalues as ”true DE”, and the probes with the 40% highest pvalues as ”true nonDE”. Moderated tstatistic values are then computed from the comparison of the two mixed samples. The pvalues from true DE and nonDE probes are ordered. The area under the ROC curve (AUC) is used to quantify the sensitivity of each BgC method, the largest value of the AUC corresponding to the highest sensitivity. The four methods present similar AUC values but the normalgamma BgC is slightly less competitive.
A similar analysis is run with the addition of an offset prior to logtransformation. Figure 5 displays the AUC values as a function of the added offset with each BgC method. The values observed for an offset equal to zero correspond to the sensitivity when a simple logtransformation is applied. As already highlighted by Shi et al[10], we observe that the addition of a moderate offset increases the sensitivity of all BgC methods. For any value of the offset, the normexp methods outperform the normalgamma. The results obtained with the different normexp BgC methods are very similar, but we note that the highest sensitivity is achieved with normexpRMA for offsets smaller than 50, and with normexpMLE for offsets larger than 50.
Figure 5. AUC as a function of added offset. AUC from moderated ttest for mixed sample differential analysis in data set (E_{4}) (proportion 25%/75% and 75%/25%) for different values of offset.
The BgC methods can also be compared regarding their ability to order a set of measured intensities corresponding to increasing or decreasing probe concentrations. This framework can refer, for example, to a longitudinal study where the gene expression is repeatedly measured at different times. The correlation between the mixture proportion and the intensity is analysed on the dilution data set (E_{5}). For the true DE probes, the intensity is expected to be increasing or decreasing with the proportion.
The dilution data sets (E_{4}) and (E_{5}) are based on the same pure biological samples. Therefore, true DE and nonDE probes defined on (E_{4}) can be considered in the analysis of the data from (E_{5}). The BeadChips used in experiments (E_{4}) and (E_{5}) are different, but some bead types are present on both devices. By mapping the annotation files from both BeadChips, the sets of probes respectively defined as DE and nonDE on (E_{4}), and present on (E_{5}) are extracted.
For each probe, the Spearman correlation coefficient is computed between the vector of mixture proportions and the observed intensities. This provides a test statistitic based on the ranking of the background corrected intensities, which allows a comparison of the BgC methods independently from the scale at which the data are analysed, provided that the transformation applied to the data is increasing. In particular, the results are not affected by the addition of an offset. The correlation coefficient is computed separately on microarrays with starting RNA quantities 250ng, 100ng, 50ng and 10ng. The coefficient is expected to be close to 1 in absolute values for the DE probes, and close to 0 for the nonDE. The probes are ranked according to their correlation coefficient value, and the resulting AUCs for each starting RNA quantity are displayed in Table 7. We observe that the normalgamma BgC is slightly but steadily less sensitive that the other methods. The AUCs observed with the different normexp estimates are very similar and do not allow to assess the superiority of one method over the others.
Table 7. AUC from Spearman correlation test
Conclusions
In many microarray experiments, background noise correction is an important issue in order to improve the measurement precision. Modelbased background correction procedures have been developed as an alternative to the default background subtraction from Illumina BeadStudio which has proved to remove informative probes. The usual normalexponential model considered for the noisesignal distribution has already been pointed out as inappropriate for Illumina BeadArrays [12]. Our observations confirm this result by highlighting the poor fitting of the normexp reconstructed densities on observed intensities with three different parameter estimates. We propose an alternative model based on a more flexible parametrisation of the signal which is assumed to follow a gamma distribution, as well as the associated background correction. The reconstructed density offers a better fit of the distribution of the observed intensities, validating this new model as more appropriate for Illumina microarrays. Moreover, the estimators based on the normalgamma model are likely to apply to other microarray technologies including Affymetrix and single color Agilent, as an extension of the normexp model.
We compare the performance of the background correction procedures based on the normalgamma and normalexponential models on simulated and experimental data sets. Our simulation study indicates that the normalgamma model brings an overall improvement in terms of signal estimation, characterised by a smaller average difference between the true signal and the background corrected intensity. But surprisingly, the differential expression analysis run on two dilution data sets shows that the improvement in terms of parametrisation does not have a positive impact on practical experiments, the normalgamma correction exhibiting a slightly poorer sensitivity than the normexp methods. This result may be explained in two ways.
On one side, the operating characteristics of the background correction procedures are compared on a set of spikein data, which allow to connect the probe intensity with the concentration of the target gene in the biological sample. We note that the normalgamma model generates less bias than the normexp methods, but at the cost of a loss in precision. With the addition of an offset prior to the logtransformation, which provides balance in the biasprecision tradeoff of the different methods, the operating characteristics appear similar, suggesting comparable performance.
On the other side, we examine the error in signal estimation as a function of the signal on logscale simulated data. The normalgamma model outperforms the other methods on small intensities, but is less competitive on moderate intensities. Due to the marked compression of the recovered intensity when the signal decreases, the improvement in terms of signal estimation for the small intensities has a weak effect on the differential expression analysis. Thus, the smaller average error of estimation observed with the normalgamma background correction does not result in a higher sensitivity in practical experiments.
Besides, the parallel drawn between the operating characteristics of the different background corrections obtained, on the one hand with spikein data and on the other hand with normalgamma simulated data, highlights high similarities. The simulations from the normalgamma model recover subtile differences between background correction procedures, whereas simulations from the normexp model totally fail to reproduce the trends observed on spikein data. These considerations enhance the validation of the normalgamma model for Illumina microarrays, and illustrate the potential of the normalgamma simulations for the comparison of preprocessing procedures. Furthermore, the similarities between the observations from spikein and simulated data are increased by sampling the background noise from the empirical negative probe distribution which suggests that an improvement in modeling could be brought by a nonnormal parametrisation of the background noise.
In conclusion, this paper addresses the lack of fit of the usual normalexponential model by proposing a more flexible parametrisation of the signal distribution as well as the associated background correction. This new model proves to be considerably more accurate for Illumina microarrays, but our results indicate that the improvement in terms of modeling does not lead to a higher sensitivity in differential analysis. Nevertheless, this realistic modeling makes way for future investigations, in particular to examine the characteristics of preprocessing strategies.
Endnote
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
The NOWAC data were provided by EL, Principal Investigator of TICE project. Statistical and computational aspects were developed by SP and YR. All authors read and approved the final manuscript.
Acknowledgements
The authors thank Gregory Nuel for fruitful discussions on numerical issue.
Fundings
Grant: ERC2008AdG 232997TICE “Transcriptomics in cancer epidemiology”.
References

Rocke DM, Durbin B: A model for measurement error for gene expression arrays.
J Comput Biol 2001, 8(6):557569. PubMed Abstract  Publisher Full Text

Ritchie ME, Silver J, Oshlack A, Holmes M, Diyagama D, Holloway A, Smyth GK: A comparison of background correction methods for twocolour microarrays.
Bioinformatics 2007, 23(20):27002707. PubMed Abstract  Publisher Full Text

Irizarry RA, Hobbs B, Collin F, BeazerBarclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data.
Biostatistics 2003, 4(2):249264. PubMed Abstract  Publisher Full Text

Silver JD, Ritchie ME, Smyth GK: Microarray background correction: maximum likelihood estimation for the normalexponential convolution.
Biostatistics 2009, 10(2):352363. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Smyth G: Limma: linear models for microarray data. In Bioinformatics and Computational Biology Solutions using R and Bioconductor. Edited by Gentleman R. New York: Springer; 2005:397420.

Ding LH, Xie Y, Park S, Xiao G, Story MD: Enhanced identification and biological validation of differential gene expression via Illumina wholegenome expression arrays through the use of the modelbased background correction methodology.
Nucleic Acids Res 2008, 36(10):e58. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dunning MJ, BarbosaMorais NL, Lynch AG, Tavaré S, Ritchie ME: Statistical issues in the analysis of Illumina data.
BMC Bioinformatics 2008, 9:85. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Xie Y, Wang X, Story M: Statistical methods of background correction for Illumina BeadArray data.
Bioinformatics 2009, 25(6):751757. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lin SM, Du P, Huber W, Kibbe WA: Modelbased variancestabilizing transformation for Illumina microarray data.
Nucleic Acids Res 2008, 36(2):e11. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Shi W, Oshlack A, Smyth GK: Optimizing the noise versus bias tradeoff for Illumina whole genome expression BeadChips.
Nucleic Acids Res 2010, 38(22):e204. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chen M, Xie Y, Story M: An exponentialgamma convolution model for background correction of Illumina BeadArray data.
Commun Stat Theory Methods 2011, 40(17):30553069. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wang XF, Ye D: The effects of error magnitude and bandwidth selection for deconvolution with unknown error distribution.
J Nonparametric Stat 2012, 24:153167. Publisher Full Text

Wang XF, Wang B: Deconvolution estimation in measurement error models: the R package decon.

Lund E, Dumeaux V, Braaten T, Hjartåker A, Engeset D, Skeie G, Kumle M: Cohort profile: the Norwegian women and cancer study–NOWAC–Kvinner og kreft.
Int J Epidemiol 2008, 37:3641. PubMed Abstract  Publisher Full Text

McCall MNN, Irizarry RAA: Consolidated strategy for the analysis of microarray spikein data.

The MicroArray Quality Control (MAQC) project shows inter and intraplatform reproducibility of gene expression measurements

Lynch AG, Hadfield J, Dunning MJ, Osborne M, Thorne NP, Tavaré S: The cost of reducing starting RNA quantity for Illumina BeadArrays: a beadlevel dilution experiment.
BMC Genomics 2010, 11:540. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Rozenholc Y, Mildenberger T, Gather U: Constructing irregular histograms by penalized likelihood.
Comput Stat Data Anal 2010, 54(12):33133323. Publisher Full Text

Gleser LJ: The gamma distribution as a mixture of exponential distributions.

Shi W, de Graaf CA, Kinkel SA, Achtman AH, Baldwin T, Schofield L, Scott HS, Hilton DJ, Smyth GK: Estimating the proportion of microarray probes expressed in an RNA sample.
Nucleic Acids Res 2010, 38(7):21682176. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Smyth GK: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments.

Langaas M, Lindqvist BH, Ferkingstad E: Estimating the proportion of true null hypotheses, with application to DNA microarray data.
J R Stat Soc Ser B 2005, 67(4):555572. Publisher Full Text