Department of Community Medicine, Faculty of Health Sciences, University of Tromsø, 9037 Tromsø, Norway

Department of Applied Mathematics, MAP5, 45 rue des Saints-Pères, University Paris Descartes, 75006 Paris, France

Abstract

Background

Illumina BeadArray technology includes non specific negative control features that allow a precise estimation of the background noise. As an alternative to the background subtraction proposed in BeadStudio which leads to an important loss of information by generating negative values, a background correction method modeling the observed intensities as the sum of the exponentially distributed signal and normally distributed noise has been developed. Nevertheless, Wang and Ye (2012) display a kernel-based estimator of the signal distribution on Illumina BeadArrays and suggest that a gamma distribution would represent a better modeling of the signal density. Hence, the normal-exponential modeling may not be appropriate for Illumina data and background corrections derived from this model may lead to wrong estimation.

Results

We propose a more flexible modeling based on a gamma distributed signal and a normal distributed background noise and develop the associated background correction, implemented in the R-package

Conclusions

This paper addresses the lack of fit of the usual normal-exponential model by proposing a more flexible parametrisation of the signal distribution as well as the associated background correction. This new model proves to be considerably more accurate for Illumina microarrays, but the improvement in terms of modeling does not lead to a higher sensitivity in differential analysis. Nevertheless, this realistic modeling makes way for future investigations, in particular to examine the characteristics of pre-processing strategies.

Background

Illumina BeadArray platform is a microarray technology offering highly replicable measurements of gene expression in a biological sample. Each probe is measured on average of thirty to sixty beads randomly distributed on the surface of the array, avoiding spatial artifacts and the reported probe intensity is the robust mean of the bead measurements. Fluorescence intensity measured on each bead is subject to several sources of noise (non-specific binding, optical noise, …). Thus the intensities produced by the microarray require a background correction in order to account measurement error. For that purpose, Illumina microarray design includes a set of non specific negative control probes which provides an estimate of the background noise distribution.

In genome-wide microarrays, the observed intensity of a probe is usually modeled as the sum of a signal and a background noise. Namely, let

where

Background correction of Affymetrix and two-color microarray data has been widely developed in literature (see ^{a} packages including

Illumina design differs from those of Affymetrix and two-color microarrays by including a set of negative probes which do not specifically target any regular probe. Aside from non specific hybridization, these negative probes do not hybridize and then have signals close to zero. Thus their observed intensity is

The background correction implemented in Illumina BeadStudio software is the subtraction of the estimated mean of the negative probe distribution. However, it creates a large amount of probes with negative intensities unusable in further analysis. The deletion of these probes is considered in some studies as an opportunity to gain statistical power when the number of strongly differentially expressed genes is large, but it can lead to an important loss of information. Ding

To avoid this problem, parametric models have been used on Illumina data with parameter estimations taking into account the specific design of Illumina microarrays. In this context, the normexp model has been first adapted. Ding

The spread of each background correction among the Illumina users is hard to evaluate since many authors do not mention precisely the pre-processing steps performed in their study. Nevertheless, the normexp model that will be especially examined in this paper is included in several widely used packages such as ^{a}.

Despite its popularity the normexp model does not properly fit Illumina microarray data. This issue was raised by Wang and Ye

**Supplementary Material 1 provides a description of the simulations, computing details and additional figures.**

Click here for file

We propose an alternative model thereafter called “

The paper is organized as follows. The experimental and simulated data sets as well as the estimation procedures are presented in Section “Methods”: the notations and the general model-based background correction formula are gathered in Section “General model-based background correction formula”; the previous models developed for Illumina microarray background correction, including the normexp model, are summarized in Section “Previous modelings”; Section “A new modeling: the normal-gamma model” presents the proposed alternative parametric model built with normal noise and gamma distributed signal, as well as a parametric estimation procedure and its associated background correction. The performances of this new model are evaluated on simulated, spike-in and dilution data sets in Section “Results and discussion”. The impact of this more flexible parametrisation on background correction as well as the perspectives for further pre-processing analyses are discussed in Section “Conclusions”. The normal-gamma parameter estimation and the associated background correction are implemented in the R-package

**Supplementary Material 2 gathers the scripts used to produce the tables and figures.**

Click here for file

Methods

Materials

Experimental data sets

● (_{1})**Nowac data**

**Supplementary Material 3 is a zip file which contains three text files with the observed intensities of the ten microarrays from data set (****
E
**

Click here for file

● (_{2})**Leukemia mice data**

● (_{3})**Spike-in data**

● (_{4})**MAQC data**

● (_{5})**Dilution data**

Simulated data sets

For each data set, **X**
^{
ℓ
}of length _{reg}=25000 corresponding to the regular probe intensities and a vector **X**
^{0,ℓ
}of length _{neg}=1000 corresponding to the negative probe intensities are generated. The values of the nine parameter sets as well as the details of the simulations are given in Additional file

● (_{1})**Normal-gamma and normexp models**. For each repetition **X**
^{
ℓ
}is generated as the sum of a gamma and a normal-distributed sample, and **X**
^{0,ℓ
}is drawn from a normal distribution. Six sets of parameters are computed from two microarrays in data sets (_{1}) and (_{2}), based on normexp and normal-gamma models in order to get realistic values (sets 1-6). The normexp parameters are actually degenerated normal-gamma parameters where the shape is equal to 1.

● (_{2})**Mixture noise distribution**. A mixture of normal and ^{2}distributions with different proportions (0, 0.1, 0.25, 0.5, 0.75,1) is considered for the background noise. These distributions model a departure from normality with a heavier right tail for larger values of

● (_{3})**Replicates**. We mimic replicate measurements from a biological sample by simulating _{3}) (set 7). Replicates from the normal-exponential model are drawn in the same way with parameter values estimated on the same array with two normexp estimates (sets 8 and 9).

● (_{4})**Replicates with empirical background noise**. Similarly to (_{3}), the signal drawn from a gamma distribution with parameter values from set 7 is identical on each array. In order to get a realistic noise distribution, the negative probe and background noise intensities are sampled from the global set of quantile-normalised negative probe intensities measured in the experimental data set (_{3}).

General model-based background correction formula

Notations

Throughout this article, the background correction is processed on one single array corresponding to one biological sample. For a given probe _{
j
} the observed intensity, _{
j
} the non-observable underlying signal and _{
j
} its background noise. For a negative control probe, _{
j
}is assumed to be 0. Let _{0}be respectively the index of regular and negative probes on the array. We denote by _{
X
}, _{
S
} and _{
B
}the densities of respectively the observed intensity, the unknown signal of interest and the background noise.

We denote by ^{2}and by

Given a parametric density and a procedure of estimation of its parameters, we call

Model-based background correction

The model based background correction (BgC) incorporates information from both signal and noise distributions. Under the additive model (1) assuming independence of _{
X
} is the convolution product of _{
S
}and _{
B
}. For an observed probe intensity _{
B
}and _{
S
}(more details can be found in

Previous modelings

The normal model for negative probes

The design of Illumina BeadArrays provides a sample of the background distribution through the negative probes. We have compared the density histogram of negative probes to the _{1}) and (_{2}).

The results of this comparison are presented in Additional file

The normal-exponential model

The normexp model is a parametric model for the noise-signal decomposition on one array. We recall it briefly and refer for example to

where ⊥ denotes the independence between variables. The parameters (

For computational reason, the _{
j
}’s are usually and often implicitly assumed to be independent. The existence of pathways between genes violates this assumption. Nevertheless, as a small proportion of genes are involved, results are reliable. According to the convolution structure (see Section “Model-based background correction”), the density of the _{
j
}’s is:

where

Normal-exponential model fit

We consider the data sets (_{1}) and (_{2}). For each array, the following procedure is implemented:

● Computation of the estimators

1. Maximum Likelihood Estimation (MLE) using both regular and negative probes,

2. Robust Multiarray Analysis (RMA) adapted from Affymetrix method,

3. NP estimation obtained by the method of moments applied to negative and regular probes,

4. Bayesian estimation. Note that the bayesian estimation results are not presented as they are nearly identical to MLE, as pointed out by Xie

● For each parameter estimation method, plot of the

● Plot of an irregular density histogram of all regular probe intensities of the array using the R-package histogram available on the CRAN with default irregular setting (see

Figure _{1}) after removal of imperfectly designed probes (more arrays are presented in Additional file

Normal-exponential fit

**Normal-exponential fit.** Normal-Exponential estimation for one array from (_{1}) after removal of imperfectly designed probes: irregular density histogram of all regular probe intensities and the

Besides Xie

A new modeling: the normal-gamma model

The poor fitting of the normexp model shown above, as well as the preliminary observations based on non-parametric estimation procedures, call for a more suitable parametric model for Illumina BeadArrays. According to Section “Previous modelings” the normal assumption for the negative probes appears relevant. We consider the gamma distribution as an extension of the exponential distribution to model the signal intensities. Besides, as a scale mixture of exponential distributions (see

The normal-gamma model

The normal-gamma model is defined as follows. For every probe

The parameters (

According to the convolution structure (see Section “Model-based background correction”), the density of _{
j
} is the convolution product of the densities of _{
j
}and _{
j
}, namely:

This density does not have any analytic expression as the normexp density (3). Nevertheless, good and fast numerical approximations can be computed using the Fast Fourier Transform (fft) and tail approximations to ensure stability. Our implementation based on

Parameter estimation in the normal-gamma model

The parameters (

where

is the likelihood from the two sets of observations **X**={_{
j
},**X**
^{0}={_{
j
},_{0}} measured on regular and negative probes, respectively. Thanks to the fft-based approximation of

Background corrected intensity for the normal-gamma model

Denoting now

using the equality

Inference of negative probes from Illumina detection p-values

Most publicly available data sets do not present the negative probe intensities. Nevertheless, for each regular probe, Illumina provides a detection p-value equal to the proportion of negative probes which have intensities greater than that probe on a given array. Following the idea from Shi _{1}). We observe that the error resulting from inference of the negative probe is negligible, with a relative error of order 10^{−3} to 10^{−4} on parameter estimation and 10^{−4}to 10^{−5} on signal estimation (see Additional file

Results and discussion

Fit on Illumina BeadArray data

Similarly to Section “Previous modelings”, we compare the irregular density histogram of the regular probe intensities with the _{1}) and (_{2}). The results, similar along the arrays, are illustrated in Figure _{1}) (more plots are presented in Additional file

Normal-exponential and normal-gamma fit

**Normal-exponential and normal-gamma fit.** Normal-Gamma estimation for one array from (_{1}) after removal of imperfectly designed probes: irregular density histogram of all regular probe intensities,

Thanks to the larger flexibility of the normal-gamma model, we observe that the distance between the MLE _{1}-distance between the histogram and the reconstructed density defined by

where

where _{1}) (with and without the non specific binding probes) and over the four arrays from (_{2}).

**Human**

**Human**

**Mice**

**(all probes)**

**(remove bad probes)**

Average deviation between normexp reconstructed density and histogram divided by the deviation between normal-gamma reconstructed density and histogram (First row: RMA estimator; second row: MLE normexp estimator; third row: NP normexp estimator). The fourth row gives the mean deviation of the normal-gamma estimator as a reference. The mean is computed over the ten arrays from (_{1}) with (first column) or without (second column) the non specific binding probes and over the four arrays from (_{2}) (third column).

nexp MLE

7.09

5.14

4.83

nexp RMA

2.96

3.18

2.71

nexp NP

7.69

5.50

5.29

Abs Dev normgam

0.17

0.21

0.20

The mean absolute deviation is 3 times smaller in favor of the normal-gamma density with respect to the normexp density using the RMA estimate, and 4 to 8 with respect to the normexp model using the MLE or NP estimates.

Quality of estimation on simulated data

The quality of estimation of the normal-gamma model is assessed on the simulation data set (_{1}). The first two sets of parameters are non degenerate normal-gamma parameters, more realistic for modeling Illumina microarrays as shown in Section “Fit on Illumina BeadArray data”. They are used to evaluate the MLE normal-gamma parameter estimation and validate the associated background correction, and to quantify the improvement brought by the new normal-gamma background correction. The last four sets are actually degenerate normal-gamma parameters where the shape parameter

Parameter estimation

For each repetition _{1}-error for each parameter

**
μ
**

**
σ
**

**
k
**

**
θ
**

set 1

7.1e-4

5.6e-3

9.3e-3

1.7e-2

set 2

1.3e-3

5.5e-3

1.0e-2

1.8e-2

set 3

3.5e-3

1.6e-2

6.9e-3

8.3e-3

set 4

4.5e-3

1.3e-2

8.9e-3

9.8e-3

set 5

2.1e-3

7.6e-3

2.6e-2

1.7e-2

set 6

3.5e-3

7.2e-3

3.9e-2

2.4e-2

The parameter estimation is of excellent quality for the gaussian distribution and of good quality for the gamma distribution.

To check wether the introduction of a fourth parameter in our model leads to a loss of precision in the parameter estimation, we compare the relative _{1}errors of the MLE parameter estimation in the normal-gamma and normexp models using the parameter sets 3 to 6, corresponding to normexp data. The results summarized in Table ^{−2}, this loss is negligible.

**
μ
**

**
sigma
**

**
theta
**

Ratio between the relative _{1}errors of the MLE estimation in the normal-gamma and in the normexp models for (

set 3

1.1

1.0

1.6

set 4

1.2

1.0

1.8

set 5

2.1

1.0

2.3

set 6

1.9

1.0

1.9

Background corrected intensity

We now study the performance of the normal-gamma background correction (BgC) obtained in (8) with respect to the existing BgC methods in terms of quality of estimation of the signal on the simulated data set (_{1}). We compare the following BgC methods, detailed in

0. Normal-gamma BgC in (8) with true parameters,

1. Normal-gamma BgC in (8) with MLE parameters,

2. Normal-exponential BgC in (4) with MLE parameters (referred to as normexp-MLE),

3. Normal-exponential BgC in (4) with RMA parameters (referred to as normexp-RMA),

4. Normal-exponential BgC in (4) with NP parameters (referred to as normexp-NP),

5. Background subtraction:

These methods are further denoted by

For each parameter set and for each BgC method

where **X**
^{0,ℓ
}for

and the last column indicates the reference risk

**
μ
**

**R(1)**

**R(2)**

**R(3)**

**R(4)**

**R(5)**

**MAD**

Mean Absolute Deviation (MAD) of the background corrected intensities for methods _{1}). Column 1: normal-gamma, column 2: normexp-MLE, column 3: normexp-RMA, column 4: normexp-NP, column 5: background subtraction. The MAD of the theoretical normal-gamma deconvolution with the true parameters is given as reference in column 6.

set 1

1.00

4.16

1.77

1.52

1.16

2.34

set 2

1.00

4.10

1.90

1.66

1.20

11.7

set 3

1.00

1.00

4.69

1.00

1.00

4.57

set 4

1.00

1.00

3.71

1.00

1.02

31.4

set 5

1.00

1.00

2.11

1.00

1.15

2.95

set 6

1.00

1.00

1.46

1.00

1.35

17.2

The normal-gamma BgC provides the same quality when the parameters are known or estimated. This holds when the data are generated either from a normal-gamma or a normexp model. Normexp-NP shows good behaviors when the data come from a normexp model but has a risk increase of order 60% if the data come from a normal-gamma model. Normexp-MLE provides good results for normal-exponential data but fails when the data come from a normal-gamma model. Not surprisingly, as already pointed by Xie

In practical experiments, the data are usually transformed before the analysis. To address this issue, the MAD is computed on log-transformed intensities (see Additional file

**
μ
**

**R(1)**

**R(2)**

**R(3)**

**R(4)**

Mean Absolute Deviation (MAD) of the background corrected intensities for methods _{1}). Column 1: normal-gamma, column 2: normexp-MLE, column 3: normexp-RMA, column 4: normexp-NP.

set 1

1.00

1.32

1.18

1.17

set 2

1.00

1.28

1.16

1.16

set 3

1.00

1.00

2.98

1.00

set 4

1.00

1.00

2.45

1.00

set 5

1.00

1.00

1.81

1.00

set 6

1.00

1.00

1.39

1.00

The MAD computation offers a global comparison of the various BgC methods in terms of signal estimation. We refine this analysis by examining the absolute deviation (AD) of the estimated signal for each signal intensity at the raw and log scales, respectively defined as:

The first row of Figure

Absolute deviation of the signal estimation on simulated data

**Absolute deviation of the signal estimation on simulated data.** Logarithm of the Absolute Deviation of estimated signal on raw scale (first row), Absolute Deviation of log-transformed estimated signal (second row) and signal log-density (third row). Normal-gamma BgC (purple) and normexp BgC with MLE (pink), RMA (blue) and NP (green) parameters.

The absolute deviation on log-transformed intensities is presented on the second row of Figure

Robustness

In Section “Previous modelings”, we have underlined the slightly heavier right tail of the negative probe distribution. To ensure that the estimation remains acceptable under the assumption of an imperfect noise parametrisation, we compare the robustness of the normal-gamma method with normexp-NP, stated as the most robust by _{2}) are presented in Additional file

In conclusion, the normal-gamma background correction globally offers a better quality in signal estimation with respect to the normexp methods. Nevertheless, this improvement depends on the scale considered and does not steadily hold over the range of intensities.

Operating characteristics

Beyond the quality of estimation of the signal, the performance of a BgC procedure in practical experiments depends on its characteristics in terms of bias and variance. In this section, we compare the operating characteristics of the normal-gamma and normexp BgC both on simulated and spike-in data. The results are gathered in Figure _{3}) are background corrected with the methods 1 to 4 described in Section “Background corrected intensity”. Quantile normalization based on both regular and negative probe intensities is applied, followed by log-transformation. The same procedures are implemented on the simulation data sets (_{3}) and (_{4}).

Operating characteristics of the BgC methods on spike-in and simulated data

**Operating characteristics of the BgC methods on spike-in and simulated data.** Row 1: average spike intensities (left) and standard deviation of spike replicates (right) for all non-zero spike concentrations. Row 2 to 4: average intensity (left) and standard deviation of replicates (right) as a function of signal intensity. Row 2: normal-gamma simulation in data set (_{3}) (parameter set 7); Row 3: gamma signal and empirical background noise distribution (data set (_{4})); Row 4 normal-exponential simulation in data set (_{3}) (parameter set 9).

Bias-precision trade-off

The quality of a pre-processing method in microarray experiments can be characterised by its ability to distinguish between distinct values of the signal. Most of the procedures underestimate the signal fold-changes. This bias in fold-change estimation, called compression, has a negative impact on differential analysis. But the efficiency of a pre-processing method also depends on its precision, characterised by the variations of the corrected intensity for a given value of the signal. The trade-off between bias and precision is an indicator of the performance of a procedure. This issue can be understood by the example of a t-test statistic for a given probe differentially expressed between two groups: an important compression attenuates the difference of average intensities between the two groups, whereas a poor precision generates a high variance term in the denominator, which reduces the value of the test statistic.

The compression and precision obtained with the four BgC methods on the data set (_{3}) are presented on the first row of Figure

The second column presents the average standard deviation between replicates over all spike bead types. We observe that the improvement in bias brought by the normal-gamma model is at the cost of a poorer precision. More generally, the precision increases with the compression for the four methods.

Innate offset

Shi

**BgC**

**Innate offset**

**Stand. Dev.**

**Slope**

Innate offset, average standard deviation of spike replicates and slope of the linear regression of the spike average intensity on the log-concentration.

normexp MLE

23.4

0.095

0.74

normexp NP

12.4

0.100

0.80

normexp RMA

6.9

0.110

0.86

normal-gamma

1.5

0.200

0.99

Shi

Operating characteristics on simulated data

In order to reinforce the validation of the normal-gamma parametrisation for the noise-signal distribution, we compare the operating characteristics obtained on spike-in data to the ones provided by the normal-gamma simulated data from set (_{3}). The spike concentration, used as references to assess the bias and precision of the procedures on spike-in data, is replaced by the true value of the signal. The results are displayed on the second row of Figure

Furthermore, we address the departure from normality observed on the negative probe distribution by simulating microarrays with a gamma distributed signal and a non-normal background noise (data set (_{4})). In order to get a realistic noise distribution, the background noise and the negative probe intensities are sampled from the quantile-normalised negative probe intensities from all arrays in (_{3}) (see details in Additional file

The same quantities are computed based on normal-exponential simulated data with parameter sets 8 and 9. The results are displayed on the fourth row of Figure

The parallel drawn between the operating characteristics of the four BgC methods on spike-in and simulated data confirms that the gamma model represents a much more accurate parametrisation for the signal distribution than the usual exponential model.

Differential expression analysis

The BgC methods are compared from a practical point of view through a differential expression analysis performed on the dilution data set (_{4}), based on the hierarchical linear model approach from Smyth

The estimate proportion of DE probes in pure samples computed with a convex decreasing density procedure

A similar analysis is run with the addition of an offset prior to log-transformation. Figure

AUC as a function of added offset

**AUC as a function of added offset.** AUC from moderated t-test for mixed sample differential analysis in data set (_{4}) (proportion 25%/75% and 75%/25%) for different values of offset.

The BgC methods can also be compared regarding their ability to order a set of measured intensities corresponding to increasing or decreasing probe concentrations. This framework can refer, for example, to a longitudinal study where the gene expression is repeatedly measured at different times. The correlation between the mixture proportion and the intensity is analysed on the dilution data set (_{5}). For the true DE probes, the intensity is expected to be increasing or decreasing with the proportion.

The dilution data sets (_{4}) and (_{5}) are based on the same pure biological samples. Therefore, true DE and non-DE probes defined on (_{4}) can be considered in the analysis of the data from (_{5}). The BeadChips used in experiments (_{4}) and (_{5}) are different, but some bead types are present on both devices. By mapping the annotation files from both BeadChips, the sets of probes respectively defined as DE and non-DE on (_{4}), and present on (_{5}) are extracted.

For each probe, the Spearman correlation coefficient is computed between the vector of mixture proportions and the observed intensities. This provides a test statistitic based on the ranking of the background corrected intensities, which allows a comparison of the BgC methods independently from the scale at which the data are analysed, provided that the transformation applied to the data is increasing. In particular, the results are not affected by the addition of an offset. The correlation coefficient is computed separately on microarrays with starting RNA quantities 250ng, 100ng, 50ng and 10ng. The coefficient is expected to be close to 1 in absolute values for the DE probes, and close to 0 for the non-DE. The probes are ranked according to their correlation coefficient value, and the resulting AUCs for each starting RNA quantity are displayed in Table

**Normal-**

**Normexp-**

**Normexp-**

**Normexp-**

**gamma**

**MLE**

**RMA**

**NP**

AUC from Spearman correlation test between the proportion and the intensity in the dilution data set (_{5}), for the four BgC methods, and the four RNA starting concentrations.

250ng

0.9778

0.9812

0.9813

0.9820

100ng

0.9774

0.9807

0.9809

0.9808

50ng

0.9805

0.9834

0.9832

0.9841

10ng

0.9782

0.9818

0.9787

0.9816

Conclusions

In many microarray experiments, background noise correction is an important issue in order to improve the measurement precision. Model-based background correction procedures have been developed as an alternative to the default background subtraction from Illumina BeadStudio which has proved to remove informative probes. The usual normal-exponential model considered for the noise-signal distribution has already been pointed out as inappropriate for Illumina BeadArrays

We compare the performance of the background correction procedures based on the normal-gamma and normal-exponential models on simulated and experimental data sets. Our simulation study indicates that the normal-gamma model brings an overall improvement in terms of signal estimation, characterised by a smaller average difference between the true signal and the background corrected intensity. But surprisingly, the differential expression analysis run on two dilution data sets shows that the improvement in terms of parametrisation does not have a positive impact on practical experiments, the normal-gamma correction exhibiting a slightly poorer sensitivity than the normexp methods. This result may be explained in two ways.

On one side, the operating characteristics of the background correction procedures are compared on a set of spike-in data, which allow to connect the probe intensity with the concentration of the target gene in the biological sample. We note that the normal-gamma model generates less bias than the normexp methods, but at the cost of a loss in precision. With the addition of an offset prior to the log-transformation, which provides balance in the bias-precision trade-off of the different methods, the operating characteristics appear similar, suggesting comparable performance.

On the other side, we examine the error in signal estimation as a function of the signal on log-scale simulated data. The normal-gamma model outperforms the other methods on small intensities, but is less competitive on moderate intensities. Due to the marked compression of the recovered intensity when the signal decreases, the improvement in terms of signal estimation for the small intensities has a weak effect on the differential expression analysis. Thus, the smaller average error of estimation observed with the normal-gamma background correction does not result in a higher sensitivity in practical experiments.

Besides, the parallel drawn between the operating characteristics of the different background corrections obtained, on the one hand with spike-in data and on the other hand with normal-gamma simulated data, highlights high similarities. The simulations from the normal-gamma model recover subtile differences between background correction procedures, whereas simulations from the normexp model totally fail to reproduce the trends observed on spike-in data. These considerations enhance the validation of the normal-gamma model for Illumina microarrays, and illustrate the potential of the normal-gamma simulations for the comparison of pre-processing procedures. Furthermore, the similarities between the observations from spike-in and simulated data are increased by sampling the background noise from the empirical negative probe distribution which suggests that an improvement in modeling could be brought by a non-normal parametrisation of the background noise.

In conclusion, this paper addresses the lack of fit of the usual normal-exponential model by proposing a more flexible parametrisation of the signal distribution as well as the associated background correction. This new model proves to be considerably more accurate for Illumina microarrays, but our results indicate that the improvement in terms of modeling does not lead to a higher sensitivity in differential analysis. Nevertheless, this realistic modeling makes way for future investigations, in particular to examine the characteristics of pre-processing strategies.

Endnote

^{a}

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

The NOWAC data were provided by EL, Principal Investigator of TICE project. Statistical and computational aspects were developed by SP and YR. All authors read and approved the final manuscript.

Acknowledgements

The authors thank Gregory Nuel for fruitful discussions on numerical issue.

Fundings

Grant: ERC-2008-AdG 232997-TICE “Transcriptomics in cancer epidemiology”.