Biostatistics Epidemiology Research Design Core, Center for Clinical and Translational Sciences, The University of Texas Health Science Center at Houston, UT Professional Building, 6410 Fannin Street, Houston, TX 77030, USA

Department of Statistical Science, Southern Methodist University, 3225 Daniel Ave., Dallas, TX 75275, USA

Department of Computer Science, New Mexico Institute of Mining and Technology, Socorro, NM 87801, USA

Department of Pathology, University of Texas, Southwestern Medical Center, 6000 Harry Hines Blvd., Dallas, TX 75390, USA

Specpro, Vicskburg, MS 39180 USA

Department of Biology Science, The University of Southern Mississippi, 118 College Dr., Hattiesburg, MS 39406, USA

Abstract

Introduction

Affymetrix GeneChip^{® }high-density oligonucleotide arrays are widely used in biological and medical research because of production reproducibility, which facilitates the comparison of results between experiment runs. In order to obtain high-level classification and cluster analysis that can be trusted, it is important to perform various pre-processing steps on the probe-level data to control for variability in sample processing and array hybridization. Many proposed preprocessing methods are parametric, in that they assume that the background noise generated by microarray data is a random sample from a statistical distribution, typically a normal distribution. The quality of the final results depends on the validity of such assumptions.

Results

We propose a Distribution Free Convolution Model (DFCM) to circumvent observed deficiencies in meeting and validating distribution assumptions of parametric methods. Knowledge of array structure and the biological function of the probes indicate that the intensities of mismatched (MM) probes that correspond to the smallest perfect match (PM) intensities can be used to estimate the background noise. Specifically, we obtain the smallest q2 percent of the MM intensities that are associated with the lowest q1 percent PM intensities, and use these intensities to estimate background.

Conclusion

Using the Affymetrix Latin Square spike-in experiments, we show that the background noise generated by microarray experiments typically is not well modeled by a single overall normal distribution. We further show that the signal is not exponentially distributed, as is also commonly assumed. Therefore, DFCM has better sensitivity and specificity, as measured by ROC curves and area under the curve (AUC) than MAS 5.0, RMA, RMA with no background correction (RMA-noBG), GCRMA, PLIER, and dChip (MBEI) for preprocessing of Affymetrix microarray data. These results hold for two spike-in data sets and one real data set that were analyzed. Comparisons with other methods on two spike-in data sets and one real data set show that our nonparametric methods are a superior alternative for background correction of Affymetrix data.

Introduction

Affymetrix GeneChip^{® }arrays are widely used in biological and medical research to estimate gene expression levels. Each gene is interrogated using 11–20 probe pairs (depending on the platform), each of which consists of a perfect match (PM) and a mismatch (MM) probe. PM probes are sequences of 25 nucleotides that are intended to be a perfect complement to a subsequence of the target transcript of interest (gene). A MM probe is also 25 nucleotides in length, with the same composition as the corresponding PM probe, except that the middle base (13th) is changed to its Watson-Crick complement. The MM probes were originally designed to be different at one base pair so that their intensities could be subtracted from those of the PM as a measure of non-specific hybridization.

In order to estimate gene expression values and perform high-level analyses, such as classification and clustering, probe-level pre-processing of the data is necessary. Typically, there are three steps of preprocessing: background correction, normalization and summarization, although not necessarily in that order. It has been argued that background correction is the most crucial step for probe level processing

One popular method, Robust Multichip Average (RMA) uses an exponential-normal convolution model for background correction, quantile normalization for the normalization step, and a median polish algorithm to summarize probe level values into a single expression value per gene

The exponential-normal convolution model is given by X = S + Y, where X is the observed PM intensity for a probe on the array, S is the true signal, assumed to have an exponential distribution with rate parameter

The exponential-normal convolution model is built on the reasonable assumption that fluorescence intensities from a microarray experiment are composed of both signal and noise, and that the noise is ubiquitous throughout the signal distribution. A convolution model of a signal distribution and a noise distribution is a natural choice in such a situation. The choice of a normal distribution for the background noise and an exponential distribution for the signal was likely made for two reasons. First, density estimates of raw PM intensities from the Affymetrix Latin Square spike-in data sets show a right-skewed curve with what looks like a long exponential tail (see Figure

Smoothed density estimates of raw PM intensities for 10 randomly selected arrays from the Affymetrix Latin Square spike-in experiments

**Smoothed density estimates of raw PM intensities for 10 randomly selected arrays from the Affymetrix Latin Square spike-in experiments**. HG-U95Av2 (a) and HG-U133A (b). Each colored line represents a different experiment. A convolution of a normal distribution and an exponential distribution seem reasonable for these data.

Quantile-quantile plots and density plots for HGU95 Spike-In data

**Quantile-quantile plots and density plots for HGU95 Spike-In data**. (a) Quantile-quantile plot with quantiles of the standard normal distribution on the horizontal axis and quantiles of the noise distribution as estimated by the exponential-normal convolution model. If the normality assumption is correct, the plotted values should lie on a straight line. (b) Quantiles of an exponential distribution versus the background corrected probe-level intensities from the exponential-normal model. Again, any departures from a straight line indicate a lack of fit for the exponential distribution. (c) Density estimates of the log base 2 PM intensities for the original (uncorrected) probe-level intensities (solid line) and the estimated background using the exponential-normal model (dashed line). (d) Density estimates of the log base 2 PM intensities from the original data versus a simulated convolution of a normal distribution and an exponential distribution. The parameters for the normal and exponential distributions were obtained using estimates given by the Bioconductor implementation of RMA.

QQ-plot of estimated background noise (a, c) and density plots (b, d) of original data versus background corrected data for nine arrays from the HG-U133A spike-in data set and the control arrays from the GoldenSpike data set

**QQ-plot of estimated background noise (a, c) and density plots (b, d) of original data versus background corrected data for nine arrays from the HG-U133A spike-in data set and the control arrays from the GoldenSpike data set**.

The objective of this paper is to introduce a new background correction method, called Distribution Free Convolution Model (DFCM). The model has the same form as the exponential-normal convolution model (X = S + Y), except that no distributional assumptions are made on the signal (S) of the noise (Y). The mean and variance for the noise distribution are estimated using MM probe intensities in a novel way that is more consistent with their biological and structural characteristics. The signal is given by the PM intensities once the estimated background has been subtracted (as explained in Methods); therefore, there is no need to estimate parameters for the signal. We compare DFCM against RMA, RMA with no background correction (RMA – noBG)

Methods

Estimating the distribution-free convolution model

Let X = S + Y, where X = observed PM intensities, S = true intensity signal, and Y = background noise. The DFCM algorithm for background correction proceeds as follows:

1. Obtain the lowest q1 percent PM intensities. q1 is typically a small number (less than 30%). Information on the estimation of q1 is given in the next section.

2. Obtain lowest q2 percent (typically 90% or 95%) of MM intensities associated with the PMs obtained in step 1. These MM intensities are a measure of background noise, and will be termed "noise" in the sequel.

3. Use a nonparametric density estimate of the lowest q2 percent of the MM intensities to find the mode of the noise distribution. By default, the DFCM algorithm uses an Epanechnikov kernel estimate. Consider this mode as an estimate of the mean of the noise distribution. We denote this estimate as

4. Estimate the standard deviation of the background noise by calculating the sample standard deviation of the noise for values that are smaller than

5. Obtain background-corrected PM intensity values for the ^{th }probe (k = 1,..., K) in the ^{th }probe set (gene), _{ki }denotes non-corrected PM intensity values. Let

Therefore, for large enough values of _{ki}, we correct for background by subtracting the estimated mean of the background noise. For small values of _{ki}, the background corrected PM intensities are determined by a linear interpolation, where the slope is a function of the background mean and standard deviation. The two equations will give the same result if

After background correction, any method of normalization or summarization can be used. For the purposes of this paper, quantile normalization and median polish summarization are used for DFCM.

Choosing q1 and q2

Background noise is estimated using a select set of MM probe signals that are not likely to include effects of non-specific hybridization to the PM target transcript or cross-hybridization to a related target transcript. In choosing _{1 }and _{2}, we are attempting to choose PM values that are small enough so that non-specific MM hybridization to a PM target is not a problem, and then choosing a subset of MM values that are unlikely to be affected by cross-hybridization.

The parameter _{1}can be thought of as a measure of the percentage of PM probes that recognize genes that are not expressed in the data set, based on the assumption that in any given biological sample there will be a subset of genes that are not expressed at a detectable level. Any signal detected for MM probes for these genes cannot be due to non-specific hybridization because the gene is not expressed, based on the low PM values.

To choose the parameter _{1}, we developed an algorithm which calculates _{1 }such that the proportion of MM intensities greater than the PM intensities for the smallest _{1}% of the data is approximately 50%. We believe that one of the reasons that MM intensities are sometimes greater than their corresponding PMs is non-specific hybridization. Therefore, in estimating _{1}, we obtain a measure the percentage of non-specific hybridization in the data set.

Data sets used for comparison

The two Affymetrix Latin-Square spike-in data sets (HG-U113A and HG-U95Av2) each contain several spiked-in transcripts in known locations on a set of chips. These data sets, and a detailed description of the Latin Square design are available at

The HG-U133A experiment differs from the HG-U95Av2 experiment in several important ways. First, the HG-U133A experiment consists of 42 specific transcripts that are spiked in at 14 concentrations ranging from 0 pM to 512 pM, again arranged in a Latin Square design. Therefore, there is a finer gradation of concentrations used than in the HG-U95Av2 experiment. Also, there are three transcripts spiked-in at each concentration and three replicate arrays for each experiment, thus a total of 42 arrays. For convenience, we will call the triples of probe sets that recognize transcripts spiked-in at the same concentration "groups".

Recently, the HG-U133A data has also been examined for the presence of additional spike-ins

We use a third spike-in experiment to examine the distributional assumptions of the exponential-normal convolution model

While this data set was used to examine the distributional assumptions of the exponential-normal convolution model, these data were not used for evaluation of the relative performance of DFCM versus other algorithms due to controversy surrounding the use of the GoldenSpike dataset for method comparison. It has been observed that the GoldenSpike experiment uses technical replicates of a single experiment, rather than biological replicates. Thus, random variability in the experiment is confounded with real signal

Examining distributional assumptions

In order to test the validity of the noise and signal distributional assumptions, we compared background noise distribution estimated by the exponential-normal convolution model with the standard normal distributions in both quantile-quantile (QQ) plots and density plots using the Affymetrix Latin Square spike-in data sets. All calculations were done using the Bioconductor suite in the R software package for statistical analysis

Quantile-quantile (QQ) plots are designed to compare the distributions of two data sets usually a "gold standard" and a test data set. Sometimes, the gold standard consists of simulated values from a distribution of interest (

We also examined the assumption of a normal background distribution using three normality tests: Shapiro-Wilk, Anderson-Darling, and Kolmogorov-Smirnov

Results of the tests of normality of the background noise as estimated by the exponential-normal convolution model.

Data Set

Test

Rejection Rate

Min P-valuea

Med P-value

Max P-value

HG-U95Av2

AD

962

0

0.0008

0.4738

KS

796

0

0.0082

0.8261

SW

999

0

0.0036

0.1186

HG-U133A

AD

850

0

0.0064

0.5915

KS

594

0

0.0307

0.7700

SW

962

0

0.0031

0.3010

GoldenSpike

AD

885

0

0.0035

0.3559

KS

639

0

0.0259

0.7490

SW

987

0

0.0016

0.1502

The rejection rates (number of p-values less than 0.05) are much higher than expected, indicating that the background noise is not likely to be normally distributed.

a P-values are not identically 0, but are 0 to at least five decimal places.

**AD **= Anderson-Darling test, **KS **= Kolmorgorov-Smirnov test, **SW **= Shapiro-Wilk test.

ROC curves and Area Under the Curve (AUC)

In order to compare the performance of DFCM versus currently available methods, we examined ROC curves and AUC for the two Latin Square spike-in data sets mentioned previously. We tested the performance of DFCM against RMA, RMA-noBG, GCRMA, MAS, dChip, and PLIER. All data files were preprocessed together for each method. For the Affymetrix data sets, we compared pairs of experiments that were separated by the same number of permutations of the Latin Square (where d = number of permutations), and obtained average true and false positive rates for each preprocessing method for each value of d, d = 1,..., 7. In these Latin Square designs, d can be thought of as the log2 fold difference in spike-in transcript levels for a majority of the transcripts. For example, for the HG-U133A data set, experiments 1 and 2, 2 and 3, 3 and 4,

We compared experiments with d = 1 through d = 7, since d = 8 is equivalent to d = 6, d = 9 equivalent to d = 5, and so on. AUC calculations were done for a cutoff of 100 false positives for the HG-U95Av2 experiment, and 200 false positives for the HG-U133A experiment. These cutoff points correspond to a false positive rate of approximately 0.8% for both experiments.

Again, the GoldenSpike data was not used for methods comparison due to serious design flaws, described fully in

Results

Testing distributional assumptions for the convolution model

In order to test the validity of the noise and signal distributional assumptions, we compared background noise distribution estimated by the exponential-normal convolution model with the standard normal distribution in both quantile-quantile (QQ) plots and density plots using the Affymetrix spike-in data sets. All calculations were done using the Bioconductor suite in the R software package for statistical analysis

Figure

Figure

Figure

Figure

The results shown for the HG-U95Av2 spike-in data apply to the HG-U133A spike-in data, with one notable exception. Figure

Table

Downstream performance of DFCM

The quantile-quantile plots provide evidence that the exponential-normal convolution model does not fit the data. These observations lead us to develop the DFCM as a means of background correction that does not rely on specific distributional assumptions. The estimation of noise and signal using DFCM is described in the Methods section. In order to determine if DFCM leads to an improvement of background correction and signal estimation, we applied DFCM and the other methods to each of the Affymetrix Latin Square data sets, and evaluated their performance characteristics using ROC/AUC analysis. The GoldenSpike data set was not used for this comparison because of the controversy around this data set

In order to compare the performance of DFCM with other commonly used methods, we examined ROC curves and AUC analyses for the two Latin Square spike-in data sets mentioned previously. We tested the performance of DFCM against RMA, RMA-noBG, GCRMA, MAS, dChip, and PLIER. All data files were preprocessed together for each method using a PowerMac G5 running R Cocoa GUI with R version 2.8.1

Figure

ROC Curves for HG-U95Av2 Latin Square Spike-In Experiment for d = 1 (a) and d = 2 (b)

**ROC Curves for HG-U95Av2 Latin Square Spike-In Experiment for d = 1 (a) and d = 2 (b)**. For other values of d, the differences among methods are very small. Curves for MAS 5.0 and PLIER (and dChip for plot b) do not appear on the graphs because their false positive and true positive rates are too small for the scale given. The scale goes from 0.4 to 0.9 in order to magnify differences among the methods.

Figure

ROC Curves for HG-U133A Latin Square Spike-In Experiment for d = 1 (a) and d = 2 (b)

**ROC Curves for HG-U133A Latin Square Spike-In Experiment for d = 1 (a) and d = 2 (b)**. For other values of d, the differences among methods are very small. The lines for MAS 5.0 and PLIER (for plot a) and MAS 5 (for plot b) do not appear on the graphs because their false positive and true positive rates are too small for the scale given. The scale goes from 0.6 to 0.9 in order to magnify differences among the methods.

Table

Average AUC for RMA, RMA-noBG, MAS 5.0, dChip, and PLIER for detection of spiked-in genes in the Affymetrix Latin Square spikein experiments, according to the value of d, which is related to the log base 2 fold change between experiments.

d

DFCM

RMA

RMA-noBG

MAS 5

dChip

PLIER

Average AUC for the HG-U95Av2 Latin Square Spike-In Data Set

1

**0.732**

0.715

0.721

0.063

0.572

0.062

2

0.871

0.869

**0.918**

0.167

0.803

0.316

3

**0.936**

0.935

0.935

0.484

0.886

0.629

4

**0.997**

0.994

0.986

0.798

0.948

0.769

5

**1.000**

0.999

0.999

0.916

0.980

0.853

6

**1.000**

**1.000**

**1.000**

0.967

0.987

0.876

7

**1.000**

**1.000**

**1.000**

0.981

0.999

0.876

Average AUC for the HG-U133A Latin Square Spike-In Data Set

1

**0.768**

0.738

0.734

0.060

0.600

0.365

2

**0.858**

0.831

0.812

0.307

0.709

0.752

3

**0.935**

0.904

0.908

0.561

0.811

0.883

4

0.934

**0.964**

**0.964**

0.837

0.913

0.951

5

**0.983**

0.990

0.983

0.939

0.971

0.985

6

**0.999**

0.998

0.996

0.968

0.989

0.994

7

**0.999**

**0.999**

**0.999**

0.978

0.972

0.996

Larger values of d correspond to larger fold changes among spiked-in transcripts between pairs of experiments. Eighteen spike-in probes sets were used for the HG-U95Av2 experiment, and 64 spiked-in probe sets were used for the HG-U133A to calculate true and false positives, as discussed in Methods. To calculate the AUCs, the number of false positives was set to 100 for the HG-U95Av2 experiment, and 200 for the HG-U133a experiment. Numbers in bold indicate the best value for each row.

Clustering and classification comparisons using real data

It has been argued that comparisons based on spike-in data do not necessarily translate to data derived from real biological specimens

Table

Number of GO terms with p-values less than 10^{-10 }for four pre-processing algorithms, according to CLASSIFI on the GSE2350 data. Larger numbers indicate better performance.

Normalization

Background Correction Methods

DFCM

RMA

None

MAS 5

Loess

86

87

88

57

Quantile

48

47

50

60

Scale

83

80

76

24

To examine the effect of normalization on the results, quantile normalization, scale normalization (as defined for the MAS 5.0 algorithm) or loess was used in combination with each of the background methods discussed in this paper. All methods (except for MAS 5.0) used median polish summarization. Differentially expressed genes were selected using two-sample t-tests. The methods GCRMA, dChip and PLIER could not be used because their background correction, normalization, and summarization algorithms cannot be separated easily.

Discussion

The RMA convolution model for background correction of microarray data from Affymetrix platforms is very popular. This model assumes that the observed value of fluorescence intensities is composed of an exponentially distributed signal with underlying normally distributed noise. This idea of a combination of signal and noise is quite reasonable, but the analysis presented here indicates that the distributional assumptions are not always correct. In order to examine the assumption of normally distributed background noise, we performed background correction using the convolution model and plotted the estimated background intensities versus a normal distribution using a quantile-quantile plot for three spike-in data sets. The plots indicate that the normality assumption may not hold for all of the spike-in data sets examined. To confirm this, we examined the data with three well-known goodness-of-fit tests. The KS test, in particular, is known to have extremely low power

We also examined the background corrected intensities, which are purported to represent the true signal, against the exponential distribution. QQ plots and goodness of fit tests show that the background corrected signal is clearly not exponentially distributed for any of the data studied here. These observations lend credibility to the notion that preprocessing approaches should not rely heavily on distributional assumptions.

There is some evidence that the gene distributions within groups are normally distributed after preprocessing with MBEI and MAS 5.0

Recently, it has been argued that the assumption that intensity values from a microarray study are random samples from any statistical distribution is seriously flawed

DFCM uses q2th percentile of the MM signal corresponding to the smallest q1 percentage of PM intensities to estimate background noise. The original intent of the MM probes was to provide a measure of non-specific hybridization that could be subtracted from the PM intensities, leaving the true signal. MAS 5.0 was developed under this assumption. It should be noted that the use of PM values alone could be justified by the noisiness and lack of validity of the MM measurement

There is a biochemically defensible rationale for estimation of background noise using DFCM. By selecting the lowest q1 percent of PM, we ensure that non-specific hybridization will not be an issue. However, there could still be some cross-hybridization, which is eliminated by taking only the smallest q2 percentage of MM. Having said this, one could simply use the lowest q1 percent of the PM; however, selecting the value for q1 could be difficult and somewhat arbitrary, and if a relatively high value for q1 is chosen might include some real signal for some of the PMs. The corresponding MMs in this case should be less and should be closer to background since in theory they should not be hybridizing to the real target.

The algorithm for choosing the value of q1 is very stable (see Methods), almost always choosing the same value of q1 for a given platform. For example, two experiments completed on the HGU95 platform will have very similar values of q1 (approximately 0.25). In other words, the values of q1 are more platform dependent than they are experiment-dependent. This fact supports the notion that different normalization procedures are required for different platforms

One way to think of q2 is as an estimate of the chance that an MM probe is cross-hybridizing to another target transcript (or that most of its signal is from non-specific hybridization). A reasonable estimate of q2 is given by subtracting this estimate of cross-hybridization potential from 100%. In the examples that follow, q2 = 90%. In practice, the value of q2 was found to have little effect on the background correction (see Figure

Boxplots for MM intensities for q1 ranging from 0.05 to 0.5, and q2 = 0.80 (left top), 0.90 (right top), 0.95 (left bottom), and 0.99 (right bottom)

**Boxplots for MM intensities for q1 ranging from 0.05 to 0.5, and q2 = 0.80 (left top), 0.90 (right top), 0.95 (left bottom), and 0.99 (right bottom)**. Our algorithm determined 0.25 to be the optimal value for q1. The value of q2 does not appreciably change the distribution of the MM intensities.

Conclusion

We have shown that microarray data from three different Affymetrix platforms (GoldenSpike, HG-U95Av2, and HG-U133A) do not meet the assumptions of the exponential-normal convolution model for background correction. This model is used in the Bioconductor software package in conjunction with quantile normalization and median polish summarization to comprise the RMA method. In all cases examined, estimated background noise did not follow a normal distribution, nor did the resulting estimated signal follow a simple exponential distribution. To circumvent these problems, we devised a distribution-free method to subtract background noise (DFCM). This method tended to perform better than many popular algorithms across a variety of experiments and array platforms.

This finding has four important implications. First, it is important to account for non-specific hybridization. We attempted to do so by using MM intensities to obtain an estimate of background noise. MAS 5.0 uses ideal mismatch to account for non-specific hybridization, but given the strong correlation between PM and MM values, the method is likely subtracting signal from the PM intensities, resulting in poor sensitivity and specificity. In other words, the method does not really account for non-specific hybridization, since the MM values do not perform as designed.

GCRMA uses the probe sequence information given by MM probes and it works well for the HG-U95Av2 data, but not for the HG-U133A data. The performance discrepancy may be explained in part by the improved technology and better knowledge of the human genome at the time of the creation of the HG-U133A chip. The other part of the explanation lies with deficiencies in estimating the various components of the GCRMA model. For example, parameter estimates for nonspecific hybridization are difficult to estimate reliably since the signal and noise from an observed intensity cannot be distinguished for most data. Estimating probe affinity is also quite difficult in practice. It has been reported that the top 2% probes will contain up to 50% of total signals

Second, any background correction method based on assumptions that the background noise is normally distributed and that the real signal is exponentially distributed may not be valid for any given array platform. Testing the distributional assumptions for real data is impossible, since we cannot know what is background and what is signal. Third, it is clear that we need to develop an understanding of the reasons certain methods perform better on certain platforms, and the role that non-specific hybridization and cross-hybridization play in the observed intensities from microarray data. Finally, the fact that different methods perform better (or worse) on different platforms indicates that no one method may be a panacea for all preprocessing needs. However, in order to test this conjecture, more spike-in data sets on a variety of platforms are necessary, as well as performance measures for use on real data sets. Automated methods for choosing the best method to analyze a particular microarray data set would be an important contribution.

Appendix

Please see Additional file

Click here for file

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

ZC devised the algorithms and performed the study; MM and RHS supervised the study, obtained supports and drafted the manuscript; QL helped to design the algorithms; YD assisted in the study; MK developed and ran the code for the CLASSIFI algorithm. All authors have read and approved the final manuscript.

Acknowledgements

The authors wish to thank William R. Schucany for a careful reading of this manuscript during its preparation, and gratefully acknowledge support for this work from Dr. Milton Packer, Director, Department of Clinical Sciences, University of Texas Southwestern Medical Center. This research was supported by the National Institutes of Health contracts N01-AI40076 and N01-AI40041 to RHS and by grant R15-AG16192 to Monnie McGee. The authors have no potential conflicts of interest directly relevant to the contents of the manuscript.

This article has been published as part of