This article is part of the supplement: The 2008 International Conference on Bioinformatics & Computational Biology (BIOCOMP'08)
A distribution-free convolution model for background correction of oligonucleotide microarray data
1 Biostatistics Epidemiology Research Design Core, Center for Clinical and Translational Sciences, The University of Texas Health Science Center at Houston, UT Professional Building, 6410 Fannin Street, Houston, TX 77030, USA
2 Department of Statistical Science, Southern Methodist University, 3225 Daniel Ave., Dallas, TX 75275, USA
3 Department of Computer Science, New Mexico Institute of Mining and Technology, Socorro, NM 87801, USA
4 Department of Pathology, University of Texas, Southwestern Medical Center, 6000 Harry Hines Blvd., Dallas, TX 75390, USA
5 Specpro, Vicskburg, MS 39180 USA
6 Department of Biology Science, The University of Southern Mississippi, 118 College Dr., Hattiesburg, MS 39406, USA
BMC Genomics 2009, 10(Suppl 1):S19 doi:10.1186/1471-2164-10-S1-S19Published: 7 July 2009
Affymetrix GeneChip® high-density oligonucleotide arrays are widely used in biological and medical research because of production reproducibility, which facilitates the comparison of results between experiment runs. In order to obtain high-level classification and cluster analysis that can be trusted, it is important to perform various pre-processing steps on the probe-level data to control for variability in sample processing and array hybridization. Many proposed preprocessing methods are parametric, in that they assume that the background noise generated by microarray data is a random sample from a statistical distribution, typically a normal distribution. The quality of the final results depends on the validity of such assumptions.
We propose a Distribution Free Convolution Model (DFCM) to circumvent observed deficiencies in meeting and validating distribution assumptions of parametric methods. Knowledge of array structure and the biological function of the probes indicate that the intensities of mismatched (MM) probes that correspond to the smallest perfect match (PM) intensities can be used to estimate the background noise. Specifically, we obtain the smallest q2 percent of the MM intensities that are associated with the lowest q1 percent PM intensities, and use these intensities to estimate background.
Using the Affymetrix Latin Square spike-in experiments, we show that the background noise generated by microarray experiments typically is not well modeled by a single overall normal distribution. We further show that the signal is not exponentially distributed, as is also commonly assumed. Therefore, DFCM has better sensitivity and specificity, as measured by ROC curves and area under the curve (AUC) than MAS 5.0, RMA, RMA with no background correction (RMA-noBG), GCRMA, PLIER, and dChip (MBEI) for preprocessing of Affymetrix microarray data. These results hold for two spike-in data sets and one real data set that were analyzed. Comparisons with other methods on two spike-in data sets and one real data set show that our nonparametric methods are a superior alternative for background correction of Affymetrix data.