Delineation of amplification, hybridization and location effects in microarray data yields better-quality normalization
1 Delft Bioinformatics Lab, Delft University of Technology, Mekelweg 4, Delft, 2628 CD, The Netherlands
2 Department of Tissue Regeneration, University of Twente, PO box 217, Enschede, 7500AE, The Netherlands
3 Centre for Molecular and Biomolecular Informatics (CMBI), Radboud Universiteit Nijmegen, PO box 9101, Nijmegen, 6500HB, The Netherlands
4 Department of Applied Biology, Radboud Universiteit Nijmegen, PO box 9101, Nijmegen, 6500HB, The Netherlands
5 Department of Molecular Pharmacology, Merck Research Laboratories, PO Box 20, Oss, 5340BH, The Netherlands
6 Physiological Genomics Group, BU Biosciences, TNO Quality of Life, PO Box 360, Zeist, 3700AJ, The Netherlands
BMC Bioinformatics 2010, 11:156 doi:10.1186/1471-2105-11-156Published: 26 March 2010
Oligonucleotide arrays have become one of the most widely used high-throughput tools in biology. Due to their sensitivity to experimental conditions, normalization is a crucial step when comparing measurements from these arrays. Normalization is, however, far from a solved problem. Frequently, we encounter datasets with significant technical effects that currently available methods are not able to correct.
We show that by a careful decomposition of probe specific amplification, hybridization and array location effects, a normalization can be performed that allows for a much improved analysis of these data. Identification of the technical sources of variation between arrays has allowed us to build statistical models that are used to estimate how the signal of individual probes is affected, based on their properties. This enables a model-based normalization that is probe-specific, in contrast with the signal intensity distribution normalization performed by many current methods. Next to this, we propose a novel way of handling background correction, enabling the use of background information to weight probes during summarization. Testing of the proposed method shows a much improved detection of differentially expressed genes over earlier proposed methods, even when tested on (experimentally tightly controlled and replicated) spike-in datasets.
When a limited number of arrays are available, or when arrays are run in different batches, technical effects have a large influence on the measured expression of genes. We show that a detailed modelling and correction of these technical effects allows for an improved analysis in these situations.