Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Methodology article

Correlation test to assess low-level processing of high-density oligonucleotide microarray data

Alexander Ploner1*, Lance D Miller2, Per Hall1, Jonas Bergh3 and Yudi Pawitan1

Author Affiliations

1 Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden

2 Genome Institute of Singapore, Singapore

3 Department of Oncology and Pathology, Cancer Center Karolinska, Radiumhemmet, Karolinska Institutet and University Hospital, Stockholm

For all author emails, please log on.

BMC Bioinformatics 2005, 6:80  doi:10.1186/1471-2105-6-80

Published: 31 March 2005

Abstract

Background

There are currently a number of competing techniques for low-level processing of oligonucleotide array data. The choice of technique has a profound effect on subsequent statistical analyses, but there is no method to assess whether a particular technique is appropriate for a specific data set, without reference to external data.

Results

We analyzed coregulation between genes in order to detect insufficient normalization between arrays, where coregulation is measured in terms of statistical correlation. In a large collection of genes, a random pair of genes should have on average zero correlation, hence allowing a correlation test. For all data sets that we evaluated, and the three most commonly used low-level processing procedures including MAS5, RMA and MBEI, the housekeeping-gene normalization failed the test. For a real clinical data set, RMA and MBEI showed significant correlation for absent genes. We also found that a second round of normalization on the probe set level improved normalization significantly throughout.

Conclusion

Previous evaluation of low-level processing in the literature has been limited to artificial spike-in and mixture data sets. In the absence of a known gold-standard, the correlation criterion allows us to assess the appropriateness of low-level processing of a specific data set and the success of normalization for subsets of genes.