Log on / register
Feedback | Support | My details
Open AccessSoftware

A method for detecting and correcting feature misidentification on expression microarrays

I-Ping Tu* 1,6 email, Marci Schaner2 email, Maximilian Diehn2 email, Branimir I Sikic3 email, Patrick O Brown2,5 email, David Botstein4 email and Michael J Fero* 1,2 email

1Functional Genomics Facility, Stanford University School of Medicine, Stanford, CA, USA

2Department of Biochemistry, Stanford University School of Medicine, Stanford, CA, USA

3Oncology Division, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA

4Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA

5Howard Hughes Medical Institute, Stanford University School of Medicine, Stanford, CA, USA

6Institute of Statistical Science, Academia Sinica, Taipei, Taiwan, R.O.C

author email corresponding author email* Contributed equally

BMC Genomics 2004, 5:64doi:10.1186/1471-2164-5-64

Published: 9 September 2004

Abstract

Background

Much of the microarray data published at Stanford is based on mouse and human arrays produced under controlled and monitored conditions at the Brown and Botstein laboratories and at the Stanford Functional Genomics Facility (SFGF). Nevertheless, as large datasets based on the Stanford Human array began to accumulate, a small but significant number of discrepancies were detected that required a serious attempt to track down the original source of error. Due to a controlled process environment, sufficient data was available to accurately track the entire process leading to up to the final expression data. In this paper, we describe our statistical methods to detect the inconsistencies in microarray data that arise from process errors, and discuss our technique to locate and fix these errors.

Results

To date, the Brown and Botstein laboratories and the Stanford Functional Genomics Facility have together produced 40,000 large-scale (10–50,000 feature) cDNA microarrays. By applying the heuristic described here, we have been able to check most of these arrays for misidentified features, and have been able to confidently apply fixes to the data where needed. Out of the 265 million features checked in our database, problems were detected and corrected on 1.3 million of them.

Conclusion

Process errors in any genome scale high throughput production regime can lead to subsequent errors in data analysis. We show the value of tracking multi-step high throughput operations by using this knowledge to detect and correct misidentified data on gene expression microarrays.


© 1999-2008 BioMed Central Ltd unless otherwise stated