Email updates

Keep up to date with the latest news and content from BMC Medical Informatics and Decision Making and BioMed Central.

Open Access Highly Accessed Research article

An assessment of recently published gene expression data analyses: reporting experimental design and statistical factors

Peyman Jafari1 and Francisco Azuaje2*

Author Affiliations

1 Department of Biostatistics and Epidemiology, Faculty of Health, Shiraz University of Medical Sciences, Shiraz, Iran

2 School of Computing and Mathematics and Computer Science Research Institute, University of Ulster, BT37 0QB, UK

For all author emails, please log on.

BMC Medical Informatics and Decision Making 2006, 6:27  doi:10.1186/1472-6947-6-27

Published: 21 June 2006

Abstract

Background

The analysis of large-scale gene expression data is a fundamental approach to functional genomics and the identification of potential drug targets. Results derived from such studies cannot be trusted unless they are adequately designed and reported. The purpose of this study is to assess current practices on the reporting of experimental design and statistical analyses in gene expression-based studies.

Methods

We reviewed hundreds of MEDLINE-indexed papers involving gene expression data analysis, which were published between 2003 and 2005. These papers were examined on the basis of their reporting of several factors, such as sample size, statistical power and software availability.

Results

Among the examined papers, we concentrated on 293 papers consisting of applications and new methodologies. These papers did not report approaches to sample size and statistical power estimation. Explicit statements on data transformation and descriptions of the normalisation techniques applied prior to data analyses (e.g. classification) were not reported in 57 (37.5%) and 104 (68.4%) of the methodology papers respectively. With regard to papers presenting biomedical-relevant applications, 41(29.1 %) of these papers did not report on data normalisation and 83 (58.9%) did not describe the normalisation technique applied. Clustering-based analysis, the t-test and ANOVA represent the most widely applied techniques in microarray data analysis. But remarkably, only 5 (3.5%) of the application papers included statements or references to assumption about variance homogeneity for the application of the t-test and ANOVA. There is still a need to promote the reporting of software packages applied or their availability.

Conclusion

Recently-published gene expression data analysis studies may lack key information required for properly assessing their design quality and potential impact. There is a need for more rigorous reporting of important experimental factors such as statistical power and sample size, as well as the correct description and justification of statistical methods applied. This paper highlights the importance of defining a minimum set of information required for reporting on statistical design and analysis of expression data. By improving practices of statistical analysis reporting, the scientific community can facilitate quality assurance and peer-review processes, as well as the reproducibility of results.