Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Methodology article

Genome comparison using Gene Ontology (GO) with statistical testing

Zhaotao Cai, Xizeng Mao, Songgang Li and Liping Wei*

Author Affiliations

Center for Bioinformatics, National Laboratory of Protein Engineering and Plant Genetic Engineering, College of Life Sciences, Peking University, Beijing 100871, P.R. China

For all author emails, please log on.

BMC Bioinformatics 2006, 7:374  doi:10.1186/1471-2105-7-374

Published: 11 August 2006

Abstract

Background

Automated comparison of complete sets of genes encoded in two genomes can provide insight on the genetic basis of differences in biological traits between species. Gene ontology (GO) is used as a common vocabulary to annotate genes for comparison. Current approaches calculate the fold of unweighted or weighted differences between two species at the high-level GO functional categories. However, to ensure the reliability of the differences detected, it is important to evaluate their statistical significance. It is also useful to search for differences at all levels of GO.

Results

We propose a statistical approach to find reliable differences between the complete sets of genes encoded in two genomes at all levels of GO. The genes are first assigned GO terms from BLAST searches against genes with known GO assignments, and for each GO term the abundance of genes in the two genomes is compared using a chi-squared test followed by false discovery rate (FDR) correction. We applied this method to find statistically significant differences between two cyanobacteria, Synechocystis sp. PCC6803 and Anabaena sp. PCC7120. We then studied how the set of identified differences vary when different BLAST cutoffs are used. We also studied how the results vary when only subsets of the genes were used in the comparison of human vs. mouse and that of Saccharomyces cerevisiae vs. Schizosaccharomyces pombe.

Conclusion

There is a surprising lack of statistical approaches for comparing complete genomes at all levels of GO. With the rapid increase of the number of sequenced genomes, we hope that the approach we proposed and tested can make valuable contribution to comparative genomics.