Log on / register
Feedback | Support | My details
Open AccessHighly AccessResearch article

In silico prioritisation of candidate genes for prokaryotic gene function discovery: an application of phylogenetic profiles

Frank PY Lin1 email, Enrico Coiera1 email, Ruiting Lan2 email and Vitali Sintchenko1,3 email

1Centre for Health Informatics, University of New South Wales, Sydney, Australia

2School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, Australia

3Centre for Infectious Diseases and Microbiology, Western Clinical School, University of Sydney, Sydney, Australia

author email corresponding author email

BMC Bioinformatics 2009, 10:86doi:10.1186/1471-2105-10-86

Published: 17 March 2009

Abstract

Background

In silico candidate gene prioritisation (CGP) aids the discovery of gene functions by ranking genes according to an objective relevance score. While several CGP methods have been described for identifying human disease genes, corresponding methods for prokaryotic gene function discovery are lacking. Here we present two prokaryotic CGP methods, based on phylogenetic profiles, to assist with this task.

Results

Using gene occurrence patterns in sample genomes, we developed two CGP methods (statistical and inductive CGP) to assist with the discovery of bacterial gene functions. Statistical CGP exploits the differences in gene frequency against phenotypic groups, while inductive CGP applies supervised machine learning to identify gene occurrence pattern across genomes. Three rediscovery experiments were designed to evaluate the CGP frameworks. The first experiment attempted to rediscover peptidoglycan genes with 417 published genome sequences. Both CGP methods achieved best areas under receiver operating characteristic curve (AUC) of 0.911 in Escherichia coli K-12 (EC-K12) and 0.978 Streptococcus agalactiae 2603 (SA-2603) genomes, with an average improvement in precision of >3.2-fold and a maximum of >27-fold using statistical CGP. A median AUC of >0.95 could still be achieved with as few as 10 genome examples in each group of genome examples in the rediscovery of the peptidoglycan metabolism genes. In the second experiment, a maximum of 109-fold improvement in precision was achieved in the rediscovery of anaerobic fermentation genes in EC-K12. The last experiment attempted to rediscover genes from 31 metabolic pathways in SA-2603, where 14 pathways achieved AUC >0.9 and 28 pathways achieved AUC >0.8 with the best inductive CGP algorithms.

Conclusion

Our results demonstrate that the two CGP methods can assist with the study of functionally uncategorised genomic regions and discovery of bacterial gene-function relationships. Our rediscovery experiments also provide a set of standard tasks against which future methods may be compared.


© 1999-2009 BioMed Central Ltd unless otherwise stated. Part of Springer Science+Business Media.