This article is part of the supplement: IEEE 7th International Conference on Bioinformatics and Bioengineering at Harvard Medical School
Research
Improving the specificity of exon prediction using comparative genomics
Department of Statistics, Purdue University, 150 N. University Street, West Lafayette, IN 47906, USA
BMC Genomics 2008, 9(Suppl 2):S13 doi:10.1186/1471-2164-9-S2-S13
Published: 16 September 2008Abstract
Background
Computational gene prediction tools routinely generate large volumes of predicted coding exons (putative exons). One common limitation of these tools is the relatively low specificity due to the large amount of non-coding regions.
Methods
A statistical approach is developed that largely improves the gene prediction specificity. The key idea is to utilize the evolutionary conservation principle relative to the coding exons. By first exploiting the homology between genomes of two related species, a probability model for the evolutionary conservation pattern of codons across different genomes is developed. A probability model for the dependency between adjacent codons/triplets is added to differentiate coding exons and random sequences. Finally, the log odds ratio is developed to classify putative exons into the group of coding exons and the group of non-coding regions.
Results
The method was tested on pre-aligned human-mouse sequences where the putative exons are predicted by GENSCAN and TWINSCAN. The proposed method is able to improve the exon specificity by 73% and 32% respectively, while the loss of the sensitivity ≤ 1%. The method also keeps 98% of RefSeq gene structures that are correctly predicted by TWINSCAN when removing 26% of predicted genes that are in non-coding regions. The estimated number of true exons in TWINSCAN's predictions is 157,070. The results and the executable codes can be downloaded from http://www.stat.purdue.edu/~jingwu/codon/ webcite
Conclusion
The proposed method demonstrates an application of the evolutionary conservation principle to coding exons. It is a complementary method which can be used as an additional criteria to refine many existing gene predictions.



