Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Research article

Identification of homologs in insignificant blast hits by exploiting extrinsic gene properties

Jos Boekhorst1* and Berend Snel12

Author Affiliations

1 Bioinformatics, Department of Biology, Faculty of Science, Utrecht University, Padualaan 8, 3584 CH, The Netherlands

2 Academic Biomedical Centre, Utrecht University, Yalelaan 1, 3584 CL Utrecht, The Netherlands

For all author emails, please log on.

BMC Bioinformatics 2007, 8:356  doi:10.1186/1471-2105-8-356

Published: 21 September 2007

Abstract

Background

Homology is a key concept in both evolutionary biology and genomics. Detection of homology is crucial in fields like the functional annotation of protein sequences and the identification of taxon specific genes. Basic homology searches are still frequently performed by pairwise search methods such as BLAST. Vast improvements have been made in the identification of homologous proteins by using more advanced methods that use sequence profiles. However additional improvement could be made by exploiting sources of genomic information other than the primary sequence or tertiary structure.

Results

We test the hypothesis that extrinsic gene properties gene length and gene order can be of help in differentiating spurious sequence similarity from homology in the gray zone. Sharing gene order and similarity in size dramatically increase the chance of a query-hit pair being homologous: gray zone query-hit pairs of similar size and with conserved gene order are homologous in 99% of all cases, while for query-hit pairs without gene order conservation and with different sizes this is only 55%.

Conclusion

We have shown that using gene length and gene order drastically improves the detection of homologs within the BLAST gray zone. Our findings suggest that the use of such extrinsic gene properties can also improve the performance of homology detection by more advanced methods, and our study thereby underscores the importance of true data integration for fully exploiting genomic information.