This article is part of the supplement: Proceedings of the 6th International Conference of the Brazilian Association for Bioinformatics and Computational Biology (X-meeting 2010)

Open Access Proceedings

A singular value decomposition approach for improved taxonomic classification of biological sequences

Anderson R Santos1, Marcos A Santos2, Jan Baumbach3, John A McCulloch1, Guilherme C Oliveira4, Artur Silva5, Anderson Miyoshi1 and Vasco Azevedo1*

Author Affiliations

1 Department of General Biology, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Av. Antônio Carlos, 6627, MG, 31.270-901, Brazil

2 Computer Science Departament, Instituto de Ciências Exatas, Universidade Federal de Minas Gerais, Belo Horizonte, Av. Antonio Carlos, 6627, 31.270-901, MG, Brazil

3 Max Planck Institute for Informatics, Campus E2 1, Saarbrücken, Germany

4 CEBio and Laboratory of Cellular and Molecular Parasitology, Instituto René Rachou, Oswaldo Cruz Foundation, Belo Horizonte, Av. Augusto de Lima 1715, 30190-002, MG, Brazil

5 Genome and Proteome Network of the State of Pará, Universidade Federal do Pará, Belém, R. Augusto Corrêa, 66.075-110, PA, Brazil

For all author emails, please log on.

BMC Genomics 2011, 12(Suppl 4):S11  doi:10.1186/1471-2164-12-S4-S11

Published: 22 December 2011



Singular value decomposition (SVD) is a powerful technique for information retrieval; it helps uncover relationships between elements that are not prima facie related. SVD was initially developed to reduce the time needed for information retrieval and analysis of very large data sets in the complex internet environment. Since information retrieval from large-scale genome and proteome data sets has a similar level of complexity, SVD-based methods could also facilitate data analysis in this research area.


We found that SVD applied to amino acid sequences demonstrates relationships and provides a basis for producing clusters and cladograms, demonstrating evolutionary relatedness of species that correlates well with Linnaean taxonomy. The choice of a reasonable number of singular values is crucial for SVD-based studies. We found that fewer singular values are needed to produce biologically significant clusters when SVD is employed. Subsequently, we developed a method to determine the lowest number of singular values and fewest clusters needed to guarantee biological significance; this system was developed and validated by comparison with Linnaean taxonomic classification.


By using SVD, we can reduce uncertainty concerning the appropriate rank value necessary to perform accurate information retrieval analyses. In tests, clusters that we developed with SVD perfectly matched what was expected based on Linnaean taxonomy.