Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Methodology article

A classification approach for genotyping viral sequences based on multidimensional scaling and linear discriminant analysis

Jiwoong Kim12, Yongju Ahn13, Kichan Lee1, Sung Hee Park1 and Sangsoo Kim1*

Author Affiliations

1 Department of Bioinformatics & Life Sciences, Soongsil University, Seoul, 156-743, Korea

2 Current Address: Equispharm Co., Ltd, Suwon, 443-766, Korea

3 Current Address: Macrogen Inc., Seoul, 153-023, Korea

For all author emails, please log on.

BMC Bioinformatics 2010, 11:434  doi:10.1186/1471-2105-11-434

Published: 21 August 2010

Abstract

Background

Accurate classification into genotypes is critical in understanding evolution of divergent viruses. Here we report a new approach, MuLDAS, which classifies a query sequence based on the statistical genotype models learned from the known sequences. Thus, MuLDAS utilizes full spectra of well characterized sequences as references, typically of an order of hundreds, in order to estimate the significance of each genotype assignment.

Results

MuLDAS starts by aligning the query sequence to the reference multiple sequence alignment and calculating the subsequent distance matrix among the sequences. They are then mapped to a principal coordinate space by multidimensional scaling, and the coordinates of the reference sequences are used as features in developing linear discriminant models that partition the space by genotype. The genotype of the query is then given as the maximum a posteriori estimate. MuLDAS tests the model confidence by leave-one-out cross-validation and also provides some heuristics for the detection of 'outlier' sequences that fall far outside or in-between genotype clusters. We have tested our method by classifying HIV-1 and HCV nucleotide sequences downloaded from NCBI GenBank, achieving the overall concordance rates of 99.3% and 96.6%, respectively, with the benchmark test dataset retrieved from the respective databases of Los Alamos National Laboratory.

Conclusions

The highly accurate genotype assignment coupled with several measures for evaluating the results makes MuLDAS useful in analyzing the sequences of rapidly evolving viruses such as HIV-1 and HCV. A web-based genotype prediction server is available at http://www.muldas.org/MuLDAS/ webcite.