This article is part of the supplement: Biodiversity Informatics
Rapid DNA barcoding analysis of large datasets using the composition vector method
1 Department of Biology, The Chinese University of Hong Kong, Hong Kong, PR China
2 Molecular Biotechnology Programme, The Chinese University of Hong Kong, Hong Kong, PR China
3 Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, NC28223, USA
BMC Bioinformatics 2009, 10(Suppl 14):S8 doi:10.1186/1471-2105-10-S14-S8Published: 10 November 2009
Sequence alignment is the rate-limiting step in constructing profile trees for DNA barcoding purposes. We recently demonstrated the feasibility of using unaligned rRNA sequences as barcodes based on a composition vector (CV) approach without sequence alignment (Bioinformatics 22:1690). Here, we further explored the grouping effectiveness of the CV method in large DNA barcode datasets (COI, 18S and 16S rRNA) from a variety of organisms, including birds, fishes, nematodes and crustaceans.
Our results indicate that the grouping of taxa at the genus/species levels based on the CV/NJ approach is invariably consistent with the trees generated by traditional approaches, although in some cases the clustering among higher groups might differ. Furthermore, the CV method is always much faster than the K2P method routinely used in constructing profile trees for DNA barcoding. For instance, the alignment of 754 COI sequences (average length 649 bp) from fishes took more than ten hours to complete, while the whole tree construction process using the CV/NJ method required no more than five minutes on the same computer.
The CV method performs well in grouping effectiveness of DNA barcode sequences, as compared to K2P analysis of aligned sequences. It was also able to reduce the time required for analysis by over 15-fold, making it a far superior method for analyzing large datasets. We conclude that the CV method is a fast and reliable method for analyzing large datasets for DNA barcoding purposes.