Open Access Highly Accessed Research article

Alignment and clustering of phylogenetic markers - implications for microbial diversity studies

James R White13, Saket Navlakha23, Niranjan Nagarajan4, Mohammad-Reza Ghodsi23, Carl Kingsford23 and Mihai Pop23*

Author Affiliations

1 Applied Mathematics and Scientific Computation Program, University of Maryland - College Park, College Park, MD, 20742, USA

2 Department of Computer Science, University of Maryland - College Park, College Park, MD, 20742, USA

3 Center for Bioinformatics and Computational Biology, University of Maryland - College Park, College Park, MD, 20742, USA

4 Computational and Mathematical Biology Program, Genome Institute of Singapore, 138672, Singapore

For all author emails, please log on.

BMC Bioinformatics 2010, 11:152  doi:10.1186/1471-2105-11-152

Published: 24 March 2010

Abstract

Background

Molecular studies of microbial diversity have provided many insights into the bacterial communities inhabiting the human body and the environment. A common first step in such studies is a survey of conserved marker genes (primarily 16S rRNA) to characterize the taxonomic composition and diversity of these communities. To date, however, there exists significant variability in analysis methods employed in these studies.

Results

Here we provide a critical assessment of current analysis methodologies that cluster sequences into operational taxonomic units (OTUs) and demonstrate that small changes in algorithm parameters can lead to significantly varying results. Our analysis provides strong evidence that the species-level diversity estimates produced using common OTU methodologies are inflated due to overly stringent parameter choices. We further describe an example of how semi-supervised clustering can produce OTUs that are more robust to changes in algorithm parameters.

Conclusions

Our results highlight the need for systematic and open evaluation of data analysis methodologies, especially as targeted 16S rRNA diversity studies are increasingly relying on high-throughput sequencing technologies. All data and results from our study are available through the JGI FAMeS website http://fames.jgi-psf.org/ webcite.