Email updates

Keep up to date with the latest news and content from BMC Genomics and BioMed Central.

Open Access Highly Accessed Research article

Beyond classification: gene-family phylogenies from shotgun metagenomic reads enable accurate community analysis

Samantha J Riesenfeld1* and Katherine S Pollard12

Author Affiliations

1 Gladstone Institute of Cardiovascular Disease, University of California, San Francisco, CA, 94158, USA

2 Division of Biostatistics and Institute for Human Genetics, University of California, San Francisco, CA, 94158, USA

For all author emails, please log on.

BMC Genomics 2013, 14:419  doi:10.1186/1471-2164-14-419

Published: 22 June 2013

Abstract

Background

Sequence-based phylogenetic trees are a well-established tool for characterizing diversity of both macroorganisms and microorganisms. Phylogenetic methods have recently been applied to shotgun metagenomic data from microbial communities, particularly with the aim of classifying reads. But the accuracy of gene-family phylogenies that characterize evolutionary relationships among short, non-overlapping sequencing reads has not been thoroughly evaluated.

Results

To quantify errors in metagenomic read trees, we developed MetaPASSAGE, a software pipeline to generate in silico bacterial communities, simulate a sample of shotgun reads from a gene family represented in the community, orient or translate reads, and produce a profile-based alignment of the reads from which a gene-family phylogenetic tree can be built. We applied MetaPASSAGE to a variety of RNA and protein-coding gene families, built trees using a range of different phylogenetic methods, and compared the resulting trees using topological and branch-length error metrics. We identified read length as one of the major sources of error. Because phylogenetic methods use a reference database of full-length sequences from the gene family to guide construction of alignments and trees, we found that error can also be substantially reduced through increasing the size and diversity of the reference database. Finally, UniFrac analysis, which compares metagenomic samples based on a summary statistic computed over all branches in a read tree, is very robust to the level of error we observe.

Conclusions

Bacterial community diversity can be quantified using phylogenetic approaches applied to shotgun metagenomic data. As sequencing reads get longer and more genomes across the bacterial tree of life are sequenced, the accuracy of this approach will continue to improve, opening the door to more applications.

Keywords:
Phylogenetics; Metagenomics; Simulations