Email updates

Keep up to date with the latest news and content from BMC Genomics and BioMed Central.

Open Access Highly Accessed Research article

Beyond classification: gene-family phylogenies from shotgun metagenomic reads enable accurate community analysis

Samantha J Riesenfeld1* and Katherine S Pollard12

Author Affiliations

1 Gladstone Institute of Cardiovascular Disease, University of California, San Francisco, CA, 94158, USA

2 Division of Biostatistics and Institute for Human Genetics, University of California, San Francisco, CA, 94158, USA

For all author emails, please log on.

BMC Genomics 2013, 14:419  doi:10.1186/1471-2164-14-419

Published: 22 June 2013

Additional files

Additional file 1: Table S1:

Sequence accession numbers. 16S rRNA sequences were obtained from the RDP on September 01, 2009 as part of a hand-curated alignment; a larger set of 1,071 16S rRNA sequences, used only for the Fast UniFrac analysis, was downloaded from the RDP on November 9, 2010. Amino acid sequences for rpoB, rpsB, and dnaG families were obtained via AMPHORA, and corresponding DNA sequences were downloaded from NCBI GenBank on August 22, 2009 (rpoB), May 17, 2011 (rpsB), and June 10, 2011 (dnaG). For lolC, amino acid sequences were downloaded from UniProt on February 16, 2011, and corresponding DNA sequences were downloaded from EMBL-EBI on March 02, 2011.

Format: XLSX Size: 66KB Download file

Open Data

Additional file 2: Figure S1:

Trends in topological error are similar across gene families. For all gene families, topological error in read trees is inversely related to both reference database size and read length, and grows with the number or reads. In each panel, the nRF measure is averaged over 30 simulations for each combination of simulation parameters. Vertical error bars show a standard deviation above and below the mean. (Data for rpoB family are shown in Figure 2).

Format: PDF Size: 162KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 3: Figure S2:

DF distributions varied across gene families, but trends were similar. Trends in the variation of DF quartiles with respect to reference database size, mean read length, and phylogenetic method were very similar across gene families, despite differences in their actual values. Each panel shows the mean values of the DF median, first quartile, and third quartile, averaged over 30 simulations for each parameter combination with 200 reads. Vertical error bars show a standard deviation above and below the mean. (Data for rpoB family are shown in Figure 3).

Format: PDF Size: 197KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 4: Figure S3:

Quantifying error using the nBS measure showed similar patterns to those seen with the nRF measure. While the absolute error measured by nBS differed from that of nRF (Figure 2), the patterns across parameter values were very similar. In each panel, the error measure is averaged over 30 simulations for each combination of simulation parameters. Vertical error bars show a standard deviation above and below the mean. Data for rpoB family are shown in both panels. Similar trends were observed for other gene families.

Format: PDF Size: 113KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 5: Figure S4:

Phylogenetic diversity of large reference database is weakly inversely correlated with topological error. The phylogenetic diversity of each reference database was determined by summing all branch lengths in a phylogenetic tree inferred via RAxML from the sequences in that database. Due to their construction (see Methods), our simulated reference databases all have greater diversity than is likely to be present in real reference databases. Each point is the mean of the nRF error over 10 simulations, for 400-bp mean read length and 200 reads. Shadowed region represents the 95% confidence interval.

Format: PDF Size: 208KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 6: Figure S5:

Tip branch lengths have greater error than internal branches. DF quartiles of tip branches are more extreme than those of internal branches and are affected more by read length, especially in the case of the small reference database. Each panel shows the mean values of the DF median, first quartile, and third quartile, averaged over 30 simulations for each parameter combination with 200 reads, for the rpoB family. Vertical error bars show a standard deviation above and below the mean.

Format: PDF Size: 149KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 7: Figure S6:

RAxML performs similarly regardless of whether it is given a fixed reference tree. Despite the more restricted optimization landscape offered by a fixed reference tree, in our simulations, there was little detectible difference in performance. Here, data for the 16S rRNA gene family are shown. We plot the mean nRF (left) and mean DF quartiles for simulations with 200 reads (right), over 30 simulations for each combination of simulation parameters. Vertical error bars show a standard deviation above and below the mean.

Format: PDF Size: 143KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 8: Supplementary Methods:

Details about software implementation and the methods used in the simulations, phylogenetic inference, error evaluation, and analyses.

Format: DOC Size: 75KB Download file

This file can be viewed with: Microsoft Word Viewer

Open Data