Table 1

Gene families used in simulations
Gene name Median length Rate of amino acid evolution Number of sequences Source of sequences Source of profile
16S rRNA 1535 bp NA (Highly conserved) 427 RDP RDP (INFERNAL)
rpoB 1296 aa 73.51 460 AMPHORA + GenBank AMPHORA (HMMER)
rpsB 226 aa 51.96 411 AMPHORA + GenBank AMPHORA (HMMER)
dnaG 395 aa 112.53 456 AMPHORA + GenBank AMPHORA (HMMER)
lolC 411 aa 184.04 442 UniProt + GenBank PhyloFacts (HMMER)

Each family of gene sequences was limited to its unique representatives among AMPHORA taxa (see Methods). Rate of amino acid evolution was determined by summing all branch lengths in a phylogenetic tree inferred via RAxML from the protein sequences; smaller values indicate fewer substitutions and greater conservation. The 16S rRNA gene requires a nucleotide model of evolution and hence has an incomparable value; it is well known to be highly conserved, with variable regions. 16S rRNA sequences were obtained from the Ribosomal Database Project (RDP) [20]. A larger set of 1,071 16S rRNA sequences was used only for the Fast UniFrac analysis (see Additional file 1: Table S1). Amino acid sequences for rpoB, rpsB, and dnaG families were obtained via AMPHORA [14], while corresponding DNA sequences were downloaded from NCBI GenBank [21]. For lolC, family members were determined by PhyloFacts [22] (family accession bpg052966 as of February 16, 2011); amino acid sequences were downloaded from UniProt [23], and corresponding DNA sequences were downloaded from EMBL-EBI [24]. Additional file 1: Table S1 provides download dates and sequence accession numbers.

Riesenfeld and Pollard

Riesenfeld and Pollard BMC Genomics 2013 14:419   doi:10.1186/1471-2164-14-419

Open Data