Department of Pediatrics, Washington University School of Medicine, Saint Louis, USA

Molecular Microbiology and Microbial Pathogenesis Program, Washington University School of Medicine, Saint Louis, USA

Institute of Hygiene, University of Muenster, Muenster, Germany

Department of Mathematics, Washington University, St. Louis, USA

Microbial Evolution Laboratory, National Food Safety and Toxicology Center, Michigan State University, East Lansing, USA

Abstract

Background

Results

The phylogeny of

Conclusions

Segment-dependent phylogenies most likely are legacies of a complex recombination history. However,

Background

For many years, our understanding of the phylogeny of

Multilocus sequence typing, which uses allelic variations in a sample of housekeeping genes distributed around the chromosome, and whole genome sequencing have been increasingly used to study

We attempted to produce a more cogent picture of the emergence of

**Table S1**. Strains Used.

Click here for file

Results

Phylogenetic topology of

In most topologies (Figure

Phylogenetic Topologies

**Phylogenetic Topologies**. Various phylogenetic topologies are assigned to Segments 1, 2, 3, and 4 (rows) by SD, NJ, ME, and MP methods (columns). Congruent topologies are displayed within conjoined panels. 'O' represents the outgroup,

**Figure S1**. Topologies generated by various analyses from each Segment.

Click here for file

The choice of segment influenced the inferred topology to a greater extent than did the method used to construct the phylogeny. This is surprising, because phylogeny should be a property of organisms, and not vary as a function of the DNA segment scrutinized. Most likely, circumchromosomal datasets produce net topologies weighted by the differing evolutionary and recombination histories of components of the chromosome. In other words, the phylogenetic history of

Inter-Group recombination

Next, we used GENECONV

Recombination between groups

**Recombination between groups**. Strains studied in Segments 1 and 2 (Panel A) or Segment 3 and 4 (Panel B) analyses are listed along the x- and y-axes, assembled as groups. Red (Segment 1 and Segment 3) and blue (Segment 2 and Segment 4) numbers within boxes correspond to fragments identified by GENECONV as having been transferred by recombination. The Fragments from Segments 1, 2, 3, and 4 are portrayed in Figures S2A, S2B, S2C, and S2D, respectively (Additional File

**Figure S2**. Fragments identified as being subjected to conversion.

Click here for file

We used three increasingly stringent tiers of analysis to determine if the exchanges between Groups occurred randomly (portrayed in Figure

Inter-group conversions, portrayed by tiers

**Inter-group conversions, portrayed by tiers**. Groups are portrayed in white circles. Bidirectional arrows between groups reflect over- (white) or under- (black) represented conversions, if p-values are < 0.05 (Tiers 1 and 2) or < 0.10 (Tier 3). Each white arrow is proportional to its observed:expected ratio. Each black arrow is proportional to the expected:observed ratio, but expected values of 0 are assigned an arbitrary value of 1 and expected:observed thickness arrows are capped at thickness ratios of 7.5:1. Adjacent to arrows are observed and expected conversions, chi squared, and p values. Further details regarding expected and observed inter-group conversions are in Table S2 (Additional File

**Table S2**. Conversion events identified by GENECONV.

Click here for file

Intra-group recombination was more frequent than inter-group exchange. Among the 258 intra-group and 772 inter-group strain to strain opportunities for pairings, GENECONV identified 40 (expected 34), 26 (expected 18), and 10 (expected 5) intra-group and 95 (expected 101), 47 (expected 55), and 9 (expected 14) inter-group recombination events for tier 1, 2, and 3 exchanges, respectively. The chi square and two-tailed approximate P values for tier 1, 2, and 3 inter- vs. intra-group comparisons are 1.415 (P = 0.23), 4.719 (P = 0.03) and 6.786 (P = 0.009), respectively.

Discussion

Our data prompt two questions: First, how can the robust recombination that occurred in

The durability of the nonrandom exchange of DNA between groups could determine the fate of

The disproportionately high intra-group recombination rates strengthen the case for highly restricted recombination networks between sets of organisms, as suggested by other investigators. For example, the patterns in Figure

The appropriateness of defining bacterial species based on net DNA homology has been questioned

Our study has several limitations. It is possible that the predominantly human origin of our strain set introduced biases. However, isolation of the

Conclusion

It is currently problematic to use circumchromosomal sequence data to develop an unambiguous emergence topology for

Methods

Strains

For our initial strain set, we selected 16 strains from ECOR groups A, B1, D, and B2, five fully sequenced

Choosing, Validating, and sequencing Extended Segments

We used a subset of

Segments 1 and 2 were sequenced (from nucleotide positions 1,084,356 to 1,110,604 and 2,368,707 to 2,393,879, respectively) in eight ECOR strains (two each from groups A, B1, B2, and D) (Additional File

Sequenced amplicons for each strain were assembled into contigs using the SeqMan Pro program (Lasergene v.3 DNASTAR software suite). Regions that failed to amplify and multi-nucleotide insertions or deletions were not included in the final concatenated assembly. Single nucleotide indels and SNPs occurring in only one strain were verified by visualizing the original trace data. The sequences from the amplicons that were successfully sequenced in every strain and for which there was orthologous sequence in the published genomes were concatenated using Lasergene's EditSeq program and aligned by ClustalW in Molecular Evolutionary Genetics Analysis (MEGA) software v.4.0

**Table S3**. Sequence alignment.

Click here for file

We constructed phylogenetic models using Neighbor Joining (NJ), Minimum Evolution (ME) and Maximum Parsimony (MP) analyses in MEGA v.4.0 software

Statistics

We used the Pearson chi-square statistic in a permutation-like simulation test to determine the statistical significance of the differences between observed and expected inter-group recombination frequencies. For expected counts, we assume that each of the 166 (Segments 1 and 2) or 292 (Segments 3 and 4) inter-stain pairings is equally likely to be involved in a gene conversion. The relative probability of a between-group gene conversion for each segment is proportional to the number of strains in the corresponding groups. Expected and simulated counts are conditional on the total number of observed counts in segments, and observed and expected numbers are summed over segments for each pair of groups. For example, if there are 10, 20, 30, and 40 total inter-group conversions in the four segments, respectively, and if Group X has five studied strains and Group Y has six studied strains for Segments 1 and 2 and four and five respectively on Segments 3 and 4, then there would be (10+20) × (5 × 6)/166 + (30+40) × (4 × 5)/292 expected gene conversions between Groups X and Y. The Pearson chi-square statistic, which is a higher-dimensional analog of the Cochran-Mantel-Haenszel (CMH) test statistic ^{6 }simulated count sets. In each simulation, the observed recombination events for each segment were randomly reassigned to pairs of groups according to the expected probabilities for that segment, specifically by simulating the values of a multinomial distribution for each segment. The simulated counts were summed across the four segments and the Pearson test score recomputed. The p-value for biases between-group recombination rates across segments is estimated as the proportion of simulations for which the randomized test score was greater than or equal to the observed test score.

The chi-square test was used to test the significance of the observed difference in inter- and intra-group recombination frequency. The total observed recombination events and possible recombination opportunities (inter-group and intra-group) were enumerated for each tier in each of the two categories. Group E was not included in the analysis because of the paucity of group E strains studied, as noted above.

Authors' contributions

SRL designed analytical strategies, selected strains, performed most analyses, and wrote the text. SAS proposed mathematical techniques, and performed statistical modeling. TSW helped SRL formulate the hypothesis that phylogeny as derived from multilocus sequence typing was inaccurate. PIT reviewed and approved analytical strategies, interpreted the data, assisted in writing of the manuscript, and obtained funding for this project. TSW died before this manuscript was submitted; all other authors read and approved the final manuscript.

Acknowledgements

We thank Christine Musser and Alison Griffith for manuscript preparation assistance; Katie E. Hyma, Steve Moseley, Eduardo Groisman, and Peter Tarr for comments; Nurmohammad Shaikh and Harry Stevens for processing data, and Lucinda Fulton and Rachel Abbott for assistance with our sequencing. We wish to acknowledge a Reviewer who suggested an alternative explanation to our findings, which we have incorporated into the Discussion. This work was supported by NIH Grants R56AI063282 (to P.I.T.), 5T32AI007172 (to S.R.L.), 5P30 DK052574 (to the Washington University Digestive Diseases Research Core Center); and Contract N01-AI-30058 (to T.S.W.) (strains are deposited at the Michigan State University STEC Center, supported by this contract); and the Melvin E. Carnahan Professorship of Pediatrics (to P.I.T.).