New analysis for consistency among markers in the study of genetic diversity: development and application to the description of bacterial diversity

Pavoine, Sandrine; Bailly, Xavier

doi:10.1186/1471-2148-7-156

Methodology article
Open access
Published: 03 September 2007

New analysis for consistency among markers in the study of genetic diversity: development and application to the description of bacterial diversity

Sandrine Pavoine¹ &
Xavier Bailly²

BMC Evolutionary Biology volume 7, Article number: 156 (2007) Cite this article

4846 Accesses
10 Citations
Metrics details

Abstract

Background

The development of post-genomic methods has dramatically increased the amount of qualitative and quantitative data available to understand how ecological complexity is shaped. Yet, new statistical tools are needed to use these data efficiently. In support of sequence analysis, diversity indices were developed to take into account both the relative frequencies of alleles and their genetic divergence. Furthermore, a method for describing inter-population nucleotide diversity has recently been proposed and named the double principal coordinate analysis (DPCoA), but this procedure can only be used with one locus. In order to tackle the problem of measuring and describing nucleotide diversity with more than one locus, we developed three versions of multiple DPCoA by using three ordination methods: multiple co-inertia analysis, STATIS, and multiple factorial analysis.

Results

This combination of methods allows i) testing and describing differences in patterns of inter-population diversity among loci, and ii) defining the best compromise among loci. These methods are illustrated by the analysis of both simulated data sets, which include ten loci evolving under a stepping stone model and a locus evolving under an alternative population structure, and a real data set focusing on the genetic structure of two nitrogen fixing bacteria, which is influenced by geographical isolation and host specialization. All programs needed to perform multiple DPCoA are freely available.

Conclusion

Multiple DPCoA allows the evaluation of the impact of various loci in the measurement and description of diversity. This method is general enough to handle a large variety of data sets. It complements existing methods such as the analysis of molecular variance or other analyses based on linkage disequilibrium measures, and is very useful to study the impact of various loci on the measurement of diversity.

Background

The exponential increase in sequencing abilities is modifying the way genetic diversity is assessed. For instance, multilocus sequencing (MLS) now allows the estimation of genetic relatedness among microorganisms for both housekeeping genes and accessory genes such as virulence or symbiotic determinants [1]. Thus, several publications reported complex MLS schemes studying more than ten genes located in different genomic regions and involved in various metabolic pathways. These studies have indicated the influence of various parameters, such as recombination rate [2] or epidemiological traits [3], on the diversification of bacterial populations. Furthermore, recent progress in sequencing technologies suggests that still more and more sequence data will be available to study questions related to community ecology in the near future [4]. New statistical methodologies should therefore be developed to deal with the complexity of data sets that will be produced. One of the main problems raised by the increase in sequence information is the assessment of congruence among population structures depicted by different molecular markers [5]. In bacterial lineages, especially for those in which sex is common, the diversity of each locus could be shaped by the gain/loss of genes, gene flow boundaries and specific selective pressures [6]. The problems which can arise from the overall analysis of a MLS data set in which loci do not share congruent evolutionary constraints include, among others, misleading inferences of genetic relatedness and phylogenetic relationships [7] or overestimation of linkage disequilibrium [8].

Bacterial isolates which are characterized by MLS usually belong to several genetic groups (i.e. species or populations) which can be defined according to the sampling strategy or according to more refined methodologies [9]. For each locus of a MLS data set, the different sequence types recovered are called alleles. In this context, the properties of the data set can be summarized by two sets of matrices. The first set includes G matrices {F₁,..., F_g,..., F_G}, in which G is the number of loci. Each of these matrices contains the frequencies of the different alleles recovered at a given locus among the populations under study. The dimensions of these matrices are thus (ρ₁, r), ..., (ρ_g, r), ..., (ρ_G, r), in which ρ_gis the number of alleles observed at locus g and r is the number of populations delineated. The second set also includes G matrices called {D₁,..., D_g..., D_G}, which contain the pairwise genetic distances between the alleles observed at locus g. Usually, the information contained within these two sets of matrices are analyzed independently using respective population genetic statistics (i.e. diversity indices and differentiation measures) and phylogenetic methods. Yet, while it is possible to perform analyses over all loci in either a population genetic or a phylogenetic framework, few methodologies are available to assess the congruence of the information obtained from different loci. In particular, a comparison of the patterns revealed by differentiation measures among the populations sampled, i.e. population structure, is a problematic issue.

Multivariate analysis is an interesting methodological way to approach this problem. For instance, Moazami-Goudarzi and Laloë [5] have proposed a two-step procedure to test the dissimilarity in population structures revealed by different microsatellite loci. Although this analysis can be used to test the similarity of population differentiations inferred from a set of markers, it can be noted that: i) it can not be used to describe population structures, and ii) genetic divergence among alleles are not taken into account, while these can be quite informative. Consequently, further improvements should be considered since alternative statistical approaches are available [10]. In this context, the aim of this survey is to propose a new procedure called multiple double principal coordinate analyses (mDPCoA). The mDPCoA aims at comparing inter-population structures provided by the different markers of a MLS scheme. Firstly, a pattern of population differences is obtained for each MLS marker using a double principal coordinate analysis (DPCoA) which is a recently developed ordination method which takes into account both the frequency of alleles and their genetic divergence [11] (see Eckburg et al. [12] and Bik et al. [13] for applications of this method to the analysis of bacterial diversity). Secondly, population patterns are compared using three different methods: the Multiple Co-inertia Analysis [14], STATIS [15], and the Multiple Factorial Analysis [16]. Finally, a permutation procedure can be used to test the pairwise correlation among MLS markers. These analysis pipelines have been used on either simulated or published MLS data sets to check the accuracy and the relevance of the procedures. The results obtained illustrate the ability of this methodology to make inferences on various features of populations under study.

Results

Algorithms of multiple Double Principal Coordinate Analysis

Computations were performed using new functions and functions implemented in the ade4 [17] and ape [18] packages written in the R software [19] [see Additional file 1]. A manual describing the use of the different functions is supplied [see Additional file 2].

Let {F₁,..., F_g,..., F_Gbe the set of matrices of type alleles × populations, containing the frequencies of alleles in the populations for the G loci, {D₁,..., D_g,..., D_G} be the set of matrices containing the distances among alleles, B_rbe the diagonal matrix containing the population weights (the weight of a population is the proportion of individuals drawn from this population), and $B_{ρ_{g}}$ be the diagonal matrix containing the allele weights for the g^th locus (the weight of an allele is its frequency over all the populations studied). The matrices of distances must be Euclidean [20], which is obtained with, for example, either Lingoes [21] or Cailliez [22] correction.

1. For a single locus g, the analysis of the among-population diversity corresponds to a DPCoA, which results in three main steps:

Defining a Euclidean space composed by principal axes of the distances among the alleles. The coordinates of the alleles in this space are in R_gsuch that: $- Q_{g}^{t} D_{g} Q_{g} = R_{g} R_{g}^{t}$ , where $Q_{g} = I_{ρ_{g}} - B_{ρ_{g}} 1_{ρ_{g}} 1_{ρ_{g}}^{t}$ is a projector which proceeds to weighted centering, with $I_{ρ_{g}}$ the ρ_g× ρ_gmatrix of identity and $1_{ρ_{g}}$ a ρ_g× 1 vector of units. That is to say, $Q_{g}^{t} D_{g} Q_{g}$ is the matrix centered by rows and columns;

2. Positioning, in this space, the populations at the centroid of the alleles they possess. The coordinates of the populations, in this space, are in C_gsuch that: $C_{g} = B_{r}^{- 1} F_{g}^{t} R_{g}$ ;

3. Proceeding to the singular value decomposition of the triplet (C_g, $I_{μ_{g}}$ , B_r), where μ_gis the number of principal axes for the alleles of the g^th locus. This third step leads to a set of positive eigenvalues, in a diagonal (ν_g× ν_g) matrix Ψ_g, and to a base of orthonormal eigenvectors, in a (r × ν_g) matrix V_g, defining the new Euclidean space. The eigenvectors constitute the principal axes of the distances among populations. In this new space, which is the DPCoA space, the coordinates of the alleles are in X_g= R_gV_g, and the coordinates of the populations in Y_g= C_gV_g.

A consideration of the set of all the loci leads thus to G triplets $(Y_{1}, I_{ν_{1}}, B_{r}), ..., (Y_{g}, I_{ν_{g}}, B_{r}), ..., (Y_{G}, I_{ν_{G}}, B_{r})$

Our objective being to evaluate the consistency among the patterns of inter-population diversity provided by each locus, considering evolutionary distances among alleles, we had to find a Euclidean space allowing the direct comparison among the individual DPCoA analyses. We evaluated three alternative solutions taken from the K-table multivariate analysis: the multiple co-inertia analysis (MCoA) [14], STATIS [15] and the multiple factorial analysis (MFA) [16].

DPCoA and Multiple Co-inertia analysis

The Multiple Co-inertia Analysis applied to the triplets $(Y_{1}, I_{ν_{1}}, B_{r}), ..., (Y_{g}, I_{ν_{g}}, B_{r}), ..., (Y_{G}, I_{ν_{G}}, B_{r})$ .

can be viewed as follows:

The main step is the definition of a set of axes $u_{g}^{[k]}$ , for 1 ≤ k <K, and 1 ≤ g ≤ G, normalized in each space $ℝ^{ν_{g}}$ , which will serve to position the populations according to each individual locus, and K unique variables v^[k], for 1 ≤ k <K, D_r-normalized in ℝ^r, which may be used to synthesize the information provided by the G loci. This definition is done by maximizing

$\sum_{g = 1}^{G} π_{g} {〈 Y_{g} u_{g} | v 〉}_{B_{r}}^{2}$ , given that

${〈 v^{[k]} | v^{[l]} 〉}_{B_{r}} = 0$ and ${〈 u_{g}^{[k]} | u_{g}^{[l]} 〉}_{B_{r}} = 0$ for all k, l (1 ≤ k <l), and all g (1 ≤ g ≤ G).

The value π_gis a weight attributed to the triplet (Y_g, $I_{ν_{g}}$ , B_r) so as to homogenize the impact of each triplet in the multiple analysis. We use π_gequal to the inverse of the inertia of the triplet (Y_g, $I_{ν_{g}}$ , B_r), sum of all its eigenvalues. Let U_gbe the matrix $[u_{g}^{[1]} | ... | u_{g}^{[k]} | ... | u_{g}^{[K]}]$ and V the matrix [v^[1]|...|v^[k]|...|v^[k]]. The individual analyses can be projected on the MCoA space. In this space, it is possible to compare the coordinates of the populations according to the consensus of the information provided by the different loci to the coordinates of the populations obtained from each locus. While V contains the consensual coordinates of the populations, the coordinates at which the g^th locus positions the populations are obtained from $L_{Y_{g}} = \sqrt{π_{g}} Y_{g} U_{g}$ . Because $Y_{g} = B_{r}^{- 1} F_{g}^{t} X_{g}$ , the matrix $L_{X_{g}} = \sqrt{π_{g}} X_{g} U_{g}$ positions the alleles of the g^th locus, so that each population is at the centroid of its allelic composition. However, to compare the individual analyses with the compromise, it is better to D_r-normalize $L_{Y_{g}}$ and $L_{X_{g}}$ because V is by definition D_r-normalized.

DPCoA and STATIS

The STATIS analysis applied to $(Y_{1}, I_{ν_{1}}, B_{r}), ..., (Y_{g}, I_{ν_{g}}, B_{r}), ..., (Y_{G}, I_{ν_{G}}, B_{r})$ implies the calculation of a degree of correlation among the triplets, the so-called Rν coefficient. The matrix

E_{g} = \frac{B_{r}^{1 / 2} Y_{g} Y_{g}^{t} B_{r}^{1 / 2}}{‖ B_{r}^{1 / 2} Y_{g} Y_{g}^{t} B_{r}^{1 / 2} ‖}

is at the core of our application of STATIS because it is symmetrical and its dimensions are similar for all the triplets, whereas the dimensions of Y_gchange. The definition of Rν is

R v (Y_{g}, Y_{h}) = \frac{C o v v (Y_{g}, Y_{h})}{\sqrt{V a v (Y_{g})} \sqrt{V a v (Y_{h})}},

where

V a v (Y_{g}) = T r a c e (Y_{g} Y_{g}^{t} B_{r} Y_{g} Y_{g}^{t} B_{r})

C o v v (Y_{g}, Y_{h}) = T r a c e (Y_{g} Y_{g}^{t} B_{r} Y_{h} Y_{h}^{t} B_{r})

The pairwise calculation of Rν leads to a square matrix describing the correlations among the loci. With its eigenvalue decomposition, it is possible to describe the correlation pattern, called the interstructure. Its first eigenvector α = (α₁,..., α_g,..., α_G) is positive and maximizes the quantity $\sum_{g = 1}^{G} \sum_{h = 1}^{G} a_{g} a_{l} R v (Y_{g}, Y_{h})$ where $\sum_{g = 1}^{G} a_{g}^{2} = 1$ . STATIS uses these properties to define a matrix

E = \sum_{g = 1}^{G} α_{g} \frac{B_{r}^{1 / 2} Y_{g} Y_{g}^{t} B_{r}^{1 / 2}}{‖ B_{r}^{1 / 2} Y_{g} Y_{g}^{t} B_{r}^{1 / 2} ‖}

whose eigenanalysis, E = UΛU^t, leads to the best compromise of the population pattern over the G loci. Note that $‖ B_{r}^{1 / 2} Y_{g} Y_{g}^{t} B_{r}^{1 / 2} ‖ = V a v (Y_{g})$ . According to this compromise, the coordinates of the populations are in $B_{r}^{- 1 / 2} U Λ^{1 / 2}$ . Owing to Lavit et al. [15], the G individual population patterns corresponding to the locus considered independently can be obtained. The coordinates of the i^th populations according to the g^th locus are the elements of the i^th row of $Y_{g} Y_{g}^{t} B_{r}^{1 / 2} U Λ^{- 1 / 2}$ . Given that $Y_{g} = B_{r}^{- 1} F_{g}^{t} X_{g}$ , the rows of the matrix $Y_{g} Y_{g}^{t} B_{r}^{1 / 2} U Λ^{- 1 / 2}$ position the alleles of the g^th locus, so that each population is at the centroid of its allelic composition.

DPCoA and Multiple Factorial Analysis

The MFA is the Principal Component Analysis (PCA) of the global matrix

Y_TOT= [π₁Y₁|...|π_gY_g|...|π_GY_G]:

Y_{T O T}^{t} B_{r} Y_{T O T} = U Λ U^{t} .

The global coordinates of the populations synthesizing the information given by all the loci are in Y_TOTU. The coordinates at which the g^th locus positions the populations are in

π_{g} Y_{g} Y_{g}^{t} B_{r} Y_{T O T} U Λ^{- 1 / 2} .

Because $Y_{g} = B_{r}^{- 1} F_{g}^{t} X_{g}$ , the matrix $π_{g} X_{g} Y_{g}^{t} B_{r} Y_{T O T} U Λ^{- 1 / 2}$ positions the alleles of the g^th locus, so that each population is at the centroid of its allelic composition.

Relationships between the multiple DPCoA and the measurement of diversity

Consider for the two next paragraphs, only one locus – the locus g. The DPCoA is centered around a diversity index called "nucleotide diversity" by Nei and Li [23], or "quadratic entropy" by Rao [24], and which is at the core of the Analysis of Molecular Variance (AMOVA) [25–27]:

H_{g} (p_{i}) = \sum_{k = 1}^{ρ_{g}} \sum_{l = 1}^{ρ_{g}} p_{k i} p_{l i} d_{k l}^{all, g} = p_{i}^{t} D^{a l l, g} p_{i}

In this formula, g designates the g^th locus, ρ_gis the number of different alleles observed for that locus, $p_{i} = {(p_{1 i}, ..., p_{k i}, ..., p_{ρ_{g} i})}^{t}$ is the vector containing the relative frequencies of the alleles in the i^th population, so that p_kiis the frequency of the allele k in the i^th population, and $d_{k l}^{all, g}$ is the distance among the alleles k and l of the g^th locus. The DPCoA uses a decomposition of this diversity component defined by Rao [27]:

H_{TOTAL, g}({μ_i},{p_i}) = H_{INTRA, g}({μ_i},{p_i}) + H_{INTRA, g}({μ_i},{p_i}),

where

H_{TOTAL, g} ({μ_{i}}, {p_{i}}) = H_{g} (\sum_{i = 1}^{r} μ_{i} p_{i}),

H_{INTRA, g} ({μ_{i}}, {p_{i}}) = \sum_{i = 1}^{r} μ_{i} H_{g} (p_{i}),

and

H_{INTER, g} ({μ_{i}} : {p_{i}}) = \sum_{i = 1}^{r} \sum_{j = 1}^{r} μ_{i} μ_{j} d^{pop, g} (p_{i}, p_{j}),

where $d^{pop, g} (p_{i}, p_{j}) = 2 H_{g} (\frac{p_{i} + p_{j}}{2}) - H_{g} (p_{i}) - H_{g} (p_{j})$ .

In the first step of the DPCoA, all the points (i.e. alleles and populations) are in a space called "common space" [11]. In this common space, the inertia (i.e. variance) of the allele points weighted by p_iis equal to H_g(p_i), the diversity of the population i, according to locus g. The inertia of all the allele points weighted by $\sum_{i = 1}^{r} μ_{i} p_{i}$ is equal to H_{TOTAL, g}, the total diversity of the data set. Finally, the inertia of all the population points weighted by μ = (μ₁,..., μ_i,..., μ_r) is equal to H_{INTER, g}, the component of diversity among populations [11]. At the end of the DPCoA analysis, all the points are projected in a subspace which optimizes the representation of the differences among populations. In this subspace, only H_{INTER, g}is maintained, which is thus the focus of the analysis: optimally displaying the diversity among populations.

Consequently, the multiple DPCoA allows us to optimize the description of diversity among populations obtained with several loci. The first goal of this method is to describe the differences in population patterns across the loci, hence studying the congruence among loci. Another objective may be to erase these differences and provide a compromise population pattern revealed by the majority of the loci. The DPCoA-STATIS is advocated for this purpose. Concerning the measurement of diversity, when several loci are considered to measure diversity, the sum or average of the diversity components over the loci is currently used as a global measure of diversity [see for example [28, 29]]. With such processes, the weights given to the loci for the sum or averaging are uniform. We have just shown that STATIS provides optimal locus weights for the calculation of the component of diversity among populations. The great advantage of these multivariate analyses is that visualization of the differences among loci is possible so that one can assess the relevance of using average information over loci, whether these means are weighted or not.

Associated tests

We performed both Mantel and Rν tests to evaluate the significance of the differences in population patterns among loci. For each locus, distances among populations are calculated with the inter-population diversity H_{INTER, g}({μ_i}:{p_i}) according to Nei and Li [23] and Rao [24, 27]. We just said that this statistic is at the core of the DPCoA. As we apply formula (H_{INTER, g}) in a pairwise fashion, the distance between population i and population j for locus g is μ_iμ_jd^{pop, g}(p_i, p_j). We choose μ_iμ_jd^{pop, g}(p_i, p_j) and not simply d^{pop, g}(p_i, p_j) to take into account differential sample sizes, exactly in the way that we considered them in ordination procedures. The Mantel test calculates correlations among the raw distance measures, while the Rν test compares principal coordinates obtained by PCoA. Rν correlations are always higher than Mantel correlations because their values lie between 0 and 1, while Mantel correlation values lie between -1 and 1.

Application to simulated and real data sets

We used the following procedure to test the methodologies presented above based on simulated and real data sets. First, pairwise correlations among loci by Mantel and/or Rν tests were assessed to define groups of consistent loci. At this step, atypical loci can be identified. Then mDPCoA was performed to describe both the compromise population structure and the differences among groups of loci. Finally, we describe the connections between the observed structures and ecological, evolutionary or functional data.

Application to a simulated data set

Simulation process

In order to assess the efficiency of the present method, simulated sequence data sets, which illustrate various population structures, were obtained assuming linkage equilibrium among loci. Assuming recombination, the different markers can indeed have different histories and thus different population structures. Moreover, if every marker has an independent history, finding similarities and differences among their genetic structures would be more difficult. Using SIMCOAL 2.0 [30] we considered a one-dimensional stepping stone model with eight populations of constant size [31]. The eight populations evolved 10⁶ generations after emerging from a single ancestral population. For each population, 60 individuals were sampled out of 10000 individuals. In this context, we simulated DNA sequence evolution of ten loci of 300 base pairs under a Jukes and Cantor model [32] assuming a mutation rate of 5 × 10^-6. The stepping stone model allows migration between adjacent populations: for example, at time t, the population 4 can exchange individuals with populations 3 or 5, but not with other populations. We chose the following migration rates: 5 × 10^-2, 10^-2, 5 × 10^-3, 10^-3, 5 × 10^-4, 10^-4, 5 × 10^-5, 10^-5, 5 × 10^-6. We also simulated an eleventh locus that reveals a different population structure. For this locus, we assumed no migration between odd populations (i.e. populations 1, 3, 5, 7) and even populations (i.e. populations 2, 4, 6, 8) and a migration rate of 10^-3 among odd or even populations, with other parameters kept unchanged. Such a simulation resulted in two clades of alleles which are obviously divergent, the first clade being specific to some populations (e.g. odd ones), the second clade being specific to other populations (e.g. even ones). Such genetic structure can be observed in case of either balancing/disruptive selection [e.g. [33]] or horizontal transfer of an outlier allele [e.g. [7]].

We applied the mDPCoA approach first on the complete data set, second on the allele distances only and then taking into account just the allele frequencies. We evaluated the intensity of inter-population structure by measuring the AMOVA ϕ_STparameter [25].

Results

The correlations among locus 11 and the ten other loci are very low and not significant as expected (Figure 1). Thus, we correctly identified the atypical locus. These correlations decrease when migration rate decreases. Test statistics based on both the Mantel correlation and the Rν correlation between the atypical locus and other loci clearly behave in a similar way, and results are hardly changed when removing allele frequencies or distances.

Regarding the correlation tests among the 10 loci submitted to the stepping stone model, the inter-population structure measured by the AMOVA ϕ_STparameter increases slightly when the migration rate decreases from 5 × 10^-2 to 5 × 10^-4 and then increases very quickly (Figure 2). Values of the Mantel correlation, the percent of significant tests according to the Mantel correlation and the percent of significant tests according to the Rν correlation are three parameters correlated with ϕ_ST, especially when using both allele frequency and allele divergences. The raw value of the Rν correlation is steadier. These results show that a non-significant correlation may be due to either an absence of genetic structure (e.g. no differentiation among populations) or reliable differences in the inter-population structures revealed by the different loci. The graphical analysis completed by ϕ_STvalues will help to reach a conclusion between the two alternatives.

Regarding the mDPCoA, we present below the results of the DPCoA-MCoA approach, which we expected to provide a description of the difference among the ten first loci and the eleventh, atypical locus (Figure 3; to limit the size of the Figure 3, only the results for migration rates 10^-2, 10^-3, 10^-4 and 10^-5 are shown since intermediate migration rates revealed intermediate inter-population structure). Indeed, for migration rates higher than 10^-2, where no inter-population structure was highlighted in the previous paragraph, the atypical locus takes the first axis of the compromise analysis, which therefore distinguishes odd from even populations. With a migration rate of 10^-3, the stepping stone model interacts with the structure provided by locus 11; the 10 first loci with a stepping stone model take the first axis and locus 11 roughly takes the second axis. With a migration rate lower than 10^-3, the first two axes of the DPCoA-MCoA only represent the stepping stone model. Whatever the migration rate, the projection of the individual loci on the DPCoA-MCoA factorial axes emphasizes locus 11's special status (Figure 3). This last result is also emphasized by specific results of the DPCoA-STATIS approach as interstructures. With a migration rate equal to 5 × 10^-4 or lower, the structure is very clear with either complete or incomplete data on allele composition.

Application to the description of Sinorhizobium species diversity

The data set

In order to test the efficiency of the procedures we proposed, we needed a real data set which should give simple and explicit results but which could also encompass the features of complex MLS data sets. We chose to focus on nitrogen fixing bacteria belonging to the genus Sinorhizobium (Rhizobiaceae) associated with the plant genus Medicago (Fabaceae). The data set we chose is a combination of two data sets fully available online from GenBank and published in two recent papers [8, 34]. The complete sampling procedure is described in the two papers and summarized in an additional file [see Additional file 3]. Based on the sampling scheme, we delineated six populations according to geographical origin (France: F, Tunisia Hadjeb: TH, Tunisia Enfidha: TE), the host plant (M. truncatula or similar symbiotic specificity: T, M. laciniata: L), and the taxonomical status of bacteria (S. meliloti: mlt, S. medicae: mdc). Each population will be called hereafter according to the three above criteria, e.g. THLmlt is the population sampled in Tunisia at Hadjeb from M. laciniata nodules which include S. meliloti isolates. S. medicae interacts with M. truncatula while S. meliloti interacts with both M. laciniata (S. meliloti bv. medicaginis) and M. truncatula (S. meliloti bv. meliloti) [35, 36]. The numbers of individuals are respectively 46 for FTmdc, 43 for FTmlt, 20 for TETmdc, 24 for TETmlt, 20 for TELmlt, 42 for THTmlt and 20 for THLmlt [see Additional files 4, 5, 6, 7].

Four different intergenic spacers (IGS), IGS_NOD, IGS_EXO, IGS_GAB,and IGS_RKP, distributed on the different replication units of the model strain 1021 of S. meliloti bv. meliloti (Figure 4) had been sequenced to characterize each bacterial isolate (DNA extraction and sequencing procedures are described in an additional file [see Additional file 3]). It is noteworthy that the IGS_NODmarker is located within the nod gene cluster and that specific alleles at these loci determine the ability of S. meliloti strains to interact with either M. laciniata or M. truncatula [37].

For each locus, we selected a model of evolution using the software PHYML [38] and its R interface provided by ape [18, 19]. This software compares the models by likelihood ratio tests. When several models were not significantly different according to a χ² test we selected the model with the smallest number of parameters. From this procedure, we selected Felsenstein's model F84 [39, 40] for IGS_NOD, IGS_EXO, IGS_GAB, and Felsenstein's model F81 [40, 41] for IGS_RKP. Then, using the ape package, a set of matrices ${D_{I G S_{N O D}}, D_{I G S_{E X O}}, D_{I G S_{G A B}}, D_{I G S_{R K P}}}$ containing pairwise genetic distances between alleles observed at each locus was computed according to these selected models, and Neighbor-Joining trees with bootstrap values were obtained from these distance matrices to illustrate the data sets (Figure 5).

We applied the multiple DPCoA to this data set, and compared the results to those obtained with STRUCTURE [42, 43]. STRUCTURE estimates population structure using genotype data. The basic hypotheses are linkage equilibrium within subpopulations (or possibly weak linkage [44]) and Hardy-Weinberg equilibrium (if the organism under study is not haploid).

Results

Mantel and Rν tests demonstrated that the locus IGS_NODprovides a very specific ordination of populations, while the three other markers IGS_RKP, IGS_EXOand IGS_GAB, were significantly congruent (Table 1).

Table 1 Pairwise correlations among loci with the complete real data set

Full size table

With DPCoA-MCoA (Figure 6), the first axis, which expresses 94% of the diversity among populations, separates the two bacterial species, S. meliloti and S. medicae, while the second axis, with 6% of the diversity among populations, distinguishes the impact of the host plants, M. laciniata and M. truncatula. The DPCoA-STATIS analysis reveals a very similar pattern (Figure 7). Consistently, the STRUCTURE analysis indeed defined two main clusters including respectively S. meliloti and S. medicae, without any trace of admixture between the two species. However, these results are a compromise with the information provided by IGS_RKP, IGS_GAB, IGS_EXOand IGS_NOD. Although the four markers effectively delineate the two bacterial species, they express this segregation differently. The DPCoA-MCoA indeed revealed that the segregation between S. meliloti and S. medicae is supported by more than 90% population variation for the three most coherent markers, i.e. IGS_RKP, IGS_GABand IGS_EXO, while it only concerns a minor part of the population variation observed for IGS_NOD. The discrimination between the impact of the two host plants, i.e. M. truncatula and M. laciniata, which appears in axis 2, is the main structure for the IGS_NODmarker. The interstructure obtained by using STATIS (Figure 7A), i.e. the eigenanalysis of the Rν matrix, illustrated the special status of IGS_NOD.

It is noteworthy that based on DPCoA-MCoA, the secondary structure is due to a host-plant effect (e.g. IGS_GAB) and/or a geographical origin effect (e.g. IGS_EXO) discriminating between French and Tunisian populations of S. meliloti. Interestingly, the effect of geographical distance on the population structure of S. meliloti is not detected by compromise analyses. Because both STATIS and MFA aim at pointing out similarities among loci, these approaches failed at highlighting the secondary structure observed using DPCoA-MCoA (Figure 7B and Figure 8).

There is a clear relationship between the patterns of population differences and the distribution of allelic diversity (Figure 6B). For instance, the two bacterial species did not share any alleles in common, even for the IGS_NODlocus. Furthermore, the populations associated with M. laciniata did not share any alleles with the populations associated with M. truncatula for the IGS_NODlocus, resulting in three independent allelic pools belonging respectively to S. medicae and the two biovars of S. meliloti. Furthermore, the distance between the IGS_NODalleles associated with M. laciniata and those associated with M. truncatula is very high, almost as high as the distance which separates S. meliloti and S. medicae on IGS_EXO. The particular polymorphism pattern observed for IGS_NODmight be explained by both the host-plant selective pressure that acts on nod genes and the events of horizontal transfer that affect the nod gene cluster [34].

Relative effects of distances and frequencies

In order to estimate the relative impacts of allele frequencies and distances in the above results, we applied the DPCoA-MCoA taking into account either sequence divergences without allele frequencies or allele frequencies without sequence divergences (Figure 9). When only sequence divergences are kept, like in the complete analysis, IGS_EXO, IGS_GAB, and IGS_RKPare significantly correlated sharing a strong separation between the species S. medicae and S. meliloti (correlations vary from 0.81 and 0.93 according to Mantel and are superior to 0.999 according to Rν; significance of correlation tests was assessed according to a 0.05 threshold). Regarding the DPCoA-MCoA factorial maps, the population structure is maintained on axis 1, which in that case exhibits 96% of the inter-population diversity. IGS_NODstands out by presenting very distinct alleles according to the host plant. On the second axis, with 4% of the inter-population diversity, the differences between populations according to host plants are maintained for IGS_GABas a secondary structure. Yet, the secondary structures of both IGS_RKPand IGS_EXObecome hardly interpretable. When only the allele frequencies are kept, due to the high differentiation between the two species S. medicae and S. meliloti for all the loci when allele distances are removed, all the pairwise correlations between loci are significant according to the Mantel statistic (correlations greater than 0.83), and all except IGS_EXO-IGS_NOD(0.61) and IGS_RKP-IGS_NOD(0.63) correlations according to the Rν statistic. Regarding the DPCoA-MCoA factorial maps, the first axis of all the loci represents the inter-species separation. The difference among populations according to their host plant measured on IGS_NODis relegated to axis 2 representing 12% of the inter-population analysis. Along this axis, all the three other loci IGS_EXO, IGS_GAB, and IGS_RKP distinguish the French population from the Tunisian populations.

The conclusions which can be drawn from these analyses of the effects of distances and frequencies on the inter-population diversity are as follows. In all of the analyses, the most peculiar locus remains IGS_NOD. The high separation of populations according to their host plant is due to distinct and distant alleles for IGS_NODand allele distances for IGS_GAB. The differences among IGS_GAB, IGS_RKP, and IGS_EXOare due to differentiation patterns among S. meliloti populations. Finally, the distinction between the French and the Tunisian populations mostly relies on allele frequency data.

Discussion

The MDPCoA approach provides a useful tool for: (i) identifying atypical loci by both tests and factorial maps; (ii) describing differences in population structures between groups of congruent loci by factorial maps; (iii) including evolutionary distances among alleles, which is seldom done.

Missing data

In all the analyses we performed, the weight of a population is the number of individuals sampled from this population divided by the total number of individuals sampled. Given that we consider several loci, this definition of the weights supposes that we have identified the allelic composition of each individual for all loci. In case of missing allelic data, i.e. if the allelic content of some individuals is missing for one or several loci, one should define different weight systems depending on the loci. According to the g^th locus, the weight of population i is the number of characterized individuals from population i divided by the total number of characterized individuals. This would lead to G different systems of weights, i.e. one per locus. Unfortunately, neither STATIS nor the MCoA nor the MFA can support different population weights. Consequently, one will have to assume a similar set of population weights over loci although some data are missing. To overcome this problem, it may be assumed that the weight of a population is the number of individuals sampled from this population divided by the total number of individuals sampled, whether or not the allelic information for all the loci and for all the individuals is available.

Another case of usual missing data is the lack of nucleotide divergence among alleles. In that case, we suggest fixing the distance among any two different alleles equal to 1, so that the DPCoA is equal to the non-symmetric correspondence analysis [11, 45]. Furthermore, the inertia of the allelic points per population in the DPCoA "common space" is then equal to the gene diversity index H, introduced by Nei [28], and the inertia of the population points is equal to the gene diversity among populations defined by Nei [28] in its decomposition of gene diversity. The inertia among population points in the best compromise plot and DPCoA-STATIS is a measure of gene diversity among populations averaged over the G loci, where the weights given to the loci are not simply uniform but set optimal for synthesizing what is common to the loci. This process gives less weight to outliers and reflects the distances among populations as they are seen by the majority of the loci.

Effects of frequencies and distances

The effect of frequencies and distances comprises two components: the effect due to sampling error and the effect due to population structure. The effects of sampling error on the component of nucleotide diversity within and between populations have been studied elsewhere [23, 46], and might be the object of further research in the context of the mDPCoA.

The relative effects of frequencies and distances on the analysis of population structure depend on the degree of differentiation among the populations under study. In case of low differentiation, population structure is usually due to variations in allelic frequencies. For instance, differences among French and Tunisian populations of S. meliloti that are highlighted by IGS_EXO, IGS_GABand IGS_RKPare due to allelic frequencies. Conversely, as the number of alleles shared by the different population decreases, taking into account the information provided by sequence divergence is crucial to efficiently describe their relationships. For instance, the specific inter-population structure of IGS_NODis mainly due to sequence divergence.

Pertinence of the correlation tests

Both correlation tests (Mantel and Rν) can be non-significant for two reasons: either because of an absence of population structure or because the two loci compared reveal different population structures. As highlighted in a previous section, the estimated ϕ_STparameter and the factorial maps obtained by one of the three versions of the mDPCoA (with MCoA, STATIS or the MFA), can be used to choose among the two alternatives. Concerning the relative interest of the two tests, the Rν test is revealed to be more powerful when applied to our simulated data set, so we advocate its use.

Relative advantages and disadvantages of the three proposed analyses – choice of a method

The three methods are alike in their procedure because they are all based on a compromise. However, they differ in the way the compromise is obtained. With the MCoA, the compromise is built during the definition of the factorial axes. It maximizes the average correlation among the individual analyses and the compromise. With STATIS, the compromise is obtained before going to the core of the multivariate ordination analysis. Here, the compromise maximizes the correlations among the patterns of inter-population diversity provided by the loci. With the MFA, the pieces of information given by the loci are simply added to each other by creating a large table juxtaposing the information on the loci. This last method is the simplest, where pieces of information are simply added. On the other hand, MCoA and STATIS first compare the patterns of inter-population diversity provided by the loci, either for visualizing in a single space the differences among loci or for erasing these differences, and find a best compromise over the loci, respectively.

Unfortunately, the representation of the differences among loci with STATIS is not optimal [15] because STATIS focuses on similarities instead of dissimilarities among loci. Consequently, in comparison to alternative methods, it theoretically lacks an optimal explicability, and an efficient description of the differences in population patterns among loci. The description of the differences among population patterns is thus more precise using MCoA and MFA. Conversely, the main advantage of STATIS over other methods is that it provides a simpler compromise pattern.

The choice among the three methods therefore depends on the goal of the underlying study. If the objective is to obtain the best compromise over the loci, then we advocate the use of DPCOA with STATIS. However, if the objective is to obtain a detailed comparison among the population patterns provided by the G loci, then we encourage the use of the DPCoA with the MCoA.

Complementarity between mDPCoA and other analyses

The mDPCoA could be associated with other tools to study population structure, including the AMOVA, which forms the basis of the DPCoA, Linkage Disequilibrium (LD) statistics, and also recent approaches such as STRUCTURE or CLONAL FRAME.

The AMOVA averages molecular variability over loci to test the existence of differences between populations or groups of populations in terms of both allele frequencies and nucleotide distances among alleles. The Mantel and Rv statistics associated with the mDPCoA use the same information to test the differences between the inter-population structures inferred by several loci.

Both linkage disequilibrium (LD) measures and the mDPCoA aim at assessing whether there is a significant association among the polymorphism patterns observed for different molecular markers. However, LD approaches and mDPCoA differ in several ways. Without discrepancies among the population structures, mDPCoA would fail to detect that different loci evolve independently, even if these are in linkage equilibrium at the population scale. Conversely, in the Sinorhizobium spp. data set, the mDPCoA detected that IGS_NODpattern of population differences was drastically different from the ones obtained with IGS_RKP, IGS_GABand IGS_EXO, suggesting a horizontal gene transfer of nod genes between S. meliloti bv. meliloti and S. medicae. Because of the differentiation between S. meliloti and S. medicae, LD measures would have failed to detect such a transfer event. Linkage disequilibrium measures and mDPCoA therefore appear as complementary tools to study the influence of sex during the evolution of bacterial lineages.

The mDPCoA is above all a descriptive method, as it does not rely on any assumptions about models of evolution such as linkage equilibrium or selective neutrality. Nevertheless, this analysis pipeline can raise questions that will be investigated using complementary analyses. Thus, demonstrating differences among population structures obtained from different loci raised questions regarding the definition of population boundaries, or the genealogy of both genes and individuals. A consensus population structure could be inferred without any a priori knowledge using STRUCTURE, and its efficiency can be confirmed and illustrated using the correlation tests and the graphical outputs of the mDPCoA. CLONAL FRAME is an explanatory method, estimating clonal relationships and looking for key recombination events with a view of finding the mechanisms implied in microevolution [47]. It can be used to gain insights into the history of an atypical locus. Finally, the detection of selection traces and mechanistic experiments can be of great interest to explain mDPCoA results. These different approaches thus complement the mDPCoA, and conversely, the mDPCoA complements these approaches. For instance, both STRUCTURE and CLONAL FRAME imply working on MLS analyses, and the choice of the finite set of loci used in these analyses may be crucial. Each method can be improved by looking at the results returned by the two others. A joint interpretation of the results of the alternative methods may thus allow a better interpretation of the results and lead to a deeper analysis of particular loci for a better understanding of the data.

Conclusion

All three methods proposed can be used for a better description of inter-population genetic diversity measured over more than one locus. They imply a new reflection on the role of means in measures of diversity: can we work on average information over loci, or do we first need to examine the differences among the patterns of diversity given by the loci? Sometimes, the differences among loci are so high that the compromise obtained by the multivariate analyses will be unstable and the use of averaged information can hamper interpretation. This issue is related to the question raised decades ago: can we build a unique, very synthetic measure of biodiversity, or do we have to make up our mind to define several conflicting measures? As it is based on multivariate analyses, the multiple DPCoA in its three forms can be used to analyze large data sets. It allows a comparison of genetic diversity measured on various loci. It complements existing tools such as AMOVA and linkage disequilibrium measures. It is used here on molecular data because it is in genetics the question of congruence among markers was raised several years ago. We illustrated this procedure using a limited but complex sequence database. The method will have to be tested on other data sets, yet the results are already very promising. Moreover, mDPCoA is potentially more general than we presented here since it can be extended to any data set where pairs of matrices comprise a matrix with abundance or presence/absence and a matrix of dissimilarities. Further applications in ecology could thus be considered, such as the description of inter-community diversity based on both genotypic and phenotypic features.

Abbreviations

AMOVA:: Analysis of MOlecular Variance
bv.:: biovar
DPCoA:: Double Principal Coordinate Analysis
FTmdc:: Population sampled at Sainte Colombe l'Eglise in France from M. truncatula nodules which include S. medicae isolates
FTmlt:: Population sampled at Sainte Colombe l'Eglise in France from M. truncatula nodules which include S. meliloti bv. meliloti isolates
IGS:: Intergenic spacers
LD:: Linkage disequilibrium
MCoA:: Multiple Co-inertia Analysis
mDPCoA:: multiple Double Principal Coordinate Analysis
MFA:: Multiple Factorial Analysis
MLS:: Multilocus Sequencing
PCA:: Principal Component Analysis
STATIS:: comes from a French expression "structuration des tabeaux à trois indices de la statistique" which means: structuration of the tables characterized by three statistical modes
TELmlt:: Population sampled in Tunisia at Enfidha from M. laciniata nodules which include S. meliloti bv. medicaginis isolates
TETmdc:: Population sampled in Tunisia at Enfidha from M. truncatula nodules which include S. medicae isolates
TETmlt:: Population sampled in Tunisia at Enfidha from M. truncatula nodules which include S. meliloti bv. meliloti isolates
THLmlt:: Population sampled in Tunisia at Hadjeb from M. laciniata nodules which include S. meliloti bv. medicaginis isolates
THTmlt:: Population sampled in Tunisia at Hadjeb from M. truncatula nodules which include S. meliloti bv. meliloti isolates.

References

Cooper JE, Feil EJ: Multilocus sequence typing: what is resolved?. Trends in Microbiology. 2004, 12: 373-377. 10.1016/j.tim.2004.06.003.
Article CAS PubMed Google Scholar
Hanage WP, Fraser C, Spratt BG: The impact of homologous recombination on the generation of diversity in bacteria. Journal of Theoretical Biology. 2006, 239: 210-209. 10.1016/j.jtbi.2005.08.035.
Article CAS PubMed Google Scholar
Fraser C, Hanage WP, Spratt BG: Neutral microepidemic evolution of bacterial pathogens. Proceedings of the National Academy of Sciences of the United States of America. 2005, 102: 1968-1973. 10.1073/pnas.0406993102.
Article PubMed Central CAS PubMed Google Scholar
Metzker ML: Emerging technologies in DNA sequencing. Genome Research. 2005, 15: 1767-1776. 10.1101/gr.3770505.
Article CAS PubMed Google Scholar
Moazami-Goudarzi K, Laloë D: Is a multivariate consensus representation of genetic relationships among populations always meaningful?. Genetics. 2002, 162: 473-484.
PubMed Central CAS PubMed Google Scholar
Hanage WP, Fraser C, Spratt BG: Fuzzy species among recombinogenic bacteria. BMC Biology. 2005, 3: 6-10.1186/1741-7007-3-6.
Article PubMed Central PubMed Google Scholar
Falush D, Torpdahl M, Didelot X, Conrad DF, Wilson DJ, Achtman M: Mismatch induced speciation in Salmonella: model and data. Philosophical Transactions of the Royal Society of London Series B - Biolog. 2006, 361: 2045-2053. 10.1098/rstb.2006.1925.
Article Google Scholar
Bailly X, Olivieri I, De Mita S, Cleyet-Marel JC, Béna G: Recombination and selection shape the molecular diversity pattern of nitrogen-fixing Sinorhizobium sp. associated to Medicago. Molecular Ecology. 2006, 15: 2719-2734.
Article CAS PubMed Google Scholar
Falush D, Wirth T, Linz B, Pritchard JK, Stephens M, Kidd M, Blaser MJ, Graham DY, Vacher S, Perez-Perez GI, Yamaoka Y, Megraud F, Otto K, Reichard U, Katzowitsch E, Wang X, Achtman M, Suerbaum S: Traces of human migrations in Helicobacter pylori populations. Science. 2003, 299: 1582-1585. 10.1126/science.1080857.
Article CAS PubMed Google Scholar
Escoufier Y: Le traitement des variables vectorielles. Biometrics. 1973, 29: 750-760. 10.2307/2529140.
Article Google Scholar
Pavoine S, Dufour AB, Chessel D: From dissimilarities among species to dissimilarities among communities: a double principal coordinate analysis. Journal of Theoretical Biology. 2004, 228: 523-537. 10.1016/j.jtbi.2004.02.014.
Article PubMed Google Scholar
Eckburg PB, Bik EM, Bernstein CN, Purdom E, Dethlefsen L, Sargent M, Gill SR, Nelson KE, Relman DA: Diversity of the human intestinal microbial flora. Science. 2005, 308: 1635-1638. 10.1126/science.1110591.
Article PubMed Central PubMed Google Scholar
Bik EM, Eckburg PB, Gill SR, Nelson KE, Purdom EA, Francois F, Perez-Perez G, Blaser MJ, Relman DA: Molecular analysis of the bacterial microbiota in the human stomach. Proceedings of the National Academy of Sciences of the United States of America. 2006, 103: 732-737. 10.1073/pnas.0506655103.
Article PubMed Central CAS PubMed Google Scholar
Chessel D, Hanafi M: Analyses de la co-inertie de K nuages de points. Revue de Statistique Appliquée. 1996, : -. [http://www.numdam.org/item?id=RSA_1996__44_2_35_0]
Google Scholar
Lavit C, Escoufier Y, Sabatier R, Traissac P: The ACT (Statis method). Computational Statistics and Data Analysis. 1994, 18: 97-119. 10.1016/0167-9473(94)90134-1.
Article Google Scholar
Escofier B, Pagès J: Multiple factor analysis: results of a three-year utilization. Multiway data analysis. Edited by: Coppi R and Bolasco S. 1989, , Elsevier Science Publishers B.V., North-Holland, 277-285.
Google Scholar
Chessel D, Dufour AB, Thioulouse. J: The ade4 package -I- One-table methods. R News. 2004, 4: 5-10. [http://cran.r-project.org/doc/Rnews/Rnews_2004-1.pdf]
Google Scholar
Paradis E, Strimmer K, Claude J, Jobb G, Opgen-Rhein R, Dutheil J, Noel Y, Bolker B: ape: Analyses of Phylogenetics and Evolution. 2005, , R package version 1.7
Google Scholar
Ihaka R, Gentleman R: R: a language for data analysis and graphics. Journal of Computational and Graphical Statistics. 1996, 5: 299-314. 10.2307/1390807.
Google Scholar
Gower JC: Euclidean distance geometry. Mathematical Scientist. 1982, 7: 1-14.
Google Scholar
Lingoes JC: Some boundary conditions for a monotone analysis of symmetric matrices. Psychometrika. 1971, 36: 195-203. 10.1007/BF02291398.
Article Google Scholar
Cailliez F: The analytic solution of the additive constant problem. Psychometrika. 1983, 48: 305-310. 10.1007/BF02294026.
Article Google Scholar
Nei M, Li WH: Mathematical model for studying genetic variation in terms of restriction endonucleases. Proceedings of the National Academy of Sciences of the United States of America. 1979, 76: 5269-5273. 10.1073/pnas.76.10.5269.
Article PubMed Central CAS PubMed Google Scholar
Rao CR: Diversity and dissimilarity coefficients: a unified approach. Theoretical Population Biology. 1982, 21: 24-43. 10.1016/0040-5809(82)90004-1.
Article Google Scholar
Excoffier L, Smouse PE, Quattro JM: Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data. Genetics. 1992, 131: 479-491.
PubMed Central CAS PubMed Google Scholar
Pavoine S, Dolédec S: The apportionment of quadratic entropy: a useful alternative for partitioning diversity in ecological data. Environmental and Ecological Statistics. 2005, 12: 125-138. 10.1007/s10651-005-1037-2.
Article CAS Google Scholar
Rao CR: Rao's axiomatization of diversity measures. Encyclopedia of Statistical Sciences. Edited by: Kotz S and Johnson NL. 1986, New York, Wiley and Sons, 614-617.
Google Scholar
Nei M: Analysis of gene diversity in subdivised populations. Proceedings of the National Academy of Sciences of the United States of America. 1973, 70: 3321-3323. 10.1073/pnas.70.12.3321.
Article PubMed Central CAS PubMed Google Scholar
Nei M: Molecular evolutionary genetics. 1987, New York, NY, USA, Columbia University Press
Google Scholar
Laval G, Excoffier L: SIMCOAL 2.0: a program to simulate genomic diversity over large recombining regions in a subdivided population with a complex history. Bioinformatics. 2004, 12: 2485-2487. 10.1093/bioinformatics/bth264.
Article Google Scholar
Kimura M: Stepping Stone model of population. Annual Report of the National Institute of Genetics. 1953, 3: 62-63.
Google Scholar
Jukes T, Cantor C: Evolution of protein molecules. Mammalian protein metabolism. Edited by: Munro HN. 1969, New York, Academic press, 21-132.
Chapter Google Scholar
Charlesworth D, Mable BK, Schierup MH, Bartolomé C, Awadalla P: Diversity and Linkage of Genes in the Self-Incompatibility Gene Family in Arabidopsis lyrata. Genetics. 2003, 164: 1519-1535.
PubMed Central CAS PubMed Google Scholar
Bailly X, Olivieri I, Brunel B, Cleyet-Marel JC, Béna G: Horizontal gene transfer and homologous recombination drive the evolution of the nitrogen-fixing symbionts of Medicago species. Journal of Bacteriology. 2007, 189: 5223-5236. 10.1128/JB.00105-07.
Article PubMed Central CAS PubMed Google Scholar
Bena G, Lyet A, Huguet T, Olivieri I: Medicago - Sinorhizobium symbiotic specificity evolution and the geographic expansion of Medicago. Journal of Evolutionary Biology. 2005, 18: 1547-1558.
Article CAS PubMed Google Scholar
Villegas MDC, Rome S, Maure L, Domergue O, Gardan L, Bailly X, Cleyet-Marel JC, Brunel B: Nitrogen-fixing sinorhizobia with Medicago laciniata constitute a novel biovar (bv. medicaginis) of S. meliloti. Systematic and Applied Microbiology. 2006, 29: 526-538. 10.1016/j.syapm.2005.12.008.
Article Google Scholar
Barran LR, Bromfield ES, Brown DC: Identification and cloning of the bacterial nodulation specificity gene in the Sinorhizobium meliloti - Medicago laciniata symbiosis. Canadian Journal of Microbiology. 2002, 48: 765-771. 10.1139/w02-072.
Article CAS PubMed Google Scholar
Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Systematic Biology. 2003, 52: 696-704. 10.1080/10635150390235520.
Article PubMed Google Scholar
Felsenstein J, Churchill GA: A Hidden Markov model approach to variation among sites in rate of evolution. Molecular Biology and Evolution. 1996, 13: 93-104. [http://mbe.oxfordjournals.org/cgi/content/abstract/13/1/93]
Article CAS PubMed Google Scholar
McGuire G, Prentice MJ, Wright F: Improved error bounds for genetic distances from DNA sequences. Biometrics. 1999, 55: 1064-1070. 10.1111/j.0006-341X.1999.01064.x.
Article CAS PubMed Google Scholar
Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution. 1981, 17: 368-376. 10.1007/BF01734359.
Article CAS PubMed Google Scholar
Falush D, Stephens M, Pritchard JK: Inference of population structure using multilocus genotype data: dominant markers and null alleles. Molecular Ecology Notes. 2007, Published article online doi: 10.1111/j.1471-8286.2007.01758.x:
Google Scholar
Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics. 2000, 155: 945-959.
PubMed Central CAS PubMed Google Scholar
Falush D, Stephens M, Pritchard JK: Inference of population structure: Extensions to linked loci and correlated allele frequences. Genetics. 2003, 164: 1567-1587.
PubMed Central CAS PubMed Google Scholar
Lauro N, D'Ambra L: L'analyse non symétrique des correspondances. Data Analysis and Informatics, III. Edited by: Diday E, Jambu M, Lebart L, Pages J and Tomassone R. 1984, North-Holland, Elsevier, 433-446.
Google Scholar
Lynch M, Crease TJ: The analysis of population survey data on DNA sequence variation. Molecular Biology and Evolution. 1990, 7: 377-394. [http://mbe.oxfordjournals.org/cgi/content/abstract/7/4/377]
CAS PubMed Google Scholar
Didelot X, Falush D: Inference on bacterial microevolution using multilocus sequence data. Genetics. 2007, 175: 1251-1266. 10.1534/genetics.106.063305.
Article PubMed Central CAS PubMed Google Scholar

Download references

Acknowledgements

The authors are grateful to Pr. I Olivieri, Pr. JPW Young and two anonymous reviewers for their useful comments about this study. We also thank R. Lower, and the American Journal Experts who helped us to improve the quality of this manuscript. This paper takes place in a research project on "Biodiversity, perception and use" funded by the French Institute of Biodiversity. Within this more general context, we develop and discuss methodologies for measuring biodiversity on multi-marker data sets at various scales, from individuals' gene loci to species' functional traits.

Author information

Authors and Affiliations

Unité de Conservation des espèces, restauration et suivi des populations (UMR MNHN-UPMC-CNRS 5173), Muséum National d'Histoire Naturelle, 55 rue Buffon, 75005, Paris, France
Sandrine Pavoine
Department of Biology, University of York, Post Office Box 373, York, YO10 5YW, UK
Xavier Bailly

Authors

Sandrine Pavoine
View author publications
You can also search for this author in PubMed Google Scholar
Xavier Bailly
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sandrine Pavoine.

Additional information

Authors' contributions

SP developed the methodology and applied it to the data. XB performed the simulations and characterized Sinorhizobium populations. He interpreted the results. Both authors contributed equally to the discussion. Both authors read and approved the final draft.

Electronic supplementary material

12862_2007_447_MOESM1_ESM.R

Additional file 1: Functions in R to perform multiple DPCoA. The file is called "mdpcoa.R". It can be read by the R software which can be downloaded free of charge, and one can refer to the Additional file 2 for explanation on how to use it. (R 8 KB)

12862_2007_447_MOESM2_ESM.pdf

Additional file 2: Instructions for performing multiple DPCoA in R. The file is called "Instruction.pdf". It describes in step by step detail how to use R to perform a multiple DPCoA using the real data set in this paper. (PDF 96 KB)

12862_2007_447_MOESM3_ESM.pdf

Additional file 3: Description of the real data set. The complete sampling procedure is given together with a description of within-population diversity. (PDF 80 KB)

12862_2007_447_MOESM4_ESM.aa

Additional file 4: DNA sequences for IGSNOD. Sequences are in "FASTA" format. The File is named "NOD.aa". See Additional file 2 for explanation on how to use this file. (AA 103 KB)

12862_2007_447_MOESM5_ESM.aa

Additional file 5: DNA sequences for IGSEXO. Sequences are in "FASTA" format. The File is named "EXO.aa". See Additional file 2 for explanation on how to use this file. (AA 126 KB)

12862_2007_447_MOESM6_ESM.aa

Additional file 6: DNA sequences for IGSGAB. Sequences are in "FASTA" format. The File is named "GAB.aa". See Additional file 2 for explanation on how to use this file. (AA 76 KB)

12862_2007_447_MOESM7_ESM.aa

Additional file 7: DNA sequences for IGSRKP. Sequences are in "FASTA" format. The File is named "RKP.aa". See Additional file 2 for explanation on how to use this file. (AA 73 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Pavoine, S., Bailly, X. New analysis for consistency among markers in the study of genetic diversity: development and application to the description of bacterial diversity. BMC Evol Biol 7, 156 (2007). https://doi.org/10.1186/1471-2148-7-156

Download citation

Received: 17 January 2007
Accepted: 03 September 2007
Published: 03 September 2007
DOI: https://doi.org/10.1186/1471-2148-7-156

New analysis for consistency among markers in the study of genetic diversity: development and application to the description of bacterial diversity

Abstract

Background

Results

Conclusion

Background

Results

Algorithms of multiple Double Principal Coordinate Analysis

DPCoA and Multiple Co-inertia analysis

DPCoA and STATIS

DPCoA and Multiple Factorial Analysis

Relationships between the multiple DPCoA and the measurement of diversity

Associated tests

Application to simulated and real data sets

Application to a simulated data set

Simulation process

Results

Application to the description of Sinorhizobium species diversity

The data set

Results

Relative effects of distances and frequencies

Discussion

Missing data

Effects of frequencies and distances

Pertinence of the correlation tests

Relative advantages and disadvantages of the three proposed analyses – choice of a method

Complementarity between mDPCoA and other analyses

Conclusion

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Ecology and Evolution

Contact us