MRC Centre for Outbreak Analysis and Modelling, Department of Infectious Disease Epidemiology, Imperial College Faculty of Medicine, St Mary's Campus, Norfolk Place, London W2 1PG, UK

Université de Lyon, Université Lyon1, UMR 5558 - LBBE "Biométrie et Biologie évolutive" Bât. Grégor Mendel, 43 bd du 11 novembre 1918, 69622 Villeurbanne cedex, France

Abstract

Background

The dramatic progress in sequencing technologies offers unprecedented prospects for deciphering the organization of natural populations in space and time. However, the size of the datasets generated also poses some daunting challenges. In particular, Bayesian clustering algorithms based on pre-defined population genetics models such as the STRUCTURE or BAPS software may not be able to cope with this unprecedented amount of data. Thus, there is a need for less computer-intensive approaches. Multivariate analyses seem particularly appealing as they are specifically devoted to extracting information from large datasets. Unfortunately, currently available multivariate methods still lack some essential features needed to study the genetic structure of natural populations.

Results

We introduce the

Conclusions

Analysis of simulated data revealed that our approach performs generally better than STRUCTURE at characterizing population subdivision. The tools implemented in DAPC for the identification of clusters and graphical representation of between-group structures allow to unravel complex population structures. Our approach is also faster than Bayesian clustering algorithms by several orders of magnitude, and may be applicable to a wider range of datasets.

Background

The study of the genetic structure of biological populations has attracted a growing interest from a wide array of fields, such as population biology, molecular ecology, and medical genetics. One of the most widely applied approaches is the inference of population structuring with Bayesian clustering methods such as STRUCTURE

Unfortunately, the reliance of Bayesian clustering methods on explicit models also comes at a cost. Model-based approaches rely on assumptions such as the type of population subdivision, which are often difficult to verify and can restrict their applicability. Furthermore, estimation of a large number of parameters

Multivariate analyses have been used for decades to extract various types of information from genetic data and have attracted renewed interest in the field

However, PCA lacks some essential features for investigating the genetic structure of biological populations. First, it does not provide a group assessment, and would require

Fundamental difference between PCA and DA

**Fundamental difference between PCA and DA**. (a) The diagram shows the essential difference between Principal Component Analysis (PCA) and Discriminant Analysis (DA). Individuals (dots) and groups (colours and ellipses) are positioned on the plane using their values for two variables. In this space, PCA searches for the direction showing the largest total variance (doted arrow), whereas DA maximizes the separation between groups (plain arrow) while minimizing variation within group. As a result, PCA fails to discriminate the groups (b), while DA adequately displays group differences.

This is precisely the rationale of Discriminant Analysis (DA)

Unfortunately, DA suffers from considerable restrictions which often preclude its application to genetic data. First, the method requires the number of variables (alleles) to be less than the number of observations (individuals). This condition is generally not fulfilled in Single Nucleotide Polymorphism (SNP) or re-sequencing datasets. Second, it is hampered by correlations between variables, which necessarily occur in allele frequencies due to the constant-row sum constraint [i.e., compositional data,

In this paper, we introduce the Discriminant Analysis of Principal Components (DAPC), a new methodological approach which retains all assets of DA without being burdened by its limitations. DAPC relies on data transformation using PCA as a prior step to DA, which ensures that variables submitted to DA are perfectly uncorrelated, and that their number is less than that of analysed individuals. Without implying a necessary loss of genetic information, this transformation allows DA to be applied to any genetic data. Like PCA, our approach can be applied to very large datasets, such as hundreds of thousands of SNPs typed for thousands of individuals. Moreover, the contributions of alleles to the structures identified by DAPC can allow for identifying regions of the genome driving genetic divergence among groups. Along with the assignment of individuals to clusters, our method provides a visual assessment of between-population genetic structures, permitting to infer complex patterns such as hierarchical clustering or clines.

Whenever group priors are unknown, we use K-means clustering of principal components to identify groups of individuals

We apply DAPC to both simulated and empirical datasets. We use simulations to assess the ability of our approach to infer the right genetic clusters, and compare our results to those obtained with STRUCTURE

Results

Analysis of simulated datasets

As a benchmark, we first compared the results of DAPC to those obtained by STRUCTURE using simulations. Data were simulated with EASYPOP

Diagram of migration models used in simulations

**Diagram of migration models used in simulations**. The four panels represent in (a) an island model, (b) a hierarchical island model, (c) hierarchical stepping stone, and in (d) a stepping stone with 24 populations. Red disks represent random mating sub-populations (demes) and arrows the interconnecting migration routes (black arrows represent greater gene flow than grey ones). Dotted lines indicate archipelagos (b) or a contact zone (c).

Parameters of simulations.

**Island model**

**Hierarchical island model**

**Hierarchical stepping stone**

**Stepping stone**

Number of populations

6

6 (3, 2, 1)

12 (6, 6)

24

Population size

200

200

100

50

Sample size^{(1)}

100

100

50

25

Migration rate

0.005

0.05/0.005^{(2)}

0.01/0.001^{(2)}

0.02

Mutation rate

10^{-4}

10^{-4}

10^{-4}

10^{-4}

Number of loci

30

30

30

30

Possible allelic states

50

50

50

50

This table indicates the parameters used to simulate data under four different models (see Figure 2). ^{(1)}Sample size refers to the number of individuals per population retained in the analyses.

^{(2)}The first migration rate refers to between-population migration, whereas the second refers to migration between the higher hierarchical levels.

Summary statistics of the simulations.

**Median**

**Quantile 5%**

**Quantile 95%**

**Island model**

_{ST}

0.1

0.07

0.13

_{S}

0.42

0.36

0.46

number of alleles/locus

5

3

8

**Hierarchical island model**

_{ST}

0.05

0.03

0.08

_{S}

0.41

0.33

0.49

number of alleles/locus

5

2

8

**Hierarchical stepping stone**

_{ST}

0.37

0.09

0.56

_{S}

0.3

0.2

0.38

number of alleles/locus

6

3

9

**Stepping stone**

_{ST}

0.42

0.12

0.64

_{S}

0.27

0.13

0.36

number of alleles/locus

6

4

9

This table reports usual genetic summary statistics computed on the simulated datasets using _{ST }refers to the mean pairwise _{ST }computed using Nei's estimator _{S }refers to the gene diversity (expected heterozygosity under random mating).

Ten independent replicates were obtained for each model. Each dataset was analysed by both STRUCTURE and DAPC. Accuracy of the results obtained with STRUCTURE depended critically on the underlying population genetic model behind the simulated data (Table

Results of the analyses of simulated data.

**Island Model**

**Hierarchical island model**

**Hierarchical stepping stone**

**Stepping stone**

Number of populations (true

6

6

12

24

6 ([6,6])

6 ([6,8])

11 (8,12)

17.5 ([13,21])

6 ([2,7])

3 ([2,6])

2 ([2,2])

2 ([2,5])

% of correct assignment by DAPC

98.2% ([96.3%,99%])

87.5% ([73.9%,91.2%])

89.7% ([87.9%,97.2%])

83.9% ([80%,88.7%])

% of correct assignment by STRUCTURE

98.6% ([98%,99.2%])

93.1% ([89.2%,95.5%])

NA^{(1)}

NA^{(1)}

This table reports the results of analyses of simulated data (see Figure 2) by DAPC and STRUCTURE. ^{(1)}NA is indicated when the percentage of successful assignment could not be computed with STRUCTURE. In these cases, the 'optimal'

The same datasets were analysed by DAPC using the

Inference of the number of clusters in simulated data

**Inference of the number of clusters in simulated data**. These four panel report examples of outputs from single simulations of the function

Then, DAPC was performed (function

Successful detection of the correct number of genetic clusters is undoubtedly a desirable feature. However, this information alone is not sufficient to describe the apportionment of genetic diversity within a population. What is additionally needed to gain real insights about the system under study is a representation of the relatedness between clusters. DAPC is particularly well suited for this task, as it finds principal components which best summarize the differences between clusters while neglecting within-cluster variation (Figure

Scatterplots of DAPC of simulated data

**Scatterplots of DAPC of simulated data**. These scatterplots show the first two principal components of the DAPC of data simulated according to four different models (a: island model; b: hierarchical islands model; c: hierarchical stepping stone and d: stepping stone; see Figure 2). Clusters are shown by different colours and inertia ellipses, while dots represent individuals.

Analysis of empirical data

Human microsatellite data

DAPC was applied to the microsatellite genotypes from the Human Genome Diversity Project-Centre d'Etude du Polymorphisme Humain (HGDP-CEPH)

Two analyses were run for this dataset. First, we used DAPC to investigate the genetic structure of the 79 sampled populations. We retained 1,000 principal components of PCA during the preliminary variable transformation, which accounted for most (approximately 94%) of the total genetic variability. It is worth noting that despite the respectable size of this dataset (1350 individuals and 8170 alleles), DAPC was run in less than a minute on a standard desktop computer. The eigenvalues of the analysis (Figure

Colorplot of the DAPC of extended HGDP-CEPH data

**Colorplot of the DAPC of extended HGDP-CEPH data**. This colorplot represents the first three principal components (PC) of the DAPC of extended HGDP-CEPH data, using populations as prior clusters. Each dot corresponds to a sampled population. Each principal component is recoded as intensities of a given colour channel of the RGB system: red (first PC), green (second PC), and blue (third PC). These channels are mixed to form colours representing the genetic similarity of populations. The inset indicates the eigenvalues of the analysis, with colour channels used to represent PCs indicated on the corresponding eigenvalues.

While largely consistent with previous well-established findings, these results are based on the clustering of individuals into geographically predefined populations. This has the possible drawback that higher-level of genetic clustering could be overlooked. To evaluate this hypothesis, we looked for the best supported number of clusters using our approach based on K-means algorithm. Inspection of the BIC values ranging from one to 100 clusters clearly showed that a subdivision into four clusters should be considered (Figure

Inference of the number of clusters in the extended HGDP-CEPH data

**Inference of the number of clusters in the extended HGDP-CEPH data**. This graph shows the output of the function

Colorplot of the DAPC of extended HGDP-CEPH data based on four inferred clusters

**Colorplot of the DAPC of extended HGDP-CEPH data based on four inferred clusters**. This colorplot represents the three principal components (PC) of the DAPC of extended HGDP-CEPH data, using the four clusters inferred by

Seasonal influenza (H3N2) hemagglutinin data

To illustrate the versatility of our approach, we selected a radically different dataset for the second example. We analysed the population structure of seasonal influenza A/H3N2 viruses using hemagglutinin (HA) sequences. Changes in the HA gene are largely responsible for immune escape of the virus (antigenic shift), and allow seasonal influenza to persist by mounting yearly epidemics peaking in winter

Assessing the genetic evolution of a pathogen through successive epidemics is of considerable epidemiological interest. In the case of seasonal influenza, we would like to ascertain how genetic changes accumulate among strains from one winter epidemic to the next. For this purpose, we retrieved all sequences of H3N2 hemagglutinin (HA) collected between 2001 and 2007 available from Genbank ^{st }of July 2005 and the 30^{th }of June 2006.

DAPC was used to investigate the pattern of genetic diversity in these data. We retained 150 principal components of PCA in the preliminary data transformation step, which altogether contained more that 90% of the total genetic variation. The first two principal components of DAPC were sufficient to summarize the temporal evolution of the virus (Figure

Scatterplots of the DAPC of seasonal influenza (H3N2) data

**Scatterplots of the DAPC of seasonal influenza (H3N2) data**. This scatterplot shows the first two principal components of the DAPC of seasonal influenza (H3N2) hemagglutinin data, using years of sampling as prior clusters. Groups are shown by different colours and inertia ellipses, while dots represent individual strains.

It has recently been suggested that seasonal influenza epidemics are seeded each year from a reservoir in Southeast Asia

Contributions of alleles to the second principal component of the DAPC of seasonal (H3N2) influenza data

**Contributions of alleles to the second principal component of the DAPC of seasonal (H3N2) influenza data**. The height of each bar is proportional to the contribution (Equation 10) of the corresponding allele to the second principal component of the analysis, which isolated the strains from the 2006 influenza epidemic from all others (see Figure 8). Only alleles whose contribution is above an arbitrary threshold (grey horizontal line) are indicated for the sake of clarity. Alleles are labeled by their position in the original alignment, and the corresponding nucleotide, separated by a dot. Position 384 and 906 correspond respectively to residue 144 and 318 in the complete hemagglutinin (HA) protein CDS. Polymorphism at position 384 leads to a mutation from Asparagine to Lysine, present in 32.1% of strains sampled in 2006 while virtually absent before 2006. Polymorphism at position 906 is synonymous.

Discussion and Conclusions

In this paper, we introduced a new multivariate method, the Discriminant Analysis of Principal Components (DAPC), for the analysis of the genetic structure of populations. This approach can be used to define clusters of individuals and to unravel possibly complex structures existing among clusters, such as hierarchical clustering and clinal differentiation, while being orders of magnitude faster than existing Bayesian clustering methods. For simulated data, DAPC proved as accurate as STRUCTURE in detecting hidden population clusters within simple island population models. Moreover, DAPC was more suited to unravel the underlying structuring in more complex population genetics models. Another major advantage of DAPC over Bayesian clustering approaches is the possibility to generate a graphical representation of the relatedness between the inferred clusters. Applied to two highly contrasted empirical datasets, our method was able to identify non-trivial and meaningful biological patterns.

One of the main assets of DAPC is its great versatility. Indeed, DAPC does not rely on a particular population genetics model, and is thus free of assumptions about Hardy-Weinberg equilibrium or linkage disequilibrium. As such it should be useful for a variety of organisms, irrespective of their ploidy and rate of genetic recombination. Also, contrary to Bayesian clustering methods, DAPC can be applied to very large datasets within negligible computational time (all analyses presented in this paper took less than minute to run on a standard computer). Moreover, the method is not restrained to genetic data, and can be applied to any quantitative data such as morphometric data. This feature is particularly interesting as it allows for partialling out the effects of undesirable covariates, such as different sequencing protocols, or trivial genetic structures that could obscure lesser, more interesting patterns. This can be achieved by analyzing the residuals of a preliminary model including the covariates as predictors instead of the raw data.

A major concern pertaining to all clustering approaches is the risk of inferring artefactual discrete groups in populations where genetic diversity is distributed continuously. Such spurious clusters are particularly likely to arise under spatially heterogeneous sampling of populations

We chose to analyse two contrasted datasets to illustrate the versatility our approach. The HGDP-CEPH dataset has been repeatedly analysed using a variety of methods

In contrast, the seasonal influenza analysis highlights features that go beyond simple genetic clustering. The DAPC scatterplot reveals that the virus is genetically structured into clusters which are arranged along a temporal cline, and shows a marked discontinuity between two successive years. Examination of allele loadings further reveals that this abrupt change is due to the apparition of new alleles in the global population, one of which induced a change in the amino-acid sequence, and may have therefore been subject to natural selection.

Although DAPC is a promising tool for the analysis of genetic data, further methodological developments should be considered to improve our approach. K-means has proved very efficient here as in previous studies for identifying genetic clusters

Irrespective of these methodological adjustments, we can see applications of DAPC beyond the mere study of the genetic structure of populations. One field where the method may be particularly relevant is association studies. In this context, population structuring ('

Association studies aim at identifying genetic features that differ between two or more groups of individuals. In other words, the aim is to identify the alleles that best discriminate a set of pre-defined clusters. DAPC seems perfectly adapted to this task, as it finds linear combinations of alleles (the discriminant functions) which best separate the clusters. Alleles with the largest contributions to this discrimination are therefore those which are the most markedly different across groups, which could represent cases and controls. A simple plot of allele contributions (Figure

To conclude, DAPC appears as a fast, powerful and flexible tool to unravel the makeup of genetically structured populations. However, we have no doubt that the application of this method goes way beyond the illustrations provided in this paper. We hope that its implementation in the free software R

Methods

Measuring between-group differentiation

Discriminant Analysis (DA), DAPC, and K-means clustering all rely on the same statistical model to quantify between-group differentiation, which is in fact a classical ANOVA model. Below, we introduce this general model using concepts and notations further used in the specific presentation of DAPC and K-means clustering.

Let **y **∈ ℝ^{
n
}be the vector of a centred variable with _{1},...,_{
n
}) distributed into **D **be the diagonal matrix containing uniform weights for the observations (**H = [**
_{
ij
}
**] **the _{
ij
}= 1 if observation _{
ij
}= 0 otherwise. We define **P = H(H**
^{
T
}
**DH)**
^{
-1
}
**H**
^{
T
}
**D **as the projector onto the dummy vectors of **H**, which can be used to replace each observation in _{
i
}by the mean value of the group to which **y**:

where **I **is the identity matrix of dimension **y **is centred, the vectors

where **y**, we use the ratio of between-group and within-group variances, also known as the F statistic:

This quantity takes positive values only, with larger values indicating stronger differences between groups. Alternatively, one could use the proportion of variance explained by the model, which is also known as the **y**, defined as:

In fact, both quantities can be used as a measure of group separation in DA and DAPC, and would yield identical results (discriminant functions) up to a constant. In the remaining, we shall refer to the F statistic only.

Discriminant Analysis of Principal Components

Let **X **be a _{1}, A_{2}, A_{3}), a homozygote genotype A_{1}/A_{1 }is coded as [1, 0, 0], while a heterozygote A_{2}/A_{3 }is coded as [0, 0.5, 0.5]. We denote **X**
^{
j
}the ^{th }allele-column of **X**. Missing data are replaced with the mean frequency of the corresponding allele, which avoids adding artefactual between-group differentiation. Without loss of generality, we assume that each column of **X **is centred to mean zero. Classical (linear) discriminant analysis seeks linear combinations of alleles with the form:

(**v **= [_{1}..._{
p
}]^{
T
}being a vector of **v **so that **F(Xv) **is maximum.

Linear combinations of alleles (Equation 5) optimizing this criterion are called **D**-symmetric matrix

where **P **is the previously defined projector onto the dummy vectors of **H**, and **W **is the matrix of covariances within groups, computed as:

This solution requires **W **to be invertible, which is not the case when the number of alleles

To circumvent this issue, DAPC uses a data transformation based on PCA prior to DA. Rather than analyzing directly **X**, we first compute the principal components of PCA, **XU**, verifying:

where **U **is a **X**
^{
T
}
**DX**, and **Λ **the diagonal matrix of corresponding non-null eigenvalues. Note that when the number of alleles (**XX**
^{
T
}
**D **to obtain **U **and **Λ **

DA is then performed on the matrix of principal components. At this step, less-informative principal components may be discarded, although this is not mandatory. Replacing **X **with **XU **into Equation 6, the solution of DAPC is given by the eigenanalysis of the **D**-symmetric matrix:

The first obtained eigenvector **v **maximizes **XUv**) under the constraint that **XUv) **= 1, which amounts to maximizing the F-statistic of **XUv**. This maximum is attained for the eigenvalue **v **(**F(XUv) **= **v **can be used to compute the linear combinations of principal components of PCA (**XU**) which best discriminate the populations in the sense of the F-statistic.

However, it can be noticed that these linear combinations of principal components ((**XU**)**v**) can also be interpreted as linear combinations of alleles (**X**(**Uv**)), in which the allele loadings are the entries of the vector **Uv**. This has the advantage of allowing one to quantify the contribution of a given allele to a particular structure. Denoting _{
j
}the loading of the ^{th }allele (**XUv**, the contribution of this allele can be computed as:

Prior clustering using K-means

Whenever groups are not known in advance, it is possible to define them using a clustering algorithm. K-means is a natural choice to do so since it uses the same model as DA and a similar measure of group differentiation. K-means relies on the model in equation (1) which decomposes the total variance of a variable into between-group and within-group components. This model can be extended to the multivariate case by summing variance components over the different variables. To differentiate univariate and multivariate variances, we use upper case notations for variances of multivariate data. Note, however, that these quantities are in both cases squared norms of vectors or matrices (considering the Frobenius norm in the multivariate case). Applied to the previously-defined matrix of principal components of PCA (**XU**) as in

with VAR(**X**) = tr(Λ), **XU**) = tr(**U**
^{
T
}
**X**
^{
T
}
**P**
^{
T
}
**DPXU**), and **(XU) **= tr(**U**
^{
T
}
**WU**). The Bayesian Information Criterion (BIC) used to choose the best clustering model is then defined as:

where **(X) **is the residual variance (^{2 }(results not shown). This result is consistent with previous findings which advocated the use of BIC for selecting the best number of groups in K-means clustering of genetic data

Clustering analyses using STRUCTURE

We used STRUCTURE

Implementation and examples

The methodological approach presented in the paper is implemented in the

Authors' contributions

TJ developed and implemented the method. FB performed the simulations. All authors contributed to analyzing and interpreting the data, and to writing the manuscript. All authors read and approved the final manuscript.

Authors' informations

TJ is a post-doctoral research associate in biometry at the Imperial College London, UK. His main focus is on developing statistical tools for analysing genetic data, with an emphasis on multivariate methods. FB is an associate professor in population genetics at the Imperial College London, UK. His work ranges from theoretical to applied population genetics, with an emphasis on Human populations and their pathogens. SD is an assistant professor in evolutionary biology and biostatistics at the Université Claude Bernard - Lyon 1, France. His interests range from empirical studies to theoretical works in population biology, ecology, and evolution.

Acknowledgements

We are grateful to our colleagues who generated the HGDP-CEPH dataset and those who made H3N2 hemagglutinin sequences publicly available on Genbank. We thank Dave Hunt, Daniel Falush, Jukka Corander, and two anonymous reviewers for providing useful comments on a previous version of the manuscript. We thank R-Forge (