Abstract
Background
Population inference is an important problem in genetics used to remove population stratification in genomewide association studies and to detect migration patterns or shared ancestry. An individual’s genotype can be modeled as a probabilistic function of ancestral population memberships, Q, and the allele frequencies in those populations, P. The parameters, P and Q, of this binomial likelihood model can be inferred using slow sampling methods such as Markov Chain Monte Carlo methods or faster gradient based approaches such as sequential quadratic programming. This paper proposes a leastsquares simplification of the binomial likelihood model motivated by a Euclidean interpretation of the genotype feature space. This results in a faster algorithm that easily incorporates the degree of admixture within the sample of individuals and improves estimates without requiring trialanderror tuning.
Results
We show that the expected value of the leastsquares solution across all possible genotype datasets is equal to the true solution when part of the problem has been solved, and that the variance of the solution approaches zero as its size increases. The Leastsquares algorithm performs nearly as well as Admixture for these theoretical scenarios. We compare leastsquares, Admixture, and FRAPPE for a variety of problem sizes and difficulties. For particularly hard problems with a large number of populations, small number of samples, or greater degree of admixture, leastsquares performs better than the other methods. On simulated mixtures of real population allele frequencies from the HapMap project, Admixture estimates sparsely mixed individuals better than Leastsquares. The leastsquares approach, however, performs within 1.5% of the Admixture error. On individual genotypes from the HapMap project, Admixture and leastsquares perform qualitatively similarly and within 1.2% of each other. Significantly, the leastsquares approach nearly always converges 1.5 to 6times faster.
Conclusions
The computational advantage of the leastsquares approach along with its good estimation performance warrants further research, especially for very large datasets. As problem sizes increase, the difference in estimation performance between all algorithms decreases. In addition, when prior information is known, the leastsquares approach easily incorporates the expected degree of admixture to improve the estimate.
Background
The inference of population structure from the genotypes of admixed individuals poses a significant problem in population genetics. For example, genome wide association studies (GWAS) compare the genetic makeup of different individuals in order to extract differences in the genome that may contribute to the development or suppression of disease. Of particular interest are single nucleotide polymorphisms (SNPs) that reveal genetic changes at a single nucleotide in the DNA chain. When a particular SNP variant is associated with a disease, this may indicate that the gene plays a role in the disease pathway, or that the gene was simply inherited from a population that is more (or less) predisposed to the disease. Determining the inherent population structure within a sample removes confounding factors before further analysis and reveals migration patterns and ancestry [1]. This paper deals with the problem of inferring the proportion of an individual’s genome originating from multiple ancestral populations and the allele frequencies in these ancestral populations from genotype data.
Methods for revealing population structure are divided into fast multivariate analysis techniques and slower discrete admixture models [2]. Fast multivariate techniques such as principal components analysis (PCA) [28] reveal subspaces in the genome where large differences between individuals are observed. For case–control studies, the largest differences commonly due to ancestry are removed to reduce false positives [4]. Although PCA provides a fast solution, it does not directly infer the variables of interest: the population allele frequencies and individual admixture proportions. On the other hand, discrete admixture models that estimate these variables typically require much more computation time. Following a recent trend toward faster gradientbased methods, we propose a faster simpler leastsquares algorithm for estimating both the population allele frequencies and individual admixture proportions.
Pritchard et al. [9] originally propose a discrete admixture likelihood model based on the random union of gametes for the purpose of population inference. In particular, their model assumes HardyWeinberg equilibrium within the ancestral populations (i.e., allele frequencies are constant) and linkage equilibrium between markers within each population (i.e., markers are independent). Each individual in the current sample is modeled as having some fraction of their genome originating from each of the ancestral populations. The goal of population inference is to estimate the ancestral population allele frequencies, P, and the admixture of each individual, Q, from the observed genotypes, G. If the population of origin for every allele, Z, is known, then the population allele frequencies and the admixture for each individual have a Dirichlet distribution. If, on the other hand, P and Q are known, the population of origin for each individual allele has a multinomial distribution. Pritchard et al. infer populations by alternately sampling Z from a multinomial distribution based on P and Q; and P and Q from Dirichlet distributions based on Z. Ideally, this Markov Chain Monte Carlo sampling method produces independent identically distributed samples (P,Q) from the posterior distribution P(P,QG). The inferred parameters are taken as the mean of the posterior. This algorithm is implemented in an opensource software tool called Structure[9].
The binomial likelihood model proposed by Pritchard et al. was originally used for datasets of tens or hundreds of loci. However, as datasets become larger, especially considering genomewide association studies with thousands or millions of loci, two problems emerge. For one, linkage disequilibrium introduces correlations between markers. Although Falush et al. [10] extended Structure to incorporate loose linkage between loci, larger datasets also pose a computational challenge that has not been met by these samplingbased approaches. This has led to a series of more efficient optimization algorithms for the same likelihood model with uncorrelated loci. This paper focuses on improving computational performance, leaving the treatment of correlated loci to future research.
Tang et al. [11] proposed a more efficient expectation maximization (EM) approach. Instead of randomly sampling from the posterior distribution, the FRAPPE EM algorithm [11] starts with a randomly initialized Z, then alternates between updating the values of P and Q for fixed Z, and maximizing the likelihood of Z for fixed P and Q. Their approach achieves similar accuracy to Structure and requires much less computation time. Wu et al. [12] specialized the EM algorithm in FRAPPE to accommodate the model without admixture, and generalized it to have different mixing proportions at each locus. However, these EM algorithms estimate an unnecessary and unobservable variable Z, something that more efficient algorithms could avoid.
Alexander et al. [13] proposed an even faster approach for inferring P and Q using the same binomial likelihood model but bypassing the unobservable variable Z. Their closesource software, Admixture, starts at a random feasible solution for P and Q and then alternates between maximizing the likelihood function with respect to P and then maximizing it with respect to Q. The likelihood is guaranteed not to decrease at each step eventually converging at a local maximum or saddle point. For a moderate problem of approximately 10000 loci, Admixture achieves comparable accuracy to Structure and requires only minutes to execute compared to hours for Structure[13].
Another feature of Structure’s binomial likelihood model is that it allowed the user to input prior knowledge about the degree of admixture. The prior distribution for Q takes the form of a Dirichlet distribution with a degree of admixture parameter, α, for every population. For α = 0, all of an individual’s alleles originate from the same ancestral population; for α > 0, individuals contain a mixture of alleles from different populations; for α = 1, every assignment of alleles to populations is equally likely (i.e., the noninformative prior); and for α → ∞, all individuals have equal contributions from every ancestral population. Alexander et al. replace the population degree of admixture parameter in Structure with two parameters, λ and γ, that when increased also decrease the level of admixture of the resulting individuals. However, the authors admit that tuning these parameters is nontrivial [14].
This paper contributes to population inference research by (1) proposing a novel leastsquares simplification of the binomial likelihood model that results in a faster algorithm, and (2) directly incorporating the prior parameter α that improves estimates without requiring trialanderror tuning. Specifically, we utilize a two block coordinate descent method [15] to alternately minimize the criterion for P and then for Q. We adapt a fast nonnegative leastsquares algorithm [16] to additionally include a sumtoone constraint for Q and an upperbound for P. We show that the expected value for the estimates of P (or Q) across all possible genotype datasets are equal to the true values when Q (or P) are known and that the variance of this estimate approaches zero as the problem size increases. Compared to Admixture, the leastsquares approach provides a slightly worse estimate of P or Q when the other is known. However, when estimating P and Q from only the genotype data, the leastsquares approach sometimes provides better estimates, particularly with a large number of populations, small number of samples, or more admixed individuals. The leastsquares approximation provides a simpler and faster algorithm, and we provide it as Matlab scripts on our website.
Results
First, we motivate a leastsquares simplification of the binomial likelihood model by deriving the expected value and covariance of the leastsquares estimate across all possible genotype matrices for partially solved problems. Second, we compare leastsquares to sequential quadratic programming (Admixture’s optimization algorithm) for these cases. Third, we compare Admixture, FRAPPE, and leastsquares using simulated datasets with a factorial design varying dataset properties in G. Fourth, we compare Admixture and leastsquares using real population allele frequencies from the HapMap Phase 3 project. Finally, we compare the results of applying Admixture and leastsquares to real data from the HapMap Phase 3 project where the true population structure is unknown.
The algorithms we discuss accept as input the number of populations, K, and the genotypes, g_{li}∈{0,1,2}, representing the number of copies of the reference allele at locus l for individual i. Then, the algorithms attempt to infer the population allele frequencies, p_{lk} = [0,1], for locus l and population k, as well as the individual admixture proportions, q_{ki} = [0,1] where ∑_{k}q_{ki} = 1. In all cases, 1 ≤ l ≤ M, 1 ≤ i ≤ N, and 1 ≤ k ≤ K. Table 1 summarizes the matrix notation.
Table 1. Matrix notation
Empirical estimate and upper bound on total variance
To validate our derived bounds on the total variance (Equations 13, 17, 18 and 19), we generate simulated genotypes from a known target for p = [0.1, 0.7]^{T}. We simulate N individual genotypes using the full matrix Q with each column drawn from a Dirichlet distribution with shape parameter α. We repeat the experiment 10000 times producing an independent and identically distributed genotype each time. Each trial produces one estimate for p. We then compute the mean and covariance of the estimates of p and compare them to those predicted in the bounds. For α = 1 and N = 100,
The bound using the sample covariance of q in Equation 13 provides the following:
The bound using the properties of the Dirichlet distribution in Equation 17 provides a bound of 0.01. As the number of samples increases, the difference between the bound and the asymptotic bound for the Dirichlet distributed q will approach zero.
Figure 1 plots the total variance (trace of the covariance) matrix for a variety of values for N and α using the same target value for p. Because the expected value of the estimate is equal to the true value of p, the total variance is analogous to the sum of the squared error (SSE) between the true p and its estimate. Clearly, the total variance decreases with N. For N = 10000, the root mean squared error falls below 1%.
Figure 1. Bound on total variance. Solid and dashed lines correspond to the empirical estimate of the total variance and the upper bound for total variance, respectively.
Intuitively, the error in the leastsquares estimate for P and Q decreases as the number of individuals and the number of loci increases, respectively. Figure 1 supports this notion, suggesting that on very large problems for which the gradient based and expectation maximization algorithms were designed, the error in the leastsquares estimate approaches zero.
Comparing leastsquares approximation to binomial likelihood model
Given estimates of the population allele frequencies, early research focused on estimating the individual admixture [17]. We also note that the number of iterations and convergence properties confound the comparison of iterative algorithms. To avoid these problems and emulate a practical research scenario, we compare leastsquares to sequential quadratic programming (used in Admixture) when P or Q are known a priori. In this scenario, each algorithm converges in exactly one step making it possible to compare the underlying updates for P and Q independently. For N = 100, 1000, and 10000; and α = 0.1, 1, and 2; we consider a grid of twodimensional points for p, where p_{i} = {0.05, 0.15, …, 0.95}. For each trial, we first generate a random Q such that every column is drawn from a Dirichlet distribution with shape parameter, α. Then, we randomly generate a genotype using Equation 11. We compute the leastsquares solution using Equation 27 and use Matlab’s builtin function ‘fmincon’ to minimize the negative of the loglikelihood in Equation 7, similar to Admixture’s approach. We repeat the process for 1000 trials and aggregate the results.
Figure 2 illustrates the root mean squared error in estimating p given the true value of Q. Both algorithms present the same pattern of performance as a function of p = [p_{1}, p_{2}]. Values of p near 0.5 present the most difficult scenarios. Positively correlated values (e.g., p_{1} = p_{2}) present slightly less error than negatively correlated values (e.g., p_{1} = 1 – p_{2}). Table 2 summarizes the performance over all values of p for varying N and α. In all cases, fmincon performs slightly better than leastsquares and both algorithms approach zero error as N increases. We repeat this analysis for known values for P and estimate q using the two approaches. Figure 3 illustrates the difference in performance for the two algorithms as we vary q_{1} between 0.05 and 0.95 with q_{2} =1 – q_{1}. Again, fmincon performs slightly better in all cases but both approach zero as M increases. In the next section we show that the additional error introduced by the leastsquares approximation to the objective function remains small relative to the error introduced by the characteristics of the genotype data.
Figure 2. Precision of bestcase scenario for estimating P. Root mean squared error for different values of p using (a) Admixture’s Sequential Quadratic Programming or (b) the leastsquares approximation.
Figure 3. Precision of bestcase scenario for estimating Q. Solid and dashed lines correspond to Admixture’s Sequential Quadratic Programming optimization and the leastsquares approximation, respectively.
Table 2. Root mean squared error in P for known Q and K= 2
Simulated experiments to compare leastsquares to Admixture and FRAPPE
In the previous sections, we consider the bestcase scenario where the true value of P or Q is known. In a realistic scenario, the algorithms must estimate both P and Q from only the genotype information. Table 3 summarizes the results of a fourway analysis of variance with 2way interactions among experimental factors. By far the factor with the most impact on performance is the number of individuals, N. The degree of admixture, α, and the number of populations, K, accounts for the second and third most variation, respectively. These three factors and twoway interactions between them account for the vast majority of variation. In particular, the choice of algorithm accounts for less than about 1% of the variation in estimation performance. That is, when estimating population structure from genotype data, the number of samples, the number of populations, and the degree of admixture play a much more important role than the choice between leastsquares, Admixture, and FRAPPE and leastsquares. However, as shown in Figure 4, when considering the computation time required by the algorithm, the choice of algorithm contributes about 40% of the variation including interactions. Therefore, for the range of population inference problems described in this study, the choice of algorithm plays a very small role in the estimation of P and Q but a larger role in computation time.
Figure 4. Computational timing comparison. Box plots show the median (red line) and interquartile range (blue box) for computation time on a logarithmic scale using (a) N=1000, α=0.5, and varying K; (b) K=4, α=0.5, and varying N; and (c) K=4, N=1000, and varying α.
Table 3. Sources of variation in root mean squared error
Further exploration reveals that the preferred algorithm depends on K, N, and α. Table 4 lists the root mean squared error for the estimation of Q for all combinations of parameters across n = 50 trials. Out of the 36 scenarios, Admixture, leastsquares, and FRAPPE perform significantly better than their peers 13, six, and zero times, respectively; they perform insignificantly worse than the best algorithm 30, 17, and 10 times, respectively. The leastsquares algorithm appears to perform well on the more difficult problems with combinations of large K, small N, or large α. Table 5 lists the root mean squared error for estimating P. For N = 100, the algorithms do not perform significantly differently. For N = 10000, all algorithms perform with less than 2.5% root mean squared error (RMSE). In all, Admixture performs significantly better than its peers 11 times out of 36. However, Admixture never performs significantly worse than its peers. Leastsquares and FRAPPE perform insignificantly worse than Admixture 17 and 20 times out of 36, respectively. Table 6 summarizes the timing results. Least square converges significantly faster 34 out of 36 times with an insignificant difference for the remaining two scenarios. FRAPPE converges significantly slower in all scenarios. With two exceptions, the leastsquares algorithm provides a 1.5 to 5times speedup.
Table 4. Root mean squared error for Q
Table 5. Root mean squared error for P
Table 6. Computation time
Comparison on admixtures derived from the HapMap3 dataset
Tables 7 and 8 lists the performance and computation time for the leastsquares approach and Admixture using a convergence threshold of ε = 1.0e4 and ε = 1.4e3, respectively. Each marker in the illustrations represents one individual. A short black line emanating from each marker indicates the offset from the original (correct) position. For all simulations, the leastsquares algorithms perform within 0.1% of Admixture for estimating the true population allele frequencies in P. For wellmixed populations in Simulation 1 and 2, the leastsquares algorithms perform comparably well or even better than Admixture. However, for less admixed data in Simulations 3 – 6, Admixture provides better estimates of the true population proportions depicted in the scatter plots. In all cases, the leastsquares algorithms perform within 1.5% of Admixture and between about 2 and 3times faster than Admixture.
Table 7. Simulation experiments (1–3) using realistic population allele frequencies from the HapMap phase 3 project
Table 8. Simulation experiments (4–6) using realistic population allele frequencies from the HapMap phase 3 project
The apparent advantage of Admixture involves individuals on the periphery of the unit simplex defining the space of Q. In Table 7, this corresponds to individuals on the boundary of the right triangle defined by the xaxis, yaxis, and y = 1 – x diagonal line. For Simulation 1, the original Q contains very few individuals on the boundary, Admixture estimates far more on the boundary, and the leastsquares was closer to the ground truth. For Simulation 2 – 6, the ground truth contains more individuals on the boundary, Admixture correctly estimates these boundary points but the leastsquares algorithms predict fewer points on the boundary. Simulation 6 provides the most obvious example where Admixture estimates individuals exactly on the boundary and leastsquares contains a jumble of individuals near but not exactly on the line.
Real dataset from the HapMap phase 3 project
Over 20 repeated trials, Admixture converged in an average of 42.1 seconds with standard deviation of 9.1 seconds, and the leastsquares approach converged in 33.6 seconds with a standard deviation of 9.8 seconds. Figure 5 illustrates the inferred population proportions for one run. The relative placement of individuals from each known population is qualitatively similar. The two methods differ at extreme points such as those values of q_{1}, q_{2}, or 1 – q_{1} – q_{2} that are near zero. The Admixture solution has more individuals on the boundary and the leastsquares approach has fewer. Although we cannot estimate the error of these estimates because the real world data has no ground truth, we can compare their results quantitatively. The Admixture and the leastsquares solution differed by an average of 1.2% root mean squared difference across the 20 trials. We estimate α = 0.12 from the Admixture solution’s total variance using Equation 31. This roughly corresponds to the simulated experiment with three populations, 100 samples, and a degree of admixture of 0.1. In that case, Admixture and leastsquares exhibited a very small root mean squared error of 0.62% and 0.74%, respectively (Table 4).
Figure 5. Comparison on HapMap Phase 3 dataset. Inferred population membership proportions using (a) Admixture and (b) leastsquares with α=1. Each point represents a different individual among the four populations: ASW, CEU, MEX, and YRI. The axes represent the proportion of each individual’s genome originating from each inferred population. The proportion belonging to the third inferred population is given by q_{3 }= 1 – q_{1 }– q_{2}.
Discussion
This work contributes to the population inference literature by providing a novel simplification of the binomial likelihood model that improves the computational efficiency of discrete admixture inference. This approximation results in an inference algorithm based on minimizing the squared distance between the genotype matrix G and twice the product of the population allele frequencies and individual admixture proportions: 2PQ. This Euclidean distancebased interpretation aligns with previous results employing multivariate statistics. For example, researchers have found success using principal component analysis to reveal and remove stratification [24] or even to reveal clusters of individuals in subpopulations [57]. Recently, McVean [5] proposed a genealogical interpretation of principal component analysis and uses it to reveal information about migration, geographic isolation, and admixture. In particular, given two populations, individuals cluster along the first principal component. Admixture proportion is the fractional distance between the two population centers. However, these cluster centers must known or inferred in order to estimate ancestral population allele frequencies. The leastsquares approach infers these estimates efficiently and directly.
Typically, discrete admixture models employ a binomial likelihood function rather than a Euclidean distancebased one. Pritchard et al. detail one such model and use a slow sampling based approach to infer the admixed ancestral populations for individuals in a sample [9]. Recognizing the performance advantage of maximizing the likelihood rather than sampling the posterior, Tang et al. proposed an expectation maximization algorithm and Alexander et al. [13] proposed a sequential quadratic programming (SQP) approach using the same likelihood function [9]. We take this approach a step further by simplifying the model proposed by Pritchard et al. to introduce a leastsquares criterion. By justifying the leastsquares simplification, we connect the fast and practical multivariate statistical approaches to the theoretically grounded binomial likelihood model. We validate our approach on a variety of simulated and real datasets.
First, we show that if the true value of P (or Q) is known, the expected value of the least squares solution for Q (or P) across all possible genotype matrices is equal to the true value, and the variance of this estimate decreases with M (or N). In this bestcase scenario, we show that SQP provides a slightly better estimate than the leastsquares solution for a variety of problem sizes and difficulty. For more common scenarios where the algorithms must estimate P and Q using only the genotype information in G, we show that for particularly difficult problems with small N, large K, or large α, the leastsquares approach often performs better than its peers. For about onethird of the parameter sets, Admixture performs significantly better than leastsquares and FRAPPE but all algorithms approach zero error as N becomes very large. In addition, the error introduced by the choice of algorithms was relatively small compared to other characteristics of the experiment such as sample size, number of populations, and the degree of admixture in the sample. That is, improving accuracy has more to do with improving the dataset than with selecting the algorithm, suggesting that algorithm selection may depend on other criteria such as its speed. In nearly all cases, the leastsquares method computes its solution faster, typically a 1.5 to 5times faster. At the current problem size involving about 10000 loci, this speed improvement may justify the use of leastsquares algorithms. For a single point estimate, researchers may prefer a slightly more accurate algorithm at the cost of seconds or minutes. For researchers testing several values of K and α and using multiple runs to gauge the fitness of each parameter set, or those estimating standard errors [13], the speed improvement could be the difference between hours and days of computation. As the number of loci increase to hundreds of thousands or even millions, speed may be more important. The leastsquares approach offers an alternative simpler and faster algorithm for population inference that provides qualitatively similar results.
The key speed advantage of the leastsquares approach comes from a single nonnegative leastsquares update that minimizes a quadratic criterion for P and then for Q per iteration. Admixture, on the other hand, minimizes several quadratic criteria sequentially as it fits the true binomial model. Although the leastsquares algorithm completes each update in less time and is guaranteed to converge to a local minimum or straddle point, predicting the number of iterations to convergence presents a challenge. We provide empirical timing results and note that selecting a suitable stopping criterion for these iterative methods can change the timing and accuracy results. For comparison, we use the same stopping criterion with published thresholds for Admixture and FRAPPE[13], and a threshold of MN×10^{10} for leastsquares.
This work is motivated in part by the desire to analyze larger genotype datasets. In this paper, we focus on the computational challenges of analyzing very large numbers of markers and individuals. However, linkage disequilibrium introduces correlations between loci that cannot be avoided in very large datasets. Large datasets can be pruned to diminish the correlation between loci. For example, Alexander et al. prune the HapMap phase 3 dataset from millions of SNPs down to around 10000 to avoid correlations. In this study, we assume linkage equilibrium and therefore uncorrelated markers and limit our analysis to datasets less than about 10000 SNPs. Incorporating linkage disequilibrium in gradientbased optimizations of the binomial likelihood model remains an open problem.
Estimating the number of populations K from the admixed samples continues to pose a difficult challenge for clustering algorithms in general and population inference in particular. In practice, experiments can be designed to include individual samples that are expected to be distributed close to their ancestors. For example, Tang et al. [11] suggested using domain knowledge to collect an appropriate number of pseudoancestors that reveal allele frequencies of the ancestral populations. The number of groups considered provides a convenient starting point for K. Lacking domain knowledge, computational approaches can be used to try multiple reasonable values for K and evaluating their fitness. For example, Pritchard et al. [9] estimated the posterior distribution of K and select the most probable K. Another approach is to evaluate the consistency of inference for different values of K. If the same value of K leads to very different inferences of P and Q from different random starting points, the inference can be considered inconsistent. Brunet et al. [18] proposed this method of model selection called consensus clustering.
For realistic population allele frequencies, P, from the HapMap Phase 3 dataset and very little admixture in Q, Admixture provides better estimates of Q. The key advantage of Admixture appears to be for individuals containing nearly zero contribution from one or more inferred populations, whereas the leastsquares approach performs better when the individuals are wellmixed. Visually, both approaches reveal population structure. Using the two approaches to infer three ancestral populations from four HapMap Phase 3 sampling populations reveals qualitatively similar results.
We believe the computational advantage of the leastsquares approach along with its good estimation performance warrants further research especially for very large datasets. For example, we plan to adapt and apply the leastsquares approach to datasets utilizing microsatellite data rather than SNPs and consider the case of more than two alleles per locus. Researchers have incorporated geospatial information into samplingbased [19] and PCAbased [8] approaches. Multiple other extensions to samplingbased or PCAbased algorithms have yet to be incorporated into faster gradientbased approaches.
Conclusion
This paper explores the utility of a leastsquares approach for the inference of population structure in genotype datasets. Whereas previous Euclidean distancebased approaches received little theoretical justification, we show that a leastsquares approach is the result of a firstorder approximation of the negative loglikelihood function for the binomial generative model. In addition, we show that the error in this approximation approaches zero as the number of samples (individuals and loci) increases. We compare our algorithm to stateoftheart algorithms, Admixture and FRAPPE, for optimizing the binomial likelihood model, and show that our approach requires less time and performs comparably well. We provide both quantitative and visual comparisons that illustrate the advantage of Admixture at estimating individuals with little admixture, and show that our approach infers qualitatively similar results. Finally, we incorporate a degree of admixture parameter that improves estimates for known levels of admixture without requiring additional parameter tuning as is the case for Admixture.
Methods
The algorithms we discuss accept the number of populations, K, and an M × N genotype matrix, G as input:
where g_{li} ∈ {0,1,2} representing the number of copies of the reference allele at the lth locus for the ith individual, M is the number of markers (loci), and N is the number of individuals. Given the genotype matrix, G, the algorithms attempt to infer the population allele frequencies and the individual admixture proportions. The matrix P contains the population allele frequencies:
where 0 ≤ p_{lk} ≤ 1 representing the fraction of reference alleles out of all alleles at the lth locus in the kth population. The matrix Q contains the individual admixture proportions:
where 0 ≤ q_{ik} ≤ 1 represents the fraction of the ith individual’s genome originating from the kth population and for all i, ∑_{k}q_{ki} = 1. Table 1 summarizes the matrix notation we use.
Likelihood function
Alexander et al. model the genotype (i.e., the number of reference alleles at a particular locus) as the result of two draws from a binomial distribution [13]. In the generative model, each allele copy for one individual at one locus has an equal chance, m_{li}, of receiving the reference allele:
The loglikelihood of the parameters P and Q from the original Structure binomial model and ignoring an additive constant is the following [13]:
To see the effect on gradientbased optimization, we also present the derivative of the likelihood with respect to a particular m_{li}:
In order to achieve a leastsquares criterion, we must approximate this derivative with a line. Figure 6 plots this derivative with respect to m_{li} for the three possible values of g_{li} (0, 1, or 2). To avoid biasing the approximation to high or low values of m_{li}, we approximate the derivative with its firstorder Taylor approximation in the neighborhood of m_{li} = 1/2. More complex optimizations might update the neighborhood of the Taylor approximation during the optimization. In the interest of simplicity, we select one neighborhood for all iterations, genotypes, individuals, and loci. The following leastsquares objective function has the approximated derivative in the above equation:
Figure 6. Firstorder approximation for slope of loglikelihood of m. Solid and dashed lines correspond to the true and approximated slope, respectively. The red, green, and blue lines correspond to g = 0, g = 1, and g = 2, respectively.
The righthandside of Equation 9 provides the leastsquares criterion. Figure 6 shows the deviation between the linear approximation and the true slope. Values match closely for 0.35 ≤ m_{li} ≤ 0.65 but as m_{li} approaches zero or one the true slope diverges for two of the three genotypes. Therefore, we have the following leastsquares optimization problem:
Bounded error for the leastsquares approach
We justify the leastsquares approach by showing that the expected value across all genotypes is equal to the true value in the binomial likelihood model, and that the covariance approaches zero as the size of the data increases. In order to analyze the least squares performance across all possible genotype matrices, we consider the generative model for G. Given the true ancestral population allele frequencies, P, and the proportion of each individual’s alleles originating from each population, Q, the genotype at locus l for individual i is a binomial random variable, g_{li}:
If M was directly observable, we could solve for P or Q given the other using P = MQ^{#} or Q = P^{#}M, where # is the MoorePenrose pseudoinverse. However, we only observe the elements of G which is only partially informative of M. First we consider the uncertainty in estimating P. Each g_{li} is an independent random variable with the following mean and bound on the variance:
Mean and total variance of the estimate of p
For ease of notation, we focus on one locus at index l in one row of P,
, one row of G, g = [g_{l1}, g_{12}, …, g_{1N}]^{T}, and estimate the mean, covariance, and provide a bound on the total variance of its estimate:
Intuitively, QQ^{T} scales linearly with N and we expect the bound on the trace to decrease linearly with N. If the columns, q, of Q are independent and identically distributed, QQ^{T} approaches N×E[qq^{T}], resulting in a bound that decreases linearly with N:
To put this bound in more familiar terms we consider q drawn from a Dirichlet distribution with shape parameter α, resulting in the following:
Asymptotically, QQ^{T} approaches N×E[qq^{T}] and (QQ^{T})^{1} approaches:
resulting in the following asymptotic bound on the total variance:
Mean and total variance of the estimate for q
The same analysis can be repeated for one individual at index i in one column of Q,
and one column of G, g = [g_{li}, g_{2i}, …, g_{Li}]^{T}:
Intuitively, P^{T}P increases linearly with M, and we expect the bound on the total variance to decrease linearly with M. Similarly, if the rows, p, of P are independent and identically distributed, P^{T}P approaches M×E[pp^{T}], resulting in an asymptotic bound that decreases linearly with M:
Incorporating degree of admixture, α
Pritchard et al. [13] use a prior distribution to bias their solution toward those with a desired level of admixture. This prior on the columns of Q takes the form of a Dirichlet distribution:
Because all the shape parameters (α) are equal, this prior assumes that all ancestral populations are equally represented in the current sample. The log of this prior probability is the following ignoring an additive constant:
The derivative of the log prior with respect to q and its firstorder approximation at the mean of q_{k} = 1/K is the following:
The following penalty function combines the columns of Q into a single negative loglikelihood function with the approximated derivative in the above equation:
The righthandside of Equation 23 acts as a penalty term for the leastsquares criterion in Equation 9. Figure 7 shows the difference between the real and approximated slope. For q near its mean of 1/K, the approximation fits closely but for extreme values of q the true slope diverges. Combining the terms in Equations 9 and 23 and including problem constraints, we have the following leastsquares optimization problem:
Figure 7. Firstorder approximation for slope of loglikelihood of q. Solid and dashed lines correspond to the true and approximated slope, respectively, for K = 2. The blue, green, red, and orange lines correspond to α = 0.1, α = 0.5, α = 1, and α = 2, respectively.
Optimization algorithm
The nonconvex optimization problem in Equation 10 can be approached as a twoblock coordinate descent problem [15,20]. We initialize Q with nonnegative values such that each column sums to one. Then, we alternate between minimizing the criterion function with respect to P with fixed Q:
and then minimizing with respect to Q with fixed P:
This process is repeated until the change in the criterion function is less than ε at which point we consider the algorithm to have converged. The Admixture algorithm suggests a threshold of ε = 1e4 but we have found that a larger threshold often suffices. Unless otherwise stated, we use a threshold that depends on the size of the problem: ε = MN×10^{10}, corresponding to 1e4 when M = 10000 and N = 100.
Leastsquares solution for P
Van Benthem and Keenan [16] propose a fast nonnegatively constrained active/passive set algorithm that avoids redundant calculations for problems with multiple righthandsides. Without considering the constraints on P, Equation 25 can be classically solved using the pseudoinverse of Q:
However, some of the elements of P may be less than zero. In the active/passive set approach, if elements of P are negative, they are clamped at zero and added to the active set. The unconstrained solution is then applied to the remaining passive elements of P. If the solution happens to be nonnegative, the algorithm finishes. If not, negative elements are added to the active set and elements in the active set with a negative gradient (will decrease the criterion by increasing) are added back to the passive set. The process is repeated until the passive set is nonnegative and the active set contains only elements with a positive gradient at zero. We extend the approach of Van Benthem and Keenan to include an upper bound at one. Therefore, we maintain two active sets: those clamped at zero and those clamped at one and update both after the unconstrained optimization of the passive set at each iteration. We provide Matlab source code that implements this algorithm on our website.
Leastsquares solution for Q
When solving for Q it is convenient to reformulate Equation 26 into simpler terms:
The unconstrained solution for this equation is the following:
When prior information is known about the sparseness, we use α in the equations above. When no prior information is known, we use α = 1 corresponding to the uninformative prior and resulting in the ordinary pseudoinverse solution. In order to incorporate the sumtoone constraint on the columns of Q, we employ the method of Lagrange multipliers using Equation 11 in the work of Settle and Drake substituting the identity matrix for the noise matrix, N[21]. For completeness, we include the solution below:
As before, some elements of Q may be negative. In that case, we utilize the active set method to clamp elements of Q at zero and update active and passive sets at each iteration until convergence as described above. We adapt the Matlab script by Van Benthem and Keenan so that the unconstrained solution uses Equation 30 instead of the standard pseudoinverse and provide it on our website.
Simulated experiments to compare the proposed approach to Admixture and FRAPPE
We generate simulated genotype data for a variety of problems using M = 10000 markers, and varying N between 100, 1000, and 10000; K between 2, 3, and 4; and α between 0.1, 0.5, 1, and 2, for a total of 36 parameter sets. For each combination of N, K, and α, we generate the ground truth P from a uniform distribution, and Q from a Dirichlet distribution parameterized by α. Then, we draw a random genotype for each individual using the binomial distribution in Equation 11. We estimate P and Q using only the genotype information and the true number of populations, K. We repeat the experiment 50 times drawing new, P, Q, and G matrices each time. Finally, we record the performance of Admixture using the published tight convergence threshold of ε = 1e4[13] and a loose convergence threshold of ε = MN×10^{4}; the leastsquares algorithm using an uninformative prior (α = 1) and ε = MN×10^{4}, and the FRAPPE EM algorithm using the published threshold of ε = 1. For reference, we also include the leastsquares algorithm with informative prior (known α) with convergence threshold of ε = MN×10^{4}_{.} In all experiments, Admixture’s performances with the two convergence thresholds were nearly identical and we only report the results for ε = MN×10^{4}, resulting in shorter computation times. We used a fourway analysis of variance (ANOVA) with a fixed effects model to reveal which factors (including algorithm) contribute more or less to the estimation error and computation time.
Statistical significance of root mean squared error and computation time
For each combination of K, N, and α, we perform a KruskalWallis test to determine if Admixture, LeastSquares, and FRAPPE perform significantly differently at a Bonferroni adjusted significance level of 0.05/(36 parameter sets) = 0.0014. If there is no significant difference, we consider their performances equal. If there is a significant difference, we perform pairwise Mann–Whitney Utests to determine significant differences between specific algorithms. We use a Bonferroni adjusted significance level of 0.05/(36 parameter sets)/(3 pairwise comparisons) = 4.6e4. The ‘Summary’ columns contain the order of performance among the algorithms such that every algorithm to the left of a ‘<’ symbol performs better than every algorithm to the right. An ‘=’ symbol indicates that the adjacent algorithms do not perform significantly differently.
Comparison on admixtures derived from the HapMap3 dataset
In the original Admixture paper [13], the authors simulate admixed genotypes from population allele frequencies derived from the HapMap Phase 3 dataset [22]. We follow their example to compare the algorithms with more realistic population allele frequencies. Rather than drawing P from a uniform distribution, we estimate the population allele frequencies for unrelated individuals in the HapMap Phase 3 dataset using individuals from the following groups: Han Chinese in Beijing, China (CHB), Utah residents with ancestry from Northern and Western Europe (CEU), and Yoruba individuals in Ibadan, Nigeria (YRI) [22]. We use the same 13928 SNPs provided in the sample data on the Admixture webpage [23]. We randomly simulate 1000 admixed individuals: q ~ Dirichlet(α_{1}, α_{2}, α_{3}). When the Dirichlet parameters are not equal, we use the degree of admixture, α, for LSα that results in the same total variance as the combination of α_{1}, α_{2}, and α_{3}:
Real dataset from the HapMap phase 3 project
In the original Admixture paper [13], the authors use Admixture to infer three hypothetical ancestral populations from four known populations in the HapMap Phase 3 dataset, including individuals with African ancestry in the American Southwest (ASW), individuals with Mexican ancestry in Los Angeles (MEX), and the same CEU CEU and YRI individuals from the previous example. We ran each algorithm 20 times on the dataset using a convergence threshold of ε = 1e4, recording the convergence times for each trial.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
RMP conceived of the leastsquares approach to inferring population structure, designed the study, and drafted the document. MDW initiated the SNP data analysis project, acquired funding to sponsor this effort, and directed the project and publication. All authors read and approved the final manuscript.
Acknowledgements
This work was supported in part by grants from Microsoft Research, National Institutes of Health (Bioengineering Research Partnership R01CA108468, P20GM072069, Center for Cancer Nanotechnology Excellence U54CA119338, and 1RC2CA148265), and Georgia Cancer Coalition (Distinguished Cancer Scholar Award to Professor M. D. Wang).
References

Beaumont M, Barratt EM, Gottelli D, Kitchener AC, Daniels MJ, Pritchard JK, Bruford MW: Genetic diversity and introgression in the Scottish wildcat.
Mol Ecol 2001, 10:319336. PubMed Abstract  Publisher Full Text

Novembre J, Ramachandran S: Perspectives on human population structure at the cusp of the sequencing era.

Menozzi P, Piazza A, CavalliSforza L: Synthetic maps of human gene frequencies in Europeans.
Science 1978, 201:786792. PubMed Abstract  Publisher Full Text

Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genomewide association studies.
Nat Genet 2006, 38:904909. PubMed Abstract  Publisher Full Text

McVean G: A genealogical interpretation of principal components analysis.
PLoS Genet 2009, 5:e1000686. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Patterson N, Price AL, Reich D: Population structure and eigenanalysis.
PLoS Genet 2006, 2:e190. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lee C, Abdool A, Huang CH: PCAbased population structure inference with generic clustering algorithms.

Novembre J, Stephens M: Interpreting principal component analyses of spatial population genetic variation.
Nat Genet 2008, 40:646649. PubMed Abstract  Publisher Full Text

Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data.
Genetics 2000, 155:945959. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Falush D, Stephens M, Pritchard JK: Inference of population structure using multilocus genotype data linked loci and correlated allele frequencies.
Genetics 2003, 164:15671587. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tang H, Peng J, Wang P, Risch NJ: Estimation of individual admixture: Analytical and study design considerations.
Genet Epidemiol 2005, 28:289301. PubMed Abstract  Publisher Full Text

Wu B, Liu N, Zhao H: PSMIX: an R package for population structure inference via maximum likelihood method.
BMC Bioinforma 2006, 7:317. BioMed Central Full Text

Alexander DH, Novembre J, Lange K: Fast modelbased estimation of ancestry in unrelated individuals.
Genome Res 2009, 19:1655. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Alexander D, Lange K: Enhancements to the ADMIXTURE algorithm for individual ancestry estimation.
BMC Bioinforma 2011, 12:246. BioMed Central Full Text

Kim H, Park H: Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method.
SIAM Journal in Matrix Analysis and Applications 2008, 30:713730. Publisher Full Text

Van Benthem MH, Keenan MR: Fast algorithm for the solution of largescale nonnegativityconstrained least squares problems.
J Chemom 2004, 18:441450. Publisher Full Text

Hanis CL, Chakraborty R, Ferrell RE, Schull WJ: Individual admixture estimates: disease associations and individual risk of diabetes and gallbladder disease among Mexican Americans in Starr County, Texas.
Am J Phys Anthropol 1986, 70:433441. PubMed Abstract  Publisher Full Text

Brunet JP, Tamayo P, Golub TR, Mesirov JP: Metagenes and molecular pattern discovery using matrix factorization.
Proc Natl Acad Sci U S A 2004, 101:4164. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Guillot G, Estoup A, Mortier F, Cosson JF: A spatial statistical model for landscape genetics.
Genetics 2005, 170:12611280. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bertsekas DP: Nonlinear programming. Belmont, Mass.: Athena Scientific; 1995.

Settle JJ, Drake NA: Linear mixing and the estimation of ground cover proportions.
Int J Remote Sens 1993, 14:11591177. Publisher Full Text

Altshuler D, Brooks LD, Chakravarti A, Collins FS, Daly MJ, Donnelly P, Gibbs RA, Belmont JW, Boudreau A, Leal SM: A haplotype map of the human genome.
Nature 2005, 437:12991320. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

ADMIXTURE: fast ancestry estimation.
[http://www.genetics.ucla.edu/software/admixture/download.html webcite]