Department of Biomathematics, UCLA, Los Angeles, California, USA

Department of Human Genetics, UCLA Los Angeles, California, USA

Department of Statistics, UCLA Los Angeles, California, USA

Abstract

Background

The estimation of individual ancestry from genetic data has become essential to applied population genetics and genetic epidemiology. Software programs for calculating ancestry estimates have become essential tools in the geneticist's analytic arsenal.

Results

Here we describe four enhancements to ADMIXTURE, a high-performance tool for estimating individual ancestries and population allele frequencies from SNP (single nucleotide polymorphism) data. First, ADMIXTURE can be used to estimate the number of underlying populations through cross-validation. Second, individuals of known ancestry can be exploited in supervised learning to yield more precise ancestry estimates. Third, by penalizing small admixture coefficients for each individual, one can encourage model parsimony, often yielding more interpretable results for small datasets or datasets with large numbers of ancestral populations. Finally, by exploiting multiple processors, large datasets can be analyzed even more rapidly.

Conclusions

The enhancements we have described make ADMIXTURE a more accurate, efficient, and versatile tool for ancestry estimation.

1 Background

Our program ADMIXTURE estimates individual ancestries by efficiently computing maximum likelihood estimates in a parametric model. The model _{ij }
_{ij }
_{ij }
_{ik }
_{kj }

of the model using block relaxation. The alternating updates of the parameter matrices _{ik}
_{kj}

The advanced features of ADMIXTURE described here allow the user to automate the choice of the number of underlying populations

2 Implementation

Cross-validation

The choice of the number of ancestral populations

where

Our _{ij }

across all masked entries over all folds. Minimizing this estimated prediction error on a grid of

Supervised learning of admixture coefficients

ADMIXTURE's strategy of simultaneously estimating individual ancestry fractions

Ancestry estimates can be estimated more accurately in supervised analysis because there is less uncertainty in allele frequencies. Interpretation of results is simplified, and run times are shorter owing to the reduced number of parameters to estimate. Both the number of iterations until convergence and the computational complexity per iteration decrease. However, we caution that supervised analysis is only suitable when the reference individuals can be assigned to ancestral populations with certainty and ancestral populations are fairly homogeneous. For exploratory analyses, unsupervised analysis is more appropriate and therefore remains the default in ADMIXTURE.

Penalized estimation and model parsimony

As noted in our later comparison of supervised and unsupervised learning, datasets culled from closely related populations typed at a modest numbers of SNPs can pose substantial challenges in ancestry estimation. For instance, overfitting tends to yield ancestry estimates with inflated amounts of admixture. The Bayesian solution to this problem is to impose an informative prior to steer parameter estimates away from danger when data is sparse. Thus, STRUCTURE imposes Dirichlet prior distributions on ancestry parameters and estimates a hyperparameter

A suitable alternative in our optimization framework is to perform penalized estimation. Rather than maximizing the log-likelihood, we maximize an objective function _{0 }penalty

which encourages not only shrinkage but also aggressive parsimony. In particular, the approximate _{0 }penalty drives small admixture coefficients to zero. Parsimony is desirable because it leads to more easily interpretable and probably more realistic parameter estimates. Estimation is performed by maximizing

Determination of the penalty tuning constants

Exploiting Multiple Processors

Very large datasets (millions of SNPs, thousands of individuals) can reduce even ADMIXTURE's efficient algorithms to a crawl. Since our original publication, we have tuned our core algorithm and improved its speed by a factor of two. We have also implemented a parallel execution mode that lets ADMIXTURE exploit multiple processors. This new option employs the OpenMP

analyzes the data file

Results and Discussion

The effectiveness of cross-validation

Figure _{ST }
_{ST }

Cross-validation (CV) of three datasets derived from the HapMap 3 resource using

**Cross-validation (CV) of three datasets derived from the HapMap 3 resource using v = 5 folds**. After subsampling 13,928 markers to minimize linkage disequilibrium, we separately cross-validated datasets containing unrelated individuals from the (a) CEU, (b) CEU, ASW, and YRI, and (c) CEU, ASW, YRI, and MEX HapMap 3 subsamples. Plots display CV error versus

Supervised analysis can yield more precise estimates

To explore the benefits of supervised analysis, we generated a number of artificial datasets and evaluated the empirical precision of parameter estimates compared to the true _{ST }
_{1 }of reference allele frequencies for population 1 was .046 for unsupervised analysis but .040 for supervised. In general, it appears that errors in estimating _{ST }
_{ST }
_{ST }

The flip-side of the systematic overestimation of the separation between populations is that ancestry fraction estimates suffer from bias. In particular, individuals will be ascribed a greater degree of admixture than they actually possess. Figure _{i}
_{1}, reflecting a small degree of ancestry from population 1, have upward-biased estimates _{i}
_{1 }exhibit a downward bias. The net effect is an apparent bias towards ancestry fractions of .5. Supervised analysis appears not to suffer from this bias.

Errors in estimating ancestral allele frequencies lead to bias in estimating ancestry fractions (

**Errors in estimating ancestral allele frequencies lead to bias in estimating ancestry fractions ( Q), with many individuals ascribed too much admixture**. The plot shows an estimate of the relationship

In our opinion the apparent bias in unsupervised ancestry estimates should not be cause for alarm. The bias becomes much less prominent for larger datasets or datasets where the ancestral populations are better differentiated. Performing the same simulation with an _{ST }

Hence, it is evident that supervised analysis, when applicable, can yield more precise estimates that are less susceptible to the biases seen in unsupervised analysis. Another benefit of supervised analysis is that it runs considerably faster. For the 10 simulated datasets with 10,000 markers, supervised analysis took an average of 5.15 seconds, while unsupervised analysis averaged 27.5 seconds.

The effects of penalized estimation

The bias in ancestry estimates observed in Figure _{ST }

Penalized estimation can reduce the bias in ancestry estimates that appears for small marker sets or closely related ancestral populations

**Penalized estimation can reduce the bias in ancestry estimates that appears for small marker sets or closely related ancestral populations**. We applied penalized estimation to the simulated dataset of 10,000 SNP markers from admixed individuals from two populations differentiated by _{ST }

Conclusion

ADMIXTURE is a fully-featured, highly efficient, and easy-to-use tool for ancestry estimation from SNP datasets. The four enhancements described here promote great flexibility in both exploratory and focused studies of genetic ancestry. Cross-validation enables rational choice of the number of ancestral populations. Supervised analysis mode can yield more accurate ancestry estimates when the number and makeup of contributing populations are certain. Parallelizing the code reduces run times and allows more ambitious analyses involving more people and SNPs. Finally, penalizing weak evidence for admixture promotes model parsimony and yields ancestry fractions more in line with users' expectations.

Availability and requirements

**Project name: **ADMIXTURE

**Project home page: **

**Software.zip, a zip archive containing Mac OS X and Linux executables, is a snapshot of the ADMIXTURE software at the time of submission of this manuscript**. The current version is maintained at

Click here for file

**Operating systems: **Linux, Mac OS X

**Programming languages: **C++

**Other requirements: **None

**License: **Binaries freely available; source code proprietary

**Any restrictions to use by non-academics: **None

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

DHA and KL devised the algorithms for penalized estimation, cross-validation, supervised analysis, and parallel execution. DHA implemented the software. DHA and KL designed the experiments, which DHA then executed and analyzed. DHA and KL composed the manuscript. The authors have approved the final manuscript.

Acknowledgements and Funding

We thank John Novembre and Marc Suchard for helpful suggestions. This work was supported by Grant T32GM008185 to D.H.A. from the National Institute of General Medical Sciences and by Grants GM53275 and MH59490 to K.L. from the United States Public Health Service.