Abstract
Background
The ever increasing sizes of population genetic datasets pose great challenges for population structure analysis. The TracyWidom (TW) statistical test is widely used for detecting structure. However, it has not been adequately investigated whether the TW statistic is susceptible to type I error, especially in large, complex datasets. Nonparametric, Principal Component Analysis (PCA) based methods for resolving structure have been developed which rely on the TW test. Although PCAbased methods can resolve structure, they cannot infer ancestry. Modelbased methods are still needed for ancestry analysis, but they are not suitable for large datasets. We propose a new structure analysis framework for large datasets. This includes a new heuristic for detecting structure and incorporation of the structure patterns inferred by a PCA method to complement STRUCTURE analysis.
Results
A new heuristic called EigenDev for detecting population structure is presented. When tested on simulated data, this heuristic is robust to sample size. In contrast, the TW statistic was found to be susceptible to type I error, especially for large population samples. EigenDev is thus bettersuited for analysis of large datasets containing many individuals, in which spurious patterns are likely to exist and could be incorrectly interpreted as population stratification. EigenDev was applied to the iterative pruning PCA (ipPCA) method, which resolves the underlying subpopulations. This subpopulation information was used to supervise STRUCTURE analysis to infer patterns of ancestry at an unprecedented level of resolution. To validate the new approach, a bovine and a large human genetic dataset (3945 individuals) were analyzed. We found new ancestry patterns consistent with the subpopulations resolved by ipPCA.
Conclusions
The EigenDev heuristic is robust to sampling and is thus superior for detecting structure in large datasets. The application of EigenDev to the ipPCA algorithm improves the estimation of the number of subpopulations and the individual assignment accuracy, especially for very large and complex datasets. Furthermore, we have demonstrated that the structure resolved by this approach complements parametric analysis, allowing a much more comprehensive account of population structure. The new version of the ipPCA software with EigenDev incorporated can be downloaded from http://www4a.biotec.or.th/GI/tools/ippca webcite.
Background
As genotyping platforms incorporate more markers, and the costs for genotyping keep falling, ever larger and more complex datasets are being analyzed. The computationally efficient nonparametric methods for analysis of genotypic datasets are thus increasingly being used to reveal population structure. Resolution of population structure reveals evolutionary relationships between groups of individuals. Furthermore, population structure must be accounted for in genomewide association studies to reduce spurious associations resulting from ancestral differences between cases and controls [1].
Principal component analysis (PCA) is a widely used nonparametric method for population structure analysis, which uses a covariance matrix for eigenanalysis. The amount and axes of variation among individuals are captured in the eigenvalues and eigenvectors, respectively. Previously, we developed a PCA framework for population structure analysis which extended the use of PCA beyond its usual application for visualizing the population structure trend by employing an iterative process to simplify the pattern of population structure. The iterative methods used by others, e.g. [2,3] rely on the available ethnogeographical population labels for subjectively grouping individuals, unlike our objective approach.
Our framework, which we dubbed iterative pruning PCA (ipPCA) uses a clustering algorithm to assign individuals into subpopulations without imposing any prior assumptions [4]. ipPCA resolves all subpopulations in a population dataset, and thus reports the total number of primal subpopulations K in addition to assigning individuals contained within them. The term "population" is synonymous with dataset for ipPCA, which is the entire collection of individuals available for analysis. The term "subpopulation" defines a group of individuals assigned by ipPCA in which no further significant substructure is present. ipPCA operates by systematically separating individuals into two clusters using a clustering algorithm based on the Euclidean distances between projected data points and the cluster centroids. The decision to separate individuals requires testing of whether significant structure is present within the dataset (or nested dataset for subsequent iterations of the algorithm). To test for homogeneity among groups of individuals, we previously proposed using the test statistic as implemented in the EIGENSTRAT/SmartPCA algorithm, which reports the probability of structure according to TracyWidom (TW) distribution [5]. If no significant structure exists, then the individuals under testing belong to a subpopulation, thus terminating the iterative clustering process. The ipPCA framework is summarized in Figure 1. Using datasets of simulated and real data, we showed how ipPCA can correctly assign individuals to subpopulations and infer K. However, the accuracy of ipPCA may be affected by the stopping criterion. An inappropriate termination criterion leads to under or overestimation of the number of subpopulations. Moreover, individual assignment errors in early iterations will be compounded and carried forward to later iterations.
Figure 1. Outline of the ipPCA framework. The framework consists of three main components. First, the genetic data are encoded, zeromeans centered and normalized. Then, individuals are projected onto a space spanned by the principal components of the input data matrix. Next, a structure metric is calculated to decide whether to advance to the clustering step or to terminate the algorithm. When the metric does not cross the threshold, a homogenous subpopulation is resolved and subsequently the algorithm terminates. Otherwise, the individuals are bisected. The algorithm iterates until all individuals have been assigned into terminal subpopulations.
Parametric algorithms for clustering individuals into subpopulations, e.g., STRUCTURE, frappe, ADMIXTURE, and BAPS, differ from ipPCA in one crucial aspect, namely the method of assigning individuals into subpopulation clusters. The aforementioned parametric algorithms infer ancestral proportions for each individual separately, and group individuals with similar patterns of inferred ancestry. ipPCA and other nonparametric approaches cannot infer ancestry. These techniques attempt to group individuals with similar genetic profiles together. Hence, parametric approaches still offer important information not seen by nonparametric analyses. Large and highly structured population datasets are however intractable for parametric analysis because the number of K ancestral clusters is limited. This is due to the limited number of available samples used to estimate subpopulation allele frequencies. In order to better observe the inherent population structure, a "supervised" structure analysis, with resampled individuals, should be performed. The choice of individuals for such supervised analysis is arbitrary and typically guided by available ethno/geographical labels. Nonetheless, careful selection is needed to ensure that individuals being compared have similar ancestries, otherwise the signals of ancestries important for differentiating some groups of individuals may be too weak.
In this paper, we propose a modification to ipPCA by introducing a new stopping criterion called EigenDev for the iterative clustering process which is more robust to spurious patterns in large datasets. The new algorithm is termed EigenDevipPCA. To distinguish between the two algorithms in the ipPCA framework, we refer to the previously proposed algorithm which uses the TW statistic as the termination criterion as TWipPCA in the subsequent sections. Furthermore, we suggest a new protocol which uses the information from EigenDevipPCA to guide parametric analysis. Using real datasets, we demonstrate how this approach can reveal new and structureinformative patterns of ancestry not detectable with unsupervised STRUCTURE analysis.
Methods
New ipPCA terminating criterion
The TracyWidom (TW) test statistic, which is implemented in the EIGENSTRAT/SmartPCA algorithm [5], is used as a stopping criterion for the TWipPCA algorithm. Although this stopping criterion has been found to work well for some datasets, we found that when much larger datasets containing roughly >1000 individuals were analyzed, the TWipPCA resolved far more subpopulations than were expected. We therefore suspected that in some cases when sampling is large, the subpopulations resolved may be spurious, i.e., type I error. Indeed, as pointed out in [5], the relative sample sizes of the underlying subpopulations affect the TW test statistic.
Besides the type I error we found when using the TW statistical test for structure, there are other drawbacks which motivated us to develop an alternative terminating criterion. The first issue is computational difficulty. To obtain the final value of the TW test statistic, too many unknown parameters need to be estimated. No best estimators for these parameters are available, so choices of estimators affect the result. Instead of using the pvalues of TW test statistics as thresholds, we propose a new terminating criterion for determining whether the data are structured. The new criterion is based on the eigenvalues of the data matrix and is termed the EigenDev heuristic. The EigenDev heuristic follows the same assumption as the TW theory, namely, if the first eigenvalue of the data matrix is significantly larger than the remaining eigenvalues, then substructure exists. However, we extend this observation beyond merely testing the significance of the first eigenvalue to take into account the remaining variance of the data. This allows us to observe structure in higher dimensions. We were inspired to develop EigenDev from the Eigenvalue Grads heuristic, which is applied in the signal processing domain [6]. This work showed that if the data contain only noise and no signal, i.e., nonstructured, then there is an excellent linear fit for the eigenvalues ranked in descending order. In population genetic data, the noise represents the natural genetic variation within a (sub)population.
To test for population structure, the EigenDev statistic is calculated from the genotypic data. This calculation first requires that a data matrix is constructed from encoded, zeromeans and normalized genotypic data, as described in [5]. This matrix contains rows corresponding to individuals and columns corresponding to alleles. Thus, biallelic SNP markers are encoded by entries in two columns, one for each allele, and STRs by the total number of alleles for that marker locus in the dataset. The presence of an allele is encoded as 1 and its absence as 0. For missing data, i.e., markers with no genotypic call, they are encoded as all 0's.
Given the zeromeans, normalized genotype data matrix X (according to [5]) containing m samples with n allele columns per sample, we construct the sample covariance matrix
The EigenDev value can then be computed from
where
and
where , are the first p eigenvalues of C ranked in descending order. The quantity in Eq. (1) could be negative in some cases. To militate against this possibility, the encoded entries are normalized to have zero mean. This step is important to remove the signal from the common elements, leaving only the differences (genetic variance) between individuals for eigenanalysis. In all empirical studies on both simulated and real data, we found that 90% of the variance in the data always results in a positive value and the convexity constraint in question has never been violated. To account for the rare cases when negative values are encountered, we have included a checking step in the algorithm to detect and report negative values. If negative values are found, the parameter p can be adjusted to ensure a positive quantity in the square root. Recall that p < min{m, n} is the number of eigenvalues used to compute the EigenDev statistic. We also stabilize the variance using log transformation. If the EigenDev value is large, the group of individuals being analyzed would comprise more than one subpopulation and ipPCA progresses to bisect the group; otherwise, the EigenDevipPCA algorithm terminates when the EigenDev value falls below a threshold.
Results
Testing
To test the EigenDev concept, several datasets were analyzed:
1. A simulated dataset composed of 10,000 individuals from the same population, each containing 10,000 SNP markers was used for testing the fit of TW distribution. It was generated using the GENOME tool [7] with the following parameters and the following tree file:
pop 1 10000 N tree.txt C 20 S 500
tree.txt:
0 10000
11
1 10000
Starting at 10,000 founder population individuals, GENOME generates the first generation with the same size as the founder. Each individual has 20 chromosomes and each chromosome contains 500 SNPs.
2. The second dataset was simulated using the same GENOME parameters as the first dataset but with different tree file:
tree.txt:
0 5000 5000
11 21
40 5000
11 12
80 5000 5000
11 21
100 10000
to generate two subpopulations of size 5,000 individuals each.
3. The third dataset is the Bovine HapMap Project collection of 497 individuals obtained from 19 different breeds, genotyped for 27203 SNPs. It is publicly available from [8].
4. The fourth dataset is publicly available from [9]. It contains 3945 individuals comprising 185 different ethno/geographical labels, typed for 1327 markers (consisting of 848 microsatellites, 476 indels, and 3 SNPs) from [10].
The ipPCA encoded input matrices from the simulated and real complex datasets are also available for download from http://www4a.biotec.or.th/GI/tools/ippca webcite.
Testing metrics for population structure
To test how TW is affected by sampling, a simulated dataset with no substructure was sampled randomly at 20 different sample sizes from 10 to 200 individuals. The corresponding probabilityprobability (pp) plots for testing the fit of the TW distribution are shown in Figure 2. It is observed that the TW distribution is violated for most of the sample sizes; good fit is observed only for the sample of 70 individuals. Therefore, the deviation from TW distribution will give a false detection (type I error), particularly for large sample sizes. On the other hand, the TW test is very sensitive for detecting structure, since it is based on a nonlinear phase change. It is not susceptible to type II error provided sufficient data are available [5]. However, the nonlinearity of the phase change means that an allornothing situation exists where the likelihood of type I cannot be controlled, even across a wide range of pvalue thresholds.
Figure 2. Testing the fit of the TW distribution. A population of size 10,000 individuals with 5,000 markers was simulated using the coalescent model. The pp plots were generated for sample sizes of 10 to 200 individuals.
To test the performance of the EigenDev heuristic, we simulated the receiver operating characteristic (ROC) curves for three different sample sizes of 100, 200, and 500 individuals from the second simulated dataset, as shown in Figure 3. To obtain the curves, the EigenDev threshold was varied between 0.077 and 0.387. It is observed that the threshold value increases with samples size, and that EigenDev performs better when the sample size is large. An EigenDev threshold of 0.21 was used for analysis of real datasets. This value is an average of the thresholds needed to achieve a 10% false positive rate for the three sample sizes. This value is a good compromise between detecting and resolving all structure present, with minimal spurious structure at typical sample sizes in real datasets.
Figure 3. The empirical receiver operating characteristic curve of the EigenDev heuristic. A structured population of 10,000 individuals of 5,000 markers containing two subpopulations (5,000 each) was simulated using the coalescent model. The ROCs were generated for sample sizes of 100, 200, and 500 individuals.
Guiding parametric analysis with ipPCA
STRUCTURE [11] can be used to perform unsupervised clustering using ancestral components information. However, the high computational complexity of STRUCTURE, especially in finding the maximum posterior probabilities for the number of K ancestral clusters limits practically to K = 20 or fewer. Therefore, highly complex datasets must be divided into subdatasets, which are then analyzed separately by STRUCTURE. Conventionally, this is done in an arbitrary fashion using prior information, e.g., ethnogeographical population labels. However, the prior information could bias the clustering results. To address this issue, we propose using the unsupervised clustering feature of ipPCA to assist in narrowing the search space for STRUCTURE in a more efficient fashion. In practice, subpopulations assigned by ipPCA can be selected for subsequent STRUCTURE analysis. We call this approach ipPCAguided STRUCTURE. We applied this method to the Bovine HapMap dataset [8], which is the expanded dataset from the one previously analyzed by us [4]. The result was similar to that reported earlier, i.e. EigenDevipPCA resolved 18 subpopulations, each of which are largely composed of individuals of the same breed, except for one subpopulation containing Angus (ANG) and Red Angus (RGU) individuals (the EigenDevipPCA results can be viewed from the ipPCA download webpage).
STRUCTURE was used with the default parameters and 10,000 burn in and 10,000 run iterations. Individuals from the Gir (GIR), Brahman (BRM), and Nelore (NEL) breeds resolved as three separate subpopulations by EigenDevipPCA) were selected for STRUCTURE analysis to determine whether differences in inferred ancestry exist between these breeds. Furthermore, these three breeds were chosen because they are B.indicus breeds, and thus more closely related to each other than the other B.taurus breeds in the dataset. STRUCTURE analysis at K = 3 on these selected individuals, as shown in Figure 4, revealed breeddistinctive patterns of ancestry not previously reported.
Figure 4. ipPCAguided STRUCTURE analysis on selected individuals from Bovine HapMap dataset. STRUCTURE analyses were performed on individuals from B. indicus breeds (GIR, BRM, and NEL). Results with K = 2 and K = 3 are shown.
Analysis of a large human dataset by ipPCA
The dataset from Tishko et.al. [10] contains a large number of individuals (3945). Furthermore, these individuals comprise 185 ethnolinguistic distinguishing labels suggesting a large number of genetically distinct groups. The dataset was analyzed by EigenDevipPCA, which assigned 49 subpopulations (Figure 5). The assigned subpopulations were largely consistent with the patterns reported earlier [10], in which geographically disparate groups of individuals are genetically distinct, and within Africa, major cultural and linguistic groups are also genetically distinct (see Additional file 1 for more information). In contrast, ipPCA using the TW stopping criterion (TWipPCA) assigned 109 subpopulations. Comparison of the subpopulations which differed between the two methods showed that on the whole, subpopulations assigned by TWipPCA were subclusters of larger subpopulations assigned by EigenDevipPCA. For instance, all Indian individuals (15 ethnic labels) were assigned to two subpopulations (SP2 and SP7) by EigenDevipPCA, whereas Indians were assigned to 11 subpopulations by TWipPCA (see Additional file 1).
Figure 5. Population assignments of the Tishkoff et al dataset using the EigenDevipPCA method. 49 assigned subpopulations are labeled SP1 to SP49. The height of the bars are proportional to the number of assigned individuals in each subpopulation. The population labels of the assigned individuals are shown to the right of each bar with the number of individuals with the same label in parentheses. To aid visualization of the individual assignment, the 185 population labels were grouped into 14 color groups reflecting geographical regions. Color gradients within the color group denote different population labels. For the complete color scheme, see Figure s3 in the Additional file 1.
Additional file 1. The detailed analysis and further discussion of the EigenDevipPCA results for the Tishkoff et al. dataset.
Format: PDF Size: 265KB Download file
This file can be viewed with: Adobe Acrobat Reader
ipPCAguided STRUCTURE analysis
African American is a term used to describe US nationals with selfidentified African ancestry, the majority of whom are descended from West African individuals who came to the US via the slave trade. The term African American though is very broad, as it encompasses individuals descended from African ancestors from a broad geographical range, and some also have recent nonAfrican ancestry. African American individuals were assigned into four subpopulations by EigenDevipPCA, namely SP4, SP5, SP15 and SP16. Subpopulations SP4 and SP5 contain the majority of African Americans together with predominantly West and Central African NigerKhordofanian speaking ethnic groups. Five African Americans were assigned to SP15, which contains predominantly Afroasiatic Cushitic speaking Bejans from Sudan. Two African Americans were assigned to SP16, which contains predominantly East Africans of mixed NiloSaharan Sudanic and Afroasiatic Cushitic speaking ethnic groups.
We then used the information from EigenDevipPCA to guide STRUCTURE. All the individuals assigned to SP4, SP5, SP15 and SP16, which included all AfricanAmerican individuals, were analyzed by STRUCTURE from K = 2 to K = 5 (see part A in Figure 6). At K = 3 or greater, each of the four subpopulations assigned by EigenDevipPCA showed distinctive patterns of ancestry, although there appeared to be some overlap between SP15 and SP16 individuals. When focusing on the AfricanAmerican individuals, distinctive ancestry patterns can also be observed, in particular when comparing SP4 and SP5 assigned individuals (see part B in Figure 6).
Figure 6. ipPCAguided STRUCTURE analysis on selected individuals from the Tishkoff et.al. dataset. A) All individuals assigned to SP4, SP5, SP15 and SP16 (see Figure 5), which included all AfricanAmerican individuals, were analyzed by STRUCTURE from K = 2 to K = 5. Individuals were sorted according to the ipPCA assignments. Major ethnolinguistic labels for individuals within each subpopulation are also shown (see Figure 5 for complete listing). B) Expanded view of AfricanAmerican individuals from A).
Discussion
TW and EigenDev stopping criteria
Analysis of population genetic structure requires first a method for detecting whether significant structure exists in the dataset (or nested dataset for further iterations of ipPCA). The current method to obtain this information is to test for deviation from the TracyWidom distribution of the largest eigenvalue computed from PCA. A pvalue lower than 10^{12 }is considered an acceptable threshold for significance in rejecting the null hypothesis that the data belong to a homogenous (sub)population, and thus are structured [5]. The first experiment with a simulated dataset with no structure revealed that significant deviation from the expected distribution is found, particularly with large sampling (>70 individuals). We infer from this result that when the sample size is large, the TW method suffers from type I error because of this deviation from the TW distribution. Simply using lower pvalue thresholds may not give better results, since there is a very small range of pvalue that is practical [5]. When applied to real datasets, homogenous (sub)populations sampled at high density may be incorrectly construed as possessing structure. In TWipPCA, this would lead to a group of individuals being assigned into separate subpopulations, when they should actually be considered belonging to a single (sub)population.
To alleviate the drawbacks of the TW test statistic, we propose a new termination criterion called EigenDev statistic that is simpler to compute, has no hidden parameters and is shown to be more robust to type I error. For simplicity, one could choose a single EigenDev value to be applied as a universal stopping criterion for ipPCA, which needs to be determined empirically. We determined a threshold of 0.21 from data simulation, which was also appropriate for the real datasets analyzed in this paper.
Analyses of Bovine HapMap dataset
The subpopulation assignment by EigenDevipPCA supports the accepted notion that cattle breeds have distinctive genetic profiles. The finding that ANG and RGU were assigned together in the same subpopulation suggests that these breeds are genetically indistinguishable for the markers available, which was also reported by other methods [12]. However, the finding that GIR, BRM, and NEL breeds are resolved as separate subpopulations by EigenDevipPCA is novel, since the earlier unsupervised STRUCTURE analysis in [12] on the entire dataset could not distinguish these breeds. ipPCAguided STRUCTURE analysis on the Bovine HapMap dataset demonstrated differences in ancestries among these breeds, consistent with the assignments by EigenDevipPCA. Among these indicine breeds, there is evidence (high heterozygosity and unique SNPs) to suggest that BRM is genetically distinct from others, including GIR and NEL [12]. These results beg the question, why STRUCTURE analysis, when done in a EigenDevipPCA guided manner, can reveal differences among these breeds which is not apparent in the unsupervised STRUCTURE analysis? The likeliest explanation is that the overall number of informative markers is low among these indicine breeds in comparison with the others (only 19% of the loci having minor allele frequencies greater than 0.3) [12]. In other words, the allele frequencies among the indicine breeds are highly correlated in comparison with the taurine breeds. Groups of individuals with highly correlated allele frequencies in comparison with other groups tend to be merged by STRUCTURE [11].
Analyses of a large human dataset
The 49 subpopulations assigned by EigenDevipPCA each contain individuals largely sharing the same ethnolinguistic label/affiliation, in accordance with [10,13]. Of note, the 426 Indian individuals were assigned to two subpopulations by EigenDevipPCA. This grouping is consistent with the parametric analysis of these individuals in [13], which showed weak evidence of structure. Hence, the greater degree of stratification resolved by TWipPCA compared with EigenDevipPCA is likely to be spurious. The spurious structure resolved by TWipPCA is thus attributable to the large sample size (426), which is well above the threshold encountered for type I error from the analysis of simulated data.
Among the African individuals, subpopulations were assigned by EigenDevipPCA revealing stratification patterns not described previously. For instance, NigerKhordofanian speaking nonPygmy individuals from West and Central Africa could not be distinguished genetically in [10], but were assigned to SP3, SP4 and SP5 subpopulations by EigenDevipPCA. The assignment of the majority African Americans to SP4 and SP5 by EigenDevipPCA (Figure 5) suggests they have West and Central African NigerKhordofanian ancestors, in agreement with [10]. On the other hand, the assignment of African Americans to different subpopulations by EigenDevipPCA is suggestive of significant structure among these individuals. Supervised STRUCTURE runs performed in [10] to elucidate African American ancestry could only reveal a subtle clinal pattern of variation among the African Americans. The EigenDevipPCA guided STRUCTURE analysis, however, shows clear differences in ancestry between SP4 and SP5 African Americans. The SP15 and SP16 assigned African Americans also show ancestry distinct from the SP4 and SP5 assigned individuals, although given the small number of individuals assigned to SP15 and SP16, it is not possible to observe significant ancestry differences between these two groups.
The EigenDevipPCA assignment of some African Americans to SP15 and SP16 was unexpected. The contemporary African individuals in these subpopulations are predominantly from Saharan and East Africa. A recent study of African American ancestry concluded that some individuals have a major ancestral component which is neither West African NigerKhordofanian, nor European [14]. The possibility that this anomalous ancestry is of Saharan or East African may also be reflected in mtDNA haplotypes, since some African Americans have anomalous haplotypes of unknown African origin [15,16]. The discrepancy between EigenDevipPCA guided STRUCTURE and supervised STRUCTURE performed in [10] is due to the choice of individuals in the analysis. When individuals with inappropriately diverged allele frequencies from others are used, key ancestral differences will be missed, the same as was shown in analysis of Bovine data.
Conclusion
We describe EigenDevipPCA for analyzing population structure. This approach assigns individuals to subpopulations and determines the total number of subpopulations present. This algorithm incorporates a novel heuristic called EigenDev for detecting substructure, which is applied to the iterative clustering process. EigenDev is robust to population sampling, allowing us to analyze large complex datasets with higher accuracy. The subpopulations assigned by EigenDevipPCA reveals overall genetic relatedness among groups of individuals, which can then be used to guide STRUCTURE. Other parametric algorithms such as Admixture and frappe could also be used in the same way. Therefore, the combination of EigenDevipPCA and STRUCTURE are complementary and can be used together to perform a powerful population stratification analysis. The software both in Matlab source code (m file) and executable versions on Windows and Linux (64 bit) are available for download at http://www4a.biotec.or.th/GI/tools/ippca webcite.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
TL, AI, PJS and ST wrote the manuscript. TL, AI and ST constructed the computational improvement scheme of the new algorithm. AA, PJS and JP conceived the ideas to reanalyze the mixed complex datasets. TL, AI, PW and CN conducted all the experiments presented in this work. TL, AA, PJS, JP and ST analyzed the results. AI, PW and CN wrote the EigenDevipPCA program and made it available in executable formats using a Matlab compiler. All authors have read and approved the final manuscript.
Acknowledgements and Funding
The authors thank King Mongkut's Institute of Technology Ladkrabang (KMITL), the National Electronics and Computer Technology Center (NECTEC) and the National Center for Genetic Engineering and Biotechnology (BIOTEC) for financial support. In particular, TL and ST acknowledge the funding supported by the National Science and Technology Development Agency (NSTDA). AA was supported by BIOTEC postdoctoral fellowship. JP was supported by BIOTEC platform technology grant and the Thailand Research Fund (TRF) new researcher grant (TRG5380028). PJS was funded by the Bill & Melinda Gates Foundation through the Grand Challenges Explorations Initiative. ST received the support from BIOTEC platform technology and TRF Career Development grant (RSA54). We thank the reviewers and editors for their constructive comments, which improve the quality and presentation of the manuscript. Finally, we would like to thank the authors and the donors who made the real datasets used in this paper available.
References

Marchini J, Cardon LR, Phillips MS, Donnelly P: The effects of human population structure on large genetic association studies.
Nat Genet 2004, 36(5):5127. PubMed Abstract  Publisher Full Text

Tian C, Plenge R, Ransom M, Lee A, Villoslada P, Selmi C, Klareskog L, Pulver A, Qi L, Gregersen P, Seldin M: Analysis and application of European genetic substructure using 300 K SNP information.
PLoS Genet 2008, 4:e4. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Paschou P, Lewis J, Javed A, Drineas P: Ancestry informative markers for finescale individual assignment to worldwide populations.
J Med Genet 2010, 47(12):83547. PubMed Abstract  Publisher Full Text

Intarapanich A, Shaw PJ, Assawamakin A, Wangkumhang P, Ngamphiw C, Chaichoompu K, Piriyapongsa J, Tongsima S: Iterative pruning PCA improves resolution of highly structured populations.
BMC Bioinformatics 2009, 10:382. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Patterson N, Price A, Reich D: Population structure and eigenanalysis.
PLoS Genet 2006, 2(12):e190. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Luo J, Zhang Z: Using Eigenvalue Grads Method to Estimate the Number of Signal Source. In Proceedings of the 5th International Conference on Signal Processing (WCCCICSP 2000). Volume 1. Beijing, China; 2000::223225.

Liang L, Zollner S, Abecasis GR: GENOME: a rapid coalescentbased whole genome simulator.
Bioinformatics 2007, 23(12):15657. PubMed Abstract  Publisher Full Text

The BovineHapMap dataset [http://bfgl.anri.barc.usda.gov/cgibin/hapmap/affy2/BulkDownloads] webcite

The Tishkoff et. al. dataset [http:/ / www.sciencemag.org/ content/ vol0/ issue2009/ images/ data/ 1172257/ DC1/ 1172257_dataset.zip] webcite

Tishkoff SA, Reed FA, Friedlaender FR, Ehret C, Ranciaro A, Froment A, Hirbo JB, Awomoyi AA, Bodo JM, Doumbo O, Ibrahim M, Juma AT, Kotze MJ, Lema G, Moore JH, Mortensen H, Nyambo TB, Omar SA, Powell K, Pretorius GS, Smith MW, Thera MA, Wambebe C, Weber JL, Williams SM: The genetic structure and history of Africans and African Americans.
Science 2009, 324(5930):103544. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pritchard JK, Stephens M, Donnelly P: Inference of Population Structure Using Multilocus Genotype Data.
Genetics 2000, 155:94559. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Consortium TBH: GenomeWide Survey of SNP Variation Uncovers the Genetic Structure of Cattle Breeds.
Science 2009, 324(5926):52832. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Rosenberg N, Mahajan S, GonzalezQuevedo C, Blum M, NinoRosales L, Ninis V, Das P, Hegde M, Molinari L, Zapata G, Weber J, Belmont J, Patel P: Low levels of genetic divergence across geographically and linguistically diverse populations from India.
PLoS Genet 2006, 2(12):e215. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bryc K, Auton A, Nelson MR, Oksenberg JR, Hauser SL, Williams S, Froment A, Bodo JM, Wambebe C, Tishkoff SA, Bustamante CD: Genomewide patterns of population structure and admixture in West Africans and African Americans.
Proc Natl Acad Sci USA 2010, 107(2):78691. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Salas A, Richards M, Lareu MV, Scozzari R, Coppa A, Torroni A, Macaulay V, Carracedo A: The African diaspora: mitochondrial DNA and the Atlantic slave trade.
Am J Hum Genet 2004, 74(3):45465. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ely B, Wilson JL, Jackson F, Jackson BA: AfricanAmerican mitochondrial DNAs often match mtDNAs found in multiple African ethnic groups.
BMC Biol 2006, 4:34. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text