Email updates

Keep up to date with the latest news and content from BMC Proceedings and BioMed Central.

This article is part of the supplement: Genetic Analysis Workshop 17: Unraveling Human Exome Data

Open Access Open Badges Proceedings

Principal components ancestry adjustment for Genetic Analysis Workshop 17 data

Jing Jin1, Jane E Cerise1, Sun Jung Kang2, Eun Jung Yoon1, Seungtai Yoon3, Nancy R Mendell1 and Stephen J Finch1*

Author Affiliations

1 Department of Applied Mathematics and Statistics, State University of New York at Stony Brook, Stony Brook, NY 11733, USA

2 Henri Begleiter Neurodynamics Laboratory, Department of Psychiatry and Behavioral Sciences, SUNY Downstate Medical Center, 450 Clarkson Avenue, Box 1203, Brooklyn, NY 11203, USA

3 Seaver Autism Center and Department of Psychiatry, Mount Sinai School of Medicine, One Gustave L. Levy Place, Box 1668, New York, NY 10029, USA

For all author emails, please log on.

BMC Proceedings 2011, 5(Suppl 9):S66  doi:10.1186/1753-6561-5-S9-S66

The electronic version of this article is the complete one and can be found online at:

Published:29 November 2011

© 2011 Jin et al; licensee BioMed Central Ltd.

This is an open access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Statistical tests on rare variant data may well have type I error rates that differ from their nominal levels. Here, we use the Genetic Analysis Workshop 17 data to estimate type I error rates and powers of three models for identifying rare variants associated with a phenotype: (1) by using the number of minor alleles, age, and smoking status as predictor variables; (2) by using the number of minor alleles, age, smoking status, and the identity of the population of the subject as predictor variables; and (3) by using the number of minor alleles, age, smoking status, and ancestry adjustment using 10 principal component scores. We studied both quantitative phenotype and a dichotomized phenotype. The model with principal component adjustment has type I error rates that are closer to the nominal level of significance of 0.05 for single-nucleotide polymorphisms (SNPs) in noncausal genes for the selected phenotype than the model directly adjusting for population. The principal component adjustment model type I error rates are also closer to the nominal level of 0.05 for noncausal SNPs located in causal genes for the phenotype. The power for causal SNPs with the principal component adjustment model is comparable to the power of the other methods. The power using the underlying quantitative phenotype is greater than the power using the dichotomized phenotype.


One limitation of genome-wide association studies is that population stratification can be a confounding variable. Population stratification occurs when there are systematic ancestry differences in allele frequencies between case subjects and control subjects. If not taken into account, population stratification can cause false-positive and/or false-negative findings [1] and can produce spurious associations [2]. Principal components analysis can be used to correct for population stratification by applying methods that infer genetic ancestry [3]. Population stratification is mainly due to the demographic history of a population, natural selection, and random fluctuations resulting from admixture. In this paper we examine the statistical properties of analysis procedures used in genome-wide association studies by adjusting principal components (PCs) across the whole genome. Another approach is to use local PC adjustment [4], but the Genetic Analysis Workshop 17 (GAW17) genotype data are not sufficiently extensive to consider this strategy.

The GAW17 data set is composed of mini-exome simulated data using 697 unrelated subjects from the 1000 Genomes Project. The quantitative phenotypes Q1 and Q2 are generated as normally distributed phenotypes. We document the p-value of the test of the coefficient of a genotype with and without adjusting for population stratification in selected genes known not to cause the phenotypes Q1 and Q2. We compare the power of the regression coefficient test when using PCs for ancestry adjustment with the power when using the seven populations given as ancestry controls for all the genes known to cause phenotypes Q1 and Q2. We study two types of phenotype, quantitative and dichotomized, test all the single-nucleotide polymorphisms (SNPs) that cause Q1 and Q2, and examine selected noncausal SNPs for these two traits.


A SNP that causes a trait is one that is specified in the function used to simulate the trait [5,6]. Any other SNP is called noncausal. SNPs on chromosomes 12, 21, and 22 are used as SNPs not causing Q1. SNPs on chromosomes 21 and 22 are used as SNPs not causing Q2. Table 1 lists the distribution of the minor allele frequencies (MAFs) of the SNPs in the genes studied.

Table 1. Distribution of minor allele frequencies of SNPs in the genes studied

We dichotomize the quantitative measures Q1 and Q2 so that the top 25% of each of the 200 replicates is scored as affected (1) and others as unaffected (0).The independent variables in these analyses are selected from the number of minor alleles in the ith SNP genotype (SNPi), the participant’s age (Age) and smoking status (Smoking), six indicator variables of the populations (POP1, …, POP6), and the 10 ancestry-adjusted PC scores (GPC1, …, GPC10). We use the FamCC software [7] to calculate these 10 PCs. All 24,487 SNPs are used in the calculations.

We use the PLINK software [8] to fit three logistic regression models to assess the association between each SNP in the genes studied and the dichotomized phenotype. The ith SNP is considered associated with the phenotype when the permutation p-value of the coefficient of SNPi reported in the PLINK logistic regression analysis is less than 0.05. Because Q1 is affected by age and smoking, the models considered are the following: (1) the SNP model, in which each SNP is adjusted for age and smoking; (2) the population adjustment model, in which each SNP is adjusted for the populations, age, and smoking; and (3) the PC adjustment model, in which each SNP is adjusted for age, smoking, and ancestry adjustment PCs. The models are defined as follows:

SNP model:


Population adjustment model:


PC adjustment model:


For the population adjustment model, only six indicators are needed to represent seven populations. The Luhya population is the reference population for the dichotomized phenotype, and the CEU population (European-descended residents of Utah) is the reference population for the quantitative phenotype. Because Q2 is not associated with either age or smoking, the covariates Age and Smoking are not used in the models for Q2. We also fit the three models to the continuous phenotypes Q1 and Q2 using PLINK. Each model is fitted to the 200 replicates provided.


The type I error rate (i.e., false-positive rate) for noncausal genes is the fraction of p-values from noncausal SNPs with permutation p-value less than 0.05. Table 2 contains the type I error rates for Q1 and Q2. The PC adjustment model has a type I error rate closer to 0.05 than the type I error rates for the SNP model and the population adjustment model. For Q2, the type I error rates are relatively close to the nominal value of 0.05 for each model.

Table 2. Type I error rates for Q1 and Q2 using all noncausal SNPs in noncausal genes

Tables 3 and 4 contain the results for Q1 and Q2 using all causal and noncausal SNPs in causal genes that determine that trait. For noncausal SNPs in causal genes for both Q1 and Q2, the PC adjustment model has permutation type I error rates that are closest to 0.05, although the type I error rates are slightly above the nominal value of 0.05. In Q1 the PC adjustment model has the lowest power for causal SNPs, possibly because of better control of the type I error rate. For Q2, where all null type I error rates are relatively close to the nominal rate of 0.05, the power for causal SNPs is roughly the same for the three models.

Table 3. Type I error rates and power for Q1 using all SNPs in causal genes

Table 4. Type I error rates and power for Q2 using all SNPs in causal genes


Because the disease status of interest is dichotomous in many studies, we study these dichotomized phenotypes. Chromosomes 21 and 22 have no causal SNPs for both Q1 and Q2. Therefore we define the SNPs on these two chromosomes as noncausal SNPs. Because other GAW17 participants have reported highly significant association between SNPs on chromosome 12 and Q1, we include the SNPs on chromosome 12 in our set of noncausal SNPs. For the genes reported here, the PC adjustment model has an empirical type I error rate that is apparently closer to the nominal level for SNPs in genes not causing the phenotype and for noncausal SNPs in causal genes, especially for genes determining Q1. The p-values for Q2 are much closer to the nominal level of 0.05 for each of the three models. This may be due to the way that the Q2 phenotype was generated. Although the PC adjustment model successfully controls the type I error rate, considering the actual population of origin does not. For noncausal SNPs in causal genes, the type I error rates in the PC adjustment model of the continuous Q1 measure are slightly higher than the nominal level in two of the four SNP strata. This may be due to the association between the noncausal SNPs and causal SNPs within the gene resulting from linkage disequilibrium. It may also result from multiple testing.

The power of the PC adjustment model is relatively strong and increases as the MAF increases, as expected. The power of regression modeling for the quantitative phenotype is greater than the power of logistic regression modeling of the dichotomized phenotype for both Q1 and Q2.

In this study, we compare the PC adjustment model with a model including population of origin as a factor. The PC adjustment model has both a type I error rate closer to the nominal level of 0.05 and high power. This is because, in general, PCs calculated using all SNPs contain more information about demographic history, natural selection, and random fluctuation in admixture than the population to which a participant is assigned. That is, participants’ genes may still hold genetic information that distinguishes them from the population from which they originated.

The data used here were simulated rather than real. We set our significance level to 0.05 because the number of replicates is 200. As a result, the expected number of null rejections is 10, which allows for meaningful statistical comparison. We also studied a nominal significance level of 0.01 (data not shown) and found similar control of the type I error rate except for SNPs with MAF < 0.005, where the type I error rate was 0.036, somewhat higher than expected. We could not study the type I error rate using typical genome-wide significance levels, such as 10−8.


The PC adjustment model with permutation p-value controls the type I error rate in the GAW17 Q1 and Q2 phenotypes. The power of the regression analysis of the quantitative phenotype is greater than the power of the analysis of the dichotomized phenotype. There is a slight decrease in power for the PC adjustment model even when MAF < 0.005.

Competing interests

The authors declare that there are no competing interests.

Authors’ contributions

JJ carried out all analyses of the dichotomized trait, participated in the analysis of the quantitative trait, and drafted the manuscript. JEC carried out the analyses of the quantitative trait. EJY and SY participated in generating and analyzing the dichotomized phenotype. SJK conceived of the study and participated in its design. NRM and SJF participated in the design, coordinated the study, and helped draft the manuscript. All authors read and approved the final manuscript.


NRM was supported by National Institutes of Health (NIH) grant MH071523. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.

This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at


  1. Zhu X, Tang H, Risch N: Admixture mapping and the role of population structure for localizing disease genes.

    Adv Genet 2008, 60:547-569. PubMed Abstract | Publisher Full Text OpenURL

  2. Campbell CD, Ogburn EL, Lunetta KL, Lyon HN, Freedman ML, Groop LC, Altshuler D, Ardlie KG, Hirschhorn JN: Demonstrating stratification in a European American population.

    Nat Genet 2005, 37:868-872. PubMed Abstract | Publisher Full Text OpenURL

  3. Price AL, Zaitlen NA, Reich D, Patterson N: New approaches to population stratification in genome-wide association studies.

    Nat Rev Genet 2010, 11:459-463. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  4. Kang SJ, Larkin EK, Song Y, Barnholtz-Sloan J, Baechle D, Feng T, Zhu X: Assessing the impact of global versus local ancestry in association studies.

    BMC Proc 2009, 3(suppl 7):S107. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  5. Dering C, Pugh E, Ziegler A: Statistical analysis of rare sequence variants: an overview of collapsing methods.

    Genet Epidemiol 2011, X(suppl X):X-X. OpenURL

  6. Dasgupta A, Sun YV, Konig IR, Bailey-Wilson JE, Malley JD: A brief review of regression-based and machine learning methods in genetic epidemiology: the GAW17 experience.

    Genet Epidemiol 2011, X(suppl X):X-X. OpenURL

  7. Zhu X, Li S, Cooper RS and Elston RC: A unified association analysis approach for family and unrelated samples correcting for stratification.

    Am J Hum Genet 2008, 82:352-365. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  8. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, et al.: PLINK: a tool set for whole-genome association and population-based linkage analysis.

    Am J Hum Genet 2007, 81:559-575. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL