Department of Biostatistics, Boston University School of Public Health, 801 Massachusetts Avenue, Boston MA, 02118, USA

Abstract

Background

There are many ways to perform adjustment for population structure. It remains unclear what the optimal approach is and whether the optimal approach varies by the type of samples and substructure present. The simplest and most straightforward approach is to adjust for the continuous principal components (PCs) that capture ancestry. Through simulation, we explored the issue of which ancestry informative PCs should be adjusted for in an association model to control for the confounding nature of population structure while maintaining maximum power. A thorough examination of selecting PCs for adjustment in a case-control study across the possible structure scenarios that could occur in a genome-wide association study has not been previously reported.

Results

We found that when the SNP and phenotype frequencies do not vary over the sub-populations, all methods of selection provided similar power and appropriate Type I error for association. When the SNP is not structured and the phenotype has large structure, then selection methods that do not select PCs for inclusion as covariates generally provide the most power. When there is a structured SNP and a non-structured phenotype, selection methods that include PCs in the model have greater power. When both the SNP and the phenotype are structured, all methods of selection have similar power.

Conclusions

Standard practice is to include a fixed number of PCs in genome-wide association studies. Based on our findings, we conclude that if power is not a concern, then selecting the same set of top PCs for adjustment for all SNPs in logistic regression is a strategy that achieves appropriate Type I error. However, standard practice is not optimal in all scenarios and to optimize power for structured SNPs in the presence of unstructured phenotypes, PCs that are associated with the tested SNP should be included in the logistic model.

Background

The principal components (PCs) of genome-wide genotype data can be used to detect and adjust for population structure in genetic association analyses

Numerous methods have been proposed to adjust for structure once PCs are computed (Table

Methods of ancestry informative PC selection.

**Method**

**First Author**

Adjusting for a fixed number of PCs

Price

Tracy-Wisdom statistic

Patterson

Regression of outcome on PCs

Novembre

Reduction in inflation of genomic control lambda

Yu

PC-Finder

Li

10% rule

Jewell

PCs + cluster

Li

Optimal criteria for selecting PCs to include in the model are not known. Suppose the population from which we draw our sample consists of two sub-populations. We can expect one of four scenarios (Table _{1}≠K_{2}), while a non-structured phenotype has equal probably of being a case in the two sub-populations (K_{1 }= K_{2}). Likewise, a structured SNP (sSNP) has unequal risk allele frequency in the two sub-populations (p_{1}≠p_{2}), and a non-structured SNP (nsSNP) has equal risk allele frequency in the two sub-populations (p_{1 }= p_{2}). By testing the phenotype for association with the PCs, we can determine if the phenotype is structured (scenarios A and C of Table

Scenarios of population structure that could occur across the genome.

**Structured Phenotype**

**(K _{1}≠ K_{2})**

**Non-Structured Phenotype**

**(K _{1 }= K_{2})**

**Structured Genotype**

**(p _{1}≠ p_{2})**

A

B

**Non-Structured Genotype**

**(p _{1 }= p_{2})**

C

D

p_{1 }and p_{2 }are the allele frequencies in population 1 and 2, respectively. K_{1 }and K_{2 }are the frequency of disease in population 1 and 2, respectively.

In a genetic association study, our primary interest is not in finding and describing the genetic structure in the sample, but in determining if the population structure in the sample has a confounding effect on the SNP association analyses and if adjustment for this confounding is necessary. We performed simulation studies to investigate the Type I error and power of associations between case/control status and a SNP when adjusting for PCs selected using samples of independent individuals. We compared the following methods of selecting PCs (label):

(1) No PC adjustment (None)

(2) 10% Rule (10% Rule)

(3) PCs significantly related to the outcome at significance level α = 0.001, 0.01, or 0.05 (Sig001, Sig01, Sig 05, respectively)

(4) PCs significantly related to the SNP at α = 0.05 or 0.01 (SNP01, SNP05, respectively)

(5) PCs significant according to the Tracy-Widom statistic at α = 0.05 (TW)

(6) Top PCs (2 or 10) determined according to eigenvalue (Top2, Top10, respectively)

(7) Simulated true population i.e, Gold standard (Pop)

We tested for association with the simulated case/control outcome using logistic regression and compared Type I error and power of associations between the outcome and SNP when adjusting for selected principal components of ancestry. Finally, to provide a practical example of the methods of PC selection, we performed all methods of PC selection using dichotomized height data from the Framingham Heart Study.

Methods

We simulated independent genome-wide SNPs by generating ancestral population allele frequencies for 10,000 SNPs (p_{j}, j = 1,...,10000) from a uniform (0.05, 0.50)-distribution. We then created two sub-populations (i = 1,2) of 500 individuals, each descending from the ancestral population according to F_{st}. We simulated the allele frequencies p_{ij }(i = 1,2; j = 1,..., 10000) in the two sub-populations according to a beta distribution

F_{st }is a measure of population differentiation _{st }= 0.01 is representative of human population structure seen within continents, while F_{st }= 0.1 is representative of structure seen between continents _{st }values of 0.01 and 0.1. 1,000 replicates of independent genome-wide SNP data were generated for 2 populations of 500 individuals.

We combined the two generated sub-populations and performed PCA using the smartpca program in EIGENSOFT

Simulation parameters.

**Description**

**Possible Values**

Population differentiation (F_{st})

0.01, 0.1

Population prevalence of disease (K)

0.10

Frequency of disease in sub population 1 (K_{1})

0.10 -- 0.19

Number of cases in sub-population 1

250 - 475

Overall risk allele frequency (p)

0.20

Risk allele freq in sub-population 1 (p_{1})

0.10 -- 0.30

Odds Ratio (log additive model)

1.0, 1.2, 1.5

Relationship between number of cases and population prevalence of disease in each sub-population.

**# of Cases in Population 1**

**# of Cases in Population 2**

**Fold increase in # of cases**

**K _{1}**

**K _{2}**

250

250

1

0.10

0.10

300

200

1.5

0.12

0.08

375

125

3

0.15

0.05

400

100

4

0.16

0.04

475

25

19

0.19

0.01

We compared the methods for selecting PCs for adjustment described above. We used logistic regression to test the association between the test SNP and case status adjusting for latent ancestry defined by the PCs. We compared the proportion of replicates significant at α = 0.05, using 1,000 replicates for each set of parameters to investigate Type I error and power. 1,000 replicates for Type I error provides a 95% confidence interval of 0.036 to 0.064 around a nominal significance level of 0.05.

To determine the effects of the PC selection methods when a sample is composed of a more complex structure, we simulated two populations that each diverged with as F_{st }of 0.01 from an ancestral population, as previously. We treated these two subpopulations as ancestral populations and then simulated two subpopulations diverging from each of the ancestral populations, again with an F_{st }of 0.01. The resulting sample had four sub-populations. Due to computational limitations, a single replicate of independent genome-wide SNP data was generated for this scenario for PCA. As before, 1,000 replicates were used to evaluate Type I error and power, simulating the genotype and phenotype (conditional on genotype for power) for each replicate. We varied the phenotypic and genotypic structure of the sub-populations, from having no structure to more extreme structure.

Results and Discussion

Simulation Study

Type I error for the methods of selection with two subpopulations with F_{st }of 0.01 is provided in Figure

Empirical Type I error results

**Empirical Type I error results**. Two sub-populations of 500 individuals each, F_{st }= 0.01. K_{1 }and p_{1 }are the population prevalence of disease and risk allele frequency, respectively, in sub-population 1. K_{2 }and p_{2 }are the population prevalence of disease and risk allele frequency, respectively, in sub-population 2. The x-axis is the various methods of selecting PCs for inclusion in the model of association and the symbols in the plot represent the phenotypic structure. The y-axis is the proportion of logistic regression models adjusting for the selected PCs for which the SNP p-values are significant at a significance level of 0.05.

Figure _{st }between the sub-populations was 0.1.

Empirical power results

**Empirical power results**. Two sub-populations of 500 individuals each, F_{st }= 0.01, simulated log additive odds ratio of 1.2. K_{1 }and p_{1 }are the population prevalence of disease and risk allele frequency, respectively, in sub-population 1. K_{2 }and p_{2 }are the population prevalence of disease and risk allele frequency, respectively, in sub-population 2. The x-axis is the various methods of selecting PCs for inclusion in the model of association and the symbols in the plot represent the phenotypic structure. The y-axis is the proportion of logistic regression models adjusting for the selected PCs for which the SNP p-values are significant at a significance level of 0.05.

We next expanded this simulation to larger differences in allele frequencies between the two sub-populations. With large F_{st }(0.10), we expect many SNPs with greater than a 0.2 difference in allele frequency between sub-populations, and thus we believe it may be more common to observe large allele frequency differences between populations than large phenotypic differences between populations. We found that as the difference in the risk allele frequency between the two sub-populations increases, the difference in power between adjusting and not adjusting for PCs becomes greater (see additional file _{1 }= p_{2 }= 0.5) and the phenotype has large structure (phenotypic ratio = 4), we found slightly higher power when selecting PCs by the 10% rule compared to not selected any PCs, contrary to the findings presented in Figure

**Supplemental Figure 1**. Empirical Type I error and power for increasing risk allele frequency differences.

Click here for file

Finally, we increased the sample size for the genome-wide SNPs simulation to 5,000 individuals from each sub-population. Increasing the sample size allowed us to determine if our observed results were affected by the simulated sample size of 500 cases and 500 controls. We found the same patterns with the larger sample size as we did when we used the 500 individuals from each sub-population (results not shown).

In general, we observed similar patterns when the data consisted of four sub-populations as with the two sub-populations scenario already presented (see additional file

**Supplemental Figure 2**. Empirical Type I error and power with 4 sub-populations.

Click here for file

**Supplemental Figure 3**. Power for 2 sub-populations and positive confounding.

Click here for file

Overall, we find that for some scenarios, the optimal choice of PCs to adjust for in a genome-wide association study using logistic regression is SNP-dependent (Table _{1 }= 0.19, K_{2 }= 0.01), selection methods that result in no PC adjustment (SNP01, 10% Rule, None) have optimal power. Conversely, when the phenotype is not structured and the SNP is structured, we achieve optimal power when PCs are included in the model, e.g., for the selection methods that include a fixed number of PCs or PCs associated with the SNP in the model (TW, Top2, Top10, SNP01, SNP05). With logistic regression, when adjusting for non-confounding covariates (covariates only associated with the outcome) there is a loss in precision

Scenarios that could occur across the genome with the optimal method of selection.

**Structured Phenotype**

**(K _{1 }≠ K_{2})**

**Non-Structured Phenotype**

**(K _{1 }= K_{2})**

**Structured Genotype**

**(p _{1}≠ p_{2})**

Any method of selection except no PC adjustment

Selecting a fixed number of PCs or PCs associated with the SNP

**Non-Structured Genotype**

**(p _{1 }= p_{2})**

Selecting PCs associated with the SNP (α = 0.01), 10% Rule, no PC adjustment

Any method of selection

p_{1 }and p_{2 }are the allele frequencies in population 1 and 2, respectively. K_{1 }and K_{2 }are the frequency of disease in population 1 and 2, respectively.

We explored the bias and standard error of our models to better understand the dependency of power on the SNP structure (Figure

Bias and standard error (SE)

**Bias and standard error (SE)**. Two sub-populations of 500 individuals each, F_{st }= 0.01, simulated log additive odds ratio of 1.2. Bias was computed as the estimated effect minus the true simulated beta. For the non-structured phenotype, each sub-population had a population prevalence of disease of 0.1. For the structured phenotype the population prevalence of disease was 0.16 in sub-population 1 and 0.04 in sub-population 2. The non-structured SNP had a frequency of 0.2 in each population, and the structured SNP had a frequency of 0.1 in sub-population 1 and 0.3 in sub-population 2. None indicates no PCs were adjusted for in the model and Pop indicates that the known population was adjusted for in the model.

Example of Principal Component Selection Criteria with Height

Average adult height is taller in northern Europe than in southern Europe. By our definition, height is a structured phenotype, i.e., it varies by ancestry. Lactose intolerance also varies across Europe from North to South. The genetic polymorphism in the LCT (Lactase) gene that causes lactose intolerance, and the SNPs in LD with this polymorphism, appears to be associated with height in non-homogeneous samples of individuals of European descent

• rs1042725 and rs6060369

• rs2322659

• rs2290305: a non-structured SNP, not associated with PC1 (p-value = 0.425).

Table

Height association results with methods for selecting PCs.

**beta (p-value) of SNP**

**sSNP***

**nSNP****

**Positive Control SNPs**

**Method for selecting PCs**

**Selected PCs**

**rs2322659**

**rs2290305**

**rs1042725**

**rs6060369**

No PC Adjustment

NA

-0.406 (< 0.001)

-0.086 (0.338)

-0.251 (0.002)

0.221 (0.007)

Top 2 PCs

PC1, PC2

-0.043 (0.646)

-0.064 (0.491)

-0.149 (0.073)

0.336 (< 0.001)

Top 10 PCs

PC1 - PC10

-0.057 (0.563)

-0.054 (0.571)

-0.132 (0.121)

0.322 (< 0.001)

Tracy-Widom statistic

PC1 - PC81

-0.062 (0.566)

-0.057 (0.581)

-0.174 (0.061)

0.318 (0.001)

Associated with the outcome at α = 0.05

PC1, PC2, PC4, PC8, PC21, PC25, PC28, PC39, PC47, PC49, PC56, PC64, PC77

-0.03 (0.762)

-0.049 (0.617)

-0.164 (0.059)

0.313 (0.001)

Associated with the outcome at α = 0.01

PC1, PC4, PC21, PC25, PC28, PC49, PC77

-0.015 (0.875)

-0.053 (0.578)

-0.162 (0.059)

0.322 (< 0.001)

Associated with the outcome at α = 0.001

PC1, PC4, PC28

0.014 (0.884)

-0.047 (0.619)

-0.163 (0.055)

0.322 (< 0.001)

Associated with the SNP at α = 0.05

varied by SNP

-0.041 (0.677)

-0.092 (0.312)

-0.148 (0.075)

0.333 (< 0.001)

Associated with the SNP at α = 0.01

varied by SNP

-.038 (0.703)

-0.086 (0.338)

-0.164 (0.170)

0.313 (< 0.001)

10% Rule

PC1#

-0.037 (0.692)

-0.086 (0.338)

-0.148 (0.075)

0.333 (< 0.001)

PC-Finder

NA

-0.406 (< 0.001)

-0.086 (0.338)

-0.251 (0.002)

0.221 (0.007)

Regression estimate (p-value) of the association between outcome and SNP adjusting for the selected PCs are displayed in the table. *sSNP is a structured SNP. This SNP is in the lactase gene, and varies in frequency across Europe; ** nSNP is a non-structured SNP. This SNP does not vary among subgroups in the FHS sample; # PC1 was selected except for the nSNP (rs2290305), in which case, no PCs were selected by the 10% rule.

Conclusions

We performed a simulation study in which we generated multiple sets of genome-wide SNPs. The goal was to investigate Type I error and power of associations between case-control status and a SNP when adjusting for ancestry informative PCs selected by a variety of rules. A second aim of this study was to examine more critically the effects of the amount of phenotypic structure and genotypic structure on the association analysis, as well as investigate the bias and precision of the associations.

We did not specifically address the issue of which SNPs to include in the PCA. Using all available SNPs in a PCA provides the maximal information to ancestry, but highly correlated SNPs or unusual chromosomal phenomena such as known inversion polymorphisms or genomic regions known to play a role in susceptibility to a disease can affect the results from a PCA

All simulations were performed using distinct sub-populations. Admixed individuals are commonly used in GWAS. While we did not explicitly simulate admixed individuals, we know based on previous work

We focused our exploration on linear PC adjustment models. We did not investigate adjusting for clusters identified in the individual genotype data because previous work has suggested that linear adjustments are adequate for the population structure typical of European populations

Our findings suggest that to optimize power under certain scenarios, the choice of covariate PCs in a genome-wide association study using logistic regression with a dichotomous outcome should be SNP-dependent. Our findings only apply to case-control or dichotomous outcome analyses using logistic regression. These results may appear to conflict with Xing and Xing

For linear regression using continuous phenotypes, one can check phenotypes for association with the PCs. If a top PC is significantly associated with the phenotype of interest then the trait-genotype association model should include PCs as covariates to adjust for population structure. Unlike logistic regression, adjusting for covariates associated with the trait in linear regression always improves the precision of the effect estimate by reducing the residual variance

Standard practice is to include a fixed number of PCs in association models for GWAS. Here, we conclude that if power is not a concern, then selecting the same set of PCs for adjustment for all SNPs in logistic regression is a strategy that achieves appropriate Type I error. However, standard practice is not optimal in all scenarios and to optimize power for structured SNPs in the presence of unstructured phenotypes, PCs that are associated with the tested SNP should be included in the logistic model. The gain in power we observed in our simulations was an approximate 5%-percentage point increase for adjusting only when the SNP is structured over always adjusting for the ancestry informative PCs. We note that some of the differences in power may disappear if we correct for Type I error, but this is not done in practice. It may be easier and more intuitive to adjusting for the same set of PCs across all SNP associations.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

GMP and KLL conceived and designed the study, carried out statistical analyses, and interpreted the data. GMP drafted the manuscript. KLL critically revised the manuscript. All authors read and approved the final manuscript.

Acknowledgements

A portion of this research was conducted using the Linux Clusters for Genetic Analysis (LinGA) computing resource funded by the Robert Dawson Evans Endowment of the Department of Medicine at Boston University School of Medicine and Boston Medical Center and contributions from individual investigators.