Department of Pharmacogenomics, Johnson & Johnson Pharmaceutical Research and Development, LLC, Raritan, New Jersey 08869 USA

Department of Epidemiology, Johnson & Johnson Pharmaceutical Research and Development, LLC, Titusville, New Jersey 08560 USA

Abstract

Population stratification (PS) represents a major challenge in genome-wide association studies. Using the Genetic Analysis Workshop 16 Problem 1 data, which include samples of rheumatoid arthritis patients and healthy controls, we compared two methods that can be used to evaluate population structure and correct PS in genome-wide association studies: the principal-component analysis method and the multidimensional-scaling method. While both methods identified similar population structures in this dataset, principal-component analysis performed slightly better than the multidimensional-scaling method in correcting for PS in genome-wide association analysis of this dataset.

Background

In the past few years, the genome-wide association (GWA) approach has become a widely used tool for identifying genetic loci related to disease risk. Population stratification (PS) is a major challenge in GWA studies (GWAS), because of the risk of generating false positives that represent genetic differences from ancestry rather than genes associated with a disease. Among the methods developed for correcting PS in GWAS, the principal-component analysis (PCA) method

The objectives of this study were: 1) to compare the population structures identified by PCA and MDS in the rheumatoid arthritis (RA) dataset of Genetic Analysis Workshop 16 (GAW16); and 2) to evaluate the performance of these two approaches for correcting PS in GWA analyses.

Methods

GAW16 Problem 1 data

GAW16 Problem 1 data, provided by the North American Rheumatoid Arthritis Consortium (NARAC), contained genome-wide data on 868 RA cases and 1,194 controls. Genotype data on 545,080 single-nucleotide polymorphisms (SNPs) were available for analysis.

Genotype data quality control

Quality control of genotype data was conducted at both the individual level and the SNP level. At the individual level, a call rate of at least 0.95 was required. Sex discrepancies were examined using the heterozygosity rate of X-chromosome. At the SNP level, a call rate of at least 0.90, a minor allele frequency of at least 0.01, and a

Principal-component analysis

PCA was performed using the computer program EIGENSOFT 2.0 ^{2 }< 0.2; iii) each SNP was regressed on the previous two SNPs, and the residual entered into the PCA. SNP loadings on all components deemed significant by the Tracy-Widom statistic

Multidimensional scaling

MDS analysis was performed using PLINK1.03 ^{2 }< 0.2. Pairwise IBS distance was calculated using all autosomal SNPs that remained after pruning. Five nearest neighbors were identified for each individual based upon the pairwise IBS distance. IBS distance to each of the five nearest neighbors was then transformed into a

Genome-wide association analyses

Three GWA analyses were performed using PLINK 1.03

Results and discussion

Genotype data quality control

All individuals had call rates >0.95 at the individual level. An examination of sex led to the exclusion of seven individuals due to incorrect or ambiguous sex information when compared with phenotype data. At the SNP level, 5,449 SNPs with call rates <0.90, 23,205 SNPs with minor allele frequencies <0.01, 1,389 SNPs with

PCA, MDS, and population structure

In the first round of PCA, 59 components met the criteria for statistical significance using the Tracy-wisdom statistic. Nearly half of the SNPs that deviated from their expected normal quantiles with a distance of at least 1 (4,413 of 9,980) were in the HLA region (chr6: 25,000,000-33,500,000 bp), a region that had been reported with higher genetic heterogeneity across different populations. After removing SNPs that deviated from their expected normal quantiles with a distance of at least one and pruning the SNPs based on LD information, 81,636 autosomal SNPs were included in the second round of PCA. This analysis resulted in eight significant components using the Tracy-Widom statistic. A small number of individuals (

In the MDS analysis, 81,652 autosomal SNPs were used to calculate the pairwise IBS distance after SNP pruning. A small number of individuals (

The Pearson correlation coefficients between each of the eight significant principal components and each of the eight leading MDS dimensions are summarized in Table

Correlation between first eight principal components and first eight MDS dimensions

**Top eight principal components**

**Top eight MDS dimensions**

**dim1**

**dim2**

**dim3**

**dim4**

**dim5**

**dim6**

**dim7**

**dim8**

evec1

**0.998 ^{a}**

0.01

-0.02

0.01

0.005

0.01

-0.002

-0.0004

evec2

0.00

**0.98**

0.17

0.03

-0.01

-0.01

0.01

0.002

evec3

-0.04

0.17

-0.96

-0.11

-0.01

0.01

0.01

-0.01

evec4

-0.02

0.00

-0.12

**0.89**

0.06

0.17

-0.07

-0.03

evec5

0.02

-0.01

-0.05

0.38

-0.20

-0.43

0.20

0.03

evec6

0.01

0.01

0.02

0.02

0.13

0.00

0.08

-0.17

evec7

-0.02

-0.001

-0.03

-0.01

0.17

-0.07

-0.17

0.31

evec8

0.02

-0.01

0.02

0.02

-0.26

-0.18

-0.32

0.06

^{a}Bold font indicates the absolute value of correlation coefficient > 0.8.

Population structures identified by PCA and MDS

**Population structures identified by PCA and MDS**. A, The first six principal components are plotted against one another with RA status distinguished by shading; B, the first six MDS dimensions are plotted against one another with RA status distinguished by shading.

GWAS results

The quantile-quantile (Q-Q) plots of the

Q-Q plots of

**Q-Q plots of p-values from three GWA analyses**. SNPs in HLA region are excluded to enhance readability.

Results from the three GWA analyses were also plotted against their chromosome locations (Figure

**Analysis method**

**Rank in non-HLA SNPs**

Trend test without adjustment

5.42 × 10^{-12}

3

Logistic regression adjusted for principal components

0.000018

18

Logistic regression adjusted for MDS dimensions

0.000022

59

Results of GWA analyses

**Results of GWA analyses**. The y axis is in square root scale to enhance readability.

Although PCA performed slightly better than MDS in correcting PS in the GWA analysis of this dataset, it would be inappropriate to conclude that PCA is a preferred approach in all GWAS. MDS is a more flexible method in general as compared with PCA. First, PCA requires that underlying data follow a multivariate normal distribution, while MDS imposes no such restriction. Second, PCA requires computation of a covariance matrix first, while MDS can be applied to any kind of distances or similarities. Pairwise IBS distance is only one example of many distance measures to which MDS can be applied. As a special case, MDS can be applied to the covariance matrix used in PCA as well. In this case, the performance of MDS in correcting PS will be equivalent to it of PCA

Conclusion

In this paper, we compared the performance of PCA and MDS in identifying population structure and correcting for PS in GWAS using data provided to GAW16 participants by the NARAC. While the two methods identified similar population structures in this dataset, PCA performed slightly better than MDS in correcting for PS in the GWA analyses of this data set.

List of abbreviations used

GAW16: Genetic Analysis Workshop 16; GWA: Genome-wide association; GWAS: GWA studies; IBS: Identity-by-state; LD: Linkage disequilibrium; MDS: Multidimensional scaling; NARAC: North American Rheumatoid Arthritis Consortium; PCA: Principal-component analysis; PS: Population stratification; Q-Q: Quantile-quantile; RA: Rheumatoid arthritis; SNP: Single-nucleotide polymorphism.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

DW participated in discussions of the analytical approach, carried out the majority of the data analyses, and drafted the manuscript. YS participated in discussions of the analytical approach and carried out some of the analyses related to multi-dimensional scaling. PS, JAB, and MAW contributed to acquisition of the data and editorial revision of the manuscript. QL participated in discussions of the analytical approach.

Acknowledgements

The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences.

This article has been published as part of