Department of Biostatistics and Computational Biology, State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, 200433, Shanghai, China
School of Biosciences, The University of Birmingham, Edgbaston, B15 2TT, Birmingham, UK
BioSS Unit, Scottish Crop Research Institute, Invergowrie, DD2 5DA, Dundee, Scotland, UK
Abstract
Background
The theoretical basis of genomewide association studies (GWAS) is statistical inference of linkage disequilibrium (LD) between any polymorphic marker and a putative disease locus. Most methods widely implemented for such analyses are vulnerable to several key demographic factors and deliver a poor statistical power for detecting genuine associations and also a high false positive rate. Here, we present a likelihoodbased statistical approach that accounts properly for nonrandom nature of case–control samples in regard of genotypic distribution at the loci in populations under study and confers flexibility to test for genetic association in presence of different confounding factors such as population structure, nonrandomness of samples etc.
Results
We implemented this novel method together with several popular methods in the literature of GWAS, to reanalyze recently published Parkinson’s disease (PD) case–control samples. The real data analysis and computer simulation show that the new method confers not only significantly improved statistical power for detecting the associations but also robustness to the difficulties stemmed from nonrandomly sampling and genetic structures when compared to its rivals. In particular, the new method detected 44 significant SNPs within 25 chromosomal regions of size < 1 Mb but only 6 SNPs in two of these regions were previously detected by the trend test based methods. It discovered two SNPs located 1.18 Mb and 0.18 Mb from the PD candidates,
Conclusions
We developed a novel likelihoodbased method which provides adequate estimation of LD and other population model parameters by using case and control samples, the ease in integration of these samples from multiple genetically divergent populations and thus confers statistically robust and powerful analyses of GWAS. On basis of simulation studies and analysis of real datasets, we demonstrated significant improvement of the new method over the nonparametric trend test, which is the most popularly implemented in the literature of GWAS.
Background
Rapid advancement in highthroughput sequencing techniques has greatly inspired the wave of genomewide association studies (GWAS) to unravel the genetic basis underlying complex traits in plants, animals and humans
In contrast to the problem raised from population stratification, the consequences of using nonrandom samples in association studies are usually neglected. We recently investigated the effect of using nonrandom samples in LD analyses and observed that estimates of LD parameters can be severely biased and that the statistical power for testing their significance substantially reduced
Methods
We consider case and control samples from
Derivation of equation (1) implies that the conditional probabilities of a marker allele given an allele at the disease locus are constant in both cases and controls across different subpopulations, that is
Method 1 proposed in the present study uses information from the conditional probability distribution of genotypes at the disease locus given any genotype at the tested marker (Table
from the numerator of the Armitage’s trend test statistic. All three methods are detailed in the following text.
Conditional probability distribution of (a) marker genotypes on a given disease genotype, (b) disease genotypes on a given marker genotype and (c) marker genotypes given a genotype at the disease locus under the penetrance model of the disease gene in case/control samples. f_{i} is the penetrance that an individual in the population is affected with disease given its genotype at the disease locus is i (i = 1, 2 and 3 for genotypes AA, Aa and aa respectively).
a.
2
(1
(1
2
(1
where
b.
2
(1
(1
2
(1
where
c.
Cases
Controls
(1
(1
(1
(1
(1
(1
(1
(1
(1
In Method 1, we first consider a case–control sample of size
The cases and controls collected from the population can be classified according to their genotypes at marker loci, while the sample size
The marker allele frequency
where
are the conditional probabilities that any case or control individual with the
and
The coefficients
. Significance of the disequilibrium parameter
Mathematical forms of the coefficients in normal equations (4) and (5).
Click here for file
It is important to note that the likelihood function under the null hypothesis can be simplified to be
When the cases and controls are collected independently from
where the superscript is used to denote the parameters for each subpopulation. To calculate the above likelihood function, we proposed firstly to work out the population specific parameters
Method 2 was modified from the Armitage’s trend test
to be the test statistic which follows the chisquare distribution with 1 d.f.. In equation (8), the denominator was the sampling variance of the numerator under the null hypothesis, i.e. there is no LD in either subpopulation.
Method 3 is virtually the Armitage’s trend test, which is the most commonly implemented approach in the literature of GWAS with a case and control design. The test statistic is built upon the number of genotypes,
Under the null hypothesis of no association between the marker and the disease locus, this follows a
under an additive genetic inheritance model
Reanalysis of the Parkinson’s disease datasets
We implemented the three methods described above to reanalyze the PD dataset which was recently published by SimonSanchez et al.
Simulation model and method
To investigate statistical properties and limitations of the method developed in the present study, we considered three schemes for sampling cases and controls from computer simulated randomly mating populations. In the first two sampling schemes, scheme A and B, we fixed the penetrance parameters
Results
Reanalysis of the Parkinson’s disease datasets
To assess the population structure in the stage I dataset, PCA was carried out using wholegenome genotype data and illustrated in Additional file
Genomewide Association Scan
Genomewide association scan. Graphic presenting association results from (a) stage I, (b) stage II and (c) twostage combined case and control samples. The analysis with each of the three datasets was done using Method 1 (black circles), 2 (red circles) and 3 (blue circles) accordingly. The red horizontal dashed lines indicate the Bonferroni significance threshold of
Top two principal components from principal component analysis (PCA) of the stage I dataset.
Click here for file
Locus
SNP name
Dist(kb) *
BPP (%)
M 1
M 2
M 3
M 1
M 2
M 3
Significance and bootstrap posterior probabilities (BPP) for the 44 significant SNPs detected by Method 1 (M 1) from stage I dataset. Shadowed are the regions at which the genetic association was tested by Method 2 (M2) and Method 3 (M 3) at the same significance level. *Distance (kb) from previous significant SNP in the same chromosome region.
1p13.213.3
rs17654531

1.9 × 10^{9}
3.2 × 10^{6}
1.2 × 10^{5}
37
22
14
rs10857899
328
2.7 × 10^{8}
3.1 × 10^{6}
3.1 × 10^{6}
57
25
27
2p23.3
rs7564397

9.7 × 10^{8}
0.013
0.033
55
0
0
2q21.2
rs1474406

4.3 × 10^{8}
2.3 × 10^{3}
0.001
57
1
3
2q36.1
rs1447108

5.5 × 10^{8}
2.5 × 10^{4}
4.4 × 10^{4}
59
4
3
3p24.3
rs1605527

2.0 × 10^{8}
1.0 × 10^{4}
9.4 × 10^{5}
53
9
10
4p15.2
rs6820719

1.6 × 10^{9}
0.23
0.30
74
0
0
rs7676830
23
8.6 × 10^{10}
0.12
0.15
77
0
0
rs12649499
11
4.8 × 10^{10}
0.20
0.26
77
0
0
4q21
rs11931074

3.9 × 10^{8}
5.1 × 10^{8}
4.8 × 10^{8}
56
54
54
rs356220
2
7.7 × 10^{11}
3.4 × 10^{8}
7.0 × 10^{8}
81
56
52
rs3857059
34
5.3 × 10^{8}
4.0 × 10^{8}
3.6 × 10^{8}
56
55
56
rs2736990
3
6.3 × 10^{12}
2.9 × 10^{9}
5.7 × 10^{9}
88
71
67
6q27
rs2072638

1.1 × 10^{11}
0.014
0.012
86
0
0
7p14p13
rs859522

1.8 × 10^{8}
9.7 × 10^{6}
3.4 × 10^{5}
62
21
14
7q21
rs3779331

6.6 × 10^{8}
0.028
0.01
56
0
0
7q21.11
rs10246477

9.3 × 10^{8}
2.3 × 10^{5}
5.3 × 10^{5}
56
13
10
8p23.2
rs7013027

5.8 × 10^{8}
4.3 × 10^{6}
1.9 × 10^{6}
56
23
29
rs4875773
63
1.6 × 10^{8}
0.02
0.044
63
0
0
8p22
rs7828611

8.4 × 10^{8}
1.2 × 10^{4}
6.2 × 10^{4}
55
6
3
rs2736050
1
9.9 × 10^{10}
1.0 × 10^{5}
2.0 × 10^{4}
74
18
5
rs2009817
3
2.0 × 10^{9}
1.3 × 10^{5}
2.1 × 10^{4}
72
16
5
8q24.2324.3
rs4556079

4.8 × 10^{8}
5.0 × 10^{6}
4.8 × 10^{6}
60
20
22
rs11781101
14
7.3 × 10^{8}
5.4 × 10^{6}
5.3 × 10^{6}
56
21
22
rs7004938
12
3.1 × 10^{8}
3.0 × 10^{6}
3.0 × 10^{6}
59
24
25
rs11783351
1
7.7 × 10^{8}
5.0 × 10^{6}
5.5 × 10^{6}
53
21
21
9q21.31
rs2378554

6.6 × 10^{8}
2.0 × 10^{6}
2.9 × 10^{5}
54
29
13
10p11.21
rs2492448

3.8 × 10^{8}
1.6 × 10^{6}
3.8 × 10^{6}
61
29
24
rs11591754
12
4.8 × 10^{10}
2.5 × 10^{7}
1.7 × 10^{6}
80
43
30
rs7923172
102
7.0 × 10^{8}
1.1 × 10^{5}
1.4 × 10^{5}
54
17
16
rs4934704
23
7.3 × 10^{8}
1.2 × 10^{5}
1.5 × 10^{5}
54
17
16
rs10827492
97
9.7 × 10^{8}
1.3 × 10^{5}
1.7 × 10^{5}
52
16
16
10q24.3
rs17115100

2.7 × 10^{8}
6.9 × 10^{6}
2.5 × 10^{5}
37
19
13
11p15.2
rs11605276

3.4 × 10^{11}
0.079
0.19
86
0
0
rs10500796
45
1.9 × 10^{8}
0.18
0.30
61
0
0
11q13
rs1726764

6.6 × 10^{8}
0.088
0.20
53
0
0
12p13
rs10849446

6.7 × 10^{9}
1.1 × 10^{4}
3.7 × 10^{5}
68
6
12
16p13.3
rs11648673

5.5 × 10^{8}
1.3 × 10^{5}
4.8 × 10^{7}
56
15
38
17q21
rs169201

1.0 × 10^{7}
6.5 × 10^{6}
1.2 × 10^{7}
57
19
49
rs199533
39
4.1 × 10^{8}
2.8 × 10^{6}
5.0 × 10^{8}
60
24
55
17q24.3
rs558076

6.6 × 10^{8}
1.0 × 10^{4}
2.5 × 10^{5}
57
7
14
rs817097
42
5.0 × 10^{8}
8.1 × 10^{6}
6.2 × 10^{6}
56
18
18
20p12.1
rs6041636

9.9 × 10^{9}
0.16
0.24
66
0
0
21q22.3
rs2070535

5.0 × 10^{8}
0.060
0.096
54
0
0
To assess variation of the predicted genetic associations, we carried out bootstrap sampling with replacement from the stage I dataset (1,000 replicates) and calculated the empirical posterior probability at each of the 44 significant SNPs. Table
Before reporting our analysis of the stage II dataset, it is worth stressing that the 345 SNPs originally genotyped were selected only from the previous analysis using Method 3
Association scans from stage II and twostage combined samples.
Click here for file
When the two datasets (stage I and stage II) were combined, 90 SNPs were detected significant at the Bonferroni corrected
There have been a total of twenty five candidate genes discovered so far to predispose individuals to Parkinson’s disease (the OMIM database with entry 168600). We explored the extent to which these candidate genes can be revealed in the present genetic association study. Listed in Figure
Significance of Parkinson’s Disease Candidate Genes
Significance of Parkinson’s disease candidate genes. The most significant SNP within ±2.5 Mb chromosome regions surrounding each of 25 Parkinson’s disease (PD) candidate genes. In parentheses is the physical distance (Mb) of the SNP to the corresponding PD candidate gene.
Simulation study
Table
Scheme A simulation under dominant and recessive genetic models.
Click here for file
Pop.
Method 1
Method 3
Population genetic parameters for 10 simulated populations and statistical inference of model parameters from 200 cases and 200 controls repeatedly sampled from the simulation populations.
1
0.5
0.5
0

0.004 ± 0.012
1.9 ± 2.5
6.9
1.0 ± 1.3
4.2
2
0.3
0.7
0

0.005 ± 0.011
2.0 ± 2.8
7.3
1.0 ± 1.4
4.5
3
0.7
0.3
0

0.002 ± 0.011
1.9 ± 2.7
6.7
1.0 ± 1.5
5.0
4
0.5
0.5
0.15
0.50 ± 0.05
0.148 ± 0.015
184.4 ± 42.8
100
73.3 ± 14.0
100
5
0.5
0.5
0.10
0.50 ± 0.09
0.097 ± 0.018
73.9 ± 26.5
99.7
33.3 ± 10.6
96.6
6
0.5
0.5
0.05
0.50 ± 0.20
0.043 ± 0.020
18.1 ± 12.0
36.8
8.8 ± 5.6
10.8
7
0.3
0.7
0.07
0.72 ± 0.12
0.064 ± 0.026
68.4 ± 25.4
99.6
29.6 ± 10.2
91.5
8
0.3
0.7
0.05
0.70 ± 0.15
0.047 ± 0.023
33.2 ± 17.6
77.3
15.1 ± 7.5
38.2
9
0.7
0.3
0.07
0.28 ± 0.14
0.062 ± 0.028
54.8 ± 23.4
96.8
26.3 ± 9.6
85.2
10
0.7
0.3
0.05
0.31 ± 0.20
0.042 ± 0.024
27.8 ± 15.6
66.1
13.7 ± 6.9
31.0
We explored the influence of using case and control samples collected from genetically divergent populations (or cohorts) on performance of the three methods. Table
Pop.
Population 1
Population 2
Admixed samples
M 1
M 2
M 3
M 1
M 2
M 3
M 1
M 2
M 3
Population genetic parameters defining two genetically divergent populations and empirical statistical powers of Methods 1–3 (M 1–3) for detecting significance of linkage disequilibrium between a polymorphic marker and a putative disease locus. The empirical power was calculated from 1,000 repeated samples of 1,000 cases and 1,000 controls as the proportion of the test statistic surpassing the Bonferroni threshold 5 × 10^{5}. The admixed samples were made up of 57% cases and 76% controls from Population 1 and the rest from Population 2.
1
0.40
0.10
0.00
0.70
0.10
0.00
0.1
0.0
0.0
1.6
0.0
0.0
1.2
0.0
25.3
2
0.45
0.10
0.00
0.70
0.10
0.00
0.0
0.0
0.0
1.0
0.0
0.0
0.6
0.0
12.6
3
0.50
0.10
0.00
0.70
0.10
0.00
0.3
0.0
0.0
1.4
0.0
0.0
1.2
0.0
3.7
4
0.55
0.10
0.00
0.70
0.10
0.00
0.2
0.0
0.0
2.1
0.0
0.0
1.1
0.0
0.9
5
0.60
0.10
0.00
0.70
0.10
0.00
0.0
0.0
0.0
1.1
0.0
0.0
1.0
0.0
0.3
6
0.65
0.10
0.00
0.70
0.10
0.00
0.1
0.1
0.1
0.9
0.0
0.0
0.5
0.0
0.0
7
0.40
0.10
0.00
0.50
0.10
0.02
0.1
0.0
0.0
94.3
44.8
45.6
91.1
2.9
50.8
8
0.45
0.10
0.00
0.50
0.10
0.02
0.0
0.0
0.0
93.4
45.7
47.2
90.8
1.4
28.4
9
0.40
0.10
0.02
0.50
0.10
0.00
99.5
93.9
94.7
1.1
0.0
0.0
99.4
70.1
90.0
10
0.45
0.10
0.02
0.50
0.10
0.00
99.7
95.4
95.5
1.1
0.0
0.0
99.3
69.3
77.4
11
0.40
0.10
0.02
0.50
0.10
0.02
99.6
95.0
95.1
93.2
43.7
45.7
100.0
99.7
100.0
12
0.45
0.10
0.02
0.50
0.10
0.02
99.6
95.2
95.6
93.1
47.5
49.0
100.0
99.7
100.0
13
0.40
0.10
0.02
0.50
0.10
0.02
99.4
95.1
95.3
92.2
45.6
47.0
100.0
4.2
6.1
14
0.45
0.10
0.02
0.50
0.10
0.02
99.1
93.9
94.0
94.2
45.8
47.8
100.0
3.0
1.4
While the true penetrance parameters at a disease locus are indeed unknown in practice, we proposed incorporation of the penetrance parameters with predefined values of (1, 1, 0), (1, 0, 0) or (1, ½, 0) into the analysis. This is mainly to ease the problem of overparameterization and to set the penetrance differ among the different disease genotypes whereas the true but unknown penetrance parameters could be far less than 1 for any single locus genotype that contributes to genetic variation of common polygenic disease traits. We investigated how the use of misspecified penetrance values would influence performance of the association tests through computer simulation. The simulation considered the scenario where the disease genotypes had very low levels of penetrance. Table
Pop.
Method 1*
Method 1**
Method 3
ρ (%)
ρ (%)
ρ (%)
Means and standard deviations (s.d.) of estimates of empirical statistical power (ρ) and the test statistic based on 200 cases and 200 controls from 1000 repeated computer simulations. The left panel lists values of the simulation parameters and the right the estimates. ρ is estimated as proportion (%) of significant tests at the Bonferroni threshold 5 × 10^{5} in 1000 simulations.
* when the true simulated parameters were used in the association test.
** when the penetrance parameters
1
0.5
0.5
0
0.1
0.05
0
0.50 ± 0.02
2.0 ± 6.1
2
1.9 ± 2.9
0
0.9 ± 1.3
0
2
0.3
0.7
0
0.1
0.05
0
0.30 ± 0.02
2.2 ± 5.9
2.1
1.8 ± 2.9
0.2
1.0 ± 1.3
0
3
0.7
0.3
0
0.2
0.1
0
0.70 ± 0.02
1.5 ± 3.9
0.5
2.0 ± 2.8
0
1.0 ± 1.3
0
4
0.5
0.5
0.15
0.2
0.1
0
0.48 ± 0.02
57.8 ± 21.8
99.2
52.5 ± 19.0
98
24.1 ± 9.2
79.1
5
0.5
0.5
0.1
0.1
0
0
0.49 ± 0.02
74.5 ± 24.1
99.7
68.7 ± 20.9
99.6
35.0 ± 11.0
97.7
6
0.5
0.5
0.05
0.1
0
0
0.50 ± 0.02
20.0 ± 12.3
42.6
19.6 ± 11.9
41.5
9.3 ± 5.8
11.8
7
0.3
0.7
0.07
0.3
0.1
0
0.29 ± 0.02
16.4 ± 12.7
32.4
14.2 ± 10.9
25
6.5 ± 4.9
4.5
8
0.3
0.7
0.05
0.3
0.1
0
0.29 ± 0.02
9.5 ± 8.9
12.6
8.2 ± 7.7
8.8
3.7 ± 3.6
1
9
0.7
0.3
0.07
0.1
0
0
0.70 ± 0.02
102.6 ± 29.7
100
93.5 ± 26.1
99.9
44.7 ± 12.0
99.6
10
0.7
0.3
0.05
0.1
0
0
0.70 ± 0.02
53.7 ± 21.1
96.6
50.5 ± 19.3
96.2
24.0 ± 9.0
80.1
Discussion
We have shown that Armitage’s trend test
We have solved three major problems in the methodology development. Firstly, genotype at the disease locus is not observable. This has led formulation of the model parameter estimation to be built on the principles of statistical analysis with missing data
which follows a chisquare distribution with 1 d.f.. Comparison of equation (11) to equation (8) shows that the two test statistics share the same numerator, and the denominator of
Structured association using logistic regression.
Click here for file
It needs to be pointed out that a full model involves a total of six unknown parameters and thus presents an overparameterization problem to statistical analysis under the model. To ease the problem, we have firstly proposed to estimate the marker allele frequency,
where
Predicting marker allele frequencies from control samples.
Click here for file
Simulation results from using biased estimates of maker allele frequency.
Click here for file
In spite that the population genetic model has been focused on the most prominent LD measure,
Conclusions
We have developed a novel likelihood based statistical approach to model linkage disequilibrium between any genetic marker locus and a putative disease locus in a randomly matting population and to infer the disequilibrium parameter and other population genetic parameters from case and control samples from the population under a likelihood based framework. The model and likelihood based approach are implemented to reanalyze large SNP datasets of the Parkinson disease case and control samples collected from multiple human cohorts. Statistical properties and utility limitations are investigated through simulation studies. Based on the simulation data analysis and analysis with the Parkinson disease case and control sample, we demonstrate that the likelihood based approach outperforms the trend test and logistic regression methods for an increased statistical power and reduced false positive inference, which are popularly implemented in the GWAS literature.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
ZL conceived and designed the study. ZL and MW developed the theoretical analysis. MW, LW, NJ and TJ implemented the simulation and analyzed the PD datasets. ZL and MW wrote the paper. All authors read and approved the final manuscript.
Acknowledgements
We thank Dr. Thomas Gasser of Neurodegenerative Diseases and German Center for Neurodegenerative Diseases (Germany) and Dr. Andrew B Singleton at National Institute on Aging (NIH, USA) for allowing us to reanalyze the Parkinson’s disease datasets. We thank two anonymous reviewers for their comments and suggestions which have been useful for improving presentation of the paper. This study was supported by research grants from the Leverhulme Trust (UK) and The National Basic Research Program of China (2012CB316505). ZL is also supported by China’s National Natural Science Foundation.