Department of Mathematical Sciences, Michigan Technological University, Houghton, MI, USA

School of Computer Science and Technology, Heilongjiang University, Harbin, PR China

Laboratory of Neurogenetics, National Institute on Aging, NIH, Bethesda, MD, USA

Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, UK

Department of Neurology, Johns Hopkins University, Baltimore, MD, USA

Department of Mathematics, Heilongjiang University, Harbin, PR China

Abstract

Background

Amyotrophic lateral sclerosis (ALS) is a fatal, degenerative neuromuscular disease characterized by a progressive loss of voluntary motor activity. About 95% of ALS patients are in "sporadic form"-meaning their disease is not associated with a family history of the disease. To date, the genetic factors of the sporadic form of ALS are poorly understood.

Methods

We proposed a two-stage approach based on seventeen biological plausible models to search for two-locus combinations that have significant joint effects to the disease in a genome-wide association study (GWAS). We used a two-stage strategy to reduce the computational burden associated with performing an exhaustive two-locus search across the genome. In the first stage, all SNPs were screened using a single-marker test. In the second stage, all pairs made from the 1000 SNPs with the lowest p-values from the first stage were evaluated under each of the 17 two-locus models.

Results

we performed the two-stage approach on a GWAS data set of sporadic ALS from the SNP Database at the NINDS Human Genetics Resource Center DNA and Cell Line Repository

Conclusion

The proposed two-stage analytical method can be used to search for joint effects of genes in GWAS. The two-stage strategy decreased the computational time and the multiple testing burdens associated with GWAS. We have also observed that the loci identified by our two-stage strategy can not be detected by single-locus tests.

Background

Amyotrophic lateral sclerosis (ALS) is a fatal progressive neurodegenerative disease that attacks nerve cells in the brain and spinal cord resulting in muscle weakness and atrophy. Although ALS is listed as a rare disease with a prevalence of approximately 1 per 10,000, it is the most common adult onset form of motor neuron diseases

The identification of susceptibility genes of sporadic ALS has been slow in arriving. The search for sporadic ALS genes has generated a large number of candidate-gene association studies

Recently, Schymick et al. made the first attempt to identify genetic factors that might be relevant in the pathogenesis of sporadic ALS by using a well-designed GWAS ^{-7}. After adjusted by permutation procedure, none of these SNPs reached the significance level of 0.05. This finding suggests that the ALS phenotype is not driven by a single powerful locus. By testing one marker at a time, the first stage analysis made the implicit assumption that susceptibility loci can be identified through their independent, marginal contributions to the trait variability. More recently, other GWAS in ALS have been conducted by different research groups

In this article, we used seventeen two-locus models to analyze the previously published genome-wide association data for ALS. We found that three SNPs were significantly associated with sporadic ALS. After we observed the significant two-locus combinations, we further estimated the impact (relative risk and odds ratio) of each of the two-locus combinations on sporadic ALS. It has been recognized that the traditional method will over estimate the odds ratio or relative risk in GWAS

Methods

In this section, we will give details of the data set and describe a new analytical method to analyze this data set.

The Data Set from GWAS for Sporadic ALS

Schymick et al. have made their data set publicly available through the website of the National Institute of Neurological Disorders and Stroke (NINDS) Human Genetics Resource Center at the Coriell Institute

Statistical Analysis

Two-locus Analysis Based on Seventeen Two-locus Models

In this article, we used seventeen two-locus models to analyze the genome-wide association data. For each SNP, we called one allele a high-risk allele if its frequency in cases was larger than the frequency in controls. For SNP A with alleles A, a and SNP B with alleles B, b, Figure

Eight two-locus epistatic models

**Eight two-locus epistatic models**. A and B are the high-risk alleles in the two markers. α and β are the penetrance. ∩: two-locus genotypes with both high-risk genotypes at SNP A and SNP B are high-risk genotypes. ∪: two-locus genotypes with at least one high risk genotype at SNP A or SNP B are high-risk genotypes.

Nine two-locus multiplicative models

**Nine two-locus multiplicative models**. A and B are the high-risk alleles in the two markers. The symbol in each cell denotes the relative risk of this cell. φ = θ^{2}, ρ = θ^{3 }and γ = θ^{4}.

Under each of the epistatic models, the nine two-locus genotypes were divided into two groups: high-risk genotype group and low-risk genotype group. For example, under the model Dom∩Dom, the high-risk group was _{H }= {_{L }= {^{2 }test statistic given by

to test for association of two-locus joint effects, where

For the nine multiplicative models, we constructed a two-locus association test as follows. Let P(_{1}, _{2}), where _{1 }and _{2 }are the genotypes in the first and second markers, respectively. Let _{0 }denote the logarithm of the penetrance of genotypes with a relative risk of 1 in the models (see Figure _{1 }= log_{0 }+ _{1}_{1 }+ _{2}, _{1 }is the numerical code of _{1 }and is given by

for a dominant, recessive or multiplicative model, respectively; _{2 }is similarly defined as the numerical code of _{2}. Under the log linear model log _{0 }+ _{1}_{1 }= 0 means that all the genotypes have the same penetrance which implies that _{0}: _{1 }= 0. For the ^{th }individual, let _{i }denote the trait value (1 for diseased individual and 0 for normal individual) and _{i }denote the numerical code of the genotype (

where N is the sample size, _{1},..., _{N}, and _{1},..., _{N}. Under the null hypothesis, _{score }follows a ^{2 }distribution with 1 df. Note that under each of the two-locus epistatic models, if we code _{epi }= _{score}.

The method to search for significant two-locus combinations for each of the seventeen models has the following two steps:

**Step 1**_{1}, _{2}, _{3 }and _{1}, _{2}, _{3 }denote the number of three genotypes in cases and controls, respectively. The 2 df genotypic test statistic is given by

where

** Step 2: **Under each of the seventeen two-locus models, we applied a two-locus association test to each of the

A permutation procedure was used to adjust for multiple tests and multiple models. In each permutation, we randomly shuffled the cases and controls and repeated step 1 and step 2 based on the permuted data. We performed the permutation procedure ^{th }model and ^{th }two-locus combination (_{il }and ^{th }permutated data, respectively. Let

Then, for the ^{th }model and ^{th }two-locus combination, P_{il}, the p-value adjusted for multiple tests and multiple models, was given by

A New Method to Estimate Penetrance

When a study identifies a locus or locus-combination that shows evidence of association with a disease, it is common to estimate the impact of this locus or locus-combination on the phenotype of interest. This impact is often expressed as an odds ratio. Estimation of the odds ratio is also helpful for planning successful replication studies.

It is recognized that the traditional estimate of odds ratio is up-biased because it is typically estimated for the locus which was significant for association

We use the following notation:

the data _{1},..., _{9}; _{1},..., _{9}}: the counts of nine two-locus genotypes in cases and controls that constitute the significant signal for association

(_{1},..., _{9}): the population frequencies of the genotypes

Because ALS is a rare disease with _{i }from the sampled controls. Thus, we assume that _{i }= (number of ^{th }genotype in controls)/^{th}, 6^{th}, 8^{th }and 9^{th }genotype combination {(aA, bB), (AA, bB), (aA, BB), (AA, BB)} is the high-risk genotype combination, and the combination of the other genotypes is the low-risk genotype combination. Let _{H }= _{5 }+ _{6 }+ _{8 }+ _{9 }denote the population frequency of the high-risk genotype combination. Then, the penetrance

Thus, we have only one unknown parameter _{1 }in step 1 and significant joint association at level _{2 }in step 2. We calculate the likelihood

where the data _{1},..., _{9}; _{1},..., _{9}}. Since the data D constitutes, by definition, a significant result, so D implies S; hence Pr(

where ^{th }genotype is a low-risk genotype; _{i }are known, we can generate the two-locus genotypes for _{1 }and the p-value of the two-locus test is less than _{2}, the data set is said to be significant for association. We repeat the process to generate the data sets many times (1 million was used in this article). The proportion of significant data sets is the estimate of Pr(

When the relative risk

Following Zollner and Prichard, when there are more than two genotype groups in the models such as these in Figure _{L }= {aabb}, middle risk genotype group _{M }= {aabB, aaBB, aAbb, AAbb}, and high risk genotype group _{H }= {aAbB, aABB, AAbB, AABB}. The odd ratio of the high risk group ^{H }is the odds of _{H }divided by the odds of _{M }∪ _{L }= {aabb, aabB, aaBB, aAbb, AAbb}. The odd ratio of the low risk genotype group ^{L }is the odds of _{L }divided by the odds of _{M }∪ _{H }= {aabB, aaBB, aAbb, AAbb, aAbB, aABB, AAbB, AABB}. The odds ratio estimation method will be the same as the case of two genotype groups.

We used this new proposed method to estimate the odds ratio for each of the two-locus combinations that showed significant association with ALS in our two-locus analysis. Based on the estimated penetrance, we used a simulation method to estimate the sample size required to replicate the findings with 80% power.

Results

We applied the two-locus analysis with two steps to the genome-wide association data set for sporadic ALS. The analysis was done for all genotypes with a call rate greater than or equal to 95% (549,062 SNPs left). SNPs on the sex chromosome were excluded in the analysis. In the first step, we returned 1,000 SNPs with the smallest p-values which corresponded to use a p-value cut-off _{1 }= 0.0023. Then we tested all of the ^{st }with a p-value of 6.8 × 10^{-7}, SNP 2 was ranked 10^{th }with a p-value of 2.2 × 10^{-5}, and SNP 3 was ranked 2^{nd }with a p-value of 1.7 × 10^{-6}.

Information of the three SNPs. HRA: high-risk allele.

**Allele frequency**

**SNP**

**dbSNP ID**

**Chromosome Location**

**Gene**

**Two alleles**

**Controls**

**Cases**

**HRA**

T

0.656

0.505

SNP1

rs4363506

10q26.13

Intergenic

C

0.344

0.495

C

A

0.467

0.341

SNP2

rs3733242

4q21.1

SHROOM3

G

0.533

0.659

G

C

0.887

0.786

SNP3

rs16984239

2p24

Intergenic

A

0.113

0.214

A

(number of cases)/(number of controls) in each of the two-locus genotypes.

**SNP1**

**SNP**

**Genotype**

**TT**

**TC**

**CC**

SNP2

AA

11/23

14/37

3/7

AG

29/50

73/56

29/11

GG

23/45

65/24

28/16

SNP3

CC

33/95

95/89

37/30

CA

29/20

52/25

22/4

AA

1/3

5/3

1/0

To estimate the impact of the two two-locus combinations on sporadic ALS, we first estimated the penetrance of the two-locus genotypes for each of the two two-locus combinations under the corresponding model. Based on the estimated penetrance, we estimated the relative risk, odds ratio and sample size required to replicate the significant findings with 80% power. We followed what is in Zollner and Pritchard to obtain the 95% CI of the estimates ^{2 }distribution with 1 df. The estimations using both the proposed method (adjusted estimates) and the traditional method (unadjusted estimates) are summarized in Table

Penetrence, relative risk and odds ratio of the two-locus combinations.

**Two-locus combination**

**SNP1 and SNP2**

**SNP1 and SNP3**

Penetrance

Unadjusted

Adjusted

R and 95% CI

Unadjusted

3.70, (2.85, 4.85)

2.55, (2.10, 3.15)

Adjusted

3.40, (2.40, 4.60)

2.35, (1.85, 2.95)

OR^{H }and 95% CI

Unadjusted

3.70, (2.85, 4.85)

3.37, (2.66, 4.34)

Adjusted

3.40, (2.40, 4.60)

3.05, (2.27, 4.01)

OR^{L }and 95% CI

Unadjusted

0.27, (0.21,0.35)

0.31, (0.23, 0.40)

Adjusted

0.29, (0.22, 0.42)

0.34, (0.25, 0.47)

SS and 95% CI

Unadjusted

680, (480, 1040)

680, (460, 1040)

Adjusted

800, (500, 1500)

810, (520, 1520)

Note: There were two genotype combinations for SNP1 and SNP2, ^{H }(OR^{L}): the odds ratio of the high-risk (low-risk) genotype group. SS: the sample size required to reach 80% power. Adjusted (Unadjusted): based on the penetrance estimated using the method proposed in this article (the traditional method). ^{-4}.

Discussion

In this study we proposed a new analytical method that considered joint effects of genes to analyze a data set from the GWAS in sporadic ALS previously performed by Schymick et al.

Population stratification may lead to false-positive results. We had also checked the population stratification problem in this data set using the following method. We randomly chose 5,000 SNPs and got their p-values by a single marker test. If population stratification did exist in this data set, among the 5,000 p-values, there should be more small p-values than expected under the uniform distribution. We used the one-side Kolmorgorov test statistic to test if the 5,000 p-values followed a uniform distribution. We repeated the procedure 10 times. The Kolmorgorov test results showed that the p-values followed a uniform distribution for all 10 replications, which indicated that there was no population stratification in this data set. The lack of population stratification in the data set was consistent with the results of Schymick et al.

Significant associations claimed by association studies often fail to be replicated. One possible reason is the overestimation of the effect in terms of the odds ratio or relative risk of the claimed variants. The overestimation of the effect leads to the underestimation of the sample size required to replicate the finding. In this article, we proposed a new method to estimate the effect of claimed variants. Based on the study of Zollner and Pritchard

Currently, several methods are available to test associations by taking joint effects of genes into account, such as combinatorial searching method (CSM) and the multifactor dimensionality reduction (MDR) method _{1}_{2 }≥ penetrance of _{1}_{2 }≥ penetrance of _{1}_{2}, where _{1}(_{1}) and _{2}(_{2}) are the high-risk (low-risk) genotypes in the first and second marker, respectively. The CSM and MDR ignore the order of genotypes and therefore can group any two genotypes together-in essence searching for the "best" one among 21,146 different partitions of the two-locus genotypes. By searching for irrelevant two-locus genotype combinations, the CSM and MDR did not gain more information but increased the noise level, and thus lost power.

Conclusion

The proposed two-stage analytical method can be used to search for two-locus joint effects of genes in GWAS. The two-stage strategy significantly decreased the computational time and the multiple testing burdens associated with GWAS. We have also observed that the three SNPs identified by our two-stage strategy can not be detected by single-locus tests.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

QS and SZ designed the study. ZZ contributed the two-locus data analysis under the direction of SZ. SZ performed the penetrance estimation. JCS & BJT assisted in data interpretation and approved the final manuscript. QS and SZ contributed to the writing of the manuscript. All authors read and approved the final manuscript.

Acknowledgements

This work was supported by the National Institute of Health (NIH) grants R01 GM069940 and the Overseas-Returned Scholars Foundation of Department of Education of Heilongjiang Province (1152 HZ01). This work was supported in part by the Intramural Research Program of the National Institute on Aging (project Z01 AG000949-02).

Pre-publication history

The pre-publication history for this paper can be accessed here: