Abstract
We compare the SNPbased and genebased association studies using 697 unrelated individuals. The BenjaminiHochberg procedure was applied to control the false discovery rate for all the multiple comparisons. We use a linear model for the singlenucleotide polymorphism (SNP) based association study. For the genebased study, we consider three methods. The first one is based on a linear model, the second is similarity based, and the third is a new twostep procedure. The results of power using a subset of SNPs show that the SNPbased association test is more powerful than the genebased ones. However, in some situations, a genebased study is able to detect the associated variants that were neglected in a SNPbased study. Finally, we apply these methods to a replicate of the quantitative trait Q1 and the binary trait D (D = 1, affected; D = 0, unaffected) for a genomewide gene search.
Background
Our aims are (1) to compare singlenucleotide polymorphism (SNP) based and genebased association studies and (2) to apply both methods to a genomewide search for associated genes. To check the effect of covariates, we use linear models for the numerical traits Q1, Q2, and Q4 and a logistic model for the binary trait D. We modify the original trait value by adjusting for significant covariate effects. Then in the SNPbased association study, we delete alias SNPs and use a linear or logistic model to find the pvalue for each SNP.
In the genebased association study, we consider three methods. The first two methods are two major types of multilocus association study methods, and the third method is a new method that we propose. Specifically, the first method is based on multilinear regression [1]. The second method is similarity based and is useful for binary traits only [2]. The pvalues are calculated using the cumulantbased estimation procedure [3]. These two multilocus association study methods can have reduced power as a result of increased degrees of freedom [4]. To solve this problem, we propose a twostep procedure that first classifies and collapses multilocus genotypes and then uses the classic T and Pearson chisquare statistics to perform an association test. To reduce computational time, the overall pvalue is obtained using a selfadjusted permutation procedure, as described by Knuth [5].
From our analysis, we conclude that the detection of association signals for rare variants is a challenging problem. The SNPbased association study is more powerful in general. However, the genebased study can pick up some association signals neglected by a SNPbased study.
Methods
Data
We use the genotype data from 697 unrelated individuals in the Genetic Analysis Workshop 17 (GAW17) data set. For the first aim (comparing SNPbased and genebased association studies), we consider all four traits and use all 200 replicates. For computational convenience, we include only a subset of genes, which is formed by the 9 true associated genes (39 SNPs) for trait Q1, the 13 true associated genes (72 SNPs) for trait Q2, the additional 15 genes (51 SNPs) for the binary trait D, and 50 other randomly selected nonassociated genes (731 SNPs). The total number of genes is 86, and they contain 893 SNPs. For the second aim (applying both methods to a genomewide search for associated genes), we use the complete list of all the genes and SNPs, which includes 3,205 genes (24,487 SNPs). Because Q1 has the most significant association and the binary trait D has the weakest association, we consider only a fixed replicate data set for these two traits as representation.
Effect of covariates
We considered the three covariates Sex, Age, and Smoking before conducting association studies. We use linear models for the quantitative traits Q1, Q2, and Q4 and a logistic model for the binary trait D. We also check assumptions in the models to make sure they are appropriate. For each trait and replicates, we first fit the corresponding models that include all three covariates to test whether they are significant. Then we include only the significant covariates in the model and calculate residuals. These residuals are used as new trait values to perform association studies.
SNPbased association study
Because there are numerous rare SNPs, it is likely that SNPs at two different loci will be in complete linkage disequilibrium. That is, R^{2} = D′ = 1. We call these alias SNPs; they do not contribute additional information to an association study. Therefore, to increase power, we use only one copy of the alias SNPs. The new data set generated here is called the cleaned data. In the subset of SNPs for our first aim (comparing SNPbased and genebased association studies), the cleaned data include 83 genes and 773 SNPs. In the complete data set for our second aim (applying both methods in a genomewide search for associated genes), the cleaned data include 2,987 genes and 15,076 SNPs. Note that the cleaned data here are used to control the false discovery rate (FDR) only.
We assume an additive genetic model. For each trait and SNP in the cleaned data, we perform simple linear (or logistic) regression. The pvalue to test the null hypothesis that the slope is 0 versus the alternative that the slope is not 0 is recorded. To adjust for multiple testing, we use the BenjaminiHochberg (BH) procedure to control FDR [6]. That is, let m be the number of SNPs in the cleaned data, and let P_{(}_{i}_{)} be the ith ordered pvalue. Then for a prespecified 0 <α < 1, define:
as the BH threshold. When the tests are independent, this procedure guarantees that FDR ≤ α.
Genebased association study
In the genebased association study, we consider all 86 genes and 893 SNPs. Likewise, we use the BH procedure to control FDR. For each gene, the multilocus genotype is defined as the composition of genotypes at all SNPs included in the gene. If the gene contains L SNPs, then in theory the total number of possible genotypes for the gene is 3^{L}. However, because of rare variants and linkage disequilibrium in nearby SNPs, the actual observed number of multilocus genotypes is much less.
The first genebased association study method is referred as GL, which is based on a linear (or logistic) model. That is, traits are regressed on all the SNPs included in the gene. The pvalue to test the null hypothesis that all coefficients are 0 versus the alternative that at least one coefficient is not 0 is recorded.
The second genebased method is referred as GS, which is similarity based and is used for the binary trait D only. The test statistic D_{s} is defined as:
where is a vector of the estimated frequencies of multilocus genotypes in the affected group, is a vector of those in the unaffected group, and A = (a_{ij}) is the genotype similarity matrix [4]. Here the similarity score a_{ij} between the ith and jth distinct genotypes is defined to be 1 minus the weighted average of absolute differences between numbers of minor alleles at each SNP of a gene. The weight is the inverse of the minor allele frequency (MAF), which accounts for the assumption that genomes that share minor alleles are more similar than those that share common ones [7]. According to Tong et al. [3], the distribution of D_{s} under the null hypothesis can be approximated by a chisquare distribution or the difference between two chisquare distributions.
The third genebased method is referred as G2, which has two steps. In the first step, we define a null multilocus genotype to be the one with genotype g = 0 at all the SNP loci, where g is the number of minor alleles. We then compare each distinct multilocus genotype with the null genotype and classify it into one of the following three groups: (1) C = 0 (not distinguishable from the null genotype); (2) C = −1 (negatively contributed to the trait); and (3) C = 1 (positively contributed to the trait). In the second step, we compare the means of the trait values for the three groups. For each gene, we estimate two pvalues (one for the C = 1 group compared with the C = 0 group and one for the C = −1 group compared with the C = 0 group) using a permutation procedure.
In a permutation, we first randomly assign an observed trait value to each individual and then follow the twostep procedure to find the mean differences for comparison with the observed ones. To improve the efficiency of computation, we use a selfadjusted procedure to decide the number of permutations needed [5]. Specifically, we require that the standard deviation of the estimated pvalue, denoted , be at most 10% of it. That is, let R be the number of permutations, and let S be the number of times that the mean difference in a permutation is greater than or equal to the observed mean difference. Then . The permutation stops when , which is equivalent to S ≥ 100R/(100 + R). The minimum number of permutations is set at 100.
Comparison of association studies
Because the number of hypotheses in a SNPbased study is different from the number of hypotheses in a genebased study, we have to redefine the false positive rate (FPR), the FDR, and the power for a SNPbased study to make them comparable with the genebased values. Specifically, consider all the SNPs in a gene. If at least one minor allele contributes significantly to the trait at a predefined significance level, then this gene is detected. Then for each replicate data set, the list of significant genes is determined. In this list of genes, some are true associated variants and some are not. The observed FPR is calculated as the number of reported false positives divided by the total number of true null genotypes. The observed FDR is the number of reported false positives divided by the total number of reported positives. The power (or alternatively, true discovery rate) is the number of reported true positives divided by the total number of true alternatives. For the genebased studies, the list of significant genes is determined directly given a predefined significance level. Likewise, the observed FPR, FDR, and power are calculated. The observed FDR versus power receiver operating characteristic (ROC) plot is presented to compare the overall performance of these methods.
Results and discussion
Effect of covariates
We estimate the parameter values in linear models separately for each of the 200 replicate data sets. The linear or logistic model assumptions seem appropriate. The effects of covariates are consistent in the 200 replicates. For trait Q1, Age and Smoking are significant, which explain 15% of the variation. The value of Q1 increases with age and is higher in smokers. No covariates contribute to trait Q2. For trait Q4, all three covariates are significant and explain 80% of the variation. The value of Q4 is smaller in females than it is in males, decreases when age increases, and is smaller in smokers. For the binary trait D, Age and Smoking are significant. The probability of the event {D = 1} increases when age increases and is higher in smokers.
Comparison of SNPbased and genebased association studies
We first check the effect of true associated SNPs using a linear model. The results show that 35% of the variation in Q1 can be explained by the 39 true associated SNPs in 9 genes; 22% of the variation in Q2 can be explained by the 72 true associated SNPs in 13 genes; there are no true associated SNPs for Q4. The variation in D explained by true associated genes cannot be calculated because the original values of the liability used to determine D are unknown.
Table 1 lists the number of positive genes for each method applied to each trait as well as the observed FDR. From this table, we see that the observed FDRs can be much larger than the values to be controlled. For example, in Q1 when α = 0.25 and the method is SNPbased, the observed FDR is 0.849, although in theory we should have FDR <α if all genes are independent. There are two possible reasons for this result. First, the total number of tests is only 86, which is not large compared to the number of associated genes. Second, these tests are not independent. Although the BH procedure to control FDR is not satisfied, we find that in the genebased studies the situation can be better than in the SNPbased one. The reason may be that independence between genes is easier to satisfy than independence between SNPs. Several investigators have discussed controlling the FDR in correlated tests [811], but we do not discuss the issue here because of space limitations.
Table 1. Number of positive genes and estimated FDR
We made a power comparison for traits Q1, Q2, and D. The results are similar. We show results only for trait D in the following discussion for conciseness. Figure 1 compares the power of the SNP and genebased methods for the binary trait D. It is obvious that the SNPbased method is most powerful on average for an association test. The GL and G2 methods are similar and are better than the GS method. However, no method is superior in all the situations. Table 2 gives the power comparison for each true associated gene. We list only the genes with power greater than or equal to 0.3 when the observed FPR is 0.1. From this table, we see that the power to detect true associated genes for trait D is generally low. For the genes for which an associated SNP is not rare, such as FLT1 (SNP C13S523, MAF = 0.067), the SNPbased method is preferred because it is simpler and more powerful. However, some genes, such as FLT4 and HIF1A, have multiple rare alleles (MAF ≤ 0.012) that contribute to the trait; in this case the genebased methods can have more power than the SNPbased method. Therefore we conclude that the genebased association might be able to pick up some association signals that are neglected in a SNPbased search.
Figure 1. ROC plot for trait D The xaxis is the average observed false positive rate over 200 replicates and 50 nonassociated genes. The yaxis is the average observed true positive rate (or power) over 200 replicates and 36 truly associated genes. SNP, SNPbased method; GL, genebased linear method; GS, genebased similarity method; G2, genebased twostep method.
Table 2. Power to identify each associated gene for trait D
Genomewide search for associated genes
TableÂ 3 lists the results from the genomewide association studies based on SNPs and genes for trait Q1 (9 associated genes out of 3,205). The control of FDR using the BH procedure almost completely fails here. For example, when we use α = 10^{−5}, the observed FDRs are 2/3, 1/3, and 3/4 for SNP, GL and G2 methods respectively. Because it is extremely hard to find associated genes here, we might instead lower our goal and try to get a (fairly large) subset of genes containing associated ones. For example, when α = 0.5, there are 6 (out of 1,272), 7 (out of 1,253), and 8 (out of 977) true associated genes using the SNPbased, GL and G2 methods respectively. It seems that in this data set the G2 method is the best, the GL method is the second best, and the SNPbased method is the worst, because the G2 method reports fewer significant genes, which contain more true associated ones instead.
Table 3. Comparison of number of significant genes using trait Q1
The problems of weak power and a high FPR are much more severe for the other traits when the genetic variation is lower. For example, for the binary trait D, we applied all four methods to data set 1. The results from these methods are similar. For the GS method, in the 100 most significant genes there are 3 true associated ones, and in the 1,000 most significant genes there are 14 true associated ones. Note that if we equally likely select 1,000 from 3,205 genes (37 true associations), the expected number of true associated genes reported is 11.5, which is less but not very much less than 14. This indicates that pvalues here are not very informative. Therefore it is almost impossible to identify true associated genes in this situation.
Conclusions
The detection of an association signal for rare variants is a challenging problem because of insufficient data. The SNPbased association study is more powerful than genebased methods on average. However, when multiple rare variants contribute to a trait, the genebased association study not only gives more reliable control over FDR but also is possible to pick up association signals neglected by a SNPbased study.
Competing interests
The authors declare that there are no competing interests.
Authors’ contributions
LT performed the statistical analysis and drafted the manuscript. JY participated in the group discussion and helped with the statistical analysis and manuscript. BT requested data and helped to draft the manuscript. RSC helped to draft the manuscript. All authors read and approved the final manuscript.
Acknowledgments
We would like to thank the referees for their constructive suggestions. This work was supported in part by grants from the National Institutes of Health (NIH) (R01 GM031575), the National Heart, Lung, and Blood Institute (NHLBI) (R01 HL053353), and the Charles R. Bronfman Institute for Personalized Medicine at Mount Sinai Medical Center (New York).
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/17536561/5?issue=S9.
References

Wang K, Abbott D: A principal components regression approach to multilocus genetic association studies.
Genet Epidemiol 2008, 32:108118. PubMed Abstract  Publisher Full Text

Li B, Leal SM: Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data.
Am J Hum Genet 2008, 83:311321. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tong L, Yang J, Cooper RS: Efficient calculation of pvalue and power for quadratic form statistics in multilocus association testing.
Ann Hum Genet 2010, 74:275285. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sha Q, Chen HS, Zhang S: A new association test using haplotype similarity.
Genet Epidemiol 2007, 31:577593. PubMed Abstract  Publisher Full Text

Knuth DE: The Art of Computer Programming, v. 2, Seminumerical Algorithms. London, AddisonWesley; 1998.

Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing.

Wessel J, Schork NJ: Generalized genomic distancebased regression methodology for multilocus association analysis.
Am J Hum Genet 2006, 79:792806. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Benjamini Y, Yekutieli D: The control of the false discovery rate under dependency.
Ann Stat 2001, 29:11651188. Publisher Full Text

Reiner A, Yekutieli D, Benjamini Y: Identifying differentially expressed genes using false discovery rate controlling procedures.
Bioinformatics 2003, 19:368375. PubMed Abstract  Publisher Full Text

Storey JD: The positive false discovery rate: a Bayesian interpretation and the qvalue.
Ann Stat 2003, 31:20132035. Publisher Full Text

Storey JD, Taylor JE, Siegmund D: Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: a unified approach.
J R Stat Soc Ser B 2004, 66:187205. Publisher Full Text