Abstract
The common genetic variants identified through genomewide association studies explain only a small proportion of the genetic risk for complex diseases. The advancement of nextgeneration sequencing technologies has enabled the detection of rare variants that are expected to contribute significantly to the missing heritability. Some genetic association studies provide multiple correlated traits for analysis. Multiple trait analysis has the potential to improve the power to detect pleiotropic genetic variants that influence multiple traits. We propose a genelevel association test for multiple traits that accounts for correlation among the traits. Gene or regionlevel testing for association involves both common and rare variants. Statistical tests for common variants may have limited power for individual rare variants because of their low frequency and multiple testing issues. To address these concerns, we use the weightedsum pooling method to test the joint association of multiple rare and common variants within a gene. The proposed method is applied to the Genetic Association Workshop 17 (GAW17) simulated miniexome data to analyze multiple traits. Because of the nature of the GAW17 simulation model, increased power was not observed for multipletrait analysis compared to singletrait analysis. However, multipletrait analysis did not result in a substantial loss of power because of the testing of multiple traits. We conclude that this method would be useful for identifying pleiotropic genes.
Background
The common disease/common variant hypothesis states that common variants contribute substantially to common diseases [1,2]. Following this hypothesis, genomewide association studies have successfully detected associations with common variants. However, such common variants explain only a small proportion of the phenotypic variation. Many of the as yet undetected common variants may have small effect sizes; therefore they are not expected to contribute significantly to the missing heritability. An alternative theory, the common disease/rare variant hypothesis, argues that a large number of rare variations with moderate to high penetrances account for genetic susceptibility to common disease [1]. Recently, deepresequencing studies of candidate genes have provided some evidence supporting the common disease/rare variant hypothesis [3]. Although various statistical methods have been developed to detect associations with common variants for common diseases, these methods are inefficient for rare variants because of the small number of observations for each single rare variant. One feasible method for rare variant analysis is to pool multiple rare variants within a gene or region and to test their joint effect. This category of methods has been reviewed by Dering et al. [4].
Some genetic association studies examine a qualitative trait, such as the casecontrol status and some additional correlated quantitative traits. For example, a genetic study of diabetes may examine the diabetic status and other related phenotypes, such as body mass index and other lipid profiles. Similarly, a glaucoma study may explore the related endophenotypes, such as central corneal thickness, intraocular pressure, and maximum vertical cuptodisc ratio. One way to analyze these data is to perform singletrait analyses separately. An alternative way is to perform a multipletrait analysis, which potentially has improved power to identify the pleiotropic variants for these traits [5,6].
Univariate test statistics or pvalues of multiple traits may not be independent because of the environmental or genetic correlations among multiple traits. Hence some classical methods of combining independent pvalues, such as Fisher’s method, are not directly applicable to the analysis of multiple correlated traits. Our purpose in this paper is to develop a test statistic for combining these correlated univariate statistics by considering the correlation structure among multiple traits. Motivated by a recently proposed approach [7], we developed a gene association test to test the joint effect of multiple variants within a gene on multiple correlated traits. The proposed method considers genes as basic units and uses the weighted sum [8] to combine the effects of multiple variants. The test statistic of multiple traits is the linear or quadratic combination of the univariate test statistics. It is likely that some rare variants may contribute to only a subset of available traits. Therefore we also conduct an alternative test on a preselected subset of multiple traits.
Methods
Let Y = (Y_{1}, …, Y_{m})^{T} denote the m available traits. Assume that the gene k has L genotyped singlenucleotide polymorphisms (SNPs), including both common and rare ones. In the first step, the genetic score S_{j} of the gene k for an individual j is calculated using the weighted sum of all SNPs within the gene. Second, a univariate test is performed to establish the association of genetic scores with all the traits separately. Then, a genelevel association test using the linear or quadratic combination of singletrait univariate statistics is constructed for multiple traits. Finally, the optimal subset of traits is selected for multipletrait analysis. The details of the various steps are described in what follows.
Gene score using weighted sum
The weightedsum gene score assigns different weights to each variant based on the estimated allele frequencies [8]. The score for gene k for individual j is given by:
where I_{ij} is the number of minor alleles for SNP i in individual j,
and m_{i} is the total number of minor alleles for SNP i in all n individuals. In the original study [8], the allele frequencies were estimated only for the control subjects. Because multipletrait analysis needs to analyze multiple quantitative traits as well as the disease status, in the present study we estimate the allele frequencies using all individuals.
Association test for multiple traits
Let S = S_{1}, S_{2}, …, S_{n})^{T} denote the scores for gene k for n individuals. The test statistic for testing the association of S with each trait Y_{l} (l = 1, …, m) is denoted Z_{l}, and it is assumed to asymptotically follow N(0, 1) under the null hypothesis. The choice for Z_{l} is discussed in the Results section. The test statistic for the combined multiple traits (CMT) method is a quadratic or linear combination of m univariate test statistics Z_{1}, …, Z_{m}.
The CMT method is motivated by a recently proposed approach [7] to test the joint effect of multiple correlated SNPs. Because the test statistics Z_{1}, …, Z_{m} are assumed to asymptotically follow N(0, 1), the joint distribution of the random vector Z = (Z_{1}, …, Z_{m})^{T} is asymptotically multivariate normal N_{m}(0, Σ), where Σ is the correlation matrix of Z. The quadratic combination statistic is given by:
and it is distributed asymptotically as a chisquare distribution with m degrees of freedom if Σ is full rank [9]. Because Σ is unknown, it is estimated by permutations. The phenotypes are permuted, and the m univariate test statistics are computed under each permutation. The estimation of Σ is given by:
under N such permutations. The pvalue of the quadratic statistic () can be approximated by the chisquare quantile.
An alternative statistic for multiple correlated traits is the linear combination statistic :
which is distributed asymptotically as N(0, 1) [9]. Here also the covariance matrix Σ is replaced by its estimate , obtained from permutations. The pvalue of the linear statistic can be approximated by the standard normal quantile. Note that the linear combination statistic may result in a loss of power if the direction of association is not the same for all of the traits. By contrast, if the effects on these traits are in the same direction, it is possible that the power of may be better than the power of . In general, is more robust than .
Association test for the optimal subset of multiple traits
It is likely that some of the relevant rare variants are associated with only a subset of traits. In this case, combining all the test statistics using the CMT method may result in a loss of power because of the high degree of freedom in the chisquare distribution. Therefore we propose an association test for optimally combined multiple traits (OCMT) using a preselected subset of these traits. To select the optimal subset, we use the CMT method to calculate the pvalues for all possible subsets with at least two traits. The subset with the minimum pvalue is selected as the optimal one, denoted A*. For any subset with at least two traits A, the quadratic combination statistic is given by:
where Z_{A} and Σ_{A} are the subvector and submatrix of Z and Σ, respectively. The pvalue is obtained from the chisquare quantile. The pvalue of A* () is given by:
To control type I error, the pvalue of the OCMT method is obtained using a permutation procedure that is based on the permutation of phenotypes. For each permutation π, the subset with the minimum pvalue is selected as the optimal subset, denoted A*(π). The pvalue of OCMT () is defined as the proportion of permutations with , where is the pvalue of A*(π).
Results
We applied the CMT and OCMT methods to the GAW17 simulated miniexome data sets [10]. The results are reported in two parts. In the first part, the power of the proposed method is compared to the power of the singletrait analysis. Initially, the CMT and OCMT methods were proposed and applied to the GAW17 data set without the knowledge of the simulation model. These original results were presented at the GAW17 meeting. Because the simulation model used to create the GAW17 data was discussed at the workshop, we reran the analysis with the knowledge of the simulation model. Here we present the results of the revised analysis based on the knowledge of the simulation model. In the second part, we present some insights into the falsepositive rate of the CMT method.
For each gene, we tested each trait separately using the ttest statistic of the β coefficient corresponding to the gene score in the logistic regression (for the disease status D) or the linear regression (for quantitative traits Q1, Q2, and Q4) adjusted for three covariates (Age, Sex, and Smoking status). The test statistic Z_{l} (l = 1, …, m) was obtained from the inverse normal distribution transformation of the ttest statistic [7] and was assumed to have a standard normal distribution.
The association tests for multiple traits were performed by using the CMT method with the quadratic combining statistic. The pvalue was approximated using the theoretical quantiles of the chisquare distribution. The correlation matrices among the test statistics Z_{1}, …, Z_{m} were estimated by 1,000 permutations. In addition, the association test using the OCMT method was performed, and its pvalue was obtained from another set of 1,000 permutations. Given a predefined significance level α, the power of any association test was defined as the proportion of 200 replicates that returned a pvalue less than or equal to α.
The samples in the GAW17 data set were collected from six cohorts, but population stratification was not considered in the simulation model. We performed the association tests with and without corrections for stratification. To correct for stratification, we corrected genotypes and phenotypes using the first 10 principal component scores derived using Eigenstrat [11].
Power of single and multipletrait analyses
Table 1 presents the power to detect the Q1 causal genes at a significance level of 0.05. Without adjusting for stratification, the univariate test for Q1 has a power greater than or equal to 0.3 for eight genes. With the adjustment, only four genes (FLT1, KDR, VEGFA, and VEGFC) still have a power greater than or equal to 0.3. For Q2 and Q4, most of the genes have a power less than 0.1. Some of the genes also are associated with the disease status (power ≥ 0.3).
Table 1. Power to detect the causal genes on Q1 at the 0.05 significance level
We used the CMT method with the quadratic statistic to perform the multipletrait analysis for Q1 and D (Q1+D). For all the Q1 causal genes, the power of the Q1+D analysis was less than or comparable to the power of the univariate test for Q1. The results show that all Q1 causal genes may have small or no pleiotropic effects that are insufficient to compensate for the increase of the critical value from to . Checking the GAW17 simulation model revealed that this result is reasonable and consistent with the simulation model.
Among all the Q1 causal genes, under the simulation model, only ELAVL4 was assumed to have pleiotropic effects on Q1 and the latent liability. Moreover, only two rare variants in this gene with MAF = 0.000717 (C1S3181 and C1S3182) contributed to the latent liability. Therefore it could be difficult to detect the association of ELAVL4 with the latent liability. The other eight causal genes on Q1 were assumed to have no pleiotropic effects. Therefore multipletrait analysis did not improve the power for Q1 causal genes. However, multipletrait analysis did elucidate the genetic correlation between the disease status and related quantitative traits and aided in the identification of the pleiotropic genes.
The pvalue of the OCMT method is the minimum of the pvalues of all subsets with at least two traits. Multipletrait analysis was performed for all subsets of Q1, Q2, Q4, and D using the CMT method. The optimally combined multiple traits were selected on the basis of their pvalues. The power of the OCMT method was comparable with that of the analysis for Q1+Q2+Q4+D. The last column of Table 1 summarizes the subset with the highest power among all the possible subsets with at least two traits and their powers. Most of the genes had the highest power when Q1 and D were combined. This finding shows that the OCMT method provides a way to select the best combination of traits, because the disease status is derived on the basis of multiple traits and Q1 has the biggest effect size.
Table 2 summarizes the power to detect the causal genes on Q2, given a significance level of 0.05. In general, the univariate test for Q2 has the largest power, and the combination of Q2 and D has relatively good performance compared with the other subsets of multiple traits. Table 3 summarizes the power to detect the causal genes on the latent liability. After adjusting for population stratification, all the genes had a power less than 0.3 for both single and multipletrait analyses. The power of the analysis for Q1+Q2+Q4+D was less than or comparable to that of the singletrait analysis for D. This result is not surprising, because only ELAVL4 has a small pleiotropic effect on Q1; the other genes have no effects on Q1, Q2, or Q4.
Table 2. Power to detect the causal genes on Q2 at the 0.05 significance level
Table 3. Power to detect the causal genes on the latent liability at the 0.05 significance level
Falsepositive rates of single and multipletrait analyses
On the basis of the results adjusted for population stratification, we calculated the falsepositive rates of single and multipletrait analyses. Genes with a power greater than or equal to 0.3 were considered the associated findings. Luedtke et al. [12] reported that 695 genes were spuriously associated with the disease status. Excluding these 695 genes, we identified 10 causal genes (4 on Q1, 6 on Q2, and 3 on D) and 77 falsepositive genes (62 on Q1, 14 on Q2, 0 on Q4, and 6 on D) in the singletrait analysis for Q1, Q2, Q4, and D. The falsepositive rate of the singletrait analysis was equal to 0.031. The multipletrait analysis Q1+Q2+Q4+D detected six causal genes and 58 falsepositive genes. The falsepositive rate of the Q1+Q2+Q4+D analysis was equal to 0.023. This result shows that, compared with the singletrait analysis, the CMT method combining multiple traits does not increase the falsepositive rate.
Discussion
In some genetic association studies, multiple correlated traits are available that can be used to identify genes responsible for multiple traits. In singletrait analysis, some common associated genes may be found across different traits. These overlapping associations may be caused by pleiotropic genes and/or the correlation structure among traits. Multipletrait analysis has the potential to improve the power to detect pleiotropic genes. The multipletrait analysis proposed here did not suffer a significant loss of power, even though true models, such as the GAW17 simulated model, had small or no pleiotropic effects. The proposed method considers the correlation matrix, thereby ensuring that the falsepositive rate is not inflated by the correlation among multiple traits.
In nextgeneration sequencing data sets, huge numbers of rare variants are genotyped across the whole genome. Because of the small number of observations for each rare variant, statistical tests for common variants are inefficient at identifying the associations. Hence the proposed method performs a genelevel test using the weighted sum to test the joint effect of multiple variants within a gene. The weighted sum is a feasible choice for the genebased score [8], but it is not the only choice; other methods are available for pooling multiple rare variants [4].
From Tables 1 and 3, it can be seen that the correction for population stratification has a large effect on the power to detect some associations with Q1, D, and Q1+Q2+Q4+D. This phenomenon also was observed from the quantilequantile plots in replicate 1 (data not shown). Although we did not consider the population structure in the simulation model, the principal component scores may influence the phenotypes, because the population structures would be similar to those of the true causal genes in this data set [13]. We conclude that the rare variant associations unadjusted for population stratifications should be interpreted with caution.
Conclusions
With the advent of nextgeneration sequencing technologies, the identification of rare variants has become realistic. Statistical methods for common variants are not applicable in rare variant analysis because of the small number of observations and the huge number of rare variants across the whole genome. Thus efficient statistical methods are needed. When multiple related traits are available, it is expected that multipletrait analysis has the improved power to detect pleiotropic genes. We proposed the CMT and OCMT methods to examine the joint effect of multiple variants on multiple traits. The proposed method uses the quadratic or linear combination of univariate test statistics and thus considers the correlation structure among multiple correlated traits. The CMT and OCMT methods were applied to the GAW17 miniexome data. The results show that the method is suitable for multiple trait analysis.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
JZ conceived of the study, participated in its design and wrote the paper. AT participated in the design of the study and made the major edits. Both authors read and approved the final manuscript.
Acknowledgments
We thank the GAW17 organizers for providing the exome data set. The Genetic Association Workshop is supported by National Institutes of Health grant R01 GM031575. We also thank G. T. H. Keong for his assistance in the analysis of population stratification.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/17536561/5?issue=S9.
References

Hirschnorm JN, Daly MJ: Genomewide association studies for common disease and complex traits.
Nat Rev Genet 2005, 6:95108. PubMed Abstract  Publisher Full Text

Iyengar SK, Elston RC: The genetic basis of complex traits: rare variants or “common gene, common disease”?
Meth Mol Biol 2007, 376:7184. Publisher Full Text

Cohen JC, Boerwinkle E, Mosley THJ, Hobbs HH: Multiple rare alleles contribute to low plasma levels of HDL cholesterol.
Science 2004, 305:869872. PubMed Abstract  Publisher Full Text

Dering C, Pugh E, Ziegler A: Statistical analysis of rare variants: An overview of collapsing methods.

Jiang C, Zheng ZB: Multiple trait analysis of genetic mapping for quantitative trait loci.
Genetics 1995, 140:11111127. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Liu XG, Liu YJ, Liu J, Pei Y, Xiong DH, Shen H, Deng HY, Papasian CJ, Drees BM, Hamilton JJ, et al.: A bivariate whole genome linkage study identified genomic regions influencing both BMD and bone structure.
J Bone Mineral Res 2008, 23:18061814. Publisher Full Text

Luo L, Peng G, Zhu Y, Dong H, Amons CI, Xiong M: Genomewide gene and pathway analysis.
Eur J Hum Genet 2010, 18:10451053. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Madsen BE, Browning SR: A groupwise association test for rare mutations using a weighted sum statistic.
PLoS Genet 2009, 5:e1000384. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bickel PJ, Doksum KA: Mathematical Statistics: Basic Ideas and Selected Topics. 2nd edition. Upper Saddle River, NJ, Prentice Hall, v. 1; 2001:506508.

Almasy LA, Dyer TD, Peralta JM, Kent JW Jr., Charlesworth JC, Curran JE, Blangero J: Genetic Analysis Workshop 17 miniexome simulation.
BMC Proc 2011, 5(suppl 9):S2. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal component analysis corrects for stratification in genomewide association studies.
Nat Genet 2006, 38:904909. PubMed Abstract  Publisher Full Text

Luedtke A, Powers S, Petersen A, Sitarik A, Bekmetjev A, Tintle NL: Evaluating methods for the analysis of rare variants in sequence data.
BMC Proc 2011, 5(suppl 9):S119. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Qin H, Elston R, Zhu X: Interrogating population structure and its impact on association tests.
BMC Proc 2011, 5(suppl 9):S25. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text