Abstract
Significance of genetic association to a marker has been traditionally evaluated through statistics that are standardized such that their null distributions conform to some known ones. Distributional assumptions are often required in this standardization procedure. Based on the observation that the phenotype remains the same regardless of the marker being investigated, we propose a simple statistic that does not need such standardization. We propose a resampling procedure to assess this statistic’s genomewide significance. This method has been applied to replicate 2 of the Genetic Analysis Workshop 17 simulated data on unrelated individuals in an attempt to map phenotype Q2. However, none of the selected SNPs are in genes that are diseasecausing. This may be due to the weak effect that each genetic factor has on Q2.
Background
The traditional approach to hypothesis testing is to construct a test statistic whose exact or approximate distribution under the null hypothesis is known. Alternative methods for assessing significance of a test statistic include resampling techniques, such as the bootstrap and permutation. These methods dominate modern genetic association studies.
Statistics used in these methods typically make use of a standardization factor, for instance, the standard error of the observed genetic effect size. Different model assumptions result in different standardization factors. For instance, in a casecontrol genetic association study, we are interested in testing whether the difference in the frequency of a reference allele between case subjects and control subjects is 0. The popular allelic test and the CochranArmitage test for trend are based on two different estimates of the variance of this difference. The estimate for the allelic test requires HardyWeinberg equilibrium in the combined casecontrol population, whereas the estimate for the CochranArmitage test does not. The properties of these two tests are rather different [1].
It is a brilliant idea to construct a test statistic through standardization. It makes it possible to assess the significance of the observed effect size with respect to a reference distribution. However, in genomewide association studies there are typically tens of thousands of markers. Presumably most of these markers are under the null hypothesis (i.e., they are not associated). The presence of these null markers provides a natural reference distribution. Standardizing the effect size in this setting appears to be unnecessary and awkward.
We propose a novel resampling approach that aims to directly assess the significance of genetic effect size. The null distribution generated from this method is for a randomly selected singlenucleotide polymorphism (SNP). Hence it naturally addresses the genomewide significance of genetic effect size at a SNP and obviates the issue of multiple testing. To define the genetic effect size, we observe that the phenotype remains the same in a genome scan. Only the testing locations are different. This observation leads to a simple representation of the genetic effect size. This approach is demonstrated by means of a genomewide association study on phenotype Q2 in replicate 2 of the Genetic Analysis Workshop 17 (GAW17) simulated data on unrelated subjects.
Methods
Let n denote the number of individuals in a study. Let y_{i} denote the measure of a quantitative phenotype y on individual i. The score of the genotype at a SNP is denoted by g. Assuming that the two alleles of the SNP are denoted by A and B, we set g = 0 for genotype AA, g = 1 for genotype AB, and g = 2 for genotype BB. The genotype score on individual i is denoted by g_{i}.
To identify SNPs associated with phenotype y, we consider a regression in which the response variable is the genotype score g and the independent variable is the phenotype y. The (partial) regression sum of squares for g equals:
where and are the sample means of g_{i} and y_{i}, respectively. Because the denominator in expression (1) remains constant in a genome scan, a natural measure of association would be:
Note that in a casecontrol study is proportional to the difference in the frequency of allele B between case subjects and control subjects. If there are p covariates x_{j}, j = 1, …, p, then phenotype y_{i} is replaced by the residual from the following linear regression:
where x_{ij} is the value of x_{j} for individual i. Similarly, genotype g is replaced by the residual g* of the regression of g over x_{j}, j = 1, …, p.
We use the statistic:
to measure the strength of association, where n* is the number of subjects whose genotype is nonmissing at the SNP being investigated. The sample mean of is 0, so it is dropped from the definition of S. So is the sample mean of . The purpose of using n* is to make the statistic S on the same scale because n* is expected to vary across the genome.
In comparison, the usual method for detecting association would consider the following regression:
The leastsquares estimate of β_{1}, denoted , is:
The statistic for testing for association is:
where is the standard error of . This procedure is equivalent to testing whether the coefficient of g is 0 in a multiple linear regression that includes x_{i}, …, x_{p} as covariates and y as the response. Because the total sum of squares for is fixed in a genome scan, there is a monotonic relationship between the regression sum of squares for Eq. (5) and the statistic T.
We propose the following resampling procedure to evaluate the genomewide significance of S: (1) Compute residuals and residuals from regression (3). (2) Randomly select a SNP from a set of SNPs that are under the null hypothesis. This null set can be determined by using a histogram of pvalue [2]. (3) Permute residuals (or, equivalently, permute residuals ). (4) Compute the statistic S for the randomly selected marker. (5) Repeat steps 2–4 the desired number of times, say K. (6) Let S^{(}^{k}^{)} denote the value of statistic S in the kth iteration. Suppose that the value of statistic S is s for the observed data. Its pvalue is computed as:
where I(·) is the indicator function satisfying I(True) = 1 and I(False) = 0.
Because the SNP selected in each iteration is random, the significance obtained in this way is genomewide.
Results
We analyze phenotype Q2 in replicate 2. There are 24,487 SNPs and 697 individuals. Residuals are computed from regression (3) with regressors Sex, Age, and Smoke. The residuals are computed in the same way. Genomewide pvalues from the tdistribution with 1 degree of freedom for statistic T are presented in Figure 1. There are 19 SNPs with a pvalue less than 5 × 10^{−4} (Table 1).
Figure 1. Genomewide plot of pvalues for the regular statistic T. The horizontal line corresponds to p = 5 × 10^{−4}.
Table 1. Selected SNPs based on pvalues of statistic T (p < 5 × 10^{−4})
The reference distribution for statistic S is obtained in the way described in the previous section. The value of K is set at 10,000,000. The histogram of pvalues for parameter β_{1} in regression (5) shows no apparent deviation from uniform distribution (data not shown). Thus all SNPs are used to determine the reference distribution for statistic S. Pvalues obtained from the resampling procedure are presented in Figure 2. Eighteen SNPs have a pvalue that is less than 5 × 10^{−4}. None of these SNPs are close to the 19 SNPs that have pvalues less than 5 × 10^{−4} for statistic T, except for C20S2223 on chromosome 20 (basepair position 61666107; p = 2.59 × 10^{−5} for statistic T; resampling p = 6.27 × 10^{−2} for statistic S). C20S2223 is located in gene PRIC285 identified by statistic S.
Figure 2. Genomewide plot of resampling pvalues for the proposed statistic S. The horizontal line corresponds to p = 5 × 10^{−4}.
Discussion and conclusions
Significance of genetic association at a SNP has been traditionally evaluated through a standardized test statistic such that its significance can be evaluated through a known distribution. For instance, in Eq. (7) the standard error of serves as a standardization factor. Its pvalue is assessed through a tdistribution or a standard normal distribution. This strategy works beautifully in general but is awkward in the context of genomewide association studies. The presence of abundant unassociated SNPs provides a natural reference distribution that is pertinent to the data. We have proposed a simple statistic S and a resampling method for generating its reference distribution. Although a distributional assumption is necessary to compute the standard error of , it is not required in our method. More important, the resampling procedure has a builtin mechanism for handling genomewide significance because the reference distribution is the distribution of an arbitrary SNP under the null hypothesis that there is no association. Application to the GAW17 simulated data on unrelated individuals (replicate 2) revealed several susceptible genes for phenotype Q2 that can be used as candidates for further investigation.
The essence of the proposed method is the existence of a large number of similar features—SNPs in the current context. The basic principle is applicable to other situations that involve a large amount of similar features, for instance, gene expression levels in gene expression studies.
We have focused on a continuous trait. As alluded to earlier, for a dichotomous trait, if there are no covariates, then the statistic S can be defined as the difference in frequency of the reference allele between the two phenotype groups. More generally, one can use a proportional odds model with genotype as the response and phenotype as the predictor. A proportional odds model is appealing for dealing with covariates and both continuous and dichotomous traits. We are actively investigating this possibility.
The proposed method is resampling based and requires intensive resampling. However, computation for each sample is fast. The overall computation time for K = 10,000,000 resampling iterations took less than two days on a computer running the Windows Vista operating system. The iteration coverage is about 408 per SNP (≈ K/24,487 SNPs). We have also used K = 100,000; the simulated pvalue remains almost the same. The resampling procedure needs to be done only once, regardless of the number of SNPs.
We are investigating the utility of our approach in other association study settings, such as genegene interaction and population stratification.
We had no knowledge of the GAW17 diseasegenerating model when this analysis was conducted. It turns out that none of the genes in Table 1 and Table 2 are disease causing. This may be due to the weak effect that each genetic factor has on Q2.
Table 2. Selected SNPs based on resampling pvalues of statistic S (p < 5 × 10^{−4})
Competing interests
The authors declare that there are no competing interests.
Authors’ contributions
KW conceived of the study and participated in its design and coordination and helped to draft the manuscript. JH helped to draft the manuscript. All authors read and approved the final manuscript.
Acknowledgments
KW’s research is partly funded by the National Institute for Environmental Health Sciences (National Institutes of Health) (NIEHS/NIH) grant P30 ES05605 through the University of Iowa Environmental Health Sciences Research Center. JH’s work is partly supported by NIH grant R01 CA120988 and National Science Foundation (NSF) grant DMS 0805670. We would like to thank the two anonymous referees for their helpful comments and suggestions.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/17536561/5?issue=S9.
References

Sasieni PD: From genotypes to genes: doubling the sample size.
Biometrics 1997, 53:12531261. PubMed Abstract  Publisher Full Text

Storey JD, Tibshirani R: Statistical significance for genomewide studies.
Proc Natl Acad Sci USA 2003, 100:94409445. PubMed Abstract  Publisher Full Text  PubMed Central Full Text