Department of Biostatistics, University of Colorado Anschutz Medical Campus, Aurora, USA

Department of Epidemiology, University of Colorado Anschutz Medical Campus, Aurora, USA

Department of Biostatistics, Harvard School of Public Health, Boston, USA

Channing Laboratory, Harvard Medical School, Boston, USA

Institute for Genomic Mathematics, University of Bonn, Bonn, Germany

, German Center for Neurodegenerative Diseases (DZNE), Bonn, Germany

Abstract

Background

For genetic association studies in designs of unrelated individuals, current statistical methodology typically models the phenotype of interest as a function of the genotype and assumes a known statistical model for the phenotype. In the analysis of complex phenotypes, especially in the presence of ascertainment conditions, the specification of such model assumptions is not straight-forward and is error-prone, potentially causing misleading results.

Results

In this paper, we propose an alternative approach that treats the genotype as the random variable and conditions upon the phenotype. Thereby, the validity of the approach does not depend on the correctness of assumptions about the phenotypic model. Misspecification of the phenotypic model may lead to reduced statistical power. Theoretical derivations and simulation studies demonstrate both the validity and the advantages of the approach over existing methodology. In the COPDGene study (a GWAS for Chronic Obstructive Pulmonary Disease (COPD)), we apply the approach to a secondary, quantitative phenotype, the Fagerstrom nicotine dependence score, that is correlated with COPD affection status. The software package that implements this method is available.

Conclusions

The flexibility of this approach enables the straight-forward application to quantitative phenotypes and binary traits in ascertained and unascertained samples. In addition to its robustness features, our method provides the platform for the construction of complex statistical models for longitudinal data, multivariate data, multi-marker tests, rare-variant analysis, and others.

Background

In genetic association studies, individuals are often recruited based on case-control ascertainment conditions of the primary phenotype

We present a more general approach that does not require any distribution assumptions for the secondary phenotype. We refer to the approach as the non-parametric population-based association test (NPBAT). The approach has a form similar to the Family Based Association Test (FBAT), a non-parametric test statistic that is frequently used in the family based setting

The general concept of the proposed association-testing framework is to condition on the phenotype of interest and treat only the genetic data as random

We illustrate the practical advantages of NPBAT by an application to the COPDGene study. The COPDGene study is a case-control study of the genetics of COPD in current or former smokers with at least 10 pack-years of smoking history

Methods

In a genetic association study, _{
i
} denote the genotype of the individual _{
i
} will depend upon the genetic model under consideration. For instance, for an additive model, _{
i
} = 0, 1, 2 for 0, 1, 2 disease alleles, respectively. _{
i
} may also be a vector in order to test several alleles simultaneously. Let _{
i
} denote the numerical trait information for individual _{
i
} could equal one for affected individuals and _{
i
} could equal zero for unaffected individuals. Different coding functions are applied depending on the phenotype of interest. For binary and continuous traits, we will discuss efficient coding schemes below. First, we define a general class of test statistics as

Note that _{
i
} is the dependent variable and we condition upon the numerical trait information _{
i
}, the NPBAT statistic has the following form:

where _{
x
} denotes the expectation of the marker score/ genotype _{
x
} can be estimated based on the sample mean of the genotypes. The asymptotic distribution of the NPBAT statistic under the null-hypothesis depends on the estimation of _{
x
} and on the specification of the trait information _{
i
}, and is derived in the Appendix.

There are various ways to code the phenotype of interest and define the coding function _{
i
}. For the analysis of affection status, one could specify the coding function to be _{
i
} = 1 or _{
i
} = 0, depending on the disease status of the proband. However, as we show in the Appendix Appendix A: Offset choice when Y is binary, a more efficient way is to set

If the phenotype _{
i
} is in fact normally distributed and _{
i
} is a continuous phenotype, we recommend _{
i
} = _{
i
} - _{
y
} where _{
y
} is the phenotypic mean in the general population.

While it is appealing that the NPBAT statistic is comparable to standard methods in these simple scenarios, the real appeal of the NPBAT statistic is when there is only phenotype information available for some subjects but there is genetic information available for all subjects. For example, in case control studies, an additional quantitative phenotype may be available for the cases but not the controls. When testing for a genetic association with this additional quantitative phenotype, the NPBAT statistic uses the genotype of both the cases and the controls with the optimal coded phenotype _{
i
} = _{
i
} - _{offset} where _{offset} is a constant. The choice of this constant is described in detail in the simulations sub-section and the asymptotic distribution of the NPBAT statistic is derived in the Appendix. Using this optimal offset choice, the NPBAT statistic has a substantial increase in power over other methods such as the NPBAT statistic when an offset choice of ^{2} test and the genotypic ^{2} test

Adjustments for population admixture

The NPBAT statistic can be adjusted for population admixture by using standard methods such as principal components analysis or genomic control

Extension to multiple phenotypes

The NPBAT statistic can be extended to

Note that _{
i
} is the _{
i
} is just one marker. So S is

where

Due to the estimation of _{
x
} based on the sample, this statistic does not have a chi square distribution and a permutation test needs to be used to assess significance levels, which can be done by using the NPBAT software package (

Simulations

In genetic association case-control studies, only the cases may have additional phenotypic information available. For instance, in a case-control study where the cases have asthma (the primary phenotype), only the cases may have FEV measurements (the secondary phenotype). In this scenario, the secondary phenotype FEV will be more severe than it would be in the general population and the analysis of this secondary phenotype can be misleading due to the ascertainment of subjects based on the primary phenotype, asthma. To simulate this scenario, we generated the genotype X for 500 cases and 500 controls and a secondary phenotype Y for only the 500 cases from a truncated normal distribution with standard deviation

We compute the NPBAT statistic with the coded phenotype _{
i
} = _{
i
} - _{offset} where _{offset} is a constant that ranges from -5 to 15 and _{
x
} is the sample mean of the genotypes in the cases. We also compute the NPBAT statistic with _{
x
} equal to the sample mean of the genotypes in the controls and _{
x
} equal to the sample mean of the genotypes in the cases and the controls. We compare the power of these three NPBAT statistics to the Improved Score Test, which is uniformly more powerful than score tests based on the generalized linear model such as the Cochran-Armitage trend test, the allelic ^{2} test and the genotypic ^{2} test

Under the null hypothesis, the NPBAT method maintains a significance level of approximately 5% or less as seen in Figure _{
x
} is the sample mean of the cases or the controls or both. Figure _{
x
} is based on the genotype of the controls and _{offset} is significantly different than the phenotypic mean of the cases. When _{
x
} is based on the genotype of the cases, the power of the NPBAT approach is similar to the improved score test and the regression. Note that the power of NPBAT approach when _{
x
} is based on the genotype of both the cases and the controls is best for high values of heritability.

Power and Significance levels for NPBAT, the Improved Score Test and the Likelihood Ratio Test (LRT)

**Power and Significance levels for NPBAT, the Improved Score Test and the Likelihood Ratio Test (LRT).** This plot compares the power and type-1 error rate of the NPBAT method using _{x} based on the sample mean of the cases, the controls and both the cases and controls. The power and significance levels of this method is compared to the improved score test and a standard linear regression. Note that the spike or drop in all the plots occurs where _{x} is based on the genotype of the controls and _{offset} is significantly different than the phenotypic mean of the cases. When _{x} is based on the genotype of the cases, the power of the NPBAT approach is similar to the improved score test and the regression. Note that the power of NPBAT approach when _{x} is based on the genotype of both the cases and the controls is best for high values of heritability.

These simulations show that for case-control studies when analyzing secondary phenotypes correlated with case-control status, we recommend to set _{offset} to a constant significantly different from the phenotypic mean of the sample and _{
x
} equal to the genotypic mean of the controls. In this situation, a robust and efficient choice for the offset _{offset} is the phenotypic mean in the general population. Note that the results of these simulations are analogous to the FBAT statistic in family studies where it was found that when ascertaining cases only from a quantitative distribution, one needed to choose an offset that was outside the range of the case’s phenotypic values

Data analysis

We applied the NPBAT method to the Genetic Epidemiology of COPD (COPDGene) Study which is a multi-center case/control study designed to identify genetic factors associated with COPD and to characterize COPD-related phenotypes

**Method**

**NPBAT:**

**NPBAT:**

**Improved Score Test**

**Regression**

rs1051730

0.00134

0.00138

0.00227

0.00259

rs8034191

0.00386

0.00391

0.00694

0.00744

Results and discussion

NPBAT is a new statistical framework for population based genetic association tests that does not require making specific assumptions about the distribution of the phenotype. By conditioning on the phenotype, NPBAT is robust against violations of phenotypic model assumptions. The practical implications of NPBAT are demonstrated when applied to the COPDGene Study. FNDS, a measure of nicotine dependence, was assessed in current smokers that represent 31% of study participants in COPDGene. We analyzed SNPs shown to be associated with FNDS

1. when a sample is ascertained based on case/control status and the phenotype of interest is correlated with case status

2. in a cohort study in which prevalent cases are excluded (i.e. the classic epidemiologic cohort study) and the phenotype of interest is correlated with the disease of interest

3. a pharmacogenetics study using a randomized clinical trial when participants are ascertained based on the levels of the target of therapy

The broad application of NPBAT is to scenarios where samples are ascertained based on selection criteria that are correlated with the phenotype of interest.

Conclusions

In conclusion, the key advantage that defines the attraction of the proposed approach is its robustness against model specification of the phenotypes. This enables extensions to different types of traits and the integration of complex statistical models for the phenotype. While, at the same time, the validity of the approach is not compromised by such generalization. Though the power is sensitive to the offset choice, NPBAT is valid regardless of the offset. As with all population-based association tests, population stratification can be a problem. Adjusting for known population sub-structure using principal components of ancestral informative markers (AIMs) or using genomic controls can reduce the impact of population stratification. The NPBAT software package which implements this method is detailed in the Appendix.

Appendix

Appendix A: Offset choice when Y is binary

The following considers the offset choice for the coded trait T when Y is binary. Assume the phenotype of interest is binary and the genotype of interest follows an additive model. Let _{0}, _{1}, and _{2} denote the number of cases with 0, 1, and 2 disease alleles, respectively. Let _{0}, _{1}, and _{2} denote the number of cases and controls with 0, 1, and 2 disease alleles, respectively. Let

In this scenario, let the coded phenotype _{
i
} = _{
i
} - _{
y
} where _{
y
} is the offset. The NPBAT statistic has the following form:

Note that the numerators of both statistics are the same. The ratio of the test statistics can be written as follows:

where _{
y
} is

Appendix B: asymptotic distribution when the secondary phenotype is available for both the cases and controls

To derive the asymptotic distribution of the NPBAT statistics for various phenotypic offset choices, let _{offset} = ((_{1} - _{offset})...(_{
n
} - _{offset}))^{
t
} and let _{
i
}s are independent, _{
i
}) = 0 and _{
i
}, which ensures asymptotic normality of

Since _{
i
} has a discrete distribution, the Lindberg condition can only be fulfilled when the integration set {|_{
i
}| ≥

Hence the integral in the Lindberg condition is always computed over a set that is empty for

Note that the statistic is maximized and has a standard normal distribution when _{offset} =

Appendix C: asymptotic distribution when the secondary phenotype is only available for the cases

Here, we derive the asymptotic distribution of the NPBAT statistic for secondary phenotypes in case/control studies. Consider a case control study where genetic information is available for both the cases and the controls, but the phenotypic information is only available for the cases. Here _{1}.._{
n
} is the coded genotype of the cases but

then

It is important to note that the _{
i
}s are independent, _{
i
}) = 0 and _{
i
}, which ensures asymptotic normality of

Since _{
i
} has a discrete distribution, the Lindberg condition can only be fulfilled when the integration set {|_{
i
}| ≥

Hence the integral in the Lindberg condition is always computed over a set that is empty for

Then the NPBAT statistic is normally distributed with mean zero and variance given above. Note that the variance is always greater than or equal to one and equals one when _{offset} = _{
x
} is based on the the controls and the phenotype information is only available for the cases, then the power is maximized when _{offset} ≈

Appendix D: NPBAT software

A software package implemented in C++ to compute both single phenotype and multiple phenotypes NPBAT statistics is available for download at the following website:

Abbreviations

COPD: Chronic obstructive pulmonary disease; FBAT: Family Based Association Test; FEV: Forced expiratory volume; FNDS: Fagerstrom nicotine dependence score; GWAS: Genome Wide Association Study; LRT: Likelihood ratio test; NPBAT: Nonparametric Population Based Association Test; PC: Principal component.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

SL derived the asymptotic distribution, performed the simulations studies and the data analysis. SL, WKY, JH, NL, and CL were involved in drafting the manuscript or revising it critically. CL made substantial contributions to conception of the method and assisted in the simulation studies. All authors read and approved the final manuscript.

Acknowledgements

We would like to acknowledge Carla Wilson at National Jewish Health for her help with the COPDgene dataset. This work was funded by NIH/ NHLBI U01 HL089856 Edwin K. Silverman, PI. COPDGene is supported by NHLBI Grant Nos U01HL089897 and U01Hl089856.