Department of Biostatistics, Harvard School of Public Health, Boston, MA, 02115, USA

Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, 44106, USA

Abstract

Background

Incorporating family data in genetic association studies has become increasingly appreciated, especially for its potential value in testing rare variants. We introduce here a variance-component based association test that can test multiple common or rare variants jointly using both family and unrelated samples.

Results

The proposed approach implemented in our R package aggregates or collapses the information across a region based on genetic similarity instead of genotype scores, which avoids the power loss when the effects are in different directions or have different association strengths. The method is also able to effectively leverage the LD information in a region and it can produce a test statistic with an adaptively estimated number of degrees of freedom. Our method can readily allow for the adjustment of non-genetic contributions to the familial similarity, as well as multiple covariates.

Conclusions

We demonstrate through simulations that the proposed method achieves good performance in terms of Type I error control and statistical power. The method is implemented in the R package “fassoc”, which provides a useful tool for data analysis and exploration.

Background

With the availability of cost-effective next generation sequencing platforms, one hot topic in the field is the analysis of low frequency and rare variants, which are believed to play an important role in the etiology of common complex diseases and may explain a portion of the missing heritability

In this work, we present an R package that implements a variance-component (VC) based association test that can test multiple common or rare variants jointly using both family and unrelated samples. The VC or linear mixed model (LMM) based approach aggregates or collapses the information across a region based on genetic similarity instead of genotype effects, which avoids the power loss when the effects are in different directions. A comparison study of binary traits

Our method can readily allow adjusting for a non-genetic contribution to the familial similarity (shared environmental effects), as well as multiple covariates such as principal components of population structure. We compare the performance of the proposed method across a range of simulation scenarios with a fixed-effect or sum test based on Feasible Generalized Least Squares (FGLS). We also investigate the factors that influence power for testing rare variants. In this paper, we also show the connection between our method and kernel machine based methods

Implementation

Assume there are _{
i
} denotes the observed quantitative trait value; _{
i
} = (_{
i,1}, _{
i,2} … _{
i,m
}) ' denotes an _{
i
} = (_{
i,1}, _{
i,2}, … _{
i,n
}) ' denotes an _{
i
} = (_{
i,1}, _{
i,2} … _{
i,n
}) ' denotes a standardized genotype vector with the _{
j
} is the minor allele frequency of the

Linear mixed model and score test

The setup of our model is similar to the linear mixed model recently proposed to estimate the genetic variance explained by genome-wide SNPs **y** = **Xβ** + **Wμ** + **ϵ** with _{
i
}’, β is a vector of coefficients (fixed effects) for covariates X, μ is a vector of causal variant effects with **y** = **Xβ** + **δ** + **ϵ**, where δ is a vector representing random effects of all SNPs, with

To estimate and test the variance expressed by a gene or a genomic region using both family and unrelated data, intuitively one can extend the above model by

where γ is a vector of the random effect of SNPs in the studied region distrtibuted _{y} can be partitioned into components attributable to the SNPs in the studied region, polygenic and residual variances:

where **S** = **ZZ** '/**y** = **Xβ** + **g** + **δ** + **α** + **ϵ** with

This model can be readily applied to haplotype-based analysis with the design matrix for genotype scores Z replaced by a haplotype matrix H, where a vector H_{
i
} records the **y** = **Xβ** + **Hγ**
_{
h
} + **δ** + **ϵ** with _{
h
} represents the random effect of haplotypes; S_{
h
} is a matrix of pair-wise similarity scores between the haplotype pairs of two individuals, with the _{
h
} will be equivalent to the average allelic sharing across multiple markers between two individuals and thus phase information is not required.

Our primary interest lies in detecting whether there is an effect of a genomic region on the phenotype, which is assessed by testing the null hypothesis

where **P** = **V**
^{− 1} − **V**
^{− 1}**X**(**X**
^{
T
}**V**
^{− 1}**X**)^{− 1}**X**
^{
T
}**V**
^{− 1} is the projection matrix under the linear mixed model (1).

It will be convenient to denote the parameter of interest

where **y** = **Xβ** + **δ** + **ε**. These estimates can be obtained using the regular statistical software that implement mixed-model functionality, or even more easily in some genetic analysis packages that can directly read in a kinship matrix, such as EMMA

However, the asymptotic distribution of the above score statistic is not a typical standard normal distribution (neither does the corresponding LRT statistic converge to a mixture of **S** = **ZZ** '/

Because asymptotically _{
τ
} follows a weighted sum of chi-square variables: _{
i
} are the **V**
_{0}
^{1/2}**MV**
_{0}
^{1/2}. A good approximation may be obtained using only

Significance of a test can be evaluated empirically through simulating a large set of sums of chi-squared random variables, where the **V**
_{0}
^{1/2}**MV**
_{0}
^{1/2} when the sample size is large. Furthermore, to ensure reliable results for a large effect size or small α level, one needs to run a huge number of simulations. For instance, when α is set at 1 × 10^{-5}, at least 10^{7} simulations are needed for each test. This becomes computationally infeasible for a genome-wide scan. Here we consider Satterthwaite’s procedure to approximate the null distribution of _{
τ
} by a scaled chi-square distribution

To account for the fact that the nuisance parameters _{
T
} may be replaced by the partial information **V**
_{0}
^{− 1} replaced by the projection matrix **P**
_{0} = **V**
_{0}
^{− 1} − **V**
_{0}
^{− 1}**X**(**X**
^{
T
}**V**
_{0}
^{− 1}**X**)^{− 1}**X**
^{
T
}**V**
_{0}
^{− 1}.

Satterthwaite’s procedure is fairly fast but may not have desirable performance in the extreme tails of the distribution. An alternative procedure would be to fit a distribution for which the first three moments are estimated, rather than only the first two. Possibilities would be to assume the distribution is a multiple of a non-central chi-square distribution, estimating the multiple and the two parameters of the non-central chi-square distribution from the empirical first three moments; alternatively, one could fit a distribution that is a multiple of a power of a chi-square distribution, estimating the multiple, the power and the

The VC score approach described above has a special advantage of being easily extended to, and compatible with, the kernel machine regression that allows for more flexible modeling of genetic effects. Methods like least-square kernel machines (LSKM) and their variants have been successfully applied in multi-marker association tests with both common and rare variants

where _{
i
} on the trait value. The function space that _{
i
},s_{
j
}) can also be thought as a similarity measure between the genotypes of individuals

Given the close relationship between the LSKM and GLMM framework, Liu et al. _{0} : **z**) = 0 based on the related linear mixed model. The corresponding model in our method is

where h is regarded as a random effect with mean zero and variance τ**K**, where **K** is an _{i},s_{
j
}). It can be shown that the best-linear unbiased estimators (BLUP) of h and β have the same form as those derived via LSKM estimation **S** replaced by **K**.

To accommodate rare variant SNPs, a weighted kernel function might be used so that similarity in rare variants will be emphasized. Assuming additive genotypic coding, a weighted IBS kernel can be written as _{
l
} is the minor allele frequency of the ^{th} SNP or variant. A more flexible way is based on the density function of a beta distribution: _{
l
} = _{
l
}; _{
l
} will be equal to 1/_{
l
}(1 − _{
l
}), in which case the weighted IBS kernel with the original genotype scores will be analogous (but not exactly identical) to using standardized genotypes in model (1). Under this formulation, the VC score test can be viewed as a special case of the LSKM approach.

Simulations

We performed simulation studies to examine the type I error and power of the proposed score approach for detecting genetic variants under a range of scenarios, especially when rare variants are the cause of the phenotypic variation. We began by simulating 10,000 haplotypes of a 500 kb genomic region under the coalescent model using the software “cosi” (^{4}, mutation rate set at 1.5 × 10^{-8} per bp per generation, and the recombination rate varying across the region with a local window size of 100 kb. A total of 2883 variant locations were generated using this setting, of which 73 % had minor allele frequencies < 0.05. We randomly picked a region of 500 variants as our test region. In determining causal variants and risk haplotypes, we used a procedure similar to that described in Feng et al.

We determined the quantitative trait values based on a normal distribution. Specifically, we first calculated the causal genetic score (g) of an individual by g = _{
i
} is coded as 0, 1, or 2 indicating the number of risk haplotypes. Next we generated the overall residual variance by var(^{2} − 1), in which ^{2} is the proportion of phenotypic variance explained by a genomic region, and var(g) is the theoretical variance of the genetic score calculated as var(^{2} − 2^{2}, where _{
m
} and _{
f
} are the values of the parents and _{
i
}~

We designed various simulation scenarios by changing parameters such as ^{2}, sample sizes, and the proportion of risk haplotypes. Each set-up consisted of 200 independent replications (by updating each time not just phenotypes, but also genotypes). To compare with fixed-effect sum tests, each data set was also analyzed by the feasible generalized least squares regression model (FGLS). FGLS is very similar to generalized least squares except that it uses an estimated variance-covariance matrix (which can be obtained under the null model)

We have evaluated the type I error for the proposed method by generating 400,000 replicates under the H_{0}. Figure

Quantile-quantile plot comparing empirical p-values (based on 400,000 simulations under the no assocation model) against those expected under the null

**Quantile-quantile plot comparing empirical p-values (based on 400,000 simulations under the no assocation model) against those expected under the null.**

Results and discussion

In our primary set of simulation for power comparison, 500 nuclear families and 2,000 unrelated individuals were generated, based on the simulation procedures described above, where the proportion of phenotypic variance explained by a region was set at (0, 0.01,…, 0.05). Each data set was analyzed by four different strategies: (1) the proposed VC-score test with all 500 variants; (2) the FGLS test using the genotype sum of 500 variants; (3) the VC-score test with only rare variants (with minor allele frequency (MAF) < 0.02 in the sample) included; (4) the FGLS test using the genotype sum of rare variants. Because we used standardized genotypic scores and true MAF thresholds for rare variants, results from method (4) should represent the best results that a weighted-sum aggregation test could possibly reach. The power was assessed at the 0.05 and 1 × 10^{-6} significance levels using 200 replications. When α was set at 0.05 and ^{2} (heritability) set at 0, all analysis strategies maintain type I error rates around the nominal level. The power of the VC-score test is close to or higher than the FGLS method under all scenarios. The VC-score method also demonstrated great robustness to the number of noise markers. Results indicate that excluding common variants (all non-causal) results in noticeable power increase when using the FLGS method, but has nearly no effect on the VC method. We also tried the VC method using the genotype sum of rare variants only. Results are not presented here because they are exactly the same as those from method (4) in view of the equivalence of the two statistics when the genotype sum is used.

The simulation results indicate that, under the current simulation settings and sample sizes, the proposed method will have adequate power to detect a genomic region with ^{2} around 0.01 in a candidate gene analysis, or a region with ^{2} around 0.02 in a genome-wide scan. Table ^{-5}, and 1 × 10^{-6}, respectively. Three different designs were considered. In design I we included an additional 1,000 unrelated individuals, while in design II we added another 250 families (approximately the same genotyping effort as 1000 unrelated individuals). Both designs gave apparent power increase compared to previous simulations (around 15% more when ^{2} is below 0.03), but the increase in design I is slightly greater than that in design II. Our preliminary simulations show that the difference can be more significant when using a smaller base sample size. As generally accepted, an association analysis using related individuals is less informative than one using the same number of unrelated individuals, and is thus less powerful. In practice, families are not randomly sampled but often selected through probands or because of existing linkage evidence. We explored this effect in design III. Rather than going through the complex modeling of the ascertainment process, we created an enriched risk haplotype pool by directly removing 2,000 non-risk haplotypes. Therefore, each risk haplotype has a little more than 1/8 chance to be assigned to a family founder instead of about 1/10. As shown in Table

**Design**

**h**
^{
2
}

**Significance level (α)**

**0.05**

**1 × 10**
^{
-5
}

**1 × 10**
^{
-6
}

Note.—The design column indicates # of families / # of unrelated individuals. Only nuclear families are simulated, with each family having two parents with a mean of two children.

a. 750 families simulated with enriched risk haplotypes.

I. 500/3000

0.01

0.840

0.355

0.270

0.02

1

0.855

0.780

0.03

1

0.985

0.980

II. 750/2000

0.01

0.890

0.345

0.235

0.02

1

0.825

0.725

0.03

1

0.975

0.970

III. 750^{a}/2000

0.01

0.94

0.435

0.335

0.02

1

0.92

0.880

0.03

1

1

0.990

We also indirectly compared the performance of the VC and FGLS methods by varying parameters that can affect the effect sizes. We calculated the power of the two methods when the proportion of risk haplotypes was set at 5%, i.e., only 500 haplotypes were tagged as risk in the 10,000 haplotype pool. Although each individual has less chance to carry a risk haplotype, there would be fewer causal variants with larger effect size simulated (if the variance explained by a region is fixed). It was found that both methods had substantial power increase compared to the first simulation, but the VC method had greater improvement than the FGLS. In a simulation set-up where causal SNPs (rare variants only) were not assigned independently (but pairs of SNPs close to each other, and thus correlated, were selected), we found the VC method had a slight power improvement while the FGLS had a small loss in power. Detailed results are listed in Additional file

**Additional simulation results and software.**

Click here for file

Many extensions are possible for improved implementation of the proposed model and testing procedure. This method can be easily extended to incorporate nonlinear and interaction effects. As discussed previously, our method can be considered as a special case in the framework of the kernel machine method. Interaction and nonlinear effects among markers can be further included in the model through specifying a valid kernel function or similarity metric. Also, more flexible weights may be incorporated into the kernel function according to allele frequencies or other prior information. Although a normally distributed trait was assumed throughout this study, the derived score statistic is also appropriate for non-normal traits

Conclusions

We propose a multi-marker VC-based association test using both family and unrelated data. A fast score test has been built on the ML and REML framework, in which only the parameters in the null model need to be estimated. Owing to the non-block-diagonal structure of the genotype-based similarity matrix, the score statistic derived has a different form from that based on the typical VC model for linkage analysis. We demonstrate through simulations that the proposed method achieves good performance in terms of Type I error control and statistical power. The method is implemented in the R package “fassoc”. We believe that “fassoc” will be a useful tool to complement existing software for family-based association studies.

Availability and requirements

Project name: fassoc package

Project home page:

Operating system(s): Linux, Mac OS X, Windows

Programming language: R

Other requirements: R (≥2.15.1)

License: GNU GPL

Any restrictions to use by non-academics: none except those posed by the license

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

XW and NJM participated in the design of the study and implementation of the method. XW drafted the manuscript. XZ and RCE participated in the conception and design of the study and in editing the manuscript. All authors read and approved the final manuscript.

Acknowledgements

The work was supported by the National Research Foundation of Korea, funded by the Korean Government, grant number NRF-2011-220-C00004, and by the National Institutes of Health, grant numbers HL086718 from the National Heart, Lung and Blood Institute, HG003054 and U01HG006382 from the National Human Genome Research Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Human Genome Research Institute or the National Institutes of Health.