Department of Epidemiology and Biostatistics, Case Western Reserve University, 10900 Euclid Ave, Cleveland, OH 44106, USA

Key Laboratory for Applied Statistics of MOE and School of Mathematics and Statistics, Northeast Normal University, Changchun 130024, China

Abstract

The Genetic Analysis Workshop 17 data we used comprise 697 unrelated individuals genotyped at 24,487 single-nucleotide polymorphisms (SNPs) from a mini-exome scan, using real sequence data for 3,205 genes annotated by the 1000 Genomes Project and simulated phenotypes. We studied 200 sets of simulated phenotypes of trait Q2. An important feature of this data set is that most SNPs are rare, with 87% of the SNPs having a minor allele frequency less than 0.05. For rare SNP detection, in this study we performed a least absolute shrinkage and selection operator (LASSO) regression and

Background

With the rapid development of technologies, more and more single-nucleotide polymorphisms (SNPs) have become available and, in particular, most of the rare variants can be identified using the next-generation sequencing technique. However, detecting associated rare variants that contribute to phenotypic variation is still a huge challenge. Current approaches for testing rare variants include grouping the rare variants based on a threshold of the minor allele frequency (MAF)

Methods

Data checking

In the Genetic Analysis Workshop 17 (GAW17) simulated data set, there are no missing genotype data. Among all the 24,487 SNPs, 91% have a MAF less than 0.1, 87% have a MAF less than 0.05, and 75% have a MAF less than 0.01. Moreover, 39% of the SNPs have a MAF less than 0.001, which leads to 9,433 SNPs being singletons among 697 unrelated individuals. Owing to the rareness of the variants, we do not examine Hardy-Weinberg disequilibrium as a quality control procedure in this study. Thus we include all SNPs and all individuals for the association analysis.

LASSO regression

To deal with the singular matrix in linear regression caused by the rare variants, we adopt a statistical method that effectively shrinks the coefficients of unassociated SNPs and reduces the variance of the estimated regression coefficients. Here, we apply the LASSO penalty

At the

where

where

Gene-level association tests

The association is tested on the gene level. Within a gene, the dependent variable is Q2 of the GAW17 data set, and the independent variables are the genotypes of all the SNPs in the gene. We use a model, with a LASSO penalty, in which no interactions are involved. This model is indexed as M1. To test for the association between a gene and Q2, we use _{M1} and RSS_{M0} be the residual sums of squares of models M1 and M0, respectively. To correct for selection bias, we use the generalized degrees of freedom (GDF)

which asymptotically follows the

GDF and

In classical linear models, the number of covariates is fixed; therefore the number of degrees of freedom is equal to the number of covariates. However, the situation is different in a LASSO regression: The number of nonzero coefficients can no longer accurately measure the model complexity. For a LASSO regression, which involves variable selection, the GDF was introduced

Suppose that the observed value _{i}_{i}_{i}_{i}^{2}. An estimate ^{2} for ^{2} can be obtained by an ordinary regression. Given a modeling procedure _{ti}^{2}),

(3) Finally, calculate:

Given GDF(

Thus the tuning parameter

Alternative methods: _{linear} and combined multivariate and the collapsing method for quantitative traits

As a comparison, we also carry out the _{linear}. A second alternative method is the combined multivariate and collapsing (CMC) method

Results

We evaluated the power and false-positive rates of the _{LASSO}, _{linear}, QCMC(0.01), and QCMC(0.05) tests based on the 200 replicates of the GAW17 data set. The significance level of the tests was first set to 1.6 × 10^{–5}, which is the Bonferroni-corrected significance level of 0.05 adjusted by the number of genes, that is, 0.05/3,205. However, because of the small sample sizes in the GAW17 data set, the power of the association tests was poor and could not be compared in our four tests. Therefore we also used the weak significance level of 0.01 for method comparison.

We examined the answers to the GAW17 simulation after our association analyses were completed. In the answers, Q2 is influenced by 72 SNPs in 13 genes, where the MAFs and effect sizes (_{i}

As shown in Table

True variance contributions of 13 causal genes given in the GAW17 answers

Number of SNPs

15

7

24

29

27

24

11

20

29

11

1

8

5

Number of causal SNPs

7

2

10

13

8

9

4

3

8

2

1

2

3

Average MAF of the causal SNPs

0.0206

0.0882

0.0022

0.0010

0.0013

0.0012

0.0029

0.0060

0.0021

0.0029

0.0122

0.0032

0.0007

Variance contribution

0.0239

0.0193

0.0125

0.0115

0.0111

0.0100

0.0098

0.0097

0.0090

0.0048

0.0034

0.0021

0.0002

We evaluated the power of the four methods based on the 13 causal genes using the 200 replicates (Figure

Power to detect 13 causal genes at the significance levels of 0.01 and 1.6 × 10^{–5} in 200 replicates.

**Power to detect 13 causal genes at the significance levels of 0.01 and 1.6 × 10 ^{–5} in 200 replicates.** The

In general, all the tests increased the power when a gene’s contribution to the phenotype variation increased. However, we observed some exceptions, possibly because the power depends on many other factors, such as allele frequency and linkage disequilibrium among the SNPs within a gene. First, although their contributions to the phenotype variation were similar, we had more power to detect ^{2} = 0.003). Among the 55 significant tests of the ^{2} = 0.002), it was not as common as C10S3059 and the linkage disequilibrium pattern was not the same as that for

Linkage disequilibrium plot for genes

**Linkage disequilibrium plot for genes SIRT1 and VLDLR.** Linkage disequilibrium plots generated from Haploview. The values of

We also investigated the false-positive rates by counting the frequency of the _{LASSO} test was slightly bigger than that of the other three tests, but not significantly so.

False-positive rates at the significance levels of 0.01 and 1.6 × 10^{–5} (the Bonferroni-corrected significance level of 0.05)

Significance level

_{LASSO}

_{linear}

QCMC(0.01)

QCMC(0.05)

0.01

0.02793

0.02094

0.02195

0.02233

1.60 × 10^{–5}

0.00016

0.00011

0.00011

0.00013

Discussion and conclusions

In this study, we used the LASSO regression and calculated the GDF for the _{LASSO} test is more powerful than the other methods.

Linear regression is the least powerful approach because of the large number of rare SNPs and because no deduction is made in the large number of degrees of freedom. The collapsing test requires specifying the predefined allele frequency threshold for grouping rare SNPs. It is difficult to determine this criterion optimally when in reality the true disease model is never known. For an extreme example, the QCMC(0.001) test was identical to the linear regression approach and the QCMC(0.1) test had no power at all in these data. Therefore, from this point of view, we recommend the LASSO approach for detecting rare SNPs.

Based on the power comparison of the

Competing interests

The authors declare that there are no competing interests.

Authors’ contributions

WG carried out the data analysis and drafted the manuscript. RCE and XZ participated in the design of the study and coordination and edited the manuscript. All authors read and approved the final manuscript.

Acknowledgments

This work was supported by National Institutes of Health (NIH) grants HL074166, HL086718 from the National Heart, Lung, and Blood Institute, HG003054 from the National Human Genome Research Institute, RR03655 from the National Center for Research Resources, and P30 CAD43703 from the National Cancer Institute. The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences. This work was also partially supported by National Natural Science Foundation of China grant 10901031.

This article has been published as part of