Human Genetics, Genome Institute of Singapore, 60 Biopolis Street, Singapore 138672, Singapore

Saw Swee Hock School of Public Health, National University of Singapore, 16 Medical Drive, Singapore 117597, Singapore

Abstract

Background

As several rare genomic variants have been shown to affect common phenotypes, rare variants association analysis has received considerable attention. Several efficient association tests using genotype and phenotype similarity measures have been proposed in the literature. The major advantages of similarity-based tests are their ability to accommodate multiple types of DNA variations within one association test, and to account for the possible interaction within a region. However, not much work has been done to compare the performance of similarity-based tests on rare variants association scenarios, especially when applied with different rare variants pooling strategies.

Results

Based on the population genetics simulations and analysis of a publicly-available sequencing data set, we compared the performance of four similarity-based tests and two rare variants pooling strategies. We showed that weighting approach outperforms collapsing under the presence of strong effect from rare variants and under the presence of moderate effect from common variants, whereas collapsing of rare variants is preferable when common variants possess a strong effect. We also demonstrated that the difference in statistical power between the two pooling strategies may be substantial. The results also highlighted consistently high power of two similarity-based approaches when applied with an appropriate pooling strategy.

Conclusions

Population genetics simulations and sequencing data set analysis showed high power of two similarity-based tests and a substantial difference in power between the two pooling strategies.

Background

Although genome-wide association studies (GWAS) have identified many common single nucleotide polymorphisms (SNPs) associated with common diseases (

Measures of genotype similarity have been the basis of many proposed statistical tests. The idea of similarity-based tests is to consider the relationship between genotypic and phenotypic similarities (similarity here roughly refers to a measure of closeness of two genotypes or phenotypes). Similarity-based tests are motivated by the fact that haplotypes carrying the same causal mutation are more related compared with those without causal mutations; so, case haplotypes are expected to share longer stretches of DNA identical by descent

In this article, we compared the performance of the four similarity-based tests (SKAT, KBAT, MDMR and a modified U-test proposed by Schaid et al.

Results

Population genetic simulations

For each test, 1000 permutations were performed to assess the significance of association. To make sure the empirical type-1 error is controlled, we ran the analysis of simulated data under the null model. As can be seen from Additional file

**Empirical type-1 error estimate for population genetics simulations (Table S1), detailed description of population genetics simulations, and considerations for possible reasons for MDMR power loss when applied with weighting pooling strategy.**

Click here for file

Figure

Power as a function of significance level for the four similarity-based tests and two rare variants pooling strategies

**Power as a function of significance level for the four similarity-based tests and two rare variants pooling strategies.** Panel 1: “Risk Rare” Scenario; Panel 2: “Risk Both” Scenario; Panel 3: “Risk Common” Scenario; Panel 4: “Mixed Rare” Scenario.

Finally, we investigated the performance of the tests in the “Mixed Rare” scenario which incorporated both risk and protective variants within a region (Figure

**Power as a function of significance level for the four similarity-based tests with IBS kernels and two rare variants pooling strategies.** Panel 1: “Risk Rare” Scenario; Panel 2: “Risk Both” Scenario; Panel 3: “Risk Common” Scenario; Panel 4: “Mixed Rare” Scenario.

Click here for file

**Scenario/Test**

**MDMR**

**SKAT**

**KBAT**

**U-Test**

**Risk Rare**

0.466

0.472

0.157

0.511

**Risk Both**

0.395

0.29

0.094

0.379

**Risk Common**

0.551

0.479

0.388

0.235

**Mixed Rare**

0.18

0.516

0.393

0.148

We also analyzed the simulated data after excluding all common variants defined as those with MAF > 1% (Figure

Power as a function of significance level for the four similarity-based tests and two rare variants pooling strategies when common variants are excluded from the analysis

**Power as a function of significance level for the four similarity-based tests and two rare variants pooling strategies when common variants are excluded from the analysis.** Panel 1: “Risk Rare” Scenario; Panel 2: “Risk Both” Scenario; Panel 3: “Risk Common” Scenario; Panel 4: “Mixed Rare” Scenario.

GAW17 data set

The GAW17 data set is a large scale exome sequencing data set with genotypes from the 1000 Genomes Project (

We performed an association analysis of causal genes that affect two quantitative traits, Q_{1} and Q_{2}, and a dichotomous trait, _{
j,
}
_{
i
}
_{
i
} _{
i
}
^{
T
}
_{
i
}

The residuals from the regression models (1) were dichotomized (upper 30% of the observed distribution were declared cases, while the others were controls) and tested for association with adjusted genotype

**Empirical type-1 error rates for dichotomized adjusted quantitative phenotype in GAW17 data set at the theoretical level of 0.05 (ARNT-VEGFC with Q1, and BCHE-VWF with Q2).**

Click here for file

**Empirical type-1 error rates for dichotomized adjusted case–control status in GAW17 data set at the theoretical level of 0.05.**

Click here for file

Figure

**Power to identify an association with dichotomized adjusted case–control status in GAW17 data set for some of the causal genes.**

Click here for file

Click here for file

Power to identify association with dichotomized adjusted quantitative trait in GAW17 data set for causal genes (ARNT-VEGFC with Q1, and BCHE-VWF with Q2)

**Power to identify association with dichotomized adjusted quantitative trait in GAW17 data set for causal genes (ARNT-VEGFC with Q1, and BCHE-VWF with Q2).**

**Scenario/Test**

**MDMR**

**SKAT**

**KBAT**

**U-Test**

The genes at which the maximum difference was achieved are in brackets.

**Q1**

0.84 (KDR)

0.45 (ARNT)

0.22 (ARNT)

0.145 (HIF3A)

**Q2**

0.605 (VNN1)

0.5 (VNN1)

0.42 (VNN1)

0.535 (VNN1)

**Dichotomous**

0.77 (FLT1)

0.42 (PRKCA)

0.43 (PRKCA)

0.535 (FLT1)

Discussion

In this article, we compared the performance of the four similarity-based tests together with two rare variants pooling strategies using population genetics simulations and the GAW17 real data set. The results suggest that weighting may be a much better strategy than collapsing under the assumption of strong effect from rare variants, and moderate or low effect from common variants. Collapsing, in turn, showed much better performance when common variants possessed a strong effect. The absolute power difference of a statistical test when applied with collapsing and weighting pooling strategies may be substantial. From our study, it follows that if researchers are inclined to believe in the association of rare variants within a region, weighted pooling should be applied with similarity-based tests, whereas collapsing is more appropriate if common variants are suspected to be associated with phenotype. Additionally, under strong rare variants effect size in one direction when common variants were excluded from the analysis, collapsing performed equally good or better than weighting. Finally, both SKAT and KBAT had consistently high power compared with other considered similarity-based tests when applied with the appropriate pooling strategy.

Recently, Basu and Pan

From our results, the MDMR test does not seem to perform well when applied with weighting pooling strategy. To have a more detailed picture, we applied weighted MDMR test to the “Risk Rare” data sets with modified weights _{
l
}
^{
p
},

**Impact of power value on MDMR test performance in a “Risk Rare” scenario.**

Click here for file

One limitation of the current study is that the minimum significance level in population genetics simulations was 0.001. For genome-wide significance, the number of permutations needed to reliably estimate the significance is very large. This makes the comparison of the similarity-based tests at the genome-wide level prohibitive. In real GWAS studies, only few highly-significant genes will require a very large number of permutations to estimate

Conclusions

The performance of similarity-based tests applied with two rare variants pooling strategies was investigated. Population genetics simulations and sequencing data set analysis showed consistently high power of two similarity-based tests and a substantial difference in performance of the two rare variants pooling strategies.

Methods

Similarity-based tests

Assume that an association study involves ^{
A
} cases and ^{
U
} controls), and within a genomic region _{
nl
}
_{
n
}
_{
n
}, _{
m
})}_{
n,m = 1}
^{
N
}, where _{
n
} is a multi-site vector of genotype {g_{1n},…,g_{Ln}} for _{
n
}, _{
m
}) = ∑ _{
l = 1}
^{
L
}
_{
l
}
_{
nl
}
_{
ml
} for some fixed weights _{
l
}
_{
n
}, _{
m
}) = (1 + ∑ _{
l
}
^{
L
}
_{
l
}
_{
nl
}
_{
ml
})^{2}, and the weighted IBS kernel _{
n
}, _{
m
}) = ∑ _{
l = 1}
^{
L
}
_{
l
}(2 − |_{
nl
} − _{
ml
}|). For our analysis, a popular exponential similarity measure

The choice of similarity was motivated by the need to analyze quantitative genotype obtained as a result of population stratification adjustment (see Results section). As the exponential similarity is a function of the Euclidean distance between two multi-site genotypes, we consider this similarity to be more appropriate compared with, for example, another popular similarity measure, identity-by-state

Weighting and collapsing

Here we consider the two major ways of rare variants pooling: weighting and collapsing. The SNP weights will be denoted as _{
l
}
_{
l
} = _{
l
}; 1, 25)^{2}, where _{
l
} is MAF of _{w} for the similarity matrix _{
w
} is as follows:

For the KBAT test statistic, the weights were incorporated differently (for details, see the description below) as the test does not use the multi-site genotype similarity.

The collapsing of rare variants was performed as described in Thalamuthu et al.
_{
n(L+1)
}

In general, this type of collapsing preserves more information than an indicator of at least one rare variant being present, as suggested by Li and Leal
_{
n(L+1)
}

Multivariate distance matrix regression (MDMR)

Let us denote _{
N
} and a vector of 1 of size _{
N
}. Following Wessel and Schork

1. Phenotype projection matrix ^{
T
}
^{
-1
}
^{
T
}, where upper

2. Dissimilarity matrix _{
ij
}}_{
i,j = 1}
^{
N
} = 1_{
N
}1_{
N
}
^{
T
} −

3. Gover’s centered matrix _{
N
} − 1_{
N
}1_{
N
}
^{
T
}/_{
N
} − 1_{
N
}1_{
N
}
^{
T
}/

4. The test statistic _{
N
} − _{
N
} −

Large values of the test statistic indicate a deviation from the null hypothesis of no association of a genotype with a phenotype.

Sequence kernel association test (SKAT)

For this test, the phenotype vector _{n},n = 1,…,N} is coded as 1 for cases and 0 for controls. The mean phenotype vector is defined as

U-test

The average similarity score between pairs of cases _{
1
} and controls _{
0
} is defined as follows:

where _{
nm
}
_{
nm
}}_{
n,m = 1}
^{
N
}). The U-test statistic is defined as _{1} − _{0})^{2}. Note that Shaid et al.

Kernel-based association test (KBAT)

Let us denote _{
l
} = {(_{
l
})_{
nm
}}_{
n,m = 1}
^{
N
} as a single SNPs similarity matrix for _{
1
} and _{
0
} are the average similarity scores for pairs of cases and controls, respectively, calculated from _{
l
}, and let _{
l
} = (_{
l1} + _{
l0})/2. Following Mukhopadhyay et al.

where the two groups are case-case and control-control pairs. The test statistic is _{
l = 1}
^{
L
}
_{
l
}/∑_{
l = 1}
^{
L
}
_{
l
}. Since the test does not utilize the multi-site similarity matrix, but only single SNP matrices _{
l
}, the weighted test statistic _{
W
} = ∑ _{
l = 1}
^{
L
}
_{
l
}
_{
l
}/∑_{
l = 1}
^{
L
}
_{
l
}
_{
l
} is used here. A large value of the

Population genetics simulations

Population genetics simulations were performed based on the code provided by King et al.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

SZ, AS and AT conceived the study. SZ and AT designed the experiments. SZ conducted the experiments and performed the analysis. SZ wrote the manuscript. SZ, AS and AT approved the manuscript.

Acknowledgements

We would like to thank the workshop organizers of GAW17 for their permission to use their data in our research. The preparation of the Genetic Analysis Workshop 17 Simulated Exome Data Set was supported by a GAW grant, R01 GM031575, and in part by NIH R01 MH059490, and used sequencing data from the 1000 Genomes Project (

Funding: This work was supported by the Agency for Science, Technology and Research (A*STAR; Singapore). The first author is a recipient of the Singapore International Graduate Award.