Email updates

Keep up to date with the latest news and content from BMC Genetics and BioMed Central.

Open Access Methodology article

Discovering joint associations between disease and gene pairs with a novel similarity test

Wan-Yu Lin12* and Wen-Chung Lee13

Author Affiliations

1 Institute of Epidemiology and Preventive Medicine, College of Public Health, National Taiwan University, No. 17, Xuzhou Rd., Taipei 100, Taiwan

2 Department of Biostatistics, University of Alabama at Birmingham, 1665 University Boulevard, Birmingham, Alabama 35294, USA

3 Research Center for Genes, Environment and Human Health, National Taiwan University, No. 17, Xuzhou Rd., Taipei 100, Taiwan

For all author emails, please log on.

BMC Genetics 2010, 11:86  doi:10.1186/1471-2156-11-86


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2156/11/86


Received:14 April 2010
Accepted:4 October 2010
Published:4 October 2010

© 2010 Lin and Lee; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Genes in a functional pathway can have complex interactions. A gene might activate or suppress another gene, so it is of interest to test joint associations of gene pairs. To simultaneously detect the joint association between disease and two genes (or two chromosomal regions), we propose a new test with the use of genomic similarities. Our test is designed to detect epistasis in the absence of main effects, main effects in the absence of epistasis, or the presence of both main effects and epistasis.

Results

The simulation results show that our similarity test with the matching measure is more powerful than the Pearson's χ2 test when the disease mutants were introduced at common haplotypes, but is less powerful when the disease mutants were introduced at rare haplotypes. Our similarity tests with the counting measures are more sensitive to marker informativity and linkage disequilibrium patterns, and thus are often inferior to the similarity test with the matching measure and the Pearson's χ2 test.

Conclusions

In detecting joint associations between disease and gene pairs, our similarity test is a complementary method to the Pearson's χ2 test.

Background

Genes in a functional pathway can have complex interactions. A gene might activate or suppress another gene, so it is of interest to test joint associations of gene pairs. Differing from epistasis (generally defined as the interaction between different genes [1]), joint associations herein include both main effects and interactions. Haplotypes from two receptors can trigger significant interactions affecting disease status [2]. Moreover, detecting associations with the use of haplotypes constructed by several adjacent and highly correlated single-nucleotide polymorphisms (SNPs) is an economical strategy. These all enlighten us regarding ways to develop methods for discovering gene pairs in association with disease by using haplotypes.

There is a growing interest in detecting gene-gene interactions [1,3,4], and some methods have been proposed to detect interactions. A well-known approach to detecting SNP-SNP interactions, the multifactor dimensionality reduction (MDR) method [5-8], however, has not been developed for testing haplotype-haplotype interactions. Another commonly used method is the classification and regression trees (CART) [9-12]. This concept has been extended to analyze haplotype data, known as the HapForest approach [13].

In this paper, we do not focus only on interactions because the definition of independence between two genes is arbitrary, often varying according to the field under discussion, such as biology, statistics or epidemiology [1]. Instead, we focus on detecting joint associations. To simultaneously detect joint association between disease and two genes (or two chromosomal regions), we propose a new test with the use of genomic similarities. Similarity-based methods are less vulnerable to the penalty of testing many markers or haplotypes, and can be more powerful than conventional association methods in some situations [14]. Our proposed test is designed to detect epistasis in the absence of main effects, main effects in the absence of epistasis, or the presence of both main effects and epistasis. We further compare our method with the HapForest approach [13], the Pearson's χ2 test, and the tests for SNP × SNP epistasis via simulation studies.

Methods

Similarity Measures

Let <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M1">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M2">View MathML</a> be the marginal similarities of the ith and jth subjects at genes G1 and G2, respectively. They can be obtained based on unphased multi-marker genotypes or statistically inferred haplotypes, and they can be scaled from 0 to 1. Here we list some commonly used similarity measures, which can be traced back to [15,16].

A. Diplotype perspective

A.1. Similarity measure based on identity-by-state (IBS) allele sharing (referred to as 'IBS'):

<a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M3">View MathML</a>

where L is the number of loci considered in Gk; <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M4">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M5">View MathML</a> are respectively the genotypes of the ith and jth subjects at the lth locus in Gk; <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M6">View MathML</a> is the number of alleles shared in common for the ith and jth subjects at the lth locus in Gk, which has possible values of 0, 1, and 2.

A.2. Similarity measure based on IBS inversely weighted by genotype frequencies (referred to as 'W-IBS'):

<a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M7">View MathML</a>

where <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M8">View MathML</a>, and <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M9">View MathML</a> is the frequency of genotype <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M4">View MathML</a>. The implication of this weight is that subjects sharing rare alleles may have more similar genomes than do subjects sharing common alleles.

Joint Similarity Regarding Two Genes

A similarity measure accounting for the joint association of genes G1 and G2 for the ith and jth subjects is

<a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M10">View MathML</a>

(1)

where <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M11">View MathML</a> ranges from 0 to 1, too. The joint similarity (<a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M11">View MathML</a>) will be high if both of the two marginal similarities (<a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M12">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M13">View MathML</a>) are high. That is, with respect to the two genes, the ith and jth subjects will be regarded as 'similar' if they are similar in both genes.

B. Haplotype perspective

B.1. Similarity based on the counting measure for haplotypes (referred to as 'COUNT'):

Let hi and hj be the ith and jth categories of haplotypes in a gene, <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M14">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M15">View MathML</a> are the alleles at the lth locus on hi and hj, respectively. The similarity based on the counting measure for haplotypes is

<a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M16">View MathML</a>

where <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M17">View MathML</a> is 1 if the alleles at the lth locus match for the ith and jth haplotypes.

B.2. Similarity based on the matching measure for haplotypes (referred to as 'MATCH'):

Let hi and hj be the ith and jth categories of haplotypes in a gene, then the similarity based on the matching measure for haplotypes is

<a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M18">View MathML</a>

where s(hi, hj) is 1 only when all alleles match for the ith and jth haplotypes, otherwise s(hi, hj) is 0.

Joint Similarity Regarding Two Genes

Let <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M19">View MathML</a> be the uth possible diplotype (i.e., the pair of haplotypes a subject possesses) in Gk of the ith subject, where <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M20">View MathML</a>, and where <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M21">View MathML</a> is the number of possible diplotypes in Gk for the ith subject. <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M22">View MathML</a> is the posterior probability that the ith subject has the uth possible diplotype in Gk, given the unphased genotypes (<a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M23">View MathML</a>). <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M22">View MathML</a> can be inferred by the expectation-maximization (EM) algorithm [17]. Then a similarity measure accounting for the joint association of genes G1 and G2 for the ith and jth subjects is

<a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M24">View MathML</a>

(2)

where <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M25">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M26">View MathML</a> can be obtained based on the counting measure or the matching measure. <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M11">View MathML</a> ranges from 0 to 1, too.

Similarity Test

Let the dissimilarity accounting for the joint association of genes G1 and G2 for the ith and jth subjects be <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M27">View MathML</a>. The test statistic to detect the joint association of genes G1 and G2 is

<a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M28">View MathML</a>

(3)

where nCS and nCN are the numbers of cases and controls, respectively; <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M29">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M30">View MathML</a> are the vectors of joint haplotype/genotype frequencies of genes G1 and G2, for the case and control samples, respectively; <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M31">View MathML</a> is the dissimilarity matrix of the joint haplotypes/genotypes of G1 and G2 (see Appendix I). When <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M32">View MathML</a> (cases and controls have a same haplotype/genotype distribution), the test statistic is 0.5. When <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M33">View MathML</a>, the test statistic is larger than 0.5. This statistic tests whether the average dissimilarity between cases and controls (between-group dissimilarity) is significantly large, with the adjustment of the dissimilarity within the case group and that within the control group (within-group dissimilarity). This is to mimic the F test to compare the between-group variability with the within-group variability. However, because of the complex correlation introduced by pair-wise similarities, the distribution of the test statistic is difficult to derive analytically, and permutation is required to obtain P values.

Simulation Study

Simulation studies were conducted to evaluate the performance of our method. We extended the simulation scheme of Li et al. [18] to two chromosomal regions. In each region, 4,000 haplotypes across 300 kb were generated using the coalescent-based program ms [19]. The effective population size was set at 10,000, the recombination rate per base pair (bp) per generation was set at 10-9, and 300 SNPs were simulated in each region. For the human genome, recombination occurs at an average rate of about 10-8 per bp per generation [20]. Our recombination rate, 10-9 per bp per generation, is the low end of the recombination rates in the human genome [18], representing a stronger linkage disequilibrium (LD). We chose this rate because multi-marker approaches are primarily designed for strong-LD regions. In each chromosomal region, 2,000 diplotypes were generated by randomly pairing the 4,000 haplotypes. Then the 2,000 diplotypes of the first region were randomly paired with the 2,000 diplotypes of the second region, to form 2,000 subjects. In this way, we generated 300 datasets.

We then considered nine disease models listed in Additional file 1. Additional file 1 lists the causal allele frequencies, the penetrance values of two-locus genotypes, and the marginal penetrance values of one-locus genotypes, for all disease models. Model 0 was used to evaluate Type-I error rates, while the other eight models were used to evaluate powers. Models 1-6 exhibit interactions in the absence of main effects when genotypes conform to Hardy-Weinberg equilibrium. We used these six disease models because they further challenged the ability of our method to discover the joint associations (or 'interactions' in this situation) of gene pairs. Models 7 and 8 exhibit both interactions and main effects. Model 7 is the jointly dominant-dominant model, which requires at least one copy of the disease allele from both loci to be affected [21,22]. Model 8 has the same penetrance table with Model 3, but has different causal allele frequencies. We deliberately let the causal allele frequency of one locus be smaller than that of another locus.

Additional file 1. Table S1. The penetrance tables and causal allele frequencies of nine disease models.

Format: DOC Size: 72KB Download file

This file can be viewed with: Microsoft Word ViewerOpen Data

For each dataset, we first randomly selected two SNPs (each from among 300 SNPs in a region) with similar MAFs to those of the causal SNPs (the tolerable difference was set to be 0.02), pretending them as the two causal SNPs. We then used the H-clust method [23,24] to choose tag SNPs with a subset formed by 200 subjects randomly drawn from the pool of 2,000 subjects. Tag SNPs were chosen with quality (MAF > 0.1) and correlation (the cut-off value for finding clusters was set to be 0.85). In each repetition, cases and controls were sampled with replacement from the pool of 2,000 subjects, where case/control status was assigned according to the genotypes of the two causal SNPs. After generating the phenotypes, the genotypes of the causal SNPs were removed from our datasets. Each chromosomal region was formed by eight SNPs - four to the left and four to the right of every causal SNP.

We evaluated the performance of our method with the matching measure ('MATCH') and the counting measure ('COUNT') of haplotypes. We also used two genotype similarity measures: 'IBS' and 'W-IBS'. We compared these with the HapForest approach [13]. HapForest is based on a tree structure, and is naturally suitable for analyzing interactions. Following the instructions of HapForest, we first invoked SNPHAP [25] to estimate the haplotype frequencies for each individual. Then HapForest was used to identify haplotypes and haplotype-haplotype interactions in association with the disease. This method suggests potential epistasis among significant haplotypes. For HapForest, a rejection of null hypothesis was defined as the identification of at least one significant haplotype from any of the two chromosomal regions.

The Pearson's χ2 test was also performed for comparison, in which the joint haplotype distributions of the two chromosomal regions were compared between cases and controls. Rather than using the asymptotic χ2 distribution, we randomly assigned the disease status in each permutation and determined the P value of observed χ2 statistics. To calculate haplotype similarities from unphased multi-marker genotypes, we first inferred haplotype phases by the EM algorithm, using the function of 'haplo.em' in the 'haplo.stats' package [17]. The obtained posteriors were then treated as weights, and all possible haplotype pairs were considered with their probabilities (see equation (2)). All the haplotypes with frequencies less than 0.01 are considered to be rare haplotypes. To avoid possible genotyping errors, we follow Sha et al. [26] to merge each rare haplotype with its most similar common haplotype (see the modified EM algorithm proposed by Sha et al. [26]). For example, Haplotype A (1-1-1-2-1-1-1-1) is considered to be a rare haplotype because its frequency is less than 0.01. Haplotypes C (1-1-1-1-1-1-1-1) and F (1-1-1-2-2-1-1-1) are the most similar haplotypes to Haplotype A (both with a similarity of 0.875 by using the counting measure), and their haplotype frequencies are 0.2 and 0.1, respectively. We merge Haplotype A with Haplotype C, the most similar haplotype with the highest frequency. We then update the haplotype data by replacing Haplotype A with Haplotype C.

We also compared our methods with the tests for SNP × SNP epistasis by using case-control data or case-only data (with the --fast-epistasis command implemented by PLINK-1.07) [27]), hereafter referred to as 'CS-CN' and 'CS', respectively. In our simulation, each chromosomal region was formed by eight SNPs, and there were 64 tests for SNP × SNP epistasis. We recorded the minimum P value (Pmin) from among all the 64 P values, and then adjusted this Pmin on the basis of Sidak correction [28], with an effective number of tests, Meff. That is, we adjusted the minimum P value (Pmin) by <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M34">View MathML</a>.

We then evaluated the validity and power of the eight tests with the 300 datasets. For each dataset, we recorded the P values of 50 repetitions (so there were 15,000 P values in total); in each repetition, P values were obtained with 1,000 permutations. Given a significance level, the type I error rate (if under Model 0) or power (if under Models 1-8) was the proportion of the number of P values smaller than the significance level to the total number of P values.

For CS-CN and CS, the P value used was <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M34">View MathML</a>. The effective number of tests (Meff) was estimated by the eigenvalue-based approach [29,30]. For each subject, we had 8+8 genotype coding values (0, 1, or 2), and 64 pair-wise products of genotype coding values, one from a SNP in region 1 and another from a SNP in region 2. Based on n subjects, we obtained a 64 × 64 correlation matrix for these 64 pair-wise products of genotype coding values. Then the eigenvalues of this correlation matrix were calculated to estimate the effective number of tests (see [29,30], or see [31] for a nice review).

Results

Type-I Error Rates

In Additional file 1, Model 0 (disease status independent of the composite genotypes) was used to evaluate the type-I error rates. This model demonstrates our null hypothesis: no main effects and no interactions. In this model, the penetrance of each composite genotype was set to be 0.05. The sample size was set at 200 subjects, of which half were cases and half were controls. Figure 1 presents the type-I error rates under different nominal significance levels (α). For α smaller than 0.2, the type-I error rates of all the tests corresponded to the nominal significance levels (α), suggesting the validity of these tests. (For α larger than 0.2, the type-I error rates of HapForest failed to match with the nominal significance levels. HapForest reported P values as 1.0 when the association signal was not strong. However, this makes no influence on our following discussions because α is usually set at a small value.)

thumbnailFigure 1. Type-I error rates under different nominal significance levels. The x-axis is nominal significance level, and the y-axis is type-I error rate.

Statistical Power

For all models except for Models 2 and 7, the total sample size was set at 1,000 subjects, of which half were cases and half were controls. For Models 2 and 7, the total sample size was set at 150 and 50, respectively. If the sample size was also set at 1,000 for Models 2 and 7, the powers of these tests would be all close to 1. Therefore, we chose two smaller sample sizes for effectively exploring the power difference between these tests. The power performances of these tests vary with the property of disease mutants introduced at rare/common haplotypes.

We first define two scores to distinguish the two situations. Let <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M35">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M36">View MathML</a>, where I(·) is the indicator function, <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M37">View MathML</a> is the frequency of the ith high-risk haplotype at Gg, g = 1, 2. We estimated haplotype frequencies based on all 2,000 subjects in a dataset when calculating the scores of SC1 or SC2. While the score of SC2 is designed for Models 1-4 and 7, SC1 is designed for Models 5, 6, and 8 (because of their relatively low causal allele frequencies). Disease mutants were considered to be introduced at rare/common haplotypes if SC2 ≤ 1/SC2 >1 (for Models 1, 2, 7); SC2 = 0/SC2 = 1 (for Models 3, 4); SC1 = 0/SC1 = 1 (for Models 5, 6); SC1 ≤ 1/SC1 >1 (for Model 8).

Figure 2 presents the powers of the eight tests when α is set to be smaller than 0.1, stratified by the property of disease mutants introduced at rare/common haplotypes. For most models, the two most powerful tests are our similarity method with the matching measure (MATCH) and the Pearson's χ2 test. MATCH is more powerful than the Pearson's χ2 test when the disease mutants were introduced at common haplotypes. Conversely, MATCH is less powerful than the Pearson's χ2 test when the disease mutants were introduced at rare haplotypes. For Model 1, haplotype-perspective methods provide no power, while diplotype-perspective methods (IBS and W-IBS) and the test for SNP × SNP epistasis by using case-only data (CS) have better performances.

thumbnailFigure 2. Powers of the eight tests, stratified by the property of disease mutants introduced at rare/common haplotypes. The x-axis is significance level, and the y-axis is power. The top row is for disease mutants introduced at rare haplotypes; the bottom row, at common haplotypes. The numbers shown in the parentheses are the numbers of repetitions summed from all the datasets with disease mutants introduced at rare/common haplotypes.

HapForest is not as powerful as MATCH and the Pearson's χ2 test. HapForest suggests potential epistasis among significant haplotypes. At each step, it builds a classifier that optimally distinguishes cases from controls based on haplotype data. This divides the whole sample into smaller and smaller subgroups by maximizing the local optimality at each node. However, the combination of local optimalities does not assure us of an overall optimality [32].

The tests for SNP × SNP epistasis by using case-control data or case-only data (CS-CN and CS) are not powerful under most disease models. Although our disease status was influenced by the joint effects of two SNPs (see Additional file 1), the tests for SNP × SNP epistasis suffered from power loss because of the need of corrections for multiple testing.

COUNT and IBS often have similar performances, because similarity measure based on the number of alleles in common between haplotypes (COUNT) is similar to that based on the number of alleles in common between individuals (IBS). Model 1 is an exception, because haplotype-perspective methods would not present any power under this model (see the penetrance values of two-locus genotypes for Model 1). W-IBS is a counting measure inversely weighted by genotype frequencies, and it is more powerful than COUNT and IBS. For most models (Models 2-6 and 8), COUNT, IBS, and W-IBS are inferior to MATCH, because the counting measures are more sensitive to marker informativity and LD patterns (results not shown). For Model 7, it requires at least one copy of the disease allele from both loci to be affected. Because the disease status is influenced by the counts of disease alleles, methods with the counting measures (COUNT, IBS, and W-IBS) are more powerful.

Discussion

Detecting joint associations of candidate genes responsible for common human diseases is a well-recognized issue. A candidate gene can contain many SNPs, and high-dimensionality becomes an important issue. The Pearson's χ2 test and the tests for SNP × SNP epistasis suffer from power loss because of large numbers of degrees of freedom and the need of adjustment for multiple testing, respectively. Compared with these conventional association methods, similarity methods are less vulnerable to the penalty of high-dimensionality.

Some similarity methods have been proposed based on this consideration. Tzeng et al. [15] compared the case-case similarity with the control-control similarity, because haplotypes around a causal locus might be more similar in two cases than in two controls randomly selected from the population. However, as pointed out by Sha et al. [26], this consideration might not be very plausible for complex diseases which were presumed to be affected by many genes and gene-environment interactions. The similarity within controls is not necessarily smaller than that within cases, because controls could be more likely to share protective haplotypes. Therefore, Sha et al. [26] proposed a test statistic that compared the between-group similarity with the within-group similarity. Our test statistics is also based on this consideration. Our test and Sha et al.'s test [26] will have similar performances, given a same similarity measure.

In this paper, we use the product of similarities of two genes/regions as a new similarity measure, which can account for the joint association of the two genes/regions, including main effects and/or interactions. This new measure can be built in the similarity test statistic. Furthermore, our equation (3) can be used to test the main effects (see Appendix II) or the joint associations of gene triplets by using: <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M38">View MathML</a>.

The computational burden of our method is reasonable for real data analyses, although permutation is required to obtain P values. If there are 100 candidate genes (each with eight tag SNPs), there will be a total of 4,950 combinations of gene pairs. With our experiences in simulations, it might take two to three days to test the 4,950 combinations for approximately 1000 subjects, given an Intel Xeon workstation with four 2.0 GHz CPUs and 2.0 GB of memory.

In general, our similarity test with the matching measure (MATCH) and the Pearson's χ2 test have better power performances. However, because both are haplotype-perspective methods, they are not appropriate for Model 1. Under this model, only the four heterozygous genotypes (AA-Bb, Aa-BB, Aa-bb, aa-Bb) lead to the disease. The implication is that besides the within-locus interference, there is some between-locus interference, and the two interferences cancel out [21] (so the double-heterozygosity genotype does not lead to the disease). The four heterozygous genotypes (AA-Bb, Aa-BB, Aa-bb, aa-Bb) generate four combinations of haplotypes: AB (one with allele A and one with allele B), Ab, aB, ab, with a same probability. Therefore, the four combinations of haplotypes are equally distributed in cases and in controls, and the haplotype-perspective methods cannot provide any power to this model.

The concept of testing joint associations can be used in the genomic distance-based regression [16]. Let D be the distance/dissimilarity matrix with elements: <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M39">View MathML</a>, and let X be the matrix containing information of phenotypes, which can be binary or continuous. Then the pseudo-F statistic can be used to test the association of phenotypic similarity with genetic similarity. The genomic distance-based regression [16] has the potential to adjust for covariate effects. With the need of adjusting for covariates, one can consider this approach with the joint similarities among genes.

Conclusions

In detecting joint associations between disease and gene pairs, our similarity test is a complementary method to the Pearson's χ2 test.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

W-Y L conceptualized the study, performed the simulation studies, and drafted the manuscript. W-C L provided advice and revised the manuscript. All authors read and approved the final manuscript.

Appendix

Appendix I: Derivation of equation (3)

<a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M40','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M40">View MathML</a>

Similarly, <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M41','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M41">View MathML</a>. We also have

<a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M42','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M42">View MathML</a>

Note that <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M29">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M30">View MathML</a> are the vectors of joint haplotype/genotype frequencies of genes G1 and G2, for the case and control samples, respectively; <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M43','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M43">View MathML</a> is the joint frequency of the kth category of haplotype/genotype at G1 and the lth category of haplotype/genotype at G2; <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M31">View MathML</a> is the dissimilarity matrix of the joint haplotypes/genotypes at G1 and G2, where its element <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M44','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M44">View MathML</a> is the dissimilarity between <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M45">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M46','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M46">View MathML</a>.

Appendix II: Test for main effects

When testing for main effects, we use only one gene/region in equation (3), i.e.,

<a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M47','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M47">View MathML</a>

where <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M48','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M48">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M49','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M49">View MathML</a> can be calculated from haplotypes or genotypes; <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M50">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M51','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M51">View MathML</a> are the vectors of haplotype/genotype frequencies at gene Gg, for the case and control samples, respectively; <a onClick="popup('http://www.biomedcentral.com/1471-2156/11/86/mathml/M52','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/11/86/mathml/M52">View MathML</a> is the dissimilarity matrix of the haplotypes/genotypes at Gg.

Acknowledgements

We thank the three anonymous reviewers for their constructive comments. This study was partly supported by National Science Councils, Taiwan.

References

  1. Cordell HJ: Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans.

    Hum Mol Genet 2002, 11(20):2463-2468. PubMed Abstract | Publisher Full Text OpenURL

  2. Lin M, Li H, Hou W, Johnson JA, Wu R: Modeling sequence-sequence interactions for drug response.

    Bioinformatics 2007, 23(10):1251-1257. PubMed Abstract | Publisher Full Text OpenURL

  3. Cordell HJ: Estimation and testing of gene-environment interactions in family-based association studies.

    Genomics 2009, 93(1):5-9. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  4. Musani SK, Shriner D, Liu N, Feng R, Coffey CS, Yi N, Tiwari HK, Allison DB: Detection of gene × gene interactions in genome-wide association studies of human population data.

    Hum Hered 2007, 63(2):67-84. PubMed Abstract | Publisher Full Text OpenURL

  5. Hahn LW, Ritchie MD, Moore JH: Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions.

    Bioinformatics 2003, 19(3):376-382. PubMed Abstract | Publisher Full Text OpenURL

  6. Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC: A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility.

    J Theor Biol 2006, 241(2):252-261. PubMed Abstract | Publisher Full Text OpenURL

  7. Ritchie MD, Hahn LW, Moore JH: Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity.

    Genet Epidemiol 2003, 24(2):150-157. PubMed Abstract | Publisher Full Text OpenURL

  8. Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer.

    Am J Hum Genet 2001, 69(1):138-147. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  9. Breiman L: Classification and regression trees. Belmont, CA: Wadsworth International Group; 1984.

  10. Bureau A, Dupuis J, Falls K, Lunetta KL, Hayward B, Keith TP, Van Eerdewegh P: Identifying SNPs predictive of phenotype using random forests.

    Genet Epidemiol 2005, 28(2):171-182. PubMed Abstract | Publisher Full Text OpenURL

  11. Clark LA, Pregibon D: Tree-based models. In Statistical Models in S. Edited by Chambers JM, Hastie TJ. Pacific Grove, California: Wadsworth and Brooks/Cole Advanced Books and Software; 1992:377-419. OpenURL

  12. Zhang H, Bonney G: Use of classification trees for association studies.

    Genet Epidemiol 2000, 19(4):323-332. PubMed Abstract | Publisher Full Text OpenURL

  13. Chen X, Liu CT, Zhang M, Zhang H: A forest-based approach to identifying gene and gene gene interactions.

    Proc Natl Acad Sci USA 2007, 104(49):19199-19203. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  14. Lin WY, Schaid DJ: Power comparisons between similarity-based multilocus association methods, logistic regression, and score tests for haplotypes.

    Genet Epidemiol 2009, 33(3):183-197. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  15. Tzeng JY, Devlin B, Wasserman L, Roeder K: On the identification of disease mutations by the analysis of haplotype similarity and goodness of fit.

    Am J Hum Genet 2003, 72(4):891-902. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  16. Wessel J, Schork NJ: Generalized genomic distance-based regression methodology for multilocus association analysis.

    Am J Hum Genet 2006, 79(5):792-806. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  17. Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA: Score tests for association between traits and haplotypes when linkage phase is ambiguous.

    Am J Hum Genet 2002, 70(2):425-434. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  18. Li Y, Sung WK, Liu JJ: Association mapping via regularized regression analysis of single-nucleotide-polymorphism haplotypes in variable-sized sliding windows.

    Am J Hum Genet 2007, 80(4):705-715. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  19. Hudson RR: Generating samples under a Wright-Fisher neutral model of genetic variation.

    Bioinformatics 2002, 18(2):337-338. PubMed Abstract | Publisher Full Text OpenURL

  20. A haplotype map of the human genome

    Nature 2005, 437(7063):1299-1320. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  21. Li W, Reich J: A complete enumeration and classification of two-locus disease models.

    Hum Hered 2000, 50(6):334-349. PubMed Abstract | Publisher Full Text OpenURL

  22. Neuman RJ, Rice JP: Two-locus models of disease.

    Genet Epidemiol 1992, 9(5):347-365. PubMed Abstract | Publisher Full Text OpenURL

  23. Rinaldo A, Bacanu SA, Devlin B, Sonpar V, Wasserman L, Roeder K: Characterization of multilocus linkage disequilibrium.

    Genet Epidemiol 2005, 28(3):193-206. PubMed Abstract | Publisher Full Text OpenURL

  24. Roeder K, Bacanu SA, Sonpar V, Zhang X, Devlin B: Analysis of single-locus tests to detect gene/disease associations.

    Genet Epidemiol 2005, 28(3):207-219. PubMed Abstract | Publisher Full Text OpenURL

  25. Clayton D: SNPHAP - A program for estimating frequencies of large haplotypes of SNPs. Department of Medical Genetics, Cambridge Institute for Medical Research, Cambridge; 2006.

  26. Sha Q, Chen HS, Zhang S: A new association test using haplotype similarity.

    Genet Epidemiol 2007, 31(6):577-593. PubMed Abstract | Publisher Full Text OpenURL

  27. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, et al.: PLINK: a tool set for whole-genome association and population-based linkage analyses.

    Am J Hum Genet 2007, 81(3):559-575. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  28. Sidak Z: Rectangular confidence regions for the means of multivariate normal distributions.

    Journal of the American Statistical Association 1967, 62:626-633. Publisher Full Text OpenURL

  29. Cheverud JM: A simple correction for multiple comparisons in interval mapping genome scans.

    Heredity 2001, 87(Pt 1):52-58. PubMed Abstract | Publisher Full Text OpenURL

  30. Nyholt DR: A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other.

    Am J Hum Genet 2004, 74(4):765-769. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  31. Galwey NW: A new measure of the effective number of tests, a practical tool for comparing families of non-independent significance tests.

    Genet Epidemiol 2009, 33(7):559-568. PubMed Abstract | Publisher Full Text OpenURL

  32. Briollais L, Wang Y, Rajendram I, Onay V, Shi E, Knight J, Ozcelik H: Methodological issues in detecting gene-gene interactions in breast cancer susceptibility: a population-based study in Ontario.

    BMC Med 2007, 5:22. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL