Department of Epidemiology and Health Statistics, School of Public Health, Shandong University, Jinan, 250012, China

Berlin Institute for Medical Systems Biology, Max-Delbrück-Center for Molecular Medicine, 13125, Berlin, Germany

CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200031, China

Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Chinese Academy of Sciences, Shanghai, 200031, China

MRC Epidemiology Unit, Institute of Metabolic Science, Addenbrooke’s Hospital, Cambridge, UK

Abstract

Background

Currently, most methods for detecting gene-gene interaction (GGI) in genomewide association studies (GWASs) are limited in their use of single nucleotide polymorphism (SNP) as the unit of association. One way to address this drawback is to consider higher level units such as genes or regions in the analysis. Earlier we proposed a statistic based on canonical correlations (CCU) as a gene-based method for detecting gene-gene co-association. However, it can only capture linear relationship and not nonlinear correlation between genes. We therefore proposed a counterpart (KCCU) based on kernel canonical correlation analysis (KCCA).

Results

Through simulation the KCCU statistic was shown to be a valid test and more powerful than CCU statistic with respect to sample size and interaction odds ratio. Analysis of data from regions involving three genes on rheumatoid arthritis (RA) from Genetic Analysis Workshop 16 (GAW16) indicated that only KCCU statistic was able to identify interactions reported earlier.

Conclusions

KCCU statistic is a valid and powerful gene-based method for detecting gene-gene co-association.

Background

Genome-wide association studies (GWASs), which may involve a large number of single nucleotide polymorphisms (SNPs) on many individuals, are widely used to identify genetic variants underlying complex diseases or other types of traits. Although a primary interest in a GWAS is to identify SNPs associated with a trait of interest, it is important to consider the associate genes and their co-association as well. One form of co-association is epistasis, which was introduced approximately 100 years ago and generally defined as interactions among genes

Methods to detect GGIs on the basis of the statistical definition include but are unlimited to logistic regression, multifactor dimensionality reduction

Methods

CCA

CCA is a classical multivariate method which concerns about linear dependencies between sets of variables. Let _{
i
}, _{
i
} (i = 1, . . ., m) denote samples of measurements on m objects, We assume the data to be column centred. Let ^{
n
}} will be referred to as the column-space and ^{
T
}) = {^{
T
}
^{
m
}} the row-space of _{
j
} ∈ ^{
T
}) and _{
j
} ∈ ^{
T
}) such that _{
j
} = _{
j
} and _{
j
} = _{
j
} are maximally correlated. _{1}, _{1}) and _{1} = _{1}, _{1} = _{1} are the corresponding canonical variates and their correlation is called the maximum canonical coefficient. Pairs of canonical vectors (_{
j
}, _{
j
}) can be recursively defined by maximizing similar expression and keeping subsequent variates orthogonal to those previously obtained. CCA can be interpreted as constructing pairs of factors from

KCCA

KCCA generalizes CCA as follows: Objects _{
i
} and _{
i
} are first mapped to some Hilbert spaces _{
x
} and _{
y
} through mapping Φ_{
x
}(.) and Φ_{
y
}(.), CCA is then performed on images {Φ_{
x
}(_{
i
})}_{
i = 1}
^{
m
} and {Φ_{
y
}(_{
i
})}_{
i = 1}
^{
m
}. Let K_{
x
} and K_{
y
} denote _{
j
}, _{
j
} ∈ ^{
m
} as a constrained optimization problem

Explicit form for the mapping Φ_{
x
}(.) and Φ_{
y
}(.) are not always required but the kernel K_{
x
} and K_{
y
} need to be fixed. Common kernel functions include linear, polynomial, radial basis function (RBF), sigmoid

Test statistic

Strategy analogous to CCU statistic was used to construct the KCCU statistic except that the maximum kernel canonical coefficient of the two genes, rather than the maximum canonical coefficient, was taken as a measure of gene-gene co-association in cases and controls. Let genotyped data of case–control study be (_{1}
^{
D
}, _{2}
^{
D
}, …, _{
P
}
^{
D
}) and (_{1}
^{
D
}, _{2}
^{
D
}, …, _{
q
}
^{
D
}) for gene A and gene B for cases, and (_{1}
^{
C
}, _{2}
^{
C
}, …, _{
P
}
^{
C
}) and (_{1}
^{
C
}, _{2}
^{
C
}, …, _{
q
}
^{
C
}) for controls. The maximum kernel canonical coefficient _{
D
} between (_{1}
^{
D
}, _{2}
^{
D
}, …, _{
P
}
^{
D
}) and (_{1}
^{
D
}, _{2}
^{
D
}, …, _{
q
}
^{
D
}) obtained through KCCA could be considered as a measurement of gene-based gene–gene co-association in cases, and _{
C
} between (_{1}
^{
C
}, _{2}
^{
C
}, …, _{
P
}
^{
C
}) and (_{1}
^{
C
}, _{2}
^{
C
}, …, _{
q
}
^{
C
}) be a measurement of gene–gene co-association in controls. The transformation analogous to Fisher’s simple correlation coefficient transformation was done to _{
D
} and _{
C
}, i.e.

The KCCU statistic for detecting statistical significance of the difference of gene-based gene-gene co-association between cases and controls can be defined as _{
D
}) and var(_{
C
}), a bootstrap procedure was employed. Seeing that the performance of kernel methods strongly relates to the choice of kernel functions and their parameters, we chose the RBF kernel owing to its flexibility in parameter specification

Data simulation

Simulation studies were conducted to assess the performance of KCCU relative to CCU under both the null (_{0}) and alternative hypotheses (_{1}), which were based on the HapMap data in the following steps:

Step 1. Phased haplotype (Phases 1 & 2 of CEU) data were downloaded from the HapMap web site (^{
2
}.

Pairwise r^{2} among the six SNPs in the first region

**Pairwise r**^{2 }**among the six SNPs in the first region.** The six SNPs are rs16857402, rs2709, rs10020551, rs4484337, rs12643262, and rs7670601. The values to the right of the 6 dbSNP IDs (rs# IDs) are the corresponding minor allele frequencies.

Pairwise r^{2} among the seven SNPs in the first region

**Pairwise r**^{2 }**among the seven SNPs in the first region.** The seven SNPs are rs17201502, rs905619, r637871, rs1027711, rs956864, rs640081, and rs706795. The values to the right of the seven dbSNP IDs (rs# IDs) are the their minor allele frequencies.

Step 2. Based on data above, large samples with 100, 000 cases and 100, 000 controls were generated using software gs2.0 ^{nd} SNP of the first region and the SNP of the other as the causal variants and they were removed in the simulation to assess gene-gene co-association. The interaction odds ratio was set as 1.0 under _{0} and 1.1, 1.2, 1.3, 1.4, 1.5 under _{1}. The SNPs in the regions were coded according to an additive genetic model. To further investigate the performance on causal SNPs with respect to minor allele frequency and LD, different SNP pairs from the two gene regions were defined as the casual variants.

**Two-locus interaction multiplicative effects model.**

Click here for file

Step 3. From the remaining SNPs, simulated data were sampled and CCU and KCCU performed under various sample sizes

Applications

The proposed KCCU statistic was applied to rheumatoid arthritis (RA) data from GAW16 Problem 1, consisting of 2,062 Illumina 550k SNP chips from 868 RA patients and 1,194 normal controls collected by the North American Rheumatoid Arthritis Consortium

Results

Simulation

Shown in Table _{0}. The KCCU statistic is normally distributed according to the one sample Kolmogorov-Smirnov test with the type I error rates of KCCU statistic being close to given nominal value (α = 0.05) for different sample sizes. This indicates that the proposed statistic performs well under the null hypothesis.

**Sample size**

**CCU**

**KCCU**

**Type I Error**

**Normality Test (D)**

**Type I error**

**Normality Test (D**)

D, Kolmogorov-Smirnov D test.

1000

0.052

>0.55

0.049

>0.55

2000

0.051

>0.55

0.054

>0.55

3000

0.056

>0.55

0.052

>0.55

4000

0.048

>0.55

0.051

>0.55

5000

0.053

>0.55

0.050

>0.55

Results on various interaction odds ratios and a sample size of 3,000 are shown in Figure

Power of CCU and KCCU statistics given different interaction odds ratios and a sample size of 3,000

**Power of CCU and KCCU statistics given different interaction odds ratios and a sample size of 3,000.**

Power of CCU and KCCU statistics given an interaction odds ratio of 1.4 and different sample sizes

**Power of CCU and KCCU statistics given an interaction odds ratio of 1.4 and different sample sizes.**

Power of CCU and KCCU statistics when SNP pairs from two regions are defined as casual variants at an interaction odds ratio of 1.3 and a sample size of 3,000

**Power of CCU and KCCU statistics when SNP pairs from two regions are defined as casual variants at an interaction odds ratio of 1.3 and a sample size of 3,000.**

Application

The performance of logistic regression test, CCU and KCCU statistics on pair-wise gene-gene co-association of three genes is shown in Table

**Co-association**

**
C5-ITGAV
**

**
C5-VEGFA
**

**
ITGAV-VEGFA
**

*significant at level 0.05.

Logistic regression

0.1015

0.1425

0.1840

CCU

0.5387

0.5325

0.8317

KCCU

σ=0.05

<0.001^{*}

<0.001^{*}

<0.001^{*}

σ=0.5

<0.001^{*}

<0.001^{*}

<0.001^{*}

σ=5

<0.001^{*}

<0.001^{*}

<0.001^{*}

σ=50

<0.001^{*}

<0.001^{*}

<0.001^{*}

Discussion

We have extended the CCU statistic to a new statistic KCCU, which can extract nonlinear correlation between two genes. Simulation studies show that both CCU and KCCU statistics performed well under null hypothesis with KCCU being more powerful than CCU with respect to significant level, sample size and relative risk. As results vary with user-defined kernel parameter, various parameters were used (the bandwidth parameter in RBF kernel) to RA data in GAW16 Problem 1, showing that the logistic regression test and CCU statistic failed to detect any interaction but KCCU statistics identified the pair-wise interactions among the three genes under various parameters. The interaction between

A reviewer has also suggested us to reiterate the relationship between gene-gene co-association and GGI which is readily available. GGI generally refers to the synergetic or antagonistic effect of two genes in addition to the summation of their independent effects on an outcome. To represent the interaction between two genes A and B in a case–control association study, a product term is customarily added to the logistic regression model _{0} + _{1}
_{2}

Several issues remain to be resolved: the uncertainty to set the kernel function with appropriate parameters for each data, the undesirable performance of both CCU and KCCU with small interaction odds ratio (e.g. 1.1), and the possible failure of maximum kernel canonical correlation coefficient to represent gene-gene co-association.

Conclusions

KCCU statistic is a valid and powerful gene-based method for detecting gene-gene co-association compared to CCU and logistic regression test. Further work is needed to make its use in GWAS more practical.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

ZSY, QSG, YGH, XSZ, FYL, JHZ and FZX conceptualized the study, acquired and analyzed the data and prepared for the manuscript. All authors approved the final manuscript.

Acknowledgements

This work was supported by the grant from National Natural Science Foundation of China (30871392) and Young Talents Innovation Foundation of School of Public Health, Shandong University. We thank GAW16 and the North American Rheumatoid Arthritis Consortium for the RA data and two anonymous reviewers for suggestions which led to substantial improvement and clarification of the paper.