Background
It is widely acknowledged that complex diseases are most likely caused by a combination of environmental factors and interactions among multiple genes 1. With the fast development of the genotyping technologies, single nucleotide polymorphisms (SNPs) have become one of the most commonly used biomarkers for disease associated gene identification in casecontrol designed genome wide association (GWA) studies 2345. However, there are several practical problems in analyzing the SNP genotype data. First, to identify true disease associated SNPs from a massive set of candidate SNPs, an accurate SNP selection strategy is of critical importance. However, the accurate identification of disease associated SNPs for phenotype classification is hindered by the curseofdimensionality and the curseofsparsity 6. Furthermore, the datasets generally contain high level of data noise, high redundancy, and many missing values, and most seriously, it is evident that epistasis is a ubiquitous phenomenon in complex diseases 78, or in other words, genegene interactions and geneenvironment interactions are likely to be important contributors to the development of many complex diseases (note the phrases genegene interaction and SNPSNP interaction are used interchangeably in this paper). The explanations from the biological perspective are as follows: (1) a SNP in a coding region may cause amino acid substitution, leading to the functional alteration of the protein; (2) a SNP in a promoter region can affect transcriptional regulation, causing the change of the protein expression abundance; and (3) a SNP in an intron region can affect splicing and expression of the gene 9. All these effects contribute quantitatively and qualitatively to the ubiquity of biomolecular interactions in biological systems. Although it is a common characteristic in complex disease development, the identification of those genetic interactions have been proven to be very challenging 10.
Most of the earliest studies focused on identifying a set of SNPs in which individual SNP has a strong association with the phenotype by applying statistical measures such as χ^{2}statistic and logistic regression 1112. However, several problems arise when applying these methods. First, it is unclear how to best adjust the resulting pvalues after testing for a very large number of possibly nonindependent hypotheses. Second, complex diseases are usually caused by the action of multiple genes in a nonadditive fashion. The standard analytical approaches in GWA studies often proceed by identifying only a very small number of SNPs (usually only one or two) that exhibit strong statistical association with the target phenotype. In other words, only SNPs that independently have a strong discriminating ability are selected, while other SNPs that individually have weaker association are not discovered 13. However, it is common that a combination of two or more SNPs, each having weak association with the phenotype, can classify the phenotypes of samples with a higher accuracy. This is natural since complex diseases are most likely caused by multiple genes and their interaction with environmental factors.
To cope with these problems, it is desirable to develop new methods which can consider multiple loci jointly. A number of such methods have been developed recently. Among these methods, nonparametric methods like Polymorphism Interaction Analysis (PIA) 14, Multifactor Dimensionality Reduction (MDR) 15, and Combinatorial Partitioning Method (CPM) 16 are the most popular ones probably due to their good generalization property on different interaction models. Specifically, PIA tries to apply multiple evaluation metrics for ranking and scoring SNP combinations while MDR and CPM attempt to modify the feature dimension to discriminate SNPSNP interactions. However, there is no onesizefitsall method for the detection and characterization of genegene interaction relationships in GWA studies. Several comparison and evaluation studies suggested that applying a combination of multiple complementary algorithms, each having its own strength, could be the most effective strategy to increase the chance of a successful analysis 101718.
A recent study by McKinney et al. 19 proposed to use a combination of a machine learning filter and an information theoretic approach to identify SNPSNP interactions. McKinney et al. found that combining the set of SNPSNP interactions into a graph can yield interesting insights about the underlying biological processes. It is anticipated that similar networkbased analysis approaches can be used as a downstream analysis for any genegene interaction identification algorithms.
In this study, we attempt to address the problem from an alternative perspective by converting the issue into a combinatorial feature selection problem. From the data mining perspective, a sample from a SNP dataset of an association study is described as a SNP feature set of the form s_{i}={g_{1}, g_{2}, ..., g_{n}}, (i = 1, ..., m) where each SNP, g_{i }, is a categorical variable which can take the value of 0, 1, and 2 for genotypes of aa, Aa, or AA at this locus, and m is the number of samples in the dataset. The dataset can, therefore, be described as an m × n matrix D_{mn}={(s_{1}, y_{1}), (s_{2}, y_{2}), ..., (s_{m}, y_{m})}, where y_{i }is the class label of the i^{th }sample. One can evaluate the discrimination ability of a set of SNPs jointly by applying the following two steps:
• Generating a reduced SNP feature set
s
′
i
=
{
g
1
,
g
2
,
…
,
g
d
}
,
(
s
′
i
⊂
s
i
)
in a combinatorial manner which restrains the dataset matrix into
D
m
d
=
{
(
s
′
1
,
y
1
)
,
(
s
′
2
,
y
2
)
,
…
,
(
s
′
m
,
y
m
)
}
. A key observation is that feature selection algorithms which evaluate SNPs individually are not appropriate since they cannot capture the associations among multiple SNPs.
• Creating classification hypothesis h using an inductive algorithm, and evaluating the quality of the trained classification model using criteria such as accuracy, sensitivity, and/or specificity with an independent test set.
Without lose of generality, we use notation s to denote applying a SNP subset to restrain the SNP dataset D_{mn}. If a SNP combination s yields a lower misclassification rate than others, we shall consider that it possibly containing SNPs with main effects or SNPSNP interactions with major implications. We now have two challenging problems for the SNP interaction identification. The first challenge is to generate SNP combinations efficiently since the number of SNP combinations grows exponentially with the number of SNPs and it is infeasible to evaluate all possible combinations exhaustively. The second challenge is to determine which inductive algorithm should be applied for the goodness test of SNP combinations. To tackle the first problem, we shall apply genetic algorithm (GA) since it has been demonstrated to be one of the most successful wrapper algorithms in feature selection from highdimensional data 2021.
Furthermore, its intrinsic ability in capturing nonlinear relationships 22 is valuable for modeling various nonadditive interactions. With regard to the second problem, there is no guiding principle on which inductive algorithms are preferable for identification of multiple loci interaction relationships. However, a promising solution is to employ an ensemble of classifiers and then to integrate/balance the evaluation results from these classifiers 23. The key issue in applying this method is that the base classifiers used for ensemble integration should be able to capture multiple SNP interactions which commonly have nonlinear relationships. This may be achieved by using appropriate nonlinear classifiers.
The rationale of using an ensemble of classifiers can be described as follows: suppose that a given classifier i generates a hypothesis space ℋ_{i }for sample classification. If the number of training samples m is large enough to characterize the real hypothesis f (in this context, f is the set of disease associated SNPs and SNP combinations) and the data are noisefree, the hypothesis space generated by i should be able to converge to f through training. However, since the amount of the training samples is often far too small compared to the size of the hypothesis space which increases exponentially with the size of the features (SNPs), the number of hypotheses a classifier can fit the available data is often very large. One effective way to constrain the hypothesis space is to apply multiple classifiers each with a different hypothesis generating mechanism. If each classifier fulfills the criteria of being accurate and diverse 24, it can be shown that one is able to reduce the hypothesis space to better capture the real hypothesis f by combining them with an appropriate integration strategy 25.
By combining GA and the ensemble of classifiers, we obtain a genetic ensemble (GE) algorithm for genegene interaction identification. The proposed algorithm has the following advantages:
• It is a nonparametric and modelfree approach. Unlike traditional parametric methods (e.g. linear regression etc.), there is no need to specify and assume the number of parameters and the interaction models. As a consequence, the proposed method generalizes well and can capture a range of interaction relationship such as additive and dominant effects.
• It accommodates the detection of both linear and nonlinear relationship of genegene interactions. As aforementioned, the ensemble could be formed by classifiers with nonlinear separation abilities. Both linear and nonlinear genegene interaction relationships are common in complex disease 26, and could be captured by a nonlinear classifier 27.
• Unlike many other methods which often study different sizes of multiloci interactions separately and repeatedly, our algorithm identifies different sizes of interactions in parallel. This feature makes the proposed algorithm particularly attractive in identifying higherorder genegene and geneenvironment interactions.
• The system is flexible. Different inductive algorithms as well as integration methods can be readily added in for further improvements.
One other motivation for developing alternative methods for SNPSNP interaction identification is in the hope that different algorithms may complement each other to increase the overall chance in identifying true interaction relationships. Therefore, to evaluate the degree of complementarity of multiple algorithms for SNPSNP interaction identification is also an important part of this study. Specifically, based on the notion of double fault 28, we designed a formula for calculating the cooccurrence of misidentification which gives an indication of the degree of complementarity between two different algorithms. Accordingly, the joint identification of using multiple algorithms is derived. Therefore, the contribution of this work is twofold: (1) designing a genetic ensemble algorithm for SNPSNP and SNPenvironment interaction identification; and (2) proposing a method for evaluating the degree of complementarity among multiple algorithms.
Methods
Overview of genetic ensemble
In our previous study, a multiobjective GA system is implemented for high dimensional data analysis 29. Here, we implement the GE algorithm in a similar way. The algorithm executes in an iterative manner and results collected through multiple iterations are used to assess the relative importance of SNPs and SNPSNP combinations.
As illustrated in Figure 1, the GE algorithm is applied to SNP selection repeatedly. In each run, randomly generated SNP subsets are fed into an ensemble committee for goodness evaluation. Two classifier integration strategies namely blocking and voting, and a diversity promoting method called double fault statistic are employed to guide the optimization process. When the evaluation of a SNP subset is completed, the feedbacks of this SNP subset are combined through a given set of weights and sent back to GA as the overall fitness of this SNP subset. After the entire population of GA is evaluated, selection, crossover and mutation are applied and the next generation begins. At the last generation of GA, the chromosome with the highest fitness is selected, and the SNP subset it represents is said to be the best SNP subset genrated by GA. The entire GA procedure is repeated (with different seeds for random initialization) n times (n = 30 in our experiments) to generate n best SNP subsets. These SNP subsets are then analyzed to identify frequently occurring SNPpairs, SNPtriplets, and higherorder SNP combinations.
<p>Figure 1</p>A schematic representation of the genetic ensemble algorithm
A schematic representation of the genetic ensemble algorithm. Multiple classifiers are integrated for genegene and geneenvironment interaction identification. Genetic algorithm is employed to select SNP subsets that represent potential genegene and geneenvironment interactions. After a predefined iterations of n, a total of n selected SNP subsets are ranked in a combinatorial manner. Each SNP combination is then assigned an identification frequency score based on the proportion of times this SNP combination is present among the n subsets. A SNP combination is called a functional SNP combination if it is highly ranked or if its frequency score is above a specified threshold.
For SNP interaction identification, a combinatorial ranking is applied to the n selected SNP subsets. Each possible SNP combination is then given an identification frequency score (the number of times it appears divided by the total number of iteration n). For example, if the SNP combination {snp_{1}, snp_{2}} appears in 25 out of 30 iterations, then its identification frequency score is 25/30 = 0.833. Two alterative criteria can be used to decide whether a SNP combination should be called or not. The first criterion is to set a frequency score cutoff, say 0.8, and call all SNP combinations with frequency score higher than this cutoff as functional SNP combinations. The second criterion is to set a cutoff rank, and call all SNP combinations with equal or higher to that rank as functional SNP combinations. As will be demonstrated in subsequent sections, the choice between this two criteria is likely a balance between detection power and false discovery rate.
Genetic algorithm
The number of SNPs considered by the genetic ensemble algorithm for potential interaction detection ranges from the lower bound of 2 to the upper bound of d, where d is the "chromosome" size of GA. The size of the GA chromosome has two implications. Firstly, it controls the number of factors we can identify. For example, if the size of d = 15 is used, we can identify from 2factor up to 15factor interactions in parallel. Secondly, d also influences on the size of the combinatorial space to be explored. It is a tradeoff between the computational time and the combinatorial space to be searched. Therefore, for different SNP sizes (that is, the number of SNPs in the dataset), we shall use different sizes of d accordingly. Similar to the size of GA chromosome, the population size p and the generation of GA g are also specified according to the SNP size in the dataset. In our implementation of the GE algorithm, the parameters d, p, and g can be specified by users. The default values of these parameters are chosen empirically such they work well in a range of datasets.
For GA selection operation, we employ the tournament selection method as it gives the control of convergence speed. Specifically, the tournament selection size, denoted as t, is dependent on the size of the population, varying from 3 to 7. The measure for determining the winner is as follows:
W
i
n
n
e
r
=
arg
max
s
∈
p
f
i
t
n
e
s
s
(
R
i
(
p
)
)
(
i
=
1
,
2
,
...
,
t
)
where R_{i}(.) is the random selection function which randomly selects gene subset s from the GA population p, t is the tournament size, and fitness(.) determines the overall fitness of the randomly selected gene subset. Single point crossover is adopted with the probability of 0.7. In order to allow pair mutations, we implemented a multimutation strategy; that is, when a single mutation occurs (configured with the probability of 0.1) on a chromosome, another single point mutation may occur on the same chromosome with a probability of 0.25 and so on. The chromosome coding scheme is to assign an id to each SNP in the dataset, and to represent the chromosome as a string of SNP ids which specify a selected SNP subset. For each position on a chromosome, it could be a SNP id or a "0" which specifying an empty position.
Therefore, different sizes of SNP combinations are explored in a single GA population in parallel. Table 1 summarizes the parameter settings.
<p>Table 1</p>Multiobjective genetic algorithm parameter settings
Parameter
Value
Genetic algorithm
Multiobjective
Chromosome size
1525
Population size
40340
Termination generation
820
Selector
Tournament selection (37)
Crossover
Single point (0.7)
Mutation
Multiple points (0.1 & 0.25)
The fitness of GA is defined as follows:
f
i
t
n
e
s
s
(
s
)
=
w
1
×
f
i
t
n
e
s
s
B
(
s
)
+
w
2
×
f
i
t
n
e
s
s
V
(
s
)
+
w
3
×
f
i
t
n
e
s
s
D
(
s
)
where s denotes a SNP combination under evaluation. The functions fitness_{B}(s), fitness_{V }(s) and fitness_{D}(s) denote the fitness of a SNP combination s as evaluated by the blocking, voting and double fault diversity measures, respectively. A complexity regularization procedure is implemented in the GE algorithm to favor shorter SNP combination if two SNP combinations have the same fitness value. The computation details of each component of the fitness function are described in the next section.
Integration strategies
Blocking
Blocking is a statistical strategy that creates similar conditions to compare random configurations in order to discriminate the real differences from differences caused by fluctuation and noise 30. Suppose a total of M classification algorithms, each having a different hypothesis denoted as
h
i
s
, (i = 1, ..., M), are used to classify the data using a SNP subset s. The fitness function determined by blocking integration strategy is as follows:
f
i
t
n
e
s
s
B
(
s
)
=
∑
i
=
1
M
B
C
(
p
(
t

h
i
s
,
D
)
,
y
)
where y is the class label vector of the test dataset D, function p(.) predicts/classifies samples in D as t using his, and BC(.) is the balanced classification accuracy devised to deal with the dataset with an imbalanced class distribution. In the binary classification, it is the area under ROC curve (AUC) 31, which can be approximated as follows:
B
C
(
p
(
t

h
i
s
,
D
)
,
y
)
=
S
e
+
S
p
2
S
e
=
N
T
P
N
c
a
s
e
×
100
,
S
p
=
N
T
N
N
c
o
n
t
r
o
l
×
100
where Se is the sensitivity value calculated as the percentage of the number of true positive classification (N_{TP }) divided by the number of cases (N_{case}). Sp is the specificity value calculated as the percentage of the number of true negative classification (N_{TN}) divided by the number of controls (N_{control}). Such a balanced classification accuracy measure can accommodate the situation in which the dataset contains an imbalanced class distribution of cases and controls 32.
The idea of applying this strategy for classifier integration in SNP selection is that by using more classifiers to validate a SNP subset, we are able to constrain the hypothesis space to the overlap region ℋ_{o}, increasing the chance of correctly identifying functional SNPs and SNPSNP interactions.
Voting
The second classifier integration strategy applied in our GE system is majority voting 33. Majority voting is one of the simplest strategies in combining classification results from an ensemble of classifiers. Despite its simplicity, the power of this strategy is comparable to many other more complex methods 34. With a majority voting of M classifiers, consensus is made by k classifiers where:
k
≥
{
M
/
2
+
1
:
if
M
is
even
(
M
+
1
)
/
2
:
if
M
is odd
Again, suppose a total of M classifiers, each having a different hypothesis denoted as his, (i = 1, ..., M), are used to classify the data using SNP subset s, the fitness function derived from majority voting is as follows:
f
i
t
n
e
s
s
V
(
s
)
=
B
C
(
V
k
(
t
′

∑
i
=
1
M
p
(
t

h
i
s
,
D
)
)
,
y
)
where y is the class label vector of the test dataset D, V_{k}(.) is the decision function of majority voting, and t' is the voting prediction. Here the balanced classification accuracy BC(.) is calculated with voting results.
The reason for using the majority voting integration is to improve sample classification accuracy while also promoting the diversity among individual classifiers implicitly 35.
Double fault diversity
The third objective function is an explicit diversity promoting strategy called double fault statistic. This statistic is commonly used to measure the diversity of ensemble classifiers 28.
Let c_{a}, c_{b }∈ {F, S} in which F denotes the sample being misclassified by a classifier while S denotes the sample being correctly classified. We define
N
c
a
c
b
as the number of samples that are (in)correctly classified by two classifiers in which the correctness of the two classifiers is denoted by c_{a }and c_{b }respectively. Using this notation, we can obtain the term:
D
(
p
(
t

h
c
a
s
,
D
)
,
p
(
t

h
c
b
s
,
D
)
)
=
N
F
F
N
which is the estimation statistic of coincident errors of a pair of classification models
h
c
a
s
and
h
c
b
s
(hence the name "double fault") in classification of a total of N samples in test dataset D, using SNP subset s.
The fitness with regard to the diversity measurement of M classifiers over subset s (denoted as fitness_{D}(s)) derived from the double fault statistic are defined as follows:
f
i
t
n
e
s
s
D
(
s
)
=
1
−
2
M
(
M
−
1
)
∑
a
=
1
M
∑
b
=
a
+
1
M
D
(
p
(
t

h
c
a
s
,
D
)
,
p
(
t

h
c
b
s
,
D
)
)
The value of this fitness function varies from 0 to 1. The value equals 0 when all classifiers misclassified every sample. It equals 1 when no sample is misclassified or there is a systematic diversity, leading to no sample been misclassified by any pair of classifiers.
Classifier selection
The motivation of applying nonlinear classifiers is based on the assumption that nonlinear and nonadditive relationships are commonly presented in genegene interaction 26. This is particularly relevant in detecting complex epistatic interaction that involves both additive and dominant effects. Therefore, in ensemble construction, we focus on evaluating nonlinear classifiers. Moreover, we prefer classifiers that are relatively computationally efficient since the identification of genegene interaction is carried out in a wrapper manner. Thus, our attention has been focused on decision tree based classifiers and instance based classifiers, as well as their hybrids because they are fast among many alternatives, while also being able to perform nonlinear classification. However, we note that any combination of linear and nonlinear classifiers can be used in our framework.
With above considerations, an initial set of experiments is conducted to select candidate classifiers for ensemble construction. The classifier selection details are presented in Results Section.
Datasets
Simulation datasets
In this study, we use the simulated datasets generated by genegene interaction models described by Moore et al. 32. In each dataset, a pair of functional SNPs which simulate genegene interaction are embedded along with nonfunctional SNPs, and the task is to identify this functional SNP pair from the nonfunctional ones.
For the datasets with balanced class distribution, the class ratio is 1:1 with 100 case samples and 100 control samples. The datasets are simulated under three different genetic heritability models (heritability of 0.2, 0.1, and 0.05), and two SNP sizes (SNP size of 20 and 100). This gives six sets of datasets and each set contains 100 replicates each generated with a different random seed 36. The property of the imbalanced datasets used for evaluation is the same as the balanced datasets, except that the class ratio is approximately 1:2 with 67 case samples and 133 control samples. For imbalanced data, we restrict the evaluation to SNP size of 20, and therefore, we obtain three sets of datasets with each set containing 100 replicates. Table 2 summarizes the characteristics of the simulated datasets used for evaluation.
<p>Table 2</p>Summary of simulation datasets
Dataset
Sample size
Ratio
Heritability
SNP size
No. replicates
balanced_200_0.2_20
200
1:1
0.2
20
100
balanced_200_0.1_20
200
1:1
0.1
20
100
balanced_200_0.05_20
200
1:1
0.05
20
100
balanced_200_0.2_100
200
1:1
0.2
100
100
balanced_200_0.1_100
200
1:1
0.1
100
100
balanced_200_0.05_100
200
1:1
0.05
100
100
imbalanced_200_0.2_20
200
1:2
0.2
20
100
imbalanced_200_0.1_20
200
1:2
0.1
20
100
imbalanced_200_0.05_20
200
1:2
0.05
20
100
Agerelated macular degeneration dataset
Agerelated macular degeneration (AMD) is the major cause of uncorrectable blindness of the elderly in many countries. As a typical complex disease, AMD is influenced by complex interactions among multiple genes and environmental factors, making it ideal for testing genegene interaction identification methods. In our experiment, the proposed GE algorithm, PIA, and MDR are applied to the dataset generated from a GWA study of AMD 2. This dataset contains a genomewide screening of 96 AMD cases and 50 controls and more than 100,000 SNPs have been genotyped for each individual.
Evaluation statistics
Evaluation statistics for single algorithm
We compare the detection power of the proposed GE algorithm with PIA (version: PIA2.0) and MDR (version: mdr2.0_beta_6). In the previous studies of MDR 37 and PIA 14, the power of an algorithm for identifying genegene interactions is estimated as the percent of times the algorithm "successfully identifies" the true functional SNP pair from 100 replicates of simulated datasets. This is repeated for every heritability model to quantify how well each algorithm performs when dealing with datasets of different difficulties (lower heritability being more difficult). An algorithm is said to have successfully identified a functional SNP pair in a dataset if the true SNPpair is reported as the top rank. For comparison with MDR and PIA, we follow this approach and estimate the power of GE, MDR, and PIA using following statistics:
P
o
w
e
r
=
N
S
N
where N is the number of datasets tested (N = 100 in our case), and N^{S }is the number of successful identification.
For GE in particular, we are also interested in estimating the distribution of false discovery rate (FDR) and true positive rate (TPR) since, in the worst case, if there is no SNPSNP interaction in the dataset, a topranked interaction list only contains false positive identifications. Formally, we estimate FDR as:
F
D
R
(
c
)
=
N
F
P
(
c
)
N
(
c
)
where FDR(c) is the FDR at the cutoff of c, N_{FP }(c) is the number of accepted false positive identifications at the cutoff of c, and N(c) is the number of accepted identifications at the cutoff of c.
Similarly, TPR can is calculated as:
T
P
R
(
c
)
=
N
T
P
(
c
)
N
T
P
(
c
)
+
N
F
N
(
c
)
where TPR(c) is the TPR at the cutoff of c. N_{TP }(c) and N_{FN }(c) are the number of accepted true positives and the number of false negatives at the cutoff of c.
Both the rank and the identification frequency score of each SNP combination can be used as the cutoff to calculate FDR and TPR at different confidence levels. We consider both approaches and using the 100 replicate datasets of each heritability model, we obtain the average FDR and TPR at different cutoffs for each heritability model.
Evaluation statistics for combining algorithms
One major motivation for developing a genetic ensemble algorithm for genegene interaction identification is to harness the complementary strength of different classifiers such that a more robust and predictive SNP subset can be obtained. To extend this idea further, we propose to combine the inferred SNPSNP interaction from different algorithms (such as MDR and PIA), in the hope that more robust results can be obtained. However, such benefits may come only when the results yielded by different SNPSNP interaction identification algorithms are complementary to each other, which is analogous to the idea of the ensemble diversity.
By modifying the equation of double fault, we design the following terms to quantify the degree of complementarity (CD) of a pair of algorithms in SNPSNP interaction identification:
S
F
(
X
,
Y
)
=
N
F
S
+
N
S
F
,
D
F
(
X
,
Y
)
=
N
F
F
C
D
(
X
,
Y
)
=
S
F
(
X
,
Y
)
D
F
(
X
,
Y
)
+
S
F
(
X
,
Y
)
where N^{XY }is the number of datasets with certain identification status using algorithms X and Y , and X, Y ∈ {F, S} in which F denotes an algorithm fails to identify the functional SNP pair while S denotes it succeeds to identify the functional SNP pair. SF (X, Y ) (single fault) is the number of times algorithms X and Y give inconsistent identification result, which is the situation that one algorithm succeeds while the other one fails. DF (X, Y ) (double fault) is the number of times both X and Y fail. The pairwise degree of complementarity of the algorithms X and Y is determined by CD(X, Y ).
Excluding the case in which both X and Y achieve 100% successful identification (which gives
0
0
), the value of CD(X, Y ) varies between 0 and 1. When the results produced by X and Y are completely complementary to each other, the value of DF (X, Y ) decreases to 0, and the value of CD(X, Y ) reaches 1.
On the contrary, the value of CD(X, Y ) decreases with the decrease of the degree of complementarity between algorithms X and Y , and reaches 0 when no degree of complementarity is found.
Our premise is that combining algorithms with higher degree of complementarity will yield higher identification power. In this study, we estimate the joint power of two or three algorithms as:
P
o
w
e
r
J
(
X
,
Y
)
=
N
−
D
F
(
X
,
Y
)
p
o
w
e
r
J
(
X
,
Y
)
=
N
−
T
F
(
X
,
Y
,
Z
)
;
T
F
(
X
,
Y
,
Z
)
=
N
F
F
F
where TF (X, Y, Z) is the "triple fault" which gives the coincident errors of three identification algorithms, and Power_{J }(X, Y ) and Power_{J }(X, Y, Z) are the joint power of combining two and three identification algorithms respectively.
Results
Classifier selection and ensemble construction
One of the most important steps in forming an ensemble of classifiers is base classifier selection. As described above, characteristics such as nonlinear separation capability, computational efficiency, high accuracy and diversity should be taken into account. With those considerations, a classifier selection and ensemble construction experiment was carried out. Specifically, we tested the merits of each candidate classifier using datasets with model number of 10, 11, 12, 13 and 14 from Moore et al. 36, all of which have minor allele frequency of 0.2, heritability of 0.1, and sample size of 200 (100 case and 100 control). These are considered as "difficult" datasets since they are simulated to have low minor allele frequency, low heritability, and small sample size 14. Twenty replicates from each model were used for evaluation and the power of each classifier in identifying the functional SNP pair was calculated. Figure 2(a) shows the 12 candidate classifiers we evaluated in this study. They are REPTree (REPT), Random Tree (RT), Alternating Decision Tree (ADT) 38, Random Forest (RT) 39, 1Nearest Neighbor (1NN), 3Nearest Neighbor (3NN), 5Nearest Neighbor (5NN), Decision Tree (J48), 1Nearest Neighbor with Cover Tree (CT1NN), 3Nearest Neighbor with Cover Tree (CT3NN) 40, entropy based nearest neighbor (KStar) 41 and 5Nearest Neighbor with Cover Tree (CT5NN).
<p>Figure 2</p>Selection of base classifiers and ensemble configuration
Selection of base classifiers and ensemble configuration. (a) Classifier selection. The value on the top of each bar denotes the estimated power in functional SNP pair identification using each classifier. (b) Ensemble configuration. The value on the top of each bar denotes the power in functional SNP pair identification using ensemble of classifiers with different values of GA chromosome mutation rate and diversity integration weight, respectively (denoted as a duplex in the Xaxis).
The identification power of each classifier was estimated using the simulated datasets. Among the twelve classifiers, six of them successfully identified the functional SNP pair more than 50% of the time. Five of them were selected to form the ensemble (colored in red in Figure 2(a)). They are J48, KStar, and three Decision Tree and kNearest Neighbor hybrid  CT1NN, CT3NN, and CT5NN.
The configuration of parameters such as GA chromosome mutation rate and integration weights of diversity measure, blocking, and voting were tested using the same sets of data as above. Specifically, the mutation rates tested were 0.05, 0.1 and 0.15. The integration weights of diversity tested were also 0.05, 0.1 and 0.15 while the integration weights for blocking and voting were kept equal, and the three weights add up to 1. This gives 9 possible configurations for the ensemble of classifiers. The identification powers of the ensemble of classifiers using these 9 configurations are shown in Figure 2(b). It is observed that all the ensembles achieved better results than the best single classifier which has an identification power of 53.8%. Among them, the best parameter setting is (0.1, 0.15) which specifies the use of a mutation rate of 0.1 and an integration weights of 0.15, 0.425, and 0.425 for diversity, blocking, and voting, respectively. This configuration gives an identification power of 60.8% which is a significant improvement from 53.8%. This setting was then fixed in our GE in the follow up experiments.
Simulation results
Genegene interaction identification
In the simulation experiment, we applied GE, PIA, and MDR for detecting the functional SNP pairs from 20 candidate SNPs and 100 candidate SNPs, respectively. Table 3 shows the evaluation results. By fixing the candidate SNP size to 20 and testing datasets generated with three heritability values (0.2, 0.1, and 0.05), we observed a decrease of the average identification power of the three algorithms (taking the average of the three identification algorithms) from 98.33 ± 0.94 to 78.67 ± 2.62 and to 43.67 ± 0.94. By fixing the candidate SNP size to 100 and testing datasets generated with three heritability value (0.2, 0.1 and 0.05), the average identification power drops to 93.67 ± 0.94, 48.33 ± 2.49, and 19.00 ± 1.63, respectively. It is clear that both heritability and SNP size are important factors to SNPSNP interaction identification. By comparing the power of each algorithm, we found no significant differences. The standard deviation is generally small ranging from 0.94 to 2.62, indicating that the three algorithms have similar performance.
<p>Table 3</p>Functional SNP pair identification in balanced datasets using GE, PIA, and MDR
Dataset
GE
PIA
MDR
Power (%)
Power (%)
Power (%)
balanced_200_0.2_20
99
97
99
balanced_200_0.1_20
80
75
81
balanced_200_0.05_20
45
43
43
balanced_200_0.2_100
95
93
93
balanced_200_0.1_100
45
49
51
balanced_200_0.05_100
17
19
21
To investigate whether an imbalanced class distribution affects identification power, we applied GE, PIA, and MDR to imbalanced datasets with a casecontrol ratio of 1:2 and a candidate SNP size of 20. From Table 4, we found that the power of the three identification algorithms decreased in comparison to those of the balanced datasets (Table 3). Such a decline of power is especially significant when the heritability of the dataset is small. This finding is essentially consistent with 32 in that datasets of larger heritability values are more robust to imbalanced class distribution. Since the sample size and other dataset characteristics between the balanced and the imbalanced datasets are the same, the observed decline of power could be attributed to the imbalanced class distribution. It is also noticed that the identification power of PIA is relatively lower compared to GE and MDR. This indicates that PIA may be more sensitive to the presence of the imbalanced class distribution than GE and MDR.
<p>Table 4</p>Functional SNP pair identification in imbalanced datasets using GE, PIA, and MDR
Dataset
GE
PIA
MDR
Power (%)
Power (%)
Power (%)
imbalanced_200_0.2_20
92
90
95
imbalanced_200_0.1_20
59
45
62
imbalanced_200_0.05_20
32
24
27
For the GE algorithm, two approaches were used to study the distribution of the TPR and FDR. For the first approach, we calculated the TPR and FDR by varying the rank cutoff of the reported SNP pairs. Figure 3 shows the distribution by using a rank cutoff of 1 to 10 (the lower the number, the higher the rank). Note that the rank cutoff of 1 gives the results equal to the power defined in Equation (10). For the second approach, we calculated the TPR and FDR by varying the identification frequency cutoff of the reported SNP pairs. Figure 4 shows the distribution by decreasing the frequency cutoff from 1 to 0. By comparing the results, we found that the decrease of the heritability (from 0.2, to 0.1 and to 0.05) has the greatest impact on TPR of GE. Sample size appears to be the second factor (from 20 SNPs to 100 SNPs), and the imbalanced class distribution is the third factor (from a balanced ratio of 1:1 to an imbalanced ratio of 1:2).
<p>Figure 3</p>True positive rate and false discovery rate estimation of GE at different rank cutoff
True positive rate and false discovery rate estimation of GE at different rank cutoff. Simulated datasets with different heritability models, number of SNPs, and class distribution, are used to evaluate the true positive rate and false discovery rate of GE at different identification cutoffs using different rankvalues (110).
<p>Figure 4</p>True positive rate and false discovery rate estimation of GE at different frequency score cutoff
True positive rate and false discovery rate estimation of GE at different frequency score cutoff. Simulated datasets with different heritability models, number of SNPs, and class distribution, are used to evaluate the true positive rate and false discovery rate of GE at different identification cutoffs using different frequency scores (10).
Generally, by decreasing the cutoff stringency (either rank cutoff or identification frequency cutoff), the TPR increases, and therefore, more functional SNP pairs can be successfully identified. However, this is achieved by accepting increasingly more false identifications (higher FDR). The simulation results indicate that FDR calculated by using the identification frequency cutoff is very steady regardless the change of heritability, SNP size, or class ratio. In most cases, an FDR close to 0 is achieved with a cutoff greater than 0.78.
The degree of complementarity among GE, MDR, and PIA
As illustrated in Table 3 and Table 4, large candidate SNP size, low heritability value, and the presence of imbalanced class distribution together give the worst scenario for detecting SNPSNP interaction. Is there a way to increase the chance of success identification in such a situation? One solution is to combine different identification results produced by different algorithms, which extends the idea of ensemble method further. However, similar to the notion of diversity in ensemble classifier, the improvement can only come if the combined results are complementary to each other. Hence, the evaluation of the degree of complementarity among each pair of algorithms becomes indispensable.
We carried out a pairwise evaluation using Equations (13) and (14). Tables 5 and 6 give the results for balanced and imbalanced situations, respectively. We observed that higher degree of complementarity is generally associated with higher identification power. For the balanced datasets, the degree of complementarity of PIA and MDR is relatively low compared to those generated by GE and PIA, or GE and MDR. The results indicate that the GE algorithm, which tackles the problem from a different perspective, is useful in complementing methods like PIA and MDR in genegene interaction identification. As for the imbalanced datasets, the difference of the complementarity degree between each pair of algorithms is reduced. This suggests that more methods need to be combined for imbalanced datasets in order to improve identification power.
<p>Table 5</p>Functional SNP pair identification in balanced datasets by combining multiple algorithms
Dataset
(GE + PIA)
(GE + MDR)
(PIA + MDR)
(GE + PIA + MDR)
CD
Power_{J }(%)
CD
Power_{J }(%)
CD
Power_{J }(%)
Power_{J }(%)
balanced_200_0.2_20
1000
100
1.000
100
0.667
99
100
balanced _200_0.1_20
0.448
84
0.556
88
0.240
81
88
balanced_200_0.05_20
0.303
54
0.303
54
0.068
45
55
balanced_200_0.2_100
1.000
100
0.923
99
0.444
95
100
balanced_200_0.1_100
0.441
62
0.400
61
0.148
54
63
balanced_200_0.05_100
0.093
22
0.116
24
0.025
21
24
<p>Table 6</p>Functional SNP pair identification in imbalanced datasets by combining multiple algorithms
Dataset
(GE + PIA)
(GE + MDR)
(PIA + MDR)
(GE + PIA + MDR)
CD
Power_{J }(%)
CD
Power_{J }(%)
CD
Power_{J }(%)
Power_{J }(%)
imbalanced_200_0.2_20
0.714
96
0.818
98
0.750
97
99
imbalanced_200_0.1_20
0.567
71
0.481
73
0.475
68
76
imbalanced_200_0.05_20
0.286
40
0.301
42
0.287
38
47
The last columns of Tables 5 and 6 show the joint identification power of the three algorithms in analyzing balanced and imbalanced data. These results indicate a significant recovery of detection ability in functional SNP pair identification by applying three algorithms collaboratively. This is especially true when analyzing imbalanced datasets and the heritability of the underlying genetic model is low. For example, the average identification power of three algorithms for imbalanced datasets with heritability of 0.1 and 0.05 are 55.33% and 27.67%, respectively (Table 4). By combining the results of the three algorithms, we are able to increase the power to 76% and 47%, respectively, improving by around 20% (Figure 5).
<p>Figure 5</p>A comparison of identification power of GE, PIA, MDR, and combination of the three algorithms
A comparison of identification power of GE, PIA, MDR, and combination of the three algorithms. The name of each dataset denotes sample size, heritability, and the number of SNP (SNP size). (a) Identification power of each algorithm and their joint power using datasets with balanced class distribution. (b) Identification power of each algorithm and their joint power using datasets with imbalanced class distribution.
Realworld data application
As an example of realworld data application, we applied the GE algorithm, PIA and MDR, to analyze the complex disease of AMD. To reduce the combinatorial search space, we followed the twostep analysis approach 42 and used a SNP filtering procedure that is similar to the method described in 23 which can be summarized as follows:
• Excluding SNPs which have more than 20% missing genotype values of total samples;
• Calculating allelic χ^{2}statistics of each remaining SNP and keeping SNPs which have a pvalue smaller than 0.05 while discarding others. A number of 3583 SNPs passed filtering.
• Utilizing RTREE program 43 to select top splitting SNPs in AMD classification. Two SNPs with id of rs380390 and rs10272438 are selected.
• Utilizing Haploview program 44 to construct the Linkage Disequilibrium (LD) blocks around above two SNPs.
After the above processing steps, we obtained 17 SNPs from the two LD blocks. They are rs2019727, rs10489456, rs3753396, rs380390, rs2284664, and rs1329428 from the first block, and rs4723261, rs764127, rs10486519, rs964707, rs10254116, rs10486521, rs10272438, rs10486523, 10486524, rs10486525, and rs1420150 from the second block. Based on the previous investigation of AMD 454647, we added another six SNPs to avoid analysis bias. They are rs800292, rs1061170, rs1065489, rs1049024, rs2736911, and rs10490924. Moreover, environment factors of Smoking status and Sex are also included for potential environment interaction detection. Together, we formed a dataset with 25 factors for AMD association screening and genegene interaction identification.
Tables 7 and 8 illustrate the top 5 most frequently identified 2factor interactions and 3factor interactions, respectively. At the first glance, we see that the identification results given by different methods are quite different from one another. Considering the results of 2factor and 3factor interaction together, however, we find that two genegene interactions and a geneenvironment interaction are identified by all three methods. Specifically, the first genegene interaction is characterized by the SNPSNP interaction pair of rs10272438×rs380390. The first SNP is a A/G variant located in intron 5 of BBS9 located in 7p14, while the second SNP is a C/G variant located in intron 15 of CFH located in 1q32. The second frequently identified genegene interaction is characterized by the SNPSNP interaction pair of rs10490924×rs10272438. The first SNP in this interaction pair is a nonsynonymous coding SNP of Ser69Ala alteration located in exon 1 of ARMS2 located in 10q26. And the second SNP is again the A/G variant located in intron 5 of BBS9 located in 7p14. As to the geneenvironment interaction pair, it is characterized by rs10272438×Sex. This pair indicates that SNP factor of the A/G variant located in intron 5 of BBS9 in location of 7p14 is likely to associate with the disease differently between male and female.
<p>Table 7</p>Twofactor interaction candidates of AMD dataset
GE
CV Acc %
PIA
CV Acc %
MDR
CV Acc %
rs10272438×rs4723261
68.5
rs10272438×rs380390
64.2
rs10490924×rs1420150
65.5
rs10272438×rs2736911
66.9
rs10490924×rs10272438
68.2
rs10272438×rs1065489
68.4
rs10272438×rs964707
68.5
Y402H×rs10272438
65.5
rs10272438×rs2284664
66.7
rs10272438×Sex
67.5
rs10254116×Smoking
67.1
rs10272438×Sex
67.5
rs10272438×rs2284664
66.7
rs10490924×rs10254116
67.7
rs10254116×rs2736911
67.7
<p>Table 8</p>Threefactor interaction candidates of AMD dataset
GE
CV Acc %
PIA
CV Acc %
MDR
CV Acc %
rs10272438×rs4723261×rs964707
68.5
rs10272438×rs380390×rs10486524
59.8
rs10272438×rs380390×rs10486524
59.8
rs10272438×rs4723261×rs2736911
67.1
rs10272438×rs380390×Sex
61.2
rs10272438×rs380390×rs964707
63.4
rs10272438×rs380390×rs964707
63.4
rs10272438×Sex×rs1065489
68.1
Y402H×rs10272438×rs964707
60.7
rs10490924×rs10272438×rs4723261
65.0
rs10272438×rs380390×rs10254116
66.6
rs10490924×rs10272438×Sex
63.4
rs10272438×Sex×rs4723261
63.5
rs10272438×rs380390×rs1420150
59.4
rs10272438×Sex×rs2736911
65.7
We also test the association of Age factor with AMD by using Gaussian discretization to partition the age value of each sample into three categories as follows:
a
g
e
(
x
)
=
{
“
young
”
x
≤
μ
−
σ
/
2
“
medium
”
“
elderly
”
μ
−
σ
/
2
<
x
<
μ
+
σ
/
2
x
≥
μ
+
σ
/
2
where μ is the average age value and σ is the standard deviation of age values.
After including the Age factor to the dataset, all three algorithms identified the geneenvironment interaction of rs1420150×Age as the interaction with major implication, indicating Age factor is, expectedly, strongly associated with the development of AMD. The SNP that interacted with the Age factor is a C/G variant located in intron 9 of BBS9 located in 7p14.
Table 9 summarizes the factors involved in potential interactions identified by all of the three different algorithms. Overall, the experimental results suggest that genes of BBS9 (BardetBiedl syndrome 9), CFH (complement factor H), and ARMS2 (agerelated maculopathy susceptibility 2) with the external factors of Age and Sex, and the interactions among them are strongly associated with the development of AMD. This is essentially consistent with current knowledge of AMD development in the literature 2454647.
<p>Table 9</p>SNPs and environmental factors strongly associated with AMD
Factor
Chrom.
Gene
Location
Effect
Main effect pvalue
rs10272438
7p14
BBS9
intron 5
A/G
1.4 × 10^{6}
rs1420150
7p14
BBS9
intron 9
C/G
2.1 × 10^{2}
rs380390
1q32
CFH
intron 15
C/G
4.1 × 10^{8}
rs10490924
10q26
ARMS2
exon 1
Ser69Ala
1.8 × 10^{3}
Sex




1.4 × 10^{2}
Age




1.1 × 10^{3}
Discussion and conclusion
How multiple genes contribute to the development of complex diseases is an essential question for complex disease study. This is because a single gene often does not have the power to discriminate the status of the complex disease, and it is likely that multiple genes each with a weak or moderate effect together contribute to the development of complex disease. Although great effort has been devoted to characterizing such genegene interactions in complex disease analysis, the results remain unsatisfactory.
The advance of highthroughput genotyping technologies provides the opportunity to elucidate the mechanism of genegene and geneenvironment interaction via SNP markers. However, current algorithms have limited power in terms of identifying true SNPSNP interactions. Moreover, the simulation results indicate that the factors such as heritability, candidate SNP size, and the presence of imbalanced class distribution all have profound impact on a given algorithm's power in identifying functional SNP interactions. One practical way to improve the chance of identifying SNPSNP interactions is to combine different methods where each addresses the same problem from a different perspective. The rationale is that the consensus may increase the confidence of identifications and complementary results may improve the power of identification.
Due to these considerations, we proposed a hybrid algorithm using genetic ensemble approach. Using this approach, the problem of SNPSNP interaction is converted to a combinatorial feature selection problem. Our simulation study indicates that the proposed GE algorithm is comparable to PIA and MDR in terms of identifying genegene interaction for complex disease analysis. Furthermore, the experimental results demonstrate that the proposed algorithm has a high degree of complementarity to PIA and MDR, suggesting the combination of GE with PIA and MDR will likely lead to higher identification power.
For the practical application of the GE algorithm, the experimental results from the simulation datasets suggest that taking the topranked result generally gives a higher sensitivity of identifying SNPSNP interactions than using a frequency score cutoff. However, if the delectability of the SNPSNP interaction is low or no such interaction is present in the dataset, the topranked result is likely to be a false positive identification. A more conservative approach is to use an identification frequency cutoff of 0.750.8 which in our simulation study gives identification results with an FDR close to 0. For any identified SNP pair with an identification frequency higher than 0.8, the confidence is very high.
As a downstream analysis, we can fit the identified SNP pairs using logistic model with interaction terms and calculate the pvalue of its coefficient in order to quantify the strength of the interaction. In particular, to test additive and dominant effects, we can fit the reported SNP combinations using the model described by Cordell 12 and analyze the coefficients associated with additive and dominant effects of each SNP.
Current GWA studies commonly produce several hundreds of thousands of SNPs, yet the genegene interaction identification algorithms like MDR, PIA and the proposed GE algorithm can only cope with a relatively small number of SNPs in a combinatorial manner. Therefore, a filtering procedure is required to reduce the number of SNPs to a "workable" amount before those combinatorial methods can be applied to datasets generated by GWA studies 4849. More efforts are required to seamlessly connect these two components to maximize the chance of detecting complex interactions among multiple genes and environmental factors 42.
In conclusion, we proposed a GE algorithm for genegene and geneenvironment interaction identification. It is comparable to two other stateoftheart algorithms (PIA and MDR) in terms of SNPSNP interaction identification. The experimental results also demonstrated the effectiveness and the necessity of applying multiple methods each with different strengths to the genegene and geneenvironment interaction identification for complex disease analysis.