Abstract
Background
Purely epistatic multilocus interactions cannot generally be detected via singlelocus analysis in casecontrol studies of complex diseases. Recently, many twolocus and multilocus analysis techniques have been shown to be promising for the epistasis detection. However, exhaustive multilocus analysis requires prohibitively large computational efforts when problems involve largescale or genomewide data. Furthermore, there is no explicit proof that a combination of multiple twolocus analyses can lead to the correct identification of multilocus interactions.
Results
The proposed 2LOmb algorithm performs an omnibus permutation test on ensembles of twolocus analyses. The algorithm consists of four main steps: twolocus analysis, a permutation test, global pvalue determination and a progressive search for the best ensemble. 2LOmb is benchmarked against an exhaustive twolocus analysis technique, a set association approach, a correlationbased feature selection (CFS) technique and a tuned ReliefF (TuRF) technique. The simulation results indicate that 2LOmb produces a low falsepositive error. Moreover, 2LOmb has the best performance in terms of an ability to identify all causative single nucleotide polymorphisms (SNPs) and a low number of output SNPs in purely epistatic two, three and fourlocus interaction problems. The interaction models constructed from the 2LOmb outputs via a multifactor dimensionality reduction (MDR) method are also included for the confirmation of epistasis detection. 2LOmb is subsequently applied to a type 2 diabetes mellitus (T2D) data set, which is obtained as a part of the UK genomewide genetic epidemiology study by the Wellcome Trust Case Control Consortium (WTCCC). After primarily screening for SNPs that locate within or near 372 candidate genes and exhibit no marginal singlelocus effects, the T2D data set is reduced to 7,065 SNPs from 370 genes. The 2LOmb search in the reduced T2D data reveals that four intronic SNPs in PGM1 (phosphoglucomutase 1), two intronic SNPs in LMX1A (LIM homeobox transcription factor 1, alpha), two intronic SNPs in PARK2 (Parkinson disease (autosomal recessive, juvenile) 2, parkin) and three intronic SNPs in GYS2 (glycogen synthase 2 (liver)) are associated with the disease. The 2LOmb result suggests that there is no interaction between each pair of the identified genes that can be described by purely epistatic twolocus interaction models. Moreover, there are no interactions between these four genes that can be described by purely epistatic multilocus interaction models with marginal twolocus effects. The findings provide an alternative explanation for the aetiology of T2D in a UK population.
Conclusion
An omnibus permutation test on ensembles of twolocus analyses can detect purely epistatic multilocus interactions with marginal twolocus effects. The study also reveals that SNPs from largescale or genomewide casecontrol data which are discarded after singlelocus analysis detects no association can still be useful for genetic epidemiology studies.
Background
Complex diseases cannot generally be explained by Mendelian inheritance [1] because they are influenced by genegene and geneenvironment interactions. Many common diseases such as asthma, cancer, diabetes, hypertension and obesity are widely accepted and acknowledged to be results of complex interactions between multiple genetic factors [2]. Attempts to identify factors that could be the causes of complex diseases have led to many genomewide association studies [3,4]. Raw results from these attempts produce a large amount of single nucleotide polymorphism (SNP) data from every individual participating in the trials.
For genetic epidemiologists, data sets from genomewide association studies present many challenges, particularly the correct identification of SNPs that associate with the disease of interest from all available SNPs [5]. This challenge can be treated as a pattern recognition problem which aims to identify an attribute or SNP set that can lead to the correct classification of recruited samples. Heidema et al. [5] and Motsinger et al. [6] have reviewed and identified many machine learning techniques that are suitable to the task. Among many strategies and techniques, the protocol that appears to be most promising for genomewide association studies involves two main steps: SNP set reduction and classification model construction [7]. From a machine learning viewpoint, attribute selection techniques can be divided into three main categories: filter, wrapper and embedded approaches [8]. In a filter approach, a measure or an index is used to determine the correlation between attributes and classes, e.g. affected and unaffected status in a casecontrol study. Attributes that are deemed to be important for the classification according to the measure are then selected. The filter approach includes χ^{2 }and odds ratio tests [9,10], omnibus permutation tests [1113], a correlationbased feature selection technique [14], a ReliefF technique [15] and a tuned ReliefF technique [16]. In a wrapper approach, the significance of an attribute subset is evaluated from the classification performance by a classifier. The capability of the wrapper approach to identify significant attributes thus depends on the chosen classifier and the search algorithm for the identification of the best attribute subset. Combinatorial [17] and restricted partitioning methods [18], a multifactor dimensionality reduction method [1925] and a polymorphism interaction analysis technique [26] are examples of the wrapper approach. An embedded approach concentrates on informative attributes during the construction of a classification model. Examples of the embedded approach include a genetic algorithm with Boolean algebra [27], genetic programming based decision trees [28,29], random forests [3032] and evolutionary neural networks [33,34]. Based on this categorisation, classification models are not direct outputs from filterbased techniques. On the other hand, classification models are readily prepared as outputs from the wrapper and embedded approaches. In other words, the last two approaches can also be regarded as classification model construction techniques.
The success of the twostep pattern recognition approach relies heavily on the attribute selection step [14]. In casecontrol studies, epistatic effects play a vital role in establishing the difficulty level of SNP screening problems [35,36]. Epistasis in the simplest form can be represented by disease models that require genotype inputs from two interacting SNPs [37,38]. Many attempts have been made to produce consistent definitions and categorisation of different types of epistasis models [2,35,3941]. According to Musani et al. [2], a pure epistasis model [42] is difficult because each SNP exhibits no marginal singlelocus effect in the model. As a result, it is impossible to detect the pure epistasis by univariate statistical tests. Examples of complex diseases that casecontrol studies have uncovered putatively pure epistasis include type 2 diabetes mellitus (T2D) [4346] and metabolic syndrome [47]. Due to the difficulty of screening for each SNP independently, it is suggested that attention should be focused on the analysis of differences between twolocus genotype distribution within case and control groups [40] and multilocus Bayesian statistical analysis [48,49].
A number of SNP screening and association detection techniques have adopted the twolocus genotype monitoring strategy as their core engines [40,5052]. The search for interactions can be carried out via either exhaustive analysis [52] or the analysis that can be divided into two stages, incorporating singlelocus analysis for the prescreening purpose [40,50,51]. In the twostage mode, at least one SNP that involves in the construction of twolocus genotype unit must be a strong candidate for the association explanation, usually verified through univariate statistical tests. Each mode of the twolocus analysis possesses different strengths and weaknesses. The exhaustive analysis has a full capability of detecting pure epistasis but requires larger computational efforts [52]. In contrast, the twostage analysis is more practical for largescale data but with some risk of missing possible pure epistasis [50]. More practical usage of both twolocus analysis modes in real casecontrol studies is required before the feasibility issue can be fully addressed.
Many genetic association studies reveal that various complex diseases are results of putative multilocus interactions [11,46,53]. With the constraints on a computational capability, exhaustive multilocus analysis in largescale or genomewide association studies would be infeasible [52]. On the other hand, singlelocus analysis would be unsuitable for the detection of pure epistasis. One possible approach that provides a tradeoff between a computational limitation and an epistasis detection capability is to capture a multilocus interaction by combining multiple results from twolocus analysis. To achieve this, it is necessary to prove that once a multilocus interaction model is broken down into a combination of twolocus models, all or some of these models remain detectable through twolocus analysis. Although it is hinted in an early work on twolocus analysis [52] that the proposed approach is plausible, explicit experimentation and testing has never been conducted.
In this article, the feasibility of employing an ensemble of twolocus analyses for the multilocus interaction determination is demonstrated. Specifically, the significance of the twolocus analysis ensemble is assessed by an omnibus permutation test [54]. The proposed method is inspired by a set association approach [11], in which a limited number of sets that contain different numbers of SNPs are explored for possible association. These SNP sets are crucial in the global pvalue calculation of the selected set via a permutation test and thus the decision to accept or reject the null hypothesis of no association. In other words, SNP set exploration and selection is required to assess the significance of the identified association. This means that the set association approach is equally interested in both SNP set selection and testing for significant association. The primary function of the proposed method is to detect possible association and assess its significance through the exploration of different ensembles of twolocus analyses. Hence, the proposed method is also equally interested in both ensemble selection and testing for significant association.
The proposed method is benchmarked against a simple exhaustive twolocus analysis technique, the set association approach [11], the correlationbased feature selection technique [14] and the tuned ReliefF technique [16]. These filterbased attribute selection techniques are suitable for the benchmark trial since they are capable of detecting association. The casecontrol classification models constructed from screened SNPs via a multifactor dimensionality reduction method [19] are also provided.
Results and discussion
Algorithm
The proposed algorithm performs an omnibus permutation test on ensembles of twolocus analyses and is referred to as a 2LOmb technique. The algorithm consists of four steps as illustrated in Figure 1 and can be described as follows.
Figure 1. Outline of 2LOmb. In this example, the algorithm takes a balanced casecontrol data set that consists of 400 samples and 1,000 SNPs. Each genotype is represented by an integer: 0 denotes a homozygous wildtype genotype, 1 denotes a heterozygous genotype and 2 denotes a homozygous variant or homozygous mutant genotype. A χ^{2 }contingency table is then constructed for each pair of SNPs in twolocus analysis. This results in the total of = 499,500 twolocus analyses. Thus, the Bonferronicorrected χ^{2}'s pvalue for each twolocus analysis is the lower value between 499,500 × its uncorrected pvalue and one. In one ensemble, Bonferronicorrected χ^{2}'s pvalues from multiple twolocus analyses are combined together via a Fisher's combining function, which in turn provides a Fisher's test statistic result. The raw pvalue for the ensemble is obtained through a permutation test, which is composed of 10,000 randomised permutation replicates. Since multiple ensembles may be tried during the identification of the best association explanation, a global pvalue is calculated to account for multiple hypothesis testing. The global pvalue is estimated through the same permutation test that gives the raw pvalue for each ensemble. The progressive search for the best association explanation is carried out by incrementally adding a twoSNP unit to the current best ensemble. The condition for search termination is based on both the raw pvalue for the explored ensemble and the global pvalue. In this example, the search is terminated after the fourth ensemble is explored due to an increase in the raw pvalue. Subsequently, the best SNP set for association explanation contains SNP1, SNP2 and SNP3 where the global pvalue that accounts for testing of four hypotheses is p < 0.0001.
Twolocus analysis
Consider a casecontrol genetic association study with n_{m }SNPs, for each pair of SNPs, a 2 × 9 contingency table with rows for disease status and columns for genotype configurations is created. A χ^{2 }test statistic and the corresponding pvalue can subsequently be computed. With the total of n_{m }SNPs, there are = n_{m}!/((n_{m } 2)!2!) possible SNP pairs. As a result, the pvalue from each twolocus analysis must be adjusted by a Bonferroni correction. The Bonferronicorrected pvalue from each analysis is the lower value between × the uncorrected pvalue and one.
Permutation test
The pvalue for the null hypothesis that ensemble ean ensemble of twolocus analyses of interestis not associated with the disease can be evaluated by a permutation test. To achieve this, a scalar statistic is first computed from a function that combines the Bonferronicorrected χ^{2}'s pvalues of individual twolocus tests. A suitable combining function must (a) be nonincreasing in each pvalue, (b) attain its maximum value when any pvalue equals to zero and (c) have a finite critical value that is less than its maximum for any significant level greater than zero [54]. In this study, a Fisher's combining function (2∑_{i }log(p_{i})) is selected [55]. The pvalue for the ensemble of twolocus analyses is assessed via a permutation simulation. In each permutation replicate, samples are constructed such that the case/control status of each sample is randomly permuted while the total numbers of case and control samples remain unchanged. A χ^{2 }contingency table with new entries and a Bonferronicorrected pvalue for the twolocus analysis within each permutation replicate are then obtained. This, in turn, leads to a new Fisher's test statistic. Let denote the value of Fisher's test statistic obtained for the ith permutation replicate, is the fraction of permutation replicates with a test statistic greater than or equal to the test statistic obtained from the original casecontrol data (). In other words,
where t is the number of permutation replicates which is set to 10,000 in this study and · denotes the size of a set.
Global pvalue determination
There are many candidate ensembles of twolocus analyses that can be explored. Let be the global null hypothesis that none of E explored ensembles of twolocus analyses is associated with the disease, the test of the global null hypothesis leads to the global pvalue and provides the genetic association explanation. In step 2, the pvalue for a fixed hypothesis is a raw or unadjusted pvalue. To account for the correlation among multiple hypotheses that have been tested during the exploration through many candidate ensembles, the testing result of the global null hypothesis depends on . In other words, the global null hypothesis is rejected if the minimum of the raw pvalues is sufficiently small. The distribution of can again be determined by a permutation simulation. However, a nested simulation is unnecessary since the same set of permutation replicates for the determination can be reused in the estimation of the empirical distribution of [56]. This strategy has been successfully implemented in a number of genetic association detection techniques, including a set association approach [11] and a haplotype interaction approach embedded in FAMHAP [57,58]. The unadjusted pvalue for the permutation replicate i of each hypothesis e is thus given by
Let be the minimum of unadjusted pvalues over all explored ensembles of twolocus analyses in the ith permutation replicate, the pvalue for the global null hypothesis H_{0 }is defined by
Search for the best ensemble of twolocus analyses
A simple progressive search is used to identify the best ensemble of twolocus analyses. The search begins by locating the best twoSNP unit with the smallest Bonferronicorrected χ^{2}'s pvalue from step 1. A permutation test is then performed for this twolocus analysis, yielding both raw and global pvalues since only one hypothesis has been explored. Next, the search attempts to combine the existing best twoSNP unit with the twoSNP unit that possesses the next smallest Bonferronicorrected χ^{2}'s pvalue from step 1 and does not have a higher permutation pvalue than the first twoSNP unit. If this new ensemble yields either a higher raw pvalue or the same raw pvalue but a higher global pvalue from a permutation test, the search is terminated and the association is explained by the previously identified twolocus analysis. Otherwise, the best ensemble of twolocus analyses is updated and the process of appending more twoSNP units to the ensemble continues. The progressive search terminates when deterioration in the raw or global pvalue is detected, or all possible twolocus analyses have been included in the ensemble. It is recalled from step 3 that for the best ensemble containing E  1 < twolocus analyses, its global pvalue is obtained from the evaluation of E hypotheses.
Validity of the algorithm
A permutation replicate in 2LOmb is constructed by randomly assigning the case or control status to each sample while maintaining the original proportion of case and control samples. Once the construction of a permutation replicate is finished, the assigned case and control labels remain fixed to the samples. The pattern of case and control labels in each permutation replicate is thus constant and unique. Therefore, the Bonferronicorrected χ^{2}'s pvalues from any twoSNP units within a permutation replicate are calculated from the same casecontrol data set. Hence, the combining of these Bonferronicorrected χ^{2}'s pvalues via a Fisher's combining function is attainable. The calculation of Fisher's test statistics from all permutation replicates and the original data set leads to the raw or unadjusted pvalue for the null hypothesis of the ensemble e as given in equation 1. Since the same set of permutation replicates is always used during the evaluation of each ensemble, the raw pvalues for the null hypotheses from all ensembles can be directly compared against one another. Furthermore, the global pvalue calculation is based on this set of permutation replicates. This is possible because the unadjusted pvalue for the permutation replicate i of ensemble e or can be calculated in a similar manner to the raw pvalue as defined in equation 2. The unadjusted pvalues for the same permutation replicate but different ensembles can also be directly compared and the subsequent calculation of is attainable. With and , the pvalue for the global null hypothesis that incorporates all E explored hypotheses can be determined by equation 3. In summary, only one set of permutation replicates is required for the calculation of both the raw pvalue for the null hypothesis of every ensemble and the global pvalue. The pvalues can be compared in each step of 2LOmb. Consequently, the selection of the best ensemble for association explanation can be carried out via a pvalue comparison.
Testing with simulated data
2LOmb is benchmarked against a simple exhaustive twolocus analysis technique, a set association approach (SAA) [11], a correlationbased feature selection (CFS) technique [14] and a tuned ReliefF (TuRF) technique [16] in a simulation trial. The exhaustive twolocus analysis is simply the twolocus analysis procedure from the first step of the 2LOmb algorithm. An interaction is declared if at least one twoSNP unit with a Bonferronicorrected χ^{2}'s pvalue below 0.05 is detected. The exhaustive twolocus analysis reports all SNPs that meet this detection condition. The simulation covers two main data categories: null data of no significant genetic association and data with causative SNPs which signify pure epistasis. The algorithm performance on the null data provides an indication for the falsepositive error. On the other hand, the algorithm performance on the data with causative SNPs indicates the detection capability. An efficient algorithm should produce an output with a low number of SNPs and a high number of correctlyidentified causative SNPs when epistasis is present. Similarly, it should also report that there are no causative SNPs in the null data. These two measures on the number of SNPs in the results are used as the performance indicators.
Each simulated data set contains 1,000, 2,000 or 4,000 SNPs in which either there are no causative SNPs or there is pure epistasis, governed by two, three or four causative SNPs. The allele frequencies of all causative SNPs are 0.5 while the minor allele frequencies of the remaining SNPs are between 0.05 and 0.5. The data set consists of balanced casecontrol samples of sizes 400, 800 or 1,600. All SNPs in control samples are in HardyWeinberg equilibrium (HWE) [59]. The genotype distribution of causative interacting SNPs follows the pure epistasis model by Culverhouse et al. [42], leading to three interesting values of heritability: 0.01, 0.025 and 0.05. Every SNP in each data set exhibits no marginal singlelocus effect (Bonferronicorrected χ^{2}'s pvalue > 0.05). Twentyfive independent data sets for each simulation setting are generated via a genomeSIM package [60]. A paired ttest is suitable to assess the significance of results since the same simulated data sets are used during the algorithm benchmarking.
The results from the null data problem are summarised in Figure 2 while the results from the two, threeand fourlocus interaction problems are shown in Figures 34, 56 and 78, respectively. Clearly, 2LOmb significantly outperforms other techniques in terms of the low number of output SNPs, the high number of correctlyidentified causative SNPs or both in every interaction problem (a paired ttest on 675 benchmark results yields a pvalue < 0.05). On the other hand, both 2LOmb and SAA have the lowest falsepositive error when compared to other techniques in the null data problem (a paired ttest on 225 benchmark results yields a pvalue < 0.05). The statistical power analysis also reveals that the benchmark trial with 25 independent data sets for each simulation setting is sufficient for an accurate evaluation of the overall algorithm performance (power > 0.95 for a Type I error rate of 0.05). These results can be further interpreted as follows.
Figure 2. Performance of the exhaustive twolocus analysis, SAA, CFS, TuRF and 2LOmb in the null data problem. The results are averaged over 25 independent simulations. False detection is declared for the exhaustive twolocus analysis, SAA and 2LOmb if the pvalues used as detection indicators in their results are less than 0.05. The results from the exhaustive twolocus analysis (E2LA), SAA, CFS, TuRF and 2LOmb are displayed using magenta, blue, green, red and black markers, respectively. In each chart, the horizontal axis represents the detection algorithm while the vertical axis represents the number of output SNPs reported by the algorithm. The top nine charts are displayed using a finer scale than the bottom nine charts.
Figure 3. Performance of the exhaustive twolocus analysis and 2LOmb in the twolocus interaction problem. The results are averaged over 25 independent simulations. Detection is declared for the exhaustive twolocus analysis and 2LOmb if the pvalues used as detection indicators in their results are less than 0.05. The results from the exhaustive twolocus analysis (E2LA) and 2LOmb are displayed using magenta and black markers, respectively. In each chart, the horizontal axis represents the detection algorithm while the vertical axis represents the number of output SNPs reported by the algorithm. All causative SNPs are present in outputs from both the exhaustive twolocus analysis and 2LOmb in all simulations.
Figure 4. Performance of SAA, CFS, TuRF and 2LOmb in the twolocus interaction problem. The results are averaged over 25 independent simulations. Detection is declared for SAA and 2LOmb if the pvalues used as detection indicators in their results are less than 0.05. The results from SAA, CFS, TuRF and 2LOmb are displayed using blue, green, red and black markers, respectively. In each chart, the horizontal axis represents the number of correctlyidentified causative SNPs while the vertical axis represents the number of output SNPs reported by the algorithm. The charts on which the red markers are invisible denote the situations in which the performance of TuRF and 2LOmb is similar. The charts in this figure are displayed using a coarser scale than the charts in Figure 3.
Figure 5. Performance of the exhaustive twolocus analysis and 2LOmb in the threelocus interaction problem. The explanation for how the results are obtained and displayed is the same as that given in Figure 3.
Figure 6. Performance of SAA, CFS, TuRF and 2LOmb in the threelocus interaction problem. The explanation for how the results are obtained and displayed is the same as that given in Figure 4.
Figure 7. Performance of the exhaustive twolocus analysis and 2LOmb in the fourlocus interaction problem. The explanation for how the results are obtained and displayed is the same as that given in Figure 3.
Figure 8. Performance of SAA, CFS, TuRF and 2LOmb in the fourlocus interaction problem. The explanation for how the results are obtained and displayed is the same as that given in Figure 4.
The performance of many existing attribute selection techniques for pattern recognition depends on the level of attribute interactions. A number of techniques, including CFS, appear to function well under a moderate level of interactions. However, the performance of CFS appears to be significantly reduced when the interaction level becomes too high [14,61] because CFS favours an attribute that is strongly correlated with the classification outcomedisease status in this studywhile at the same time is not correlated with other attributes. Since the main driving force behind epistasis is the interaction between SNPs, which are themselves attributes, CFS would not intuitively select all causative SNPs. Consequently, the SNP set produced by CFS appears to contain only uncorrelated SNPs. Obviously, a SNP that is a part of the interaction model would occasionally be picked up by CFS but CFS never successfully identifies all causative SNPs in any interaction problems. In addition, CFS reports more erroneous SNPs than other techniques in the null data problem and all three interaction problems due to many SNPs being uncorrelated.
The benchmarking of attribute selection techniques by Hall and Holmes [14] also reveals that ReliefF [15] is better than CFS in problems with a high level of interactions. Since ReliefF is essentially the core engine of TuRF, the results from this study are in agreement with the early benchmark trial. This finding strengthens the observation that the interaction level of SNPs in pure epistasis models is too high for CFS to handle. Similar to its predecessor, the performance of TuRF still depends on both the number of attributes and sample size. TuRF performs well in the majority of simulation scenarios with 1,0002,000 SNPs and 8001,600 samples. These scenarios are relatively easy since the number of SNPs is small while the sample size is large. However, the size of output SNP set, reported by TuRF from the null data problem and all three interaction problems, increases significantly when the difficulty level rises by either reducing the sample size or increasing the number of SNPs. This implies that when the problem contains a large number of candidate SNPs, the only way to ensure that TuRF reports a proper SNP set is to use a relatively large sample size, making it impractical in real genetic association studies due to many factors including disease prevalence, population size and genotyping cost.
The global pvalues in most of the SAA results from the null data problem and all three interaction problems exceed 0.05, showing that SAA reports a low falsepositive result in the null data problem. Nonetheless, SAA remains unsuitable for detecting pure epistasis because of its high falsenegative error. This poor performance can be traced back to the manner in which SAA exploits an omnibus permutation test. As stated earlier, singlelocus analysis does not detect any association between a SNP and the disease in this study. Hoh et al. [11] have demonstrated that genetic association can be more significantly observed when the singlelocus test statistics are combined together. Nonetheless, there is an additional requirement that each causative SNP must exhibit a marginal singlelocus effect. In the current study, the association signal from each causative SNP is lower than the required threshold, leading to similar test statistics and global pvalues for both combinations of multiple SNPs which include causative SNPs and those which exclude causative SNPs.
Both 2LOmb and exhaustive twolocus analysis technique are capable of identifying all causative SNPs. However, the size of output SNP set from 2LOmb is significantly smaller than that from the exhaustive twolocus analysis. Appended SNPs to the causative SNPs in the output from 2LOmb and those from the exhaustive twolocus analysis are erroneous SNPs. These erroneous SNPs are parts of false twoSNP units with Bonferronicorrected χ^{2}'s pvalues less than 0.05. A similar trend of results regarding the size of output SNP set is also observed in the benchmark trial involving the application of 2LOmb and exhaustive twolocus analysis to the null data. This signifies that the permutation test and the progressive search embedded in 2LOmb can help reducing the number of erroneous SNPs in the output.
As mentioned earlier, 2LOmb produces the best results among five techniques in the benchmark trial. 2LOmb has a low falsepositive error in the null data problem and is capable of detecting all causative SNPs in every simulated data set in all three interaction problems. This performance is further strengthened by highly significant global pvalues in 2LOmb results from all three interaction problems (p < 0.0001) and the presence of a SNP in common among some or all pairs of twoSNP units in the three and fourlocus interaction problems. Nonetheless, some of the 2LOmb outputs contain a few erroneous SNPs which are irrelevant to the correct association explanation. Since all three interaction problems involving different numbers of causative SNPs are investigated by varying the total number of SNPs, the sample size and the level of heritability, these parameters may influence the number of erroneous SNPs in the 2LOmb results. Similarly, the total number of SNPs and the sample size may affect the number of erroneous SNPs in the 2LOmb results from the null data problem. ANOVA reveals that the only source of variation that significantly affects the number of erroneous SNPs in the null data, twolocus interaction and threelocus interaction problems is the sample size (p < 0.000001). In addition, the sample size must be greater than 800 for an increase in the number of erroneous SNPs to be significant. In contrast, ANOVA reveals that two sources of variation that affect the number of erroneous SNPs in the fourlocus interaction problem are the sample size (p < 0.000001) and the total number of SNPs (p < 0.00005). Similar to the null data, twolocus interaction and threelocus interaction problems, the sample size in the fourlocus interaction problem must be greater than 800 to create a significant increase in the number of erroneous SNPs. On the other hand, the number of erroneous SNPs appears to decrease when the total number of SNPs increases. These two sources of variation also interact with each other (p < 0.005). However, the interaction is most evident only when the sample size is large, i.e. when the sample size is 1,600.
ANOVA shows that the number of erroneous SNPs in the 2LOmb results is influenced by the sample size and the total number of SNPs but not by the heritability. It is observed that the number of erroneous SNPs increases when the sample size is large. This counterintuitive phenomenon can be explained as follows. As 2LOmb combines pvalues that are determined from χ^{2 }tests, the number of entries for the contingency table construction is large when the sample size is large. This subsequently leads to a significantly large χ^{2 }statistic and hence an extremely small pvalue if the SNPs under consideration are causative SNPs. At the same time, the possibility that a reasonably large χ^{2 }statistic and a small pvalue can be obtained by chance from a twoSNP unit which is irrelevant to the correct association explanation also inevitably increases. With the increase in the possibility of erroneous SNP inclusion, the size of output SNP set gets bigger when the sample size is large. Another observation that appears to be counterintuitive is the reduction in the number of erroneous SNPs when the total number of SNPs increases. This phenomenon is the result of the Bonferroni correction usage. When the total number of SNPs is doubled, the Bonferroni correction factor in 2LOmb is quadrupled. A higher correction factor leads to a more stringent criterion for SNP selection. This subsequently leads to the reduction in the number of erroneous SNPs when the total number of SNPs is large.
In contrast to the first two parameters, different levels of heritability appear to have no effect on the 2LOmb results because all simulated data sets have balanced casecontrol samples and the embedded interaction models have the same architecture. For instance, a twolocus interaction model leads to zero penetrances for genotypes AABB, AABb, AaBB, Aabb, aaBb and aabb. Hence, the penetrances for these six genotypes are always equal to zero regardless of the heritability. On the other hand, genotypes AAbb, AaBb and aaBB have nonzero penetrances (see Methods for details). Therefore, different heritability levels certainly lead to different penetrances for genotypes AAbb, AaBb and aaBB. However, the ratios between the penetrances of these three genotypes are fixed and independent of the heritability. This model description can be generalised to cover the other multilocus interaction models. In addition, the maximum penetrance in any twolocus or multilocus interaction models always stays below 0.1 even though the heritability is at the highest level (see Methods for details). This means that case samples are always oversampled from affected individuals to achieve a balanced casecontrol data set. Since all explored heritability levels lead to the same case oversampling pattern, the simulated data sets of which the only primary difference being the heritability levels are indistinguishable from one another. This leads to the result similarities in interaction problems with the same number of SNPs in the data set, sample size and number of causative SNPs but different levels of heritability as shown in Figures 3, 4, 5, 6, 7 and 8. The result trend is also independent of the number of simulated data sets used in the benchmark trial.
In a permutation test, the ability to differentiate between two pvalues is influenced by the number of permutation replicates. With t permutation replicates, the test declares an actual pvalue that is less than 1/t to be zero. During the progressive search for the best ensemble, the inclusion of a new twoSNP unit is accepted if this inclusion does not worsen the current result. If the number of permutation replicates is too low, the search may include erroneous twoSNP units that are irrelevant to the correct association explanation. The analysis is confirmed as the number of output SNPs from 2LOmb is equal to the number of causative SNPs in most of simulation results. This phenomenon suggests that the number of permutation replicates employed in this study (t = 10,000) is high enough to screen off most of the erroneous twoSNP units. In other words, the inclusion of these erroneous twoSNP units leads to an increase in the pvalue by at least 1/t. Nonetheless, the fact that 2LOmb results are not entirely free from erroneous SNPs suggests that there are erroneous twoSNP units with extremely small pvalues. It is advisable to perform a genotype relative risk calculation for the elimination of erroneous SNPs. If the presence of an erroneous twoSNP unit is suspected, its result on twolocus genotype relative risk would not be as significant as that from the other twoSNP units in the ensemble. Alternatively, an additional means for further SNP screening by other techniques such as MDR is also recommended. The chance of erroneous SNP discovery would be further minimised by employing two consecutive attribute selection techniques. The same concept has been adopted for the implementation of MDR software, in which many additional filters including a χ^{2 }test, an odds ratio test, ReliefF and TuRF are available for SNP screening prior to the MDR analysis.
The two, three and fourlocus interaction data sets which have been screened for causative SNPs by 2LOmb are subsequently subjected to MDR analysis. MDR has successfully identified all erroneous SNPs and the correct interaction models have been constructed from all data sets. The prediction accuracy from the MDR analysis is illustrated in Figure 9. It is noted that the prediction accuracy from all data sets is quite high due to the manner in which the pure epistasis model is defined [42]. Using the penetrance table for a twolocus interaction model with the heritability = 0.01 (see Methods for details), the twolocus genotype distribution of causative SNPs in a balanced casecontrol sample set from simulated data with 800 samples can be estimated and shown in Figure 10.
Figure 9. Prediction accuracy from the MDR analysis. A 10fold crossvalidation strategy is applied during the accuracy evaluation. The best MDR model is located by exploring all possible SNP combinations. All erroneous SNPs, which are left over after the screening by 2LOmb, have been successfully identified. All MDR models contain the correct number of causative SNPs. In addition, the MDR crossvalidation consistency is 10/10.
Figure 10. Genotype distribution of two causative SNPs in a balanced casecontrol data set with the sample size of 800. The left (black) bar in each cell represents the number of case samples while the right (white) bar represents the number of control samples. The cells with genotypes AABB, AABb, AaBB, Aabb, aaBb and aabb are labelled as protective genotypes while the cells with genotypes AAbb, AaBb and aaBB are labelled as diseasepredisposing genotypes.
Six genotypes in Figure 10 namely AABB, AABb, AaBB, Aabb, aaBb and aabb are protective genotypes. In other words, a sample with one of these six genotypes is a control sample because the penetrances for these genotypes are zero. It is also noted that the control samples with all nine genotypes precisely follow the distribution as jointly described by independent singlelocus genotype distribution from loci A and B. In contrast, three remaining genotypes in Figure 10 namely AAbb, AaBb and aaBB are labelled as diseasepredisposing genotypes because the majority of samples with these three genotypes are case samples. Samples with these genotypes may be either case or control samples because the penetrances for these genotypes are between zero and one. In fact, the probabilities that persons with these genotypes to have the disease are quite low since the penetrances for these genotypes are small. However, case samples must be oversampled from affected individuals to ensure a balanced casecontrol data set because the disease prevalence for this twolocus interaction model is only 0.004975. In addition, each case sample must contain one of these three genotypes because the penetrances for the other genotypes are zero. As a result, the case samples with these genotypes do not follow the same twolocus genotype distribution as in the control samples. With six genotypes being exclusively specific to control samples and the majority of three remaining genotypes being found in case samples, the MDR prediction accuracy for the twolocus interaction model is high. This explanation can also be generalised to cover the MDR results from the other multilocus interaction data sets.
Another advantage of using 2LOmb for SNP screening prior to the MDR analysis is the reduction in computational time for interaction detection. The computational time for 2LOmb to finish screening the SNPs is provided to demonstrate this strength of 2LOmb. Moreover, the computational time required to identify causative SNPs by the MDR analysis and that by the combined approach which involves SNP screening by 2LOmb and follows by the MDR analysis is given. The previouslydescribed simulated data sets with causative SNPs are used to produce the computational time results from the SNP screening by 2LOmb and the combined approach. All possible interaction models that can be constructed from the 2LOmb outputs are explored by MDR in the combined approach. On the other hand, the data sets for the direct MDR analysis are prepared by restricting the number of SNPs in each data set to 100. Only SNPs that are irrelevant to the correct association explanation are removed from the original simulated data sets. Furthermore, MDR only explores the interaction models that do not cover more than four SNPs in the data for this latter simulation setting. The summary of computational time required for the SNP screening by 2LOmb and that for both direct MDR and combined approaches to correctly identify all causative SNPs is given in Table 1. The maximum time required by 2LOmb to screen SNPs in the largest data set is 419 seconds or approximately seven minutes. Moreover, the combined 2LOmb and MDR approach discovers the correct causative SNPs much faster than MDR. This time reduction is achieved even though the problems have been simplified for the direct MDR analysis. A direct application of MDR to the original simulated data sets is certainly impractical.
Table 1. Computational time required by 2LOmb, a combined 2LOmb and MDR approach, and direct MDR analysis to detect interactions in simulated data sets with different sizes and different numbers of causative SNPs.
The simulated multilocus interaction problems in this article are based on the pure epistasis model by Culverhouse et al. [42]. It is possible to capture a number of multilocus interactions with marginal twolocus effects via a combination of twolocus analyses. However, there are many multilocus interaction scenarios without marginal twolocus effects. In such cases, 2LOmb and the exhaustive twolocus analysis technique are unable to detect interactions. Among the explored techniques, TuRF and MDR have a better chance of detection. Nonetheless, TuRF functions well only when the total number of SNPs in data is small and the sample size is large enough while the total number of SNPs in data affects the practicality of direct MDR analysis.
Every attribute selection technique has a limitation in terms of the maximum numbers of samples and attributes that it can handle. Singlelocus analysis techniques always have a higher limit than multilocus analysis techniques. Because attribute subset evaluation is usually integrated into multilocus analysis techniques, consequently the number of possible attribute subsets that can be explored is extremely large when the candidate attribute set is large. Together with a potentially large sample size, a higher computational requirement for multilocus analysis techniques is inevitable. As a result, the direct application of multilocus analysis techniques to a much larger data set than those presented in this article, which is usually considered in genomewide association studies, would be impractical. However, it is reasonable to expect that both marginal singlelocus and epistatic effects are present in any genomewide data sets. A multistage strategy that incorporates multiple techniques, designed for different detection modes, would be more suitable to handle large data. For instance, the marginal singlelocus effects should be the first priority and, as such, be detected by singlelocus analysis. Then, a special case of pure epistasis [2] or semipurely epistatic events, in which a SNP displaying a marginal singlelocus effect interacts with a SNP that exhibits no marginal singlelocus effect, should be considered. Many twolocus analysis techniques have been proven to be well suited to this type of epistasis [40,50,51]. Finally, the detection of pure epistasis is carried out in the last stage. With the reduction of SNPs from the first stage, the chance that some multilocus analysis techniques are applicable to the remaining SNPs increases. In addition to the multistage approach, a prior knowledge regarding the previously reported association can be exploited to select candidate genes based upon ontology and pathways. This practice is due to the necessity for the derivation of plausible interpretation. The screening for SNPs within or near candidate genes before the association detection also increases the chance that multilocus analysis techniques can be applied to the remaining data.
Testing with real data
2LOmb has been applied to study a type 2 diabetes mellitus (T2D) data set, collected and investigated by the Wellcome Trust Case Control Consortium (WTCCC) [3]. The data set consists of 1,999 case samples from affected individuals in the UK and 3,004 control samples, which are the results of a merging between 1,500 samples from the UK blood services and 1,504 samples from the 1958 British birth cohort. The original genomewide data set contains 500,568 SNPs that are obtained through the Affymetrix GeneChip 500 K Mapping Array Set. The SNP set is primarily reduced by screening for SNPs within and near 372 candidate genes collected by the Human Genome Epidemiology Network (HuGENet) [62]. These candidate genes cover genes from both positive and negative genetic association reports, in which studies are conducted in various ethnic groups and populations. The SNP set is further reduced by removing SNPs that exhibit strong evidence of genetic association via singlelocus analysis. The final SNP set contains 7,065 SNPs from 370 candidate genes. All SNPs in the reduced data set exhibit no marginal singlelocus effects (Bonferronicorrected χ^{2}'s pvalue > 0.05). Detailed description of the final SNP set is given in the supplement (see Additional file 1).
Additional file 1. List of SNPs for the association study of T2D. This Excel spreadsheet file contains the information about 7,065 SNPs which are explored during the genetic association study of T2D. Bonferronicorrected and uncorrected χ^{2}'s pvalues from singlelocus analyses are also provided.
Format: XLS Size: 847KB Download file
This file can be viewed with: Microsoft Excel Viewer
The 2LOmb search in the reduced T2D data set takes 3,456 seconds (57.6 minutes) of computational time on the Beowulf cluster. The possible genetic association is detected from 11 intronic SNPs in four genes (global pvalue < 0.0001). Details of these SNPs, the twoSNP units that exhibit marginal twolocus effects and the identified genes are given in Table 2. A twoSNP unit is located in LMX1A. A twoSNP unit is also detected in PARK2. In addition, there is one SNP in common among SNPs in both GYS2 twoSNP units. Similarly, there is one common SNP among three twoSNP units located in PGM1. Nonetheless, a twoSNP unit in which each SNP is located in a different gene is absent, indicating that there is no evidence of genegene interactions which can be observed from the 2LOmb result. Linkage disequilibrium (LD) analysis is subsequently performed using a JLIN package [63] and the resulting LD patterns are illustrated in Figure 11. It is noted that there is strong LD among SNPs within each gene due to high values of D' [64] and r^{2 }[65]. The genotype and haplotype relative risks are then calculated and the results are presented in Tables 3, 4, 5, 6, 7, 8, 9 and 10. Haplotype inference is carried out using an expectationmaximisation method [66]. The analysis reveals that a more prominent indication of a relative risk is observed when twoSNP units are considered. It is also noted that the genotype relative risk is directly influenced by the haplotype relative risk once a genotype is phased into all possible haplotype pairs. The detection of these twoSNP units is thus believed to be the consequence of haplotype effects. An early T2D association study also reveals similar haplotype effects in FUSION data [67]. Next, an interaction dendrogram [68,69] constructed from the 11 SNPs by MDR software is given in Figure 12. A strong synergistic effect between the two SNPs in PARK2 is clearly observed. In contrast, the interactions between PGM1, LMX1A, PARK2 and GYS2 are clearly absent.
Figure 11. Linkage disequilibrium (LD) patterns of SNPs in PGM1, LMX1A, PARK2 and GYS2. LD is explained via D' displayed in the upper triangle and r^{2 }displayed in the lower triangle. Dark colours indicate high values while pale colours indicate low values. Distances between SNPs are given in terms of the number of base pairs. SNP1 = rs2269241, SNP2 = rs2269239, SNP3 = rs3790857, SNP4 = rs2269238, SNP5 = rs2348250, SNP6 = rs6702087, SNP7 = rs1893551, SNP8 = rs6924502, SNP9 = rs6487236, SNP10 = rs1871142 and SNP11 = rs10770836.
Figure 12. Interaction dendrogram produced from 11 SNPs that are chosen by 2LOmb. The colours in the dendrogram comprise a spectrum of colours representing a transition from synergy to redundancy. Synergy denotes the situation in which the entropybased interaction between two SNPs provides more information than the entropybased correlation between the pair. Redundancy refers to the situation in which the entropybased interaction between two SNPs provides less information than the entropybased correlation between the pair [7].
Table 2. 2LOmb identifies 11 intronic SNPs, which are located in four genes, from the reduced T2D data.
Table 3. Genotype relative risk evaluated from genotype distribution of SNPs in PGM1.
Table 4. Haplotype relative risk evaluated from genotype distribution of SNPs in PGM1.
Table 5. Genotype relative risk evaluated from genotype distribution of SNPs in LMX1A.
Table 6. Haplotype relative risk evaluated from genotype distribution of SNPs in LMX1A.
Table 7. Genotype relative risk evaluated from genotype distribution of SNPs in PARK2.
Table 8. Haplotype relative risk evaluated from genotype distribution of SNPs in PARK2.
Table 9. Genotype relative risk evaluated from genotype distribution of SNPs in GYS2.
Table 10. Haplotype relative risk evaluated from genotype distribution of SNPs in GYS2.
Since many early genetic association studies of T2D and metabolic syndrome employ MDR analysis [4345,47], additional MDR analysis would be useful for the comparison. The screened T2D casecontrol data set which contains 11 SNPs identified by 2LOmb is further subjected to MDR analysis. The prediction accuracy of the best MDR model is summarised in Table 11. The model covers six SNPs in three genes: PGM1, PARK2 and GYS2. These SNPs are also present in three twoSNP units identified by 2LOmb. It is noted that the prediction accuracy in this real data set is much less than that from the simulated data sets. Nevertheless, the attainment of low prediction accuracy does not necessarily suggest that there is no genetic association. Early works involving genetic association studies of T2D and metabolic syndrome in various populations via MDR analysis produce similar values of prediction accuracy as summarised in Table 12. The prediction accuracy by MDR from most studies is in the range of 0.50.6. The only genetic association study of T2D that the prediction accuracy is distinctively high is conducted in a Korean population [43]. Differences in genetic background, candidate genes and selected SNPs are the main causes of variation in the genetic association results. Although MDR does not select five SNPs from the 2LOmb output, these SNPs should not be regarded as erroneous SNPs because there is strong linkage disequilibrium among SNPs in each gene. Moreover, early genotype and haplotype relative risk analysis clearly indicates that each gene, identified by 2LOmb, plays a role in the T2D association explanation. Overall, the analysis with the methods above only confirms the positive association for PGM1, LMX1A, PARK2 and GYS2 while genegene interactions are clearly absent. This signifies that, for the current study, there is no interaction between each pair of the identified genes that can be described by purely epistatic twolocus interaction models. In addition, there are no interactions between these four genes that can be described by purely epistatic multilocus interaction models with marginal twolocus effects.
Table 11. Prediction accuracy of the best MDR model constructed from the 2LOmb output.
Table 12. Summary of prediction accuracy by MDR from early genetic association studies of T2D in a Korean population, a Han Chinese population from Taiwan, a female population from the US, and that from an early genetic association study of metabolic syndrome in an Italian population from the Centre East Coast Italy.
The four genes selected by 2LOmb regulate many pathways that involve in the disease development [7072]. The genetic association studies involving these genes have been previously conducted in different populations. For instance, LMX1A has been chosen as a positional and biological candidate gene for a casecontrol study of T2D in Pima Indians [73]. This gene is chosen as a candidate because a linkage of T2D to chromosome 1q21q23 has been previously reported [74]. In addition, LMX1A is one of LIM homeobox genes that are expressed in pancreas and has been shown to activate insulin gene transcription. Although SNPs have been carefully selected from the entire gene, no association between these SNPs in LMX1A and T2D has been found in this ethnic group.
PARK2 is another candidate gene that is also selected for casecontrol studies, based on evidence from genomewide linkage analysis [75]. A linkage of T2D in an African American population to chromosome 6q24q27 has been previously identified [76]. Although PARK2 mainly involves in the development of Parkinson's disease, singlelocus analysis reveals strong evidence of association between SNPs, which are in the vicinity of SNPs identified by 2LOmb, and T2D in African Americans.
In contrast to LMX1A and PARK2, which are candidate genes in typical T2D casecontrol studies, GYS2 is considered in a study to identify genes responsible for troglitazoneassociated hepatotoxicity in Japaneses with T2D [77]. In other words, both case and control samples in the study are drawn from troglitazonetreated T2D patients, in which case patients exhibit an abnormal increase in alanine transaminase (ALT) and aspartate transaminase (AST) levels. GYS2 regulates starch and sucrose metabolism and an insulin signalling pathway. The selected SNPs in GYS2 are not found to associate with troglitazoneinduced hepatotoxicity.
Similar to the study of GYS2, the association study involving PGM1 is not carried out as a typical T2D casecontrol study. In fact, an attempt to identify association between PGM1 polymorphisms and obesity has been conducted among T2D affected individuals in Italy [78]. PGM1 regulates glycolysis and gluconeogenesis, starch and sucrose metabolism, galactose metabolism, a pentose phosphate pathway, and streptomycin biosynthesis. Isozyme polymorphisms [79,80], which are defined through structural differences in PGM1 protein, are used instead of SNPs in the study where positive association is identified.
In summary, positive association has been reported from previous studies involving PARK2 in African Americans and PGM1 in Italians. In contrast, negative association has been reported from previous studies about LMX1A in Pima Indians and GYS2 in Japaneses. Both GYS2 and PGM1 regulate starch and sucrose metabolism while LMX1A and PARK2 govern insulin gene transcription and Parkinson's disease development, respectively. The above discussion strengthens the importance of conducting largescale association studies due to two main reasons. Firstly, a gene that does not contribute to the aetiology of a complex disease in one population may be important for association explanation in another population. Secondly, the absence of interacting candidate genes from a study may lead to negative association due to a lack of necessary genetic information. A twolocus interaction can occur between SNPs from genes that regulate one specific pathway [44] or between SNPs from genes that regulate different pathways [45]. Furthermore, a multilocus interaction may involve both SNPs from genes that regulate the same pathway and SNPs from genes that govern different pathways. Hence, candidate genes should be selected by considering all pathways that directly and indirectly contribute to the disease development.
This study produces evidence of association between 11 intronic SNPs in PGM1, LMX1A, PARK2 and GYS2, and T2D in a UK population. Although there are other independent genomewide T2D data sets, the association detection within these data using a similar methodology to the presented method has never been attempted because the methodology employed in the majority of genomewide association studies is based on singlelocus analysis [3,81]. It is recalled that each SNP explored in the reduced T2D data set exhibits no marginal singlelocus effect. Hence, the most logical approach to confirm the possibility of replicating association results from the current study is to perform the same detection method on these independent data sets. This is certainly important to gain further understanding of the genetic role in T2D susceptibility.
Implementation
2LOmb is implemented in a C programming language. All functions within the program are written by the first author except the χ^{2 }distribution function, which is taken from the Numerical Recipes in C [82]. The program can be compiled by Microsoft Visual Studio and GNU C compilers. The program has been successfully tested for the execution under Windows and Linux operating systems. The time required by 2LOmb to complete a problem containing n attributes is T (n) = = n!/((n  2)!2!) = n(n  1)/2. 2LOmb thus has the order of n^{2 }time complexity (T (n) ∈ O(n^{2})). Consequently, 2LOmb can tackle problems in quadratic time. 2LOmb in its present form occupies one processor during the program execution. A parallel version of 2LOmb for genomewide data is under development. All results included in the study are collected from the execution of computer programs in a Beowulf cluster. The computational platform consists of 12 nodes. Each node is equipped with dual Xeon 2.8 GHz processors and 4GB of main memory. The Rocks Cluster Distribution is installed on all nodes.
Conclusion
In this article, a method for detecting epistatic multilocus interactions in casecontrol data is presented. The study focuses on pure epistasis [2], which cannot be detected via singlelocus analysis [42]. To overcome this difficulty, the proposed method performs an omnibus permutation test [54] on ensembles of twolocus analyses and is thus referred to as 2LOmb. The detection performance of 2LOmb is evaluated using both simulated and real data. From the simulation, 2LOmb produces a low falsepositive error when the tests on null data of no association are performed. Furthermore, 2LOmb can identify all causative SNPs and outperforms a simple exhaustive twolocus analysis technique, a set association approach (SAA) [11], a correlationbased feature selection (CFS) technique [14] and a tuned ReliefF (TuRF) technique [16] in various interaction scenarios with marginal twolocus effects. These scenarios are set up by varying the number of causative SNPs, the number of SNPs in data, the sample size and the heritability. ANOVA reveals that the number of SNPs in data and the sample size influence the number of erroneous SNPs appended to the correctlyidentified causative SNPs in the 2LOmb output. In contrast, the results from 2LOmb appear to be insensitive to the variation in heritability. After subjecting the data sets containing only SNPs that are screened by 2LOmb to multifactor dimensionality reduction (MDR) analysis [19], all erroneous SNPs are successfully removed. In addition, an insight into the MDR models is provided. 2LOmb is subsequently applied to a real casecontrol type 2 diabetes mellitus (T2D) data set, which is collected from a UK population by the Wellcome Trust Case Control Consortium (WTCCC) [3]. The original genomewide data set is first reduced by selecting only SNPs that locate within or near 372 candidate genes reported by the Human Genome Epidemiology Network (HuGENet) [62]. In addition, the selected SNPs must exhibit no marginal singlelocus effects. The final data set, which consists of 1,999 case samples and 3,004 control samples, contains 7,065 SNPs from 370 candidate genes. 2LOmb identifies 11 intronic SNPs that are associated with the disease. These SNPs are located in PGM1, LMX1A, PARK2 and GYS2. The 2LOmb result suggests that there is no interaction between each pair of the identified genes that can be described by purely epistatic twolocus interaction models. Moreover, there are no interactions between these four genes that can be described by purely epistatic multilocus interaction models with marginal twolocus effects. This evidence of genetic association for these four genes leads to an alternative explanation for the aetiology of T2D in the UK population. It also implies that SNPs from genomewide data which are usually discarded after singlelocus analysis confirms the null hypothesis of no association can still be useful for genetic association studies of complex diseases.
Methods
Pure epistasis model
The pure epistasis model of interest is proposed by Culverhouse et al. [42]. The model describes a restriction or constraint for penetrance of each genotype constituting the interaction model. Consider a twolocus model that captures an interaction between loci A and B, let A and a be the major (common) and minor (rare) alleles at locus A. Similarly, let B and b be the major and minor alleles at locus B. At each locus, the genotype is represented by characters 0, 1 or 2 where 0 denotes a homozygous wildtype genotype (AA and BB), 1 denotes a heterozygous genotype (Aa and Bb) and 2 denotes a homozygous variant or homozygous mutant genotype (aa and bb). f_{ij }∈ [0, 1] is defined as the disease penetrance of the twolocus genotype ij that consists of genotype i at locus A and genotype j at locus B. The marginal penetrances M_{Ai }for genotype i at locus A and M_{Bj }for genotype j at locus B are given by
and
where p_{A }and p_{B }are the major allele frequencies. Equations 4 and 5 are usually represented by a penetrance table as illustrated in Table 13. The twolocus interaction model is a pure epistasis model if
Table 13. Penetrances for a twolocus interaction model.
where K is the disease prevalence. Obviously, many combinations of penetrance f_{ij }satisfy the condition given in equation 6. Culverhouse et al. [42] suggest that a pure epistasis model with the maximum heritability is particularly useful in association studies. The heritability (h^{2}) of the twolocus interaction model is defined by
where V_{T }= K(1  K) is the total variance of the dichotomous phenotypes in the population and V_{I }is the epistatic variation attributable to the genotypes. V_{I }is defined by
The search for feasible penetrance f_{ij }that also maximises the heritability or other variancebased objectives can be treated as a constraint optimisation problem. Many algorithms including a double description method [42] and a genetic algorithm [83] have been proven to be suitable for the task.
Culverhouse et al. [42] have identified the maximum heritability of purely epistatic twolocus and multilocus interaction models for various values of disease prevalence. For instance, the maximum heritability of a twolocus interaction model for p_{A }= p_{B }= 0.5 with the penetrances in Table 14 is
Table 14. Twolocus penetrances that lead to the maximum heritability (K) = 2K/(1  K) for K ∈ (0, 1/4].
When a twolocus interaction model is expanded into a multilocus interaction model, the marginal penetrance equality constraint must be extended to cover all loci. Furthermore, the expression for V_{I }must also be expanded to cover additional genotypes while the expression for V_{T }remains unchanged. With the necessary model expansion, the maximum heritability of a threelocus interaction model for p_{A }= p_{B }= p_{C }= 0.5 with the penetrances in Table 15 is given by
Table 15. Threelocus penetrances that lead to the maximum heritability (K) = 9K/(1  K) for K ∈ (0, 1/16].
Similarly, the maximum heritability of a fourlocus interaction model for p_{A }= p_{B }= p_{C }= p_{D }= 0.5 with the penetrances in Table 16 is
Table 16. Fourlocus penetrances that lead to the maximum heritability (K) = 35K/(1  K) for K ∈ (0, 1/64].
Additional details about the maximum heritability and the corresponding twolocus and multilocus penetrance tables for other values of disease prevalence can be found in Culverhouse et al. [42]. In this article, the simulated data sets are generated to achieve the maximum heritability of 0.01, 0.025 and 0.05. The values of disease prevalence that lead to the target heritability for two, three and fourlocus interaction models are given in Table 17.
Table 17. Disease prevalence that gives the target maximum heritability of 0.01, 0.025 and 0.05 for two, three and fourlocus interaction models.
genomeSIM
genomeSIM is a simulation package for generating casecontrol samples in largescale and genomewide association studies [60]. The package is capable of producing many realistic scenarios, which can be observed in a population and genetic samples, including linkage disequilibrium, phenocopy and genotyping errors. The case/control status of each sample is determined from the penetrancebased genetic models or interaction models. As a result, the package can accommodate many epistasis models including the one proposed by Culverhouse et al. [42]. A data set can be produced via two modes: a populationbased simulation and a probabilitybased simulation. In the populationbased simulation, an initial population is generated according to the predefined allele frequency of each SNP. Then further generations are created by crossing the genotype strings within successive generations until the specified number of generations is reached. The resulting data set contains a populationdependent case and control samples that follow a forwardtime simulation strategy. In contrast, genotype strings are incrementally generated without any string crossing for only one generation in the probabilitybased simulation. The creation of new strings is terminated only when the desired numbers of case and control samples are obtained. In this study, the probabilitybased simulation is used to produce all case and control samples where the simulation parameter setting is given in the supplement (see Additional file 2). genomeSIM is available upon request to Scott M. Dudek at the Vanderbilt University dudek@chgr.mc.vanderbilt.edu.
Additional file 2. genomeSIM parameters. This text file contains an example of parameter setting in the genomeSIM simulation package.
Format: TXT Size: 2KB Download file
Set association approach
A set association approach (SAA) is an association detection technique based on an omnibus permutation test on sets of candidate SNPs [11]. The test captures information about genotyping errors, deviation from HardyWeinberg equilibrium (HWE) and allelic association. In the first step, the genotype distribution for each SNP in the control samples is checked for HWE. Then, the number of SNPs that is to be excluded from the study (n_{d}) is set to the number of SNPs in the control samples that deviate from HWE. Two test statistics are subsequently calculated for each SNP: an allelic association statistic and a statistic for the deviation from HWE of each SNP in the case samples. The allelic association statistic is a χ^{2 }statistic which is calculated from the contingency table of alleles or genotypes with disease status. On the other hand, a χ^{2 }statistic for the deviation from HWE of each SNP in the case samples indicates the level of association. A large deviation from the equilibrium usually signifies strong association between a SNP and the disease. However, an excessively large deviation may be the result of genotyping errors. n_{d }SNPs with largest test statistics for the deviation from HWE are hence excluded from the consideration.
The test statistics for the allelic association and deviation from HWE are multiplied together to form a single S statistic for each remaining SNP. SNPs are then ranked according to their S statistics. A preset number of SNPs with highest ranks are considered for association. The first candidate SNP set contains only the SNP with the highest rank (the highest S statistic). The pvalue for this first set is determined from a permutation simulation where the case and control labels are randomly permuted while the numbers of case and control samples remain unchanged. In each permutation replicate, a new genotype contingency table is constructed and a new S statistic is subsequently obtained. The pvalue is given by the fraction of permutation replicates with an S statistic greater than or equal to the S statistic from the original data. The second candidate SNP set consists of the first two SNPs in the rank list. The test statistic for this SNP set is the sum of S statistics from both SNPs. The pvalue for the second candidate SNP set is also obtained through the permutation simulation. By progressively adding the remaining SNP with the highest rank to the previously considered candidate set and performing the permutation simulation, pvalues for all candidate SNP sets are estimated. The sizes of candidate SNP sets have the range of one to the preset number. Among all candidate sets, the SNP set that best describes genetic association has the lowest pvalue.
Since multiple hypotheses are postulated during the construction of candidate SNP sets, the global pvalue for the selected candidate set must be evaluated. This is achieved through a permutation simulation in which the current raw pvalue for the chosen candidate set is now used as the test statistic. The existing permutation replicates, created for the early estimation of the raw pvalue, can be reused and a nested permutation simulation is hence avoided. In this study, the maximum allowable size of the candidate SNP set is the total number of available SNPs while the number of permutation replicates for pvalue evaluation is set to 10,000. The allelic association statistic employed in the study is the χ^{2 }statistic that is obtained through the contingency table of genotypes with disease status. A PASCAL program for the set association approach can be obtained from the website for S statistic in gene mapping [84].
Correlationbased feature selection technique
A correlationbased feature selection (CFS) technique [14] is an attribute (SNP) subset evaluation heuristic that considers both the usefulness of individual features (SNPs) in the (casecontrol) classification task and the level of intercorrelation among features. Each attribute subset is assigned a score given by
where Merit_{F }is the heuristic merit of an n_{c}attribute subset F, _{cf }is the average featureclass correlation and _{ff }is the average featurefeature intercorrelation. An attribute subset receives a high merit score if it contains features that are highly correlated with the class and at the same time have low intercorrelation among one another. An application of a bestfirst search for the best subset identification is carried out to avoid searching through all possible attribute subsets. CFS has been integrated into a Weka package [85,86].
Tuned ReliefF
A tuned ReliefF (TuRF) algorithm is a ranking algorithm for identifying genetic markers which are important in casecontrol classification [16]. TuRF is built on a ReliefF engine [15]. ReliefF randomly picks a sample from the (casecontrol) data and identifies its n_{k }nearest neighbours from the same class and another n_{k }nearest neighbours from the opposite class. The attribute valuesthe genotypes in this applicationof the neighbour samples are compared to that of the randomly picked sample and are subsequently used to update the relevance score for each attribute (genetic marker). This process is repeated for a specified number of samples, which is limited by the total sample size. The rationale of ReliefF is that an attribute which is important for the classification should have different values for samples from different classes and have the same value for samples from the same class. The relevance score of an attribute have a range from 1 (not relevant) to +1 (highly relevant). TuRF exploits the capability of ReliefF by repeatedly executing ReliefF and removing a portion of worst attributes at the end of each execution. This leads to the reevaluation of remaining attributes and, hence, reduces the effects of attribute noise on the attribute screening. In this study, the number of repetitions for random sample picking in the ReliefF part is equal to the total number of casecontrol samples while the neighbourhood size (n_{k}) for the relevance score calculation is set to ten. Furthermore, the worst 1% of SNPs is removed at the end of each ReliefF iteration (TuRF 1%). TuRF has been integrated into the current distribution of multifactor dimensionality reduction (MDR) software.
Multifactor dimensionality reduction
A multifactor dimensionality reduction (MDR) method is a wrapperbased technique that is capable of identifying the best genetic marker combination among possible markers for the separation between case and control samples [19]. Similar to other wrapperbased methods, an n_{f }fold crossvalidation technique provides a means to determine the prediction accuracy of the candidate marker model. Basically, the combined case and control samples are randomly divided into n_{f }folds where n_{f } 1 folds of samples are used to construct a decision table while the remaining fold of samples is used to identify the prediction capability of the constructed decision table. The decision table construction and testing procedure is repeated n_{f }times. Hence, the samples in each fold are always used both to construct and to test the decision table. The number of cells in a decision table is given by where n_{c }is the number of candidate markers selected from possible markers and G is the number of possible genotypes according to the marker. For a SNP, which is a biallelic marker, G is equal to three. During the decision table construction, each cell in the table is filled with case and control samples that have their genotype corresponds to the cell label. The ratio between numbers of case and control samples provides the decision for each cell whether the corresponding genotype is a protective or diseasepredisposing genotype. An example of decision table construction is illustrated in Figure 13.
Figure 13. An MDR decision table that is constructed using a balanced casecontrol data set with the sample size of 800. The genotype of each sample is determined from two SNPs. The table consists of nine cells where each cell represents a unique genotype. The left (black) bar in each cell represents the number of case samples while the right (white) bar represents the number of control samples. The cells with genotypes AABB, AABb, AAbb, AaBB and aaBB are labelled as protective genotypes while the cells with genotypes AaBb, Aabb, aaBb and aabb are labelled as diseasepredisposing genotypes.
The prediction accuracy of the decision table is subsequently evaluated by counting the numbers of case and control samples in the testing fold that their disease status can correctly be identified using the constructed decision rules. The process of decision table construction and evaluation must be cycled through all or some of possible  1 combinations where n_{m }is the total number of available markers in the study. The best genetic marker combination is determined from two criteria: prediction accuracy and crossvalidation consistency. Each time that a testing fold is used for the prediction accuracy determination, the accuracy of the interesting marker combination model is compared with that from other models that also contain the same number of markers. The model that consistently ranks the first in comparison to other choices with the same number of markers has high crossvalidation consistency. Prediction accuracy is the main criterion for decision making while crossvalidation consistency is only used as an auxiliary measure. Crossvalidation consistency generally confirms that the high rank model can consistently be identified regardless of how the samples are divided for crossvalidation. In a situation where two or more models with different number of markers are equally good in terms of prediction accuracy and crossvalidation consistency, the most parsimonious modelthe combination with the least number of markersis chosen as the best model.
After the best model has been selected, a permutation test is used to assess the probability of obtaining prediction accuracy that is at least as large as or larger than that observed in the original data from randomised data. This represents the probability that the null hypothesis of no association is true. Each permutation replicate is constructed by randomly assigning the case/control status to each sample with the numbers of case and control samples remaining fixed. MDR analysis is subsequently carried out to obtain the prediction accuracy of each permutation replicate. The empirical pvalue is denoted by the fraction of permutation replicates with the prediction accuracy greater than or equal to the prediction accuracy obtained from the original data. MDR software, which incorporates many additional features including interaction visualisation via dendrograms and genetic marker screening via a χ^{2 }test, an odds ratio test, ReliefF and TuRF, is available from its homepage [87].
JLIN
JLIN or a Java LINkage disequilibrium plotter is a computer program for visualisation of linkage disequilibrium analysis [63]. The program is capable of displaying many statistical measures including D' [64] and r^{2 }[65]. The program is publicly available from the Centre for Genetic Epidemiology and Biostatistics, University of Western Australia [88].
Interaction dendrogram
An interaction dendrogram is a graphical tool for the visualisation of relationships among attributes (SNPs) [68,69]. The interaction dendrogram is constructed via hierarchical clustering analysis and is embedded into MDR software [87]. The dendrogram illustrates the entropybased interaction between attributes by displaying interacting or related attributes closely together as adjacent leaves in a tree. At the same time, independent attributes are placed far apart from one another. In addition, the conclusion regarding whether the interaction between attributes is synergistic or redundancy is present can be deduced.
Availability and requirements
The 2LOmb program for Windows platforms and examples of simulated data are available at http://code.google.com/p/nachol/w/list webcite.
List of abbreviations
2LOmb: omnibus permutation test on ensembles of twolocus analyses; ALT: alanine transaminase; ANOVA: analysis of variance; AST: aspartate transaminase; CFS: correlationbased feature selection; CI: confidence interval; CVC: crossvalidation consistency; DIO2: deiodinase, iodothyronine, type II; E2LA: exhaustive twolocus analysis; EGFR: epidermal growth factor receptor (erythroblastic leukemia viral (verbb) oncogene homolog, avian); FAMHAP: software for singlemarker analysis and joint analysis of unphased genotype data from tightly linked markers (haplotype analysis); FUSION: FinlandUnited States Investigation of NIDDM Genetics; genomeSIM: simulation package for generating casecontrol samples in largescale and genomewide association studies; GYS2: glycogen synthase 2 (liver); HNF4A: hepatocyte nuclear factor 4, alpha; HuGENet: Human Genome Epidemiology Network; HWE: HardyWeinberg equilibrium; JLIN: Java LINkage disequilibrium plotter; KCNJ11: potassium inwardlyrectifying channel, subfamily J, member 11; LD: linkage disequilibrium; LIM domains: protein structural domains that are named after their initial discovery in the proteins Lin11, Isl1 and Mec3; LMX1A: LIM homeobox transcription factor 1, alpha; MDR: multifactor dimensionality reduction; NIDDM: noninsulindependent diabetes mellitus; PARK2: Parkinson disease (autosomal recessive, juvenile) 2, parkin; PGM1: phosphoglucomutase 1; PPARG: peroxisome proliferatoractivated receptor gamma; RXRG: retinoid X receptor, gamma; SAA: set association approach; SNP: single nucleotide polymorphism; T2D: type 2 diabetes mellitus; TuRF: tuned ReliefF; UCP2: uncoupling protein 2 (mitochondrial, proton carrier); Weka: Waikato environment for knowledge analysis; WTCCC: Wellcome Trust Case Control Consortium.
Authors' contributions
WW conducted the literature survey, formulated the research question, implemented the proposed algorithm, designed the experiment, and collected and interpreted the computational results. AA conducted the literature survey, formulated the research question, designed the experiment and secured the access to the genomeSIM package. TP performed the statistical analysis and interpreted the statistical results. SS monitored and oversaw the execution of computer programs on the Beowulf cluster. CL provided additional comments about the genetic association study of T2D. NC conducted the literature survey, formulated the research question, designed the proposed algorithm, designed the experiment, secured the access to the T2D data from WTCCC, selected the candidate genes for the T2D association study, discussed all results, drew the conclusions and wrote the manuscript. All authors read and approved the final manuscript.
Authors' information
WW is a Ph.D. student at the Department of Electrical Engineering, Faculty of Engineering, King Mongkut's University of Technology North Bangkok. He also received his B.Eng. and M.Eng. degrees in electrical engineering from King Mongkut's University of Technology North Bangkok. His current research interests include machine learning, evolutionary computation and bioinformatics.
AA is a Ph.D. student at the Department of Immunology, Faculty of Medicine Siriraj Hospital, Mahidol University. He also received his B.Sc. degree in pharmacy from Mahidol University. His current research interests include human genetics, genetic epidemiology, population genetics and bioinformatics.
TP is a Ph.D. student at the Department of Electrical Engineering, Faculty of Engineering, King Mongkut's University of Technology North Bangkok. He also received his B.Eng. and M.Eng. degrees in production engineering from King Mongkut's University of Technology North Bangkok. His current research interests include evolutionary multiobjective optimisation and machine learning.
SS is a parttime research assistant at the Department of Electrical Engineering, Faculty of Engineering, King Mongkut's University of Technology North Bangkok. He received his B.Eng. and M.Eng. degrees in electrical engineering from Thammasat University and King Mongkut's University of Technology North Bangkok, respectively. His current research interests include machine learning and genetic epidemiology.
CL is the Head of Division of Molecular Genetics at the Department of Research and Development, Faculty of Medicine Siriraj Hospital, Mahidol University. He also received his M.D. degree from Mahidol University. His current research interests include human genetics and genetic diseases.
NC is an associate professor of electrical engineering at King Mongkut's University of Technology North Bangkok and an adjunct professor of genetic epidemiology at Mahidol University. He received his B.Eng. and Ph.D. degrees from the Department of Automatic Control and Systems Engineering, University of Sheffield. His current research interests include evolutionary computation, machine learning and genetic epidemiology.
Acknowledgements
The authors are extremely grateful to four anonymous reviewers and Dr. Danielle Talbot for their valuable comments and suggestions, which have contributed a lot towards improving the content and presentation of this article. The authors are also extremely grateful to Dr. Kuntinee Maneeratana, Dr. Pensiri Tongpadungrod, Dr. Graeme James Sheppard and Mr. Edward L. Karas for their efforts on proofreading the manuscript. The authors wish to thank Dr. Marong Phadoongsidhi and Dr. Vara Varavithya and for their assistance on processing the T2D data and providing an access to the Beowulf cluster, respectively. The authors also wish to thank Dr. Sissades Tongsima for his assistance on testing of multiple hypotheses, providing many suggestions for the manuscript revision and allowing the usage of computer facilities at the Genome Institute, National Center for Genetic Engineering and Biotechnology (BIOTEC), National Science and Technology Development Agency (NSTDA) during the algorithm development. The authors acknowledge Scott M. Dudek at the Vanderbilt University for providing an access to the genomeSIM package. Furthermore, the authors acknowledge Dr. Audrey Duncanson and the Wellcome Trust Case Control Consortium for granting an access to the genomewide casecontrol data. WW, AA and TP were supported by the Thailand Research Fund (TRF) through the Royal Golden Jubilee Ph.D. Programme (Grant No. PHD/1.E.KN.49/A.1, PHD/4.I.MU.45/C.1 and PHD/1.E.KN.50/A.1, respectively). CL was supported by the Mahidol Research Grant. NC was supported by the Thailand Research Fund and the National Science and Technology Development Agency through the NSTDA Chair Professor Grant.
References

Risch N, Merikangas K: The future of genetic studies of complex human diseases.
Science 1996, 273:15161517. PubMed Abstract  Publisher Full Text

Musani SK, Shriner D, Liu N, Feng R, Coffey CS, Yi N, Tiwari HK, Allison DB: Detection of gene × gene interactions in genomewide association studies of human population data.
Hum Hered 2007, 63:6784. PubMed Abstract  Publisher Full Text

The Wellcome Trust Case Control Consortium: Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls.
Nature 2007, 447:661678. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

The GAIN Collaborative Research Group: New models of collaboration in genomewide association studies: the Genetic Association Information Network.
Nat Genet 2007, 39:10451051. PubMed Abstract  Publisher Full Text

Heidema AG, Boer JMA, Nagelkerke N, Mariman ECM, van der A DL, Feskens EJM: The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases.
BMC Genet 2006, 7:23. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Motsinger AA, Ritchie MD, Reif DM: Novel methods for detecting epistasis in pharmacogenomics studies.
Pharmacogenomics 2007, 8:12291241. PubMed Abstract  Publisher Full Text

Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC: A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility.
J Theor Biol 2006, 241:252261. PubMed Abstract  Publisher Full Text

Saeys Y, Inza I, Larrañaga P: A review of feature selection techniques in bioinformatics.
Bioinformatics 2007, 23:25072517. PubMed Abstract  Publisher Full Text

Lewis CM: Genetic association studies: design, analysis and interpretation.
Brief Bioinform 2002, 3:146153. PubMed Abstract  Publisher Full Text

Montana G: Statistical methods in genetics.
Brief Bioinform 2006, 7:297308. PubMed Abstract  Publisher Full Text

Hoh J, Wille A, Ott J: Trimming, weighting, and grouping SNPs in human casecontrol association studies.
Genome Res 2001, 11:21152119. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Potter DM: Omnibus permutation tests of the association of an ensemble of genetic markers with disease in casecontrol studies.
Genet Epidemiol 2006, 30:438446. PubMed Abstract  Publisher Full Text

Chapman J, Clayton D: Detecting association using epistatic information.
Genet Epidemiol 2007, 31:894909. PubMed Abstract  Publisher Full Text

Hall MA, Holmes G: Benchmarking attribute selection techniques for discrete class data mining.
IEEE Trans Knowl Data Eng 2003, 15:14371447. Publisher Full Text

RobnikŠikonja M, Kononenko I: Theoretical and empirical analysis of ReliefF and RReliefF.
Mach Learn 2003, 53:2369. Publisher Full Text

Moore JH, White BC: Tuning ReliefF for genomewide genetic analysis. In Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics. Edited by Marchiori E, Moore JH, Rajapakse JC. Berlin, Heidelberg: Springer; 2007:166175.
[Goos G, Hartmanis J, van Leeuwen J (Founding and Former Series Editors): Lecture Notes in Computer Science, vol 4447].

Nelson MR, Kardia SLR, Ferrell RE, Sing CF: A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation.
Genome Res 2001, 11:458470. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Culverhouse R, Klein T, Shannon W: Detecting epistatic interactions contributing to quantitative traits.
Genet Epidemiol 2004, 27:141152. PubMed Abstract  Publisher Full Text

Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer.
Am J Hum Genet 2001, 69:138147. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hahn LW, Ritchie MD, Moore JH: Multifactor dimensionality reduction software for detecting genegene and geneenvironment interactions.
Bioinformatics 2003, 19:376382. PubMed Abstract  Publisher Full Text

Bush WS, Dudek SM, Ritchie MD: Parallel multifactor dimensionality reduction: a tool for the largescale analysis of genegene interactions.
Bioinformatics 2006, 22:21732174. PubMed Abstract  Publisher Full Text

Chung Y, Lee SY, Elston RC, Park T: Odds ratio based multifactordimensionality reduction method for detecting genegene interactions.
Bioinformatics 2007, 23:7176. PubMed Abstract  Publisher Full Text

Bush WS, Edwards TL, Dudek SM, McKinney BA, Ritchie MD: Alternative contingency table measures improve the power and detection of multifactor dimensionality reduction.
BMC Bioinformatics 2008, 9:238. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Lou XY, Chen GB, Yan L, Ma JZ, Mangold JE, Zhu J, Elston RC, Li MD: A combinatorial approach to detecting genegene and geneenvironment interactions in family studies.
Am J Hum Genet 2008, 83:457467. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Edwards TL, Lewis K, Velez DR, Dudek SM, Ritchie MD: Exploring the performance of multifactor dimensionality reduction in large scale SNP studies and in the presence of genetic heterogeneity among epistatic disease models.
Hum Hered 2009, 67:183192. PubMed Abstract  Publisher Full Text

Mechanic LE, Luke BT, Goodman JE, Chanock SJ, Harris CC: Polymorphism Interaction Analysis (PIA): a method for investigating complex genegene interactions.
BMC Bioinformatics 2008, 9:146. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Liang KH, Hwang Y, Shao WC, Chen EY: An algorithm for model construction and its applications to pharmacogenomic studies.
J Hum Genet 2006, 51:751759. PubMed Abstract  Publisher Full Text

EstradaGil JK, FernándezLópez JC, HernándezLemus E, SilvaZolezzi I, HidalgoMiranda A, JiménezSánchez G, VallejoClemente EE: GPDTI: a Genetic Programming Decision Tree Induction method to find epistatic effects in common complex diseases.
Bioinformatics 2007, 23:i167i174. PubMed Abstract  Publisher Full Text

Nunkesser R, Bernholt T, Schwender H, Ickstadt K, Wegener I: Detecting highorder interactions of single nucleotide polymorphisms using genetic programming.
Bioinformatics 2007, 23:32803288. PubMed Abstract  Publisher Full Text

Lunetta KL, Hayward LB, Segal J, van Eerdewegh P: Screening largescale association study data: exploiting interactions using random forests.
BMC Genet 2004, 5:32. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Bureau A, Dupuis J, Falls K, Lunetta KL, Hayward B, Keith TP, van Eerdewegh P: Identifying SNPs predictive of phenotype using random forests.
Genet Epidemiol 2005, 28:171182. PubMed Abstract  Publisher Full Text

Chen X, Liu CT, Zhang M, Zhang H: A forestbased approach to identifying gene and genegene interactions.
Proc Natl Acad Sci USA 2007, 104:1919919203. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ritchie MD, White BC, Parker JS, Hahn LW, Moore JH: Optimization of neural network architecture using genetic programming improves detection and modeling of genegene interactions in studies of human diseases.
BMC Bioinformatics 2003, 4:28. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

MotsingerReif AA, Dudek SM, Hahn LW, Ritchie MD: Comparison of approaches for machinelearning optimization of neural networks for detecting genegene interactions in genetic epidemiology.
Genet Epidemiol 2008, 32:325340. PubMed Abstract  Publisher Full Text

Cordell HJ: Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans.
Hum Mol Genet 2002, 11:24632468. PubMed Abstract  Publisher Full Text

Wilson SR: Epistasis. In Nature Encyclopedia of the Human Genome. Volume 2. Edited by Cooper DN. London: Nature Publishing Group; 2004::317320.

Neuman RJ, Rice JP: Twolocus models of disease.
Genet Epidemiol 1992, 9:347365. PubMed Abstract  Publisher Full Text

Schork NJ, Boehnke M, Terwilliger JD, Ott J: Twotraitlocus linkage analysis: a powerful strategy for mapping complex genetic traits.
Am J Hum Genet 1993, 53:11271136. PubMed Abstract  PubMed Central Full Text

Li W, Reich J: A complete enumeration and classification of twolocus disease models.
Hum Hered 2000, 50:334349. PubMed Abstract  Publisher Full Text

Marchini J, Donnelly P, Cardon LR: Genomewide strategies for detecting multiple loci that influence complex diseases.
Nat Genet 2005, 37:413417. PubMed Abstract  Publisher Full Text

Hallgrímsdóttir IB, Yuster DS: A complete classification of epistatic twolocus models.
BMC Genet 2008, 9:17. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Culverhouse R, Suarez BK, Lin J, Reich T: A perspective on epistasis: limits of models displaying no main effect.
Am J Hum Genet 2002, 70:461471. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cho YM, Ritchie MD, Moore JH, Park JY, Lee KU, Shin HD, Lee HK, Park KS: Multifactordimensionality reduction shows a twolocus interaction associated with type 2 diabetes mellitus.
Diabetologia 2004, 47:549554. PubMed Abstract  Publisher Full Text

Hsieh CH, Liang KH, Hung YJ, Huang LC, Pei D, Liao YT, Kuo SW, Bey MSJ, Chen JL, Chen EY: Analysis of epistasis for diabetic nephropathy among type 2 diabetic patients.
Hum Mol Genet 2006, 15:27012708. PubMed Abstract  Publisher Full Text

Qi L, van Dam RM, Asselbergs FW, Hu FB: Genegene interactions between HNF4A and KCNJ11 in predicting type 2 diabetes in women.
Diabet Med 2007, 24:11871191. PubMed Abstract  Publisher Full Text

Zhang Z, Zhang S, Wong MY, Wareham NJ, Sha Q: An ensemble learning approach jointly modeling main and interaction effects in genetic association studies.
Genet Epidemiol 2008, 32:285300. PubMed Abstract  Publisher Full Text

Fiorito M, Torrente I, De Cosmo S, Guida V, Colosimo A, Prudente S, Flex E, Menghini R, Miccoli R, Penno G, Pellegrini F, Tassi V, Federici M, Trischitta V, Dallapiccola B: Interaction of DIO2 T92A and PPARγ2 P12A polymorphisms in the modulation of metabolic syndrome.
Obesity 2007, 15:28892895. PubMed Abstract  Publisher Full Text

Albrechtsen A, Castella S, Andersen G, Hansen T, Pedersen O, Nielsen R: A Bayesian multilocus association method: allowing for higherorder interaction in association studies.
Genetics 2007, 176:11971208. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhang Y, Liu JS: Bayesian inference of epistatic interactions in casecontrol studies.
Nat Genet 2007, 39:11671173. PubMed Abstract  Publisher Full Text

Evans DM, Marchini J, Morris AP, Cardon LR: Twostage twolocus models in genomewide association.
PLoS Genet 2006, 2:e157. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ionita I, Man M: Optimal twostage strategy for detecting interacting genes in complex diseases.
BMC Genet 2006, 7:39. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gayán J, GonzálezPérez A, Bermudo F, Sáez ME, Royo JL, Quintas A, Galan JJ, Morón FJ, RamirezLorca R, Real LM, Ruiz A: A method for detecting epistasis in genomewide studies using casecontrol multilocus association analysis.
BMC Genomics 2008, 9:360. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Heidema AG, Feskens EJM, Doevendans PAFM, Ruven HJT, van Houwelingen HC, Mariman ECM, Boer JMA: Analysis of multiple SNPs in genetic association studies: comparison of three multilocus methods to prioritize and select SNPs.
Genet Epidemiol 2007, 31:910921. PubMed Abstract  Publisher Full Text

Pesarin F: Multivariate Permutation Tests with Applications to Biostatistics. Chichester: Wiley; 2001.

Fisher RA: Statistical Methods for Research Workers. 4th edition. London: Oliver and Boyd; 1932.

Westfall PH, Young SS: ResamplingBased Multiple Testing: Examples and Methods for pvalue Adjustment. New York: John Wiley and Sons; 1993.

Becker T, Schumacher J, Cichon S, Baur MP, Knapp M: Haplotype interaction analysis of unlinked regions.
Genet Epidemiol 2005, 29:313322. PubMed Abstract  Publisher Full Text

Herold C, Becker T: Genetic association analysis with FAMHAP: a major program update.
Bioinformatics 2009, 25:134136. PubMed Abstract  Publisher Full Text

Hardy GH: Mendelian proportions in a mixed population.
Science 1908, 28:4950. PubMed Abstract  Publisher Full Text

Dudek SM, Motsinger AA, Velez DR, Williams SM, Ritchie MD: Data simulation software for wholegenome association and other studies in human genetics. In Proceedings of the Pacific Symposium on Biocomputing 2006: 37 January 2006; Maui. Edited by Altman RB, Dunker AK, Hunter L, Murray T, Klein TE. Singapore: World Scientific; 2006:499510. PubMed Abstract  Publisher Full Text

Guyon I, Elisseeff A: An introduction to variable and feature selection.
J Mach Learn Res 2003, 3:11571182. Publisher Full Text

Yu W, Gwinn M, Clyne M, Yesupriya A, Khoury MJ: A navigator for human genome epidemiology.
Nat Genet 2008, 40:124125. PubMed Abstract  Publisher Full Text

Carter KW, McCaskie PA, Palmer LJ: JLIN: a java based linkage disequilibrium plotter.
BMC Bioinformatics 2006, 7:60. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Lewontin RC: The interaction of selection and linkage. I. general considerations; heterotic models.
Genetics 1964, 49:4967. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hill WG, Robertson A: Linkage disequilibrium in finite populations.
Theor Appl Genet 1968, 38:226231. Publisher Full Text

Excoffier L, Slatkin M: Maximumlikelihood estimation of molecular haplotype frequencies in a diploid population.
Mol Biol Evol 1995, 12:921927. PubMed Abstract  Publisher Full Text

Epstein MP, Satten GA: Inference on haplotype effects in casecontrol studies using unphased genotype data.
Am J Hum Genet 2003, 73:13161329. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Jakulin A, Bratko I, Smrke D, Demšar J, Zupan B: Attribute interactions in medical data analysis. In Artificial Intelligence in Medicine. Edited by Dojat M, Keravnou E, Barahona P. Berlin, Heidelberg: Springer; 2003:229238.
[Carbonell JG, Siekmann J (Series Editors): Lecture Notes in Artificial Intelligence, vol 2780].
PubMed Abstract  Publisher Full Text 
Jakulin A, Bratko I: Analyzing attribute dependencies. In Knowledge Discovery in Databases: PKDD 2003. Edited by Lavrač N, Gamberger D, Todorovski L, Blockeel H. Berlin, Heidelberg: Springer; 2003:229240.
[Carbonell JG, Siekmann J (Series Editors): Lecture Notes in Artificial Intelligence, vol 2838].

Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes.
Nucleic Acids Res 2000, 28:2730. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kanehisa M, Goto S, Hattori M, AokiKinoshita K, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M: From genomics to chemical genomics: new developments in KEGG.
Nucleic Acids Res 2006, 34:D354D357. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, Yamanishi Y: KEGG for linking genomes to life and the environment.
Nucleic Acids Res 2008, 36:D480D484. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Thameem F, Wolford JK, Wang J, German MS, Bogardus C, Prochazka M: Cloning, expression and genomic structure of human LMX1A, and variant screening in Pima Indians.
Gene 2002, 290:217225. PubMed Abstract  Publisher Full Text

Hanson RL, Ehm MG, Pettitt DJ, Prochazka M, Thompson DB, Timberlake D, Foroud T, Kobes S, Baier L, Burns DK, Almasy L, Blangero J, Garvey WT, Bennett PH, Knowler WC: An autosomal genomic scan for loci linked to type II diabetes mellitus and bodymass index in Pima Indians.
Am J Hum Genet 1998, 63:11301138. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Leak TS, Mychaleckyj JC, Smith SG, Keene KL, Gordon CJ, Hicks PJ, Freedman BI, Bowden DW, Sale MM: Evaluation of a SNP map of 6q2427 confirms diabetic nephropathy loci and identifies novel associations in type 2 diabetes patients with nephropathy from an AfricanAmerican population.
Hum Genet 2008, 124:6371. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sale MM, Freedman BI, Langefeld CD, Williams AH, Hicks PJ, Colicigno CJ, Beck SR, Brown WM, Rich SS, Bowden DW: A genomewide scan for type 2 diabetes in AfricanAmerican families reveals evidence for a locus on chromosome 6q.
Diabetes 2004, 53:830837. PubMed Abstract  Publisher Full Text

Watanabe I, Tomita A, Shimizu M, Sugawara M, Yasumo H, Koishi R, Takahashi T, Miyoshi K, Nakamura K, Izumi T, Matsushita Y, Furukawa H, Haruyama H, Koga T: A study to survey susceptible genetic factors responsible for troglitazoneassociated hepatotoxicity in Japanese patients with type 2 diabetes mellitus.
Clin Pharmacol Ther 2003, 73:435455. PubMed Abstract  Publisher Full Text

GloriaBottini F, Magrini A, Antonacci E, La Torre M, Di Renzo L, De Lorenzo A, Bergamaschi A, Bottini E: Phosphoglucomutase genetic polymorphism and body mass.
Am J Med Sci 2007, 334:421425. PubMed Abstract  Publisher Full Text

Spencer N, Hopkinson DA, Harris H: Phosphoglucomutase polymorphism in man.
Nature 1964, 204:742745. PubMed Abstract  Publisher Full Text

March RE, Putt W, Hollyoake M, Ives JH, Lovegrove JU, Hopkinson DA, Edwards YH, Whitehouse DB: The classical human phosphoglucomutase (PGM1) isozyme polymorphism is generated by intragenic recombination.
Proc Natl Acad Sci USA 1993, 90:1073010733. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zeggini E, Scott LJ, Saxena R, Voight BF: Metaanalysis of genomewide association data and largescale replication identifies additional susceptibility loci for type 2 diabetes.
Nat Genet 2008, 40:638645. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Press WH, Teukolsky SA, Vetterling WT, Flannery BP: Numerical Recipes in C: The Art of Scientific Computing. 2nd edition. Cambridge: Cambridge University Press; 1992.

Moore JH, Hahn LW, Ritchie MD, Thornton TA, White BC: Routine discovery of complex genetic models using genetic algorithms.
Appl Soft Comput 2004, 4:7986. Publisher Full Text

S Statistic in Gene Mapping [http://www.genemapping.cn/sumstat.html] webcite

Witten IH, Frank E: Data Mining: Practical Machine Learning Tools and Techniques. 2nd edition. San Francisco: Morgan Kaufmann; 2005.

Weka 3: Data Mining Software in Java [http://www.cs.waikato.ac.nz/ml/weka/] webcite

Multifactor Dimensionality Reduction [http://www.multifactordimensionalityreduction.org/] webcite

JLIN: A Java Based Linkage Disequilibrium Plotter [http://www.genepi.org.au/jlin.html] webcite