Epistatic interactions of multiple single nucleotide polymorphisms (SNPs) are now believed to affect individual susceptibility to common diseases. The detection of such interactions, however, is a challenging task in large scale association studies. Ant colony optimization (ACO) algorithms have been shown to be useful in detecting epistatic interactions.
AntEpiSeeker, a new two-stage ant colony optimization algorithm, has been developed for detecting epistasis in a case-control design. Based on some practical epistatic models, AntEpiSeeker has performed very well.
AntEpiSeeker is a powerful and efficient tool for large-scale association studies and can be downloaded from http://nce.ads.uga.edu/~romdhane/AntEpiSeeker/index.html webcite.
Genetic association studies, which aim at detecting association between one or more genetic polymorphisms and a trait of interest such as a quantitative characteristic, discrete attribute or disease, have gained a lot of popularity in the past decade . Although great progress in mapping genes responsible for Mendelian traits has been made, the genetic basis underlying many complex diseases remain unknown. It is widely accepted that these diseases may be caused by the joint effects of multiple genetic variations, which may show little effect individually but strong interactions. Such interactive effects of multiple genetic variations are often referred to as epistasis or epistatic interactions . Recently, increasing numbers of studies have suggested the presence of epistatic interactions in complex diseases, e.g. breast cancer , type-2 diabetes  and atrial fibrillation .
A number of multi-locus approaches have been proposed to detect epistatic interactions, such as the combinatorial partitioning method (CPM) , restricted partitioning method (RPM) , the multifactor-dimensionality reduction (MDR) , the focused interaction testing framework (FITF)  and the backward genotype-trait association (BGTA) . Although these methods were tested and showed promising performance on small data sets, the computational burden prohibits their application on large scale datasets.
Typically, a large scale dataset for association studies may have several tens to hundreds of thousands of markers. For example, the genome-wide case-control data set for Age-related Macular Degeneration (AMD) contains more than 100 thousand SNPs genotyped on 96 cases and 50 controls . An exhaustive search of two-locus interactions needs to evaluate at least 5.00 × 109 locus combinations, and this number increases to 1.67 × 1014 when three-locus interactions are considered. Although this process is computationally hard it could be enhanced by two recent approaches: the Bayesian epistasis association mapping (BEAM)  and SNPharvester , which were shown to be able to handle large scale datasets. However, more efficient and accurate methods are still desired.
The solution to this difficult search problem could be achieved using an optimization technique called ant colony optimization (ACO) algorithm. Ant colony algorithms, proposed first by Dorigio and Gambardella , are tools to solve difficult optimization problems such as the traveling salesman problem. ACO simulates how real ant colonies find the shortest route to a food source. Real ant colonies communicate through chemicals called pheromones, which are deposited along the path an ant travels. Ants that choose a shorter path will transverse the distance at a faster rate, resulting in more pheromones deposited along that path. Subsequent ants will then choose the path with more pheromones, thus creating a positive feedback. In ACO, artificial ants work as parallel units that communicate through a probability distribution function (PDF), which is updated by weights or pheromones. The change in pheromones is determined by some type of expert knowledge. As the PDF is updated, "paths" that perform better will be sampled at higher rates by subsequent artificial ants, and in turn deposit more pheromones. Thus, a positive feedback similar to real ant colonies is simulated.
Two recent studies showed the possibility of applying ant colony optimization to association studies [14,15]. However, the use of MDR for detecting epistatic interactions in these studies dramatically increased the computational burden. Besides, these studies did not test performance using the more practical epistatic models such as the ones proposed by Marchini et al. .
In this study, a new tool named AntEpiSeeker has been developed to search for epistatic interactions in large-scale association studies. The use of χ2 values as score function to measure the association between an SNP set and the phenotype is computationally efficient. The two-stage design of ant colony optimization and the idea of searching bigger SNP sets harboring epistatic interactions enhance the power of ACO algorithms. AntEpiSeeker showed improved performance based on some practical epistatic models and large scale datasets.
The generic ant colony optimization
The ACO has been proven to be a successful technique for numerous non-deterministic polynomial-time hard (NP-hard) combinatorial optimization problems such as the traveling salesman problem, the graph coloring problem, the frequency assignment problem, the quadratic assignment problem, feature selection for microarray classification and the vehicle routing problem [17-22]. ACO has the advantages of a positive feedback, and it lends itself to parallel computing, among other advantages.
As defined by Dorigio and Gambardella , ACO is comprised of parallel artificial ants that communicate through a probability density function (PDF) that is updated by weights or 'pheromone levels'. In this case, the ACO is an iterative procedure which stops at a pre-defined number of iterations and the weights are determined by the significance of the epistatic interaction of the selected set of SNPs. The probability of selecting locus k at iteration i is defined as:
where τk(i) is the amount of pheromones for locus k at iteration i; is some form of prior information, which is set to 1 in this study as we treat each locus equally; α is the parameter determining the weight given to the pheromones deposited by ants. The ACO is initialized with all loci having an equal level of pheromone τ0. Using the PDF defined in equation (1), each artificial ant, m, will select an SNP set Sm of n loci from the whole set of genomic SNPs. The epistatic interaction for this SNP set is evaluated by the χ2 test. The pheromone level of each locus k in Sm is then updated, based on the performance of Sm, as:
where ρ is a constant between 0 and 1 that represents the pheromone evaporation rate; Δτk(i) is the change in pheromone level for locus k at iteration i, which equals 0.1 χ2 of Sm in this study, and is set to zero if locus k ∉ Sm. This process is repeated for all artificial ants.
In an effort to increase the detection power of the generic ACO as outlined in the previous section, an advanced ACO algorithm called AntEpiSeeker, which employs a two-stage design of ACO, is proposed. The first stage of AntEpiSeeker searches SNP sets of sufficient size (larger than the number of SNPs in a given epistatic interaction) using the above ACO, which results in a pre-defined number of highly suspected SNP sets determined by χ2 scores, and another SNP set of a pre-defined size, determined by pheromone levels. The second stage of AntEpiSeeker conducts exhaustive search of epistatic interactions within the highly suspected SNP sets, and within the reduced set of SNPs with top ranking pheromone levels. The use of highly suspected SNP sets (much smaller than the available SNPs in the data) enhances the power of detecting pure epistasis based on greatly reduced computational cost and the SNP set with top ranking pheromone levels is used to detect epistasis among the SNPs with big marginal effects. Additionally, we suggest two rounds of search: 1) using a relatively large size SNP set, which is sensitive to strong signals, and 2) using a relatively small size SNP set, which is sensitive to weak signals. The pseudocode for AntEpiSeeker is shown in Figure 1.
Figure 1. Pseudocode for AntEpiSeeker.
Minimizing false positives
AntEpiSeeker may report all detected epistatic interactions at a p-value threshold. In addition, AntEpiSeeker incorporates a procedure for minimizing false positives, which can be described as:
1) The set of all detected epistatic interactions is denoted by EIall and another null set, holding the epistatic interactions with minimized false positives, is denoted by EIm.
2) Each reported epistatic interaction Ii in EIall is attempted to be added into EIm sequentially. If Ii does not have any locus overlapping with those of each epistatic interaction in EIm, Ii is added to EIm. Otherwise, assuming that the epistatic interaction Jj in EIm has at least one locus overlapping with those of Ii, determine if the p-value of Ii is smaller than that of Jj. If so, Jj in EIm is replaced by Ii. If not, Ii is not reported in EIm.
The AntEpiseeker package was written in C++. Before compiling, the GNU Scientific Library (GSL) needs to be installed on the user's computer. A separate parameter file named "parameters.txt" specifies the parameters needed to run the program. The SNP data file should be comma-delimited, with the first row specifying the SNP names. All subsequent rows should contain SNP data for each sample. The SNP data should be coded by 0, 1 and 2. The last column indicates the sample status (0 indicates control and 1 indicates case). There are three output files. "AntEpiSeeker.log" records some intermediate results, "results_maximized.txt" reports all detected epistatic interactions, and the user-specified output file shows the epistatic interactions with minimized false positives. The user specified output file includes the locus name, χ2 value and p-value. The software and its source code are available for download at http://nce.ads.uga.edu/~romdhane/AntEpiSeeker/index.html webcite.
The parameters needed to run AntEpiseeker include iAntCount, iItCountLarge, iItCountSmall, α, iTopModel, iTopLoci, ρ, τ0, largesetsize, smallsetsize, iEpiModel, pvalue, INPFILE, OUTFILE. The parameter "iEpiModel" specifies the number of SNPs in an epistatic interaction. The parameters "largesetsize", "smallsetsize" must be greater than "iEpiModel". For a two-locus interaction model, we suggest largesetsize = 6, smallsetsize = 3, iEpiMode = 2; For a three-locus interaction model, we suggest largesetsize = 6, smallsetsize = 4, iEpiModel = 3. The parameters "iItCountLarge", "iItCountSmall" should be chosen according to the number of SNPs genotyped in the data (Denoted by L). Typically, we suggest iItCountSmall ≥ 0.1 × L and iItCountLarge = 0.5 × iItCountSmall. iAntCount may vary from 500 to 5,000, where larger iAntCount should correspond to larger L. ρ should range from 0.01 to 0.1 for better performance, where smaller L should use larger ρ. The default parameters in the AntEpiSeeker package, used in our most simulation studies, were an optimal setting balanced between ρ and iAntCount, which should work well on medium size datasets (2 × 103 ≤ L ≤ 2 × 104).
Power and computational time evaluation on a simulated data set
The performance of AntEpiSeeker was evaluated by comparison with two recent methods, BEAM  and SNPHarverster , as well as the generic ACO algorithm, using simulated data generated by the simulation program provided in the BEAM package. Note that the generic ACO algorithm does not select SNP sets of bigger size, and thus the parameters "largesetsize, smallsetsize, iItCountLarge, iItCountSmall" were not needed for the generic ACO algorithm. The simulation study was conducted following the procedure and parameter settings of many previous studies [11,12,16,23,24]. For each combination of parameter settings, 50 datasets containing 4,000 samples (2,000 cases and 2,000 controls) and 2000 SNPs were simulated. The detection power was calculated as the ratio of the number of successful identifications to the number of datasets at the significance level 0.01 after Bonferroni correction. Data was simulated following three genetic models: 1) additive model, 2) epistatic interactions with multiplicative effects and 3) epistatic interactions with threshold effects, as defined by Marchini et al. . Other parameters for data simulation were the effective size λ (a measure of marginal effects as defined by Marchini et al. ), linkage disequilibrium between SNPs measured by r2 and minor allele frequencies (MAFs). λ was set to 0.3 for Model 1 and 0.2 for Models 2 and 3. For r2, two values (0.7 and 1.0) were used for each model. For MAF, three values (0.1, 0.2, and 0.5) were considered. The parameters for BEAM were set as default. The parameter settings for SNPHarvester were: 1 ≤ k ≤ 5 and paths = 50, as suggested by its original simulation study. The parameter settings for AntEpiSeeker were: largesetsize = 6, smallsetsize = 3, iItCountLarge = 150, iItCountSmall = 300, iEpiModel = 2, iAntCount = 1000, α = 1, ρ = 0.05 and τ0 = 100 (also available in the software package of AntEpiSeeker). The parameters of the generic ACO algorithm were set as iAntCount = 1000, α = 1, ρ = 0.05, τ0 = 100, iItCount (number of iterations) = 900, iEpiModel = 2. The comparison of detection power for AntEpiSeeker, BEAM, SNPHarvester and the generic ACO is presented in Figure 2. The results show that AntEpiSeeker outperforms BEAM in all parameter settings and is superior to SNPHarvester and the generic ACO in most parameter settings.
Figure 2. Power comparison between AntEpiSeeker, SNPHarvester, BEAM and Generic ACO. The absence of bars indicates zero power.
In addition, AntEpiSeeker is computationally efficient. In the above simulation study, the average running time of AntEpiSeeker, SNPHarvester and BEAM were 27, 54 and 133 minutes respectively, using a Linux system based on Dual Core AMD Opteron(tm) Processor 275.
False positive rate evaluation on a null simulation
To approximate the false positive rates of AntEpiSeeker, a dataset without any genetic effects was simulated. The dataset contained 2,000 SNPs and 4,000 samples (2000 cases and 2000 controls), whose SNPs were generated independently with MAF uniformly distributed in [0.1, 0.5]. The parameters for running programs were the same as used in the first experiment. At different p-values, the false positive rates of the exhaustive search, BEAM, SNPHarvester and AntEpiSeeker are shown in Table 1. BEAM did not report any false positive, while the false positive rate of SNPHarvester is much higher than exhaustive search. AntEpiSeeker has a false positive rate that is comparable to but no larger than the desired significance level.
Table 1. False positive rate of different methods on a null simulation.
Evaluation of AntEpiSeeker on a simulated large scale dataset
To further test the performance of AntEpiseeker, a real data based simulation study was carried out. The dataset consisted of the SNP genotypes on human chromosome 1 from 912 individuals (11 populations) of the International HapMap project (Phase 3) . Loci with missing genotypes or MAF<0.1 were removed, resulting in 73,355 SNP markers for analysis. Because it has no case/control status attached, the data was randomly and equally divided between cases and controls (456 individuals in each group). Additionally, 132 epistatic interactions following the above mentioned 3 models were added to the data with randomly selected causative loci. The p-value threshold was set at 0.0001. The parameters for running programs were the same as used in the first experiment, for a fair comparison. The available software BEAM was not able to handle this dataset. The performances of SNPHarvester, generic ACO and AntEpiSeeker were compared in terms of true positive rates and false discovery rates, as summarized in Table 2. The results suggest that AntEpiSeeker significantly outperforms other methods on this large scale dataset.
Table 2. Performance comparison of different methods on a simulated large-scale dataset.
Results on WTCCC RA data
AntEpiSeeker was used to perform a large-scale association study on the rheumatoid arthritis(RA) data from the Wellcome Trust Case Control Consortium (WTCCC) , which consisted of 332,831 SNP markers and 3,503 individuals (1,504 controls and 1,999 cases). Chromosomes were scanned separately first and then jointly. The parameters for AntEpiSeeker were adjusted according to the general rule presented in the section of parameter setting. Different methods were also compared based on this real dataset. The available software BEAM was not able to handle such a big dataset. Compared with SNPHarvester, AntEpiSeeker identified more SNP markers, which were previously identified as having remarkable marginal effects [26-28], as being involved in epistatic interactions. We summarized these epistatic interactions in Table 3. Other epistatic interactions suggested by AntEpiSeeker were posted on our project web site at http://nce.ads.uga.edu/~romdhane/AntEpiSeeker/index.html webcite. It took about 5 days for AntEpiSeeker to handle the WTCCC RA data, based on Dual Core AMD Opteron(tm) Processor 275, while SNPHarvester took about 2 weeks to handle it, a similar time to the one reported in .
Table 3. Some epistatic interactions identified by AntEpiSeeker on WTCCC RA data.
In this paper, we proposed a novel tool (AntEpiSeeker) for the discovery of epistatic interactions in large scale case-control studies. AntEpiSeeker was assessed through comparison with two recent approaches on both simulated and real datasets. AntEpiSeeker, which adopts a two-stage optimization procedure, is a modified algorithm derived from the generic ACO. To demonstrate the advantages of the two-stage optimization, we also compared the performance of AntEpiSeeker with that of the generic ACO. AntEpiSeeker is a continuous research project and may be upgraded in the future.
Availability and requirements
Project name: AntEpiSeeker
Project home page: http://nce.ads.uga.edu/~romdhane/AntEpiSeeker/index.html webcite
Operating system(s): Windows, Linux
Programming language: C++
Other requirements: GNU Scientific Library (GSL) is needed for recompile
License: None for usage
Any restrictions to use by non-academics: None
The authors declare that they have no competing interests.
YW and KR conceived the project. YW developed the software. YW and XL analyzed the data. YW and RR wrote the manuscript. All authors read and approved the final manuscript.
This study was supported in part by resources and technical expertise from the University of Georgia Research Computing Center, a partnership between the Office of the Vice President for Research and the Office of the Chief Information Officer.
Future Generation Computer System 2001, 17:441-449. Publisher Full Text
Future Generation Computer Systems 2000, 16:927-935. Publisher Full Text
Math Med Bio 2007, 24:413-26. Publisher Full Text
Plenge RM, Cotsapas C, Davies L, Price AL, de Bakker PI, Maller J, Pe'er I, Burtt NP, Blumenstiel B, DeFelice M, Parkin M, Barry R, Winslow W, Healy C, Graham RR, Neale BM, Izmailova E, Roubenoff R, Parker AN, Glass R, Karlson EW, Maher N, Hafler DA, Lee DM, Seldin MF, Remmers EF, Lee AT, Padyukov L, Alfredsson L, Coblyn J, Weinblatt ME, Gabriel SB, Purcell S, Klareskog L, Gregersen PK, Shadick NA, Daly MJ, Altshuler D: Two independent alleles at 6q23 associated with risk of rheumatoid arthritis.
Raychaudhuri S, Remmers EF, Lee AT, Hackett R, Guiducci C, Burtt NP, Gianniny L, Korman BD, Padyukov L, Kurreeman FA, Chang M, Catanese JJ, Ding B, Wong S, Helm-van Mil AH, Neale BM, Coblyn J, Cui J, Tak PP, Wolbink GJ, Crusius JB, Horst-Bruinsma IE, Criswell LA, Amos CI, Seldin MF, Kastner DL, Ardlie KG, Alfredsson L, Costenbader KH, Altshuler D, Huizinga TW, Shadick NA, Weinblatt ME, de Vries N, Worthington J, Seielstad M, Toes RE, Karlson EW, Begovich AB, Klareskog L, Gregersen PK, Daly MJ, Plenge RM: Common variants at CD40 and other loci confer risk of rheumatoid arthritis.