In this project we want to determine pairs of single nucleotide polymorphisms (SNPs) which have a statistically significant effect on the phenotypic variation of the flowering time of Arabidopsis thaliana.
Material and methods
For a large-scale dataset of over 200,000 SNPs from about 200 individuals together with several phenotypes, published by Atwell et al. , we develop efficient methods to find pairs of SNPs which are strongly associated with the phenotype. As an exhaustive search of all possible combinations of interacting SNPs is often unfeasible, even when only considering pairs of interacting SNPs, the challenge is to find methods which avoid an exhaustive search but can still guarantee to find the causal pair. We propose two distinct approaches to efficiently determine the t top-scoring pairs of SNPs.
Results and conclusions
In the first approach we employ a branch-and-bound strategy to reduce the search space by pruning insignificant pairs of SNPs. Based on this branch-and-bound strategy we develop the two methods fastHSIC and COAT, which use as association measures the Hilbert-Schmidt Independence Criterion (HSIC)  and Pearson's correlation coefficient, respectively. The key idea is that we are able to bound the association scores of pairs of SNPs for both methods based only on the association score of one of the SNPs of the pair.
In our second approach we use prior biological knowledge to select a much smaller subset of candidate genes which, according to other findings, affect the flowering time of Arabidopsis thaliana. These candidate genes and interactions between them make up a network of 1,452 nodes or genes and 938 edges or gene-gene interactions, and allow us to select a subset of SNPs that lie within or in close proximity to the genes of the network.
Empirical evaluation of our own as well as traditional methods on the original and the reduced dataset shows that both our approaches can greatly reduce the runtime.