Abstract
Background
The etiology of complex diseases is due to the combination of genetic and environmental factors, usually many of them, and each with a small effect. The identification of these smalleffect contributing factors is still a demanding task. Clearly, there is a need for more powerful tests of genetic association, and especially for the identification of rare effects
Results
We introduce a new genetic association test based on symbolic dynamics and symbolic entropy. Using a freely available software, we have applied this entropy test, and a conventional test, to simulated and real datasets, to illustrate the method and estimate type I error and power. We have also compared this new entropy test to the Fisher exact test for assessment of association with lowfrequency SNPs. The entropy test is generally more powerful than the conventional test, and can be significantly more powerful when the genotypic test is applied to low allelefrequency markers. We have also shown that both the Fisher and Entropy methods are optimal to test for association with lowfrequency SNPs (MAF around 15%), and both are conservative for very rare SNPs (MAF<1%)
Conclusions
We have developed a new, simple, consistent and powerful test to detect genetic association of biallelic/SNP markers in casecontrol data, by using symbolic dynamics and symbolic entropy as a measure of gene dependence. We also provide a standard asymptotic distribution of this test statistic. Given that the test is based on entropy measures, it avoids smoothed nonparametric estimation. The entropy test is generally as good or even more powerful than the conventional and Fisher tests. Furthermore, the entropy test is more computationally efficient than the Fisher's Exact test, especially for large number of markers. Therefore, this entropybased test has the advantage of being optimal for most SNPs, regardless of their allele frequency (Minor Allele Frequency (MAF) between 150%). This property is quite beneficial, since many researchers tend to discard low allelefrequency SNPs from their analysis. Now they can apply the same statistical test of association to all SNPs in a single analysis., which can be especially helpful to detect rare effects.
Background
The etiology of complex diseases is due to the combination of genetic and environmental factors, usually many of them, and each with a small effect. The identification of these smalleffect contributing factors is still a demanding task, often requiring a large budget, thousands of individuals, and halfamillion or more genetic markers. Even so, success is not guaranteed. In the last decade, genetic association tests have become widely used, since they can detect small genetic effects. The current availability of genomewide genotyping tools, combined with large collections of affected and unaffected individuals, has allowed for association analysis of the entire genome with the intention to detect even those small genetic effects (i.e., OddsRatios (OR) around 1.2) that influence common complex diseases.
We have seen recently a proliferation of genomewide association (GWA) analyses, some of which are identifying even genes with only small or modest effect sizes ([1] for a review). Nonetheless, the genetic factors found so far do not explain the total heritability of these diseases. Perhaps, the genetic architecture of these diseases is more complex than previously thought, involving many more genes, each with a small effect, and interacting among them and with environmental factors in complex ways. There is also the possibility of a large background of rare mutations, each possibly having a relatively large effect, but at a very low frequency [2]. Clearly, there is a need for more powerful tests of genetic association, and especially for the identification of rare effects. This need will probably be exacerbated when lowcost whole genome sequencing becomes available, uncovering a large amount of rare variants in humans [3].
Although Information Theory was originally applied in the context of communication and engineering problems [4], entropybased approaches have been also successfully applied to gene mapping. Specifically, there are informationtheorybased tests implemented for populationbased association studies using genotypic tests in casecontrol analysis and QTL analysis [5,6], genecentric multimarker [7] and haplotypebased association studies [8] or epistasis analysis [912]. Moreover an entropybased Transmission Disequilibrium Test (TDT) has also been described to conduct genomewide studies in family trios [13].
In spite of these achievements, there is a scarce amount of simple and userfriendly computer programs to analyze and prioritize genome wide signals using entropybased algorithms. Furthermore, a general entropybased allelic test has not been described, studied and implemented in software to date. We have created a new genetic association test based on entropy that provides a general tool to conduct whole genome association studies. It is a new, simple, consistent and powerful test to detect genetic association of biallelic/SNP markers in casecontrol data, by using symbolic dynamics and symbolic entropy as a measure of gene dependence. Furthermore, we have implemented these algorithms in a software freely available to the scientific community. Using this computer program, named Gentropia, we have applied this entropy test, and a conventional test, to simulated and real datasets, to illustrate the method and estimate type I error and power of the test.
Results
To illustrate the method we used data from the SNP Resource at the NINDS Human Genetics Resource Center DNA and Cell Line Repository http://ccr.coriell.org/ninds/ webcite. The original genotyping was performed in the laboratory of Drs. Singleton and Hardy (NIA, LNG), Bethesda, MD USA [14]. We have used data on 270 patients with Parkinson's disease and 271 normal control individuals who were genotyped for 396,591 SNPs in all 22 autosomal chromosomes using the Illumina Infinium I and Infinium II assays. Cases were all unrelated white individuals with idiopathic Parkinson's disease and age of onset between 5584 years (except for 3 youngonset individuals). The control sample was composed of neurologically normal, unrelated, white individuals. To explore the properties of the entropy test, and compare it to an equivalent conventional chisquare test, we simulated and analyzed datasets with specific properties. To simulate the specific effect size of a genetic variant, we wrote an algorithm that fixes the oddsratio (OR) attributed to the SNP, and either fixes or sets randomly the minor allele frequency (MAF) in controls. Subsequently it estimates the MAF in cases necessary to generate the desired OR. Then, specific genotypes are generated for cases and controls according to the estimated allele frequencies in each group, and assuming HardyWeinberg equilibrium. Most datasets include 500 cases and 500 controls, and SNP marker genotypes were simulated under different genetic models (OR equal to 1 (no effect), 1.25, 1.5 and 2), and different marker allele frequencies (MAF equal to 0.05, 0.2, and 0.4). Type I error of the statistical tests was evaluated in a dataset where 10,000 SNPs were simulated under the null hypothesis, with allele frequencies chosen randomly between 0 and 0.5, but each SNP had a similar MAF in the case and control groups. For the power analysis, each dataset contains 100 SNPs with specific OR and MAF. Finally, to evaluate low allelefrequency markers in more detail, we simulated datasets of 5000 cases and 5000 controls, and 1000 SNPs with a variety of effect sizes (OR equal to 1, 1.5 and 1.8) and allele frequencies (MAF = 0.01, 0.03 and 0.06).
The entropy allelic and genotypic tests can be compared to association tests used commonly in the field of Human Genetics. For a biallelic SNP marker, a test of association between the SNP and a disease can be computed by comparing the allelic or the genotypic frequencies in cases and controls. The conventional allelic test is a chisquare test statistic with 1 degree of freedom, while the conventional genotypic test is a chisquare test with 2 degrees of freedom.
Null simulations
A simulated dataset, consisting of 500 cases and 500 controls, was analyzed with both conventional chisquare and entropybased association tests. A total of 10,000 SNPs, with different allele frequencies (0 < MAF < 0.5), were simulated under the null hypothesis, that is, to have no effect on the trait. This analysis can reflect whether the new entropybased association tests conform to the theoretical distribution. We counted the number of teststatistics that had values above the critical values of the expected distribution, to estimate the type I error for each test.
The conventional chisquare and the entropybased tests, in both its genotypic and allelic versions, yield approximately the expected number of false positives (see Table 1), suggesting they all conform to the expected theoretical distributions (χ^{2 }with 1 or 2 degrees of freedom). The entropybased test statistic was always equal or larger in value than the conventional chisquare test. On average, the entropybased genotypic method increased the teststatistic in 0.047 chisquare units (a 1.7 percent), while the entropy allelic test exhibited an average increase of 0.003 chisquare units (a 0.1 percent).
Table 1. Type I Error for Conventional and Entropy tests.
Power analysis
To estimate the power of both conventional and entropybased tests, we carried out an analysis of simulated dataset of 500 cases and 500 controls. Sets of 100 SNPs were simulated under different alternative hypothesis, with different effect sizes (oddsratios of 1.25, 1.5, and 2) and minor allele frequencies (0.05, 0.2 and 0.4). The entropybased test statistic was always equal or larger in value than the conventional chisquare test, and therefore its power was also always equal or larger (see Tables 2 and 3). This increase in power is small, and it is more pronounced for the genotypic than for the allelic test. The gain in power with the genotypic entropy test tends to become apparent for larger chisquare values, or especially in markers with low allele frequency. For the allelic test, the entropy test is also sensibly more powerful for the OR = 2 and MAF = 0.05 simulation.
Table 2. Power (%) for conventional (CA) and entropy (AL) allelic tests for different Minor Allele Frequencies (MAF) and Oddsratios (OR).
Table 3. Power (%) for conventional (CG) and entropy (GE) genotypic tests for different Minor Allele Frequencies (MAF) and Oddsratios (OR).
Because the gain in power is correlated with the size of the chisquare statistic, we computed a "proportional power gain", that is, the difference between the entropy and the conventional chisquares, divided by the conventional chisquare. This proportional gain allows us to compare the gain across the different simulated scenarios. As can be seen in Table 4, the average increase in power is only small, ranging between 0.1 and 9.7%, except for the genotypic test on low allele frequency SNPs, for which the power gain range is much larger (5.29.7%). In general, the gain in power increases when the OR increases and when MAF decreases.
Table 4. Allelic and genotypic entropytests gain power (%) for different Minor Allele Frequencies (MAF) and Oddsratios (OR).
Results show that the entropy test is similar or more powerful than the conventional chisquare test. The gain in power is small, and in some cases not different from the falsepositive increase under the null hypotheses. Nonetheless, the entropy tests are an improvement over genotypic tests, for reasons discussed in the Discussion, and may become useful when power is limited, and especially, for the analysis of low allelefrequency SNPs.
Lowallele frequency markers
To study in more detail the performance of the genotypic conventional and entropy tests in lowallele frequency markers, we simulated datasets of 5000 cases and 5000 controls, so there would be enough power to detect these rare effects. Each dataset included 1000 SNPs simulated under an specific effect size (OR equal to 1, 1.5 and 1.8) and allele frequency (MAF equal to 0.01, 0.03 and 0.06).
The analysis of the nulleffect markers (OR = 1), reveals that both tests conform approximately well to the hypothetical null distribution for allele frequencies of 0.06 and 0.03. However, both tests are too conservative for very rare alleles, with minor allele frequencies around 1% (Table 5).
Table 5. Type I error for genotypic test with low minor allele frequencies (MAF).
Table 6 confirms that both tests behave similarly when the study has enough statistical power, that is, for allele frequencies above 5%, and even for markers with lower frequency and large effect (MAF = 3% and OR = 1.8). In contrast, it is evident that the genotypic entropy test is more powerful than the genotypic conventional test for markers with rare (MAF = 0.01), or low allelefrequency (MAF = 0.03).
Table 6. GenotypicTest Power (%) for different low minor allele frequencies (MAF) SNPs and Oddsratios (OR).
It is important to note here that the Fisher exact test is often used as a test of association for rare SNPs. For this reason, we have also compared the Fisher and Entropy tests, in their allelic and genotypic versions. For low frequency SNPs (MAF = 0.03) the results suggest that all four tests conform to their theoretical distribution (Tables 7 and 8). We find that Fisher and Entropy are quite similar for the allelic test, with the Entropy test being slightly more powerful (but also slightly more liberal) than Fisher. We conclude that both tests are essentially equivalent for the allelic test. However, both allelic tests are more powerful than any of the genotypic tests (Table 7).
Table 7. Fisher versus entropy allelic tests for different Oddsratios (OR).
Table 8. Fisher versus entropy genotypic tests tests for different Oddsratios (OR).
Tables 8 and 9 describe the Genotypic test, for MAFs of 0.03 and 0.01. When comparing Fisher versus Entropy genotypic tests with low frequency SNPs (MAF = 0.03), power is also very similar, slightly better for Fisher than Entropy (Table 8).
Table 9. Fisher versus entropy genotypic tests tests for different Oddsratios (OR).
Nonetheless, for very rare alleles (MAF<0.01), both tests are extremely conservative, more so the Entropy test, which consequently shows lower power for association than the Fisher test. In summary, it seems that both tests, Fisher and Entropy, are optimal to test for association with lowfrequency SNPs (MAF around 15%), and both are conservative for very rare SNPs (MAF<1%).
These results alltogether suggest that symbolic entropy based tests are valid for testing for association, and do not create a significant bias under the null hypothesis. Moreover, the entropy tests are more stable than the conventional and Fisher exact tests regardless the allelic frequency. In addition, entropy tests are less expensive in computational terms than Fisher exact test.
Parkinson disease
To illustrate the analysis method in a real dataset, we have analyzed a sample of 270 Parkinson disease patients and 271 controls, genotyped for 396,591 SNPs across the genome. This dataset includes SNPs with a wide variety of characteristics, such as different allelic and genotypic frequencies.
As we saw in the simulated datasets, the entropy tests are generally more powerful than the conventional tests. For these real data, we find some SNPs for which the entropy chisquare is lower than the conventional chisquare. However, these markers have low call rates in cases (lower than 45%), suggesting the presence of genotyping errors, and therefore would generally be excluded from association analysis. For the genotypic test, chisquare values (2 df) range between 0 and 41.95 for the conventional test, and 044.95 for the entropy test. On average, the entropy chisquare is 1.9% larger than the conventional test. If we consider the top100 chisquare values for each test, there is 92% concordance in the SNPs that appear in these two rankings (irrespective of order within the ranking). The 8 SNPs chosen only by the conventional test still appear in the top 112 SNPs for the entropy test, revealing that the entropy test agrees well with the conventional test. Nonetheless, the 8 SNPs chosen only by the entropy test appear in ranks 105359 in the conventional test. These SNPs far down the ranking of the conventional test have a common characteristic, they have a low frequency for the rare genotype (02 individuals only). As we saw for the null simulations, for low allele/genotype frequencies, the genotypic entropy test statistic is larger than the conventional chisquare, suggesting that the entropy test can help detect genetic effects in low allele/genotype frequency SNPs. For the allelic test, chisquare values (1 df) range between 030.35 for the conventional test, and 032.29 for the entropy test. Both tests agree well in chisquare size, with only a 0.1% difference on average. Both tests also agree on 96% of the SNPs in their top100 ranking, and the 4 SNPs in disagreement, are ranked no lower than 106th in the other ranking. These tests are nearly identical, with the entropy test slightly more powerful.
Discussion
Several entropybased tests have been recently developed for populationbased and familybased genetic association studies to perform gene mapping of complex diseases. However, to our best knowledge a simple and computationally feasible allelic entropybased test useful for GWA studies is not available yet. Allelic and genotypic methods represent the goldstandard statistical test to start the prioritisation of markers during GWAS. The development of new and more powerful association tests can aid in the identification of small or rare effects, which may be widespread in the etiology of complex diseases, as shown in the recent GWAS [1]. To cover this need, we have developed a new likelihood ratio test of genetic association for biallelic markers such as SNP markers that is based on symbolic analysis and the relevant concept of entropy. Other authors, [8,5] and [6] among others, have used the concept of entropy for case/control association studies. In [8], the authors develop a statistic, namely T_{PE}, that asymptotically follows a χ^{2 }distribution. In order to obtain the asymptotic distribution of T_{PE}, they require entropy to be continuously differentiable with respect to the frequencies of the haplotypes, which represent a problem when the frequency of an haplotype is zero either in cases or in controls. In such a case then the haplotypes need to be grouped with others haplotypes which yield a decrease in statistical power. Moreover the T_{PE }statistic also requires the estimation of an inverse matrix. Since this is not always possible, this inverse matrix has to be approximated by its generalized inverse, possibly introducing a bias in the statistic. Also, the computation of T_{PE }is more expensive in computational running times than our entropybased test GE_{i }[6]. provides a measure for linkage disequilibrium (LD) between a marker and the trait locus, that is based on the comparison of the entropy and conditional entropy in a marker in extreme samples of population. Nevertheless, the authors do not give the distribution of the constructed measure, and hence it is not possible to assign a statistical significance to the procedure. Finally, [5], in the context of clusters of genetic markers, uses multidimensional scaling in conjunction with the Mutual Information (MI) between two discrete random variables. They use the fact that under the null of no association, MI can be approximated by means of a second order Taylor series expansion to a Gamma distribution. These entropybased methods provide a tool to test for allelic and genotypic association between a marker and a qualitative phenotype. In these papers, the empirical size and power of the tests has not been computed nor compared in power with conventional tests.
The entropy test has several advantages over conventional tests: (1) It has been proved that the test is consistent. This is a valuable property since the test will asymptotically reject any systematic deviations between the distributions of cases and controls. (2) Importantly, the test does not require prior knowledge of parameters, and therefore can not be biased by potential decisions of the user. These properties, together with the fact that the test is simple, intuitive and fast in computational terms, make this test a theoretically appealing and powerful technique to deal with the detection of genetic association.
We have shown, both in simulated and real data, that the entropy and conventional tests, both in their genotypic and allelic versions, fit well their expected null distributions, or are even conservative for the detection of rare alleles (MAF ≤ 0.01). More so, the entropy genotypic test is more powerful than the conventional test, especially for those lowfrequency SNPs. This is an important property, because there is a current need for tools to detect rare genetic effects.
The Fisher exact test is often used as a test of association for rare SNPs, although it is hard to program because of the complexity of its formula, and it is also computationally intensive. To make sure that the Entropy test is efficient also for rare SNPs, we have compared the Fisher and Entropy tests, in their allelic and genotypic versions. We have shown that the Entropy test is as powerful as the Fisher exact test for the analysis of low frequency SNPs (MAF between 15%). Therefore, this entropybased test has the advantage of being optimal for most SNPs, only losing power respect to the Fisher test for very rare alleles (MAF<1%). This property is quite beneficial, since many researchers tend to discard low allelefrequency SNPs from their analysis. Now they can apply the same statistical test of association to all SNPs in a single analysis.
These entropy tests are easy to compute with the formulas provided in this paper, which can be incorporated into any genetic analysis tool. We are making freely available a simple software (Gentropia) to carry out these entropybased genetic analyses. A linux version of the software can be downloaded from the following Website: http://www.neocodex.com/en/Gentropia.zip webcite. The analysis is quite fast. For example, an association analysis of 1,000 SNPs on 10,000 individuals takes only 4 seconds on a 2.4 Ghz CPU; A genomewide association analysis of 400,000 SNPs on 550 individuals takes 84 seconds, which is quite satisfactory
Conclusions
In summary, this is an application of symbolic analysis and entropy to carry out a genomewide association analysis. We have implemented this simple and fast method in a freely available software http://www.neocodex.com/en/Gentropia.zip webcite. This entropybased method to detect genetic association is more powerful than conventional tests, and can be especially useful in the detection of rare effects due to lowfrequency genotypes. The method can be improved to include other tests of association (dominance, recessive, etc.), and covariates. Moreover, the method can be extended for the detection of epistasis.
Methods
Entropy Model
First we give some definitions and introduce the basic notation.
Let P be the population to be studied. Denote by C the set of cases with a particular disease in P and by C^{c }the complementary, that is, the set of controls. Let N_{ca }and N_{co }be the cardinality of the sets C and C^{c }respectively and let N = N_{ca }+ N_{co }be the total amount of individuals in the population. Each SNP_{i }in each individual e ∈ P can take only one of the three possible values, AA_{i}, Aa_{i }or aa_{i}. Let S_{i }= {AA_{i}, Aa_{i}, aa_{i}}. Moreover, each individual e ∈ P belongs to either C or C^{c}, therefore we can say that a SNP_{i }takes the value (X_{i}, ca) if e ∈ C or (X_{i}, co) if e ∈ C^{c}, for X_{i }∈ S_{i}. We will call an element in S_{i }× {ca,co} a symbol. Therefore we can define the following map
defined by f_{i}(e) = (X_{i}, t) for X_{i }∈ S_{i }and t ∈ {ca,co}, that is, the map f_{i }associates to each individual e ∈ P the value of its SNP_{i }and whether e is a control or a case. We will call f_{i }a symbolization map. In this case we will say that individual e is of (X_{i}, t) type. In other words, each individual is labelled with its genotype, differentiating whether the individual is a control or a case.
Denote by
and
that is, the cardinality of the subsets of P formed by all the individuals of (X_{i}, ca) type and (X_{i}, co) type respectively. Therefore is the number of individuals of X_{i}type.
Also, under the conditions above, one could easily compute the relative frequency of a symbol (X_{i}, t) ∈ S_{i }× {ca, co} by:
and
Hence the total frequency of a symbol X_{i }is .
Now under this setting we can define the symbolic entropy of a SNP_{i}. This entropy is defined as the Shannon's entropy of the 3 distinct symbols as follows:
Symbolic entropy, h(S_{i}), is the information contained in comparing the 3 symbols (i.e., the 3 possible values of the genotype) in S_{i }among all the individuals in P.
Similarly we have the symbolic entropy for cases, controls and casecontrol entropy by
and
respectively.
Construction of the entropy test
In this section we construct a test to detect gene effects in the set C of cases with all the machinery defined in Section 1. In order to construct the test, which is the aim of this paper, we consider the following null hypothesis:
that is,
against any other alternative.
Now for a symbol (X_{i}, t) ∈ S_{i }× {ca, co} and an individual e ∈ P we define the random variable as follows:
that is, we have that = 1 if and only if e is of (X_{i}, t) type, = 0 otherwise. Therefore, given that an individual e is a case, t = ca, (respectively e is a control t = co), the variable indicates whether individual e has genotype X_{i }(taking value 1) or not (taking value zero).
Then is a Bernoulli variable with probability of "success" either if t = ca or if t = co, where "success" means that e is of (X_{i}, t) type. Then we are interested in to know how many e's are of (X_{i}, t) type for all symbol (X_{i}, t) ∈ S_{i }× {ca, co}. In order to answer the question we construct the following variable
The variable can take the values {0,1,2,..., N}. Therefore, it follows that the variable is the Binomial random variable
Then the joint probability density function of the 6 variables
is:
where a_{1 }+ a_{2}+ a_{3}+ a_{4}+ a_{5}+ a_{6 }=N. Consequently the joint distribution of the 6 variables is a multinomial distribution.
The likelihood function of the distribution (15) is:
where . Also, since
it follows that the logarithm of this likelihood function remains as
In order to obtain the maximum likelihood estimators and of and respectively for all i = 1,2,3,, we solve the following equations
to get that
Then, under the null H_{0}, we have that and thus,
Therefore the likelihood ratio statistic is (see for example [15]):
and thus, under the null H_{0 }we get that λ_{i}(Y) remains as:
On the other hand, GE_{i }= 2ln(λ_{i}(Y)) asymptotically follows a Chisquared distribution with 2 degrees of freedom (see for instance [15]). Hence, we obtain that the estimator of GE_{i }is:
Therefore we have proved the following theorem.
Theorem 1. Let SNP_{i }be a single nucleotide polymorphism. For a particular disease denote by N the number of individuals in the population, N_{ca }the number of cases and by N_{co }the number of controls. Denote by h(C,C^{c}) the casecontrol entropy and by h(S_{i}), h(S_{i}, ca) and h(S_{i}, co) the symbolic entropy in the population, in cases and in controls respectively, as defined in (5, 6 and 7). If the SNP_{i }distributes equally in cases than in controls, then
is asymptotically distributed.
Let α be a real number with 0 ≤ α ≤ 1. Let be such that
Then the decision rule in the application of the GE_{i }test at a 100(1α)% confidence level is:
Furthermore, an entropy allelic test can be developed in a similar manner. More concretely, let now define the set A_{i }= {A_{i}, a_{i}} formed by the two possible alleles that form the SNP_{i}.
Let
Denote by and the total allele frequency. Then we can easily define the allele entropies of a SNP_{i }by
Now, with this notation and following all the steps of the proof of Theorem 1, we get the following result.
Theorem 2. Let A_{i }= {A_{i}, a_{i}} be the alleles forming a single nucleotide polymorphism SNP_{i }. For a particular disease denote by N the number of individuals in the population, N_{ca }the number of cases and by N_{co }the number of controls. Denote by h(C,C^{c}) the casecontrol entropy and by h(A_{i}), h(A_{i}, ca) and h(A_{i}, co) the allele entropy in the population, in cases and in controls respectively. If the allele A_{i }distributes equally in cases than in controls, then
is asymptotically distributed.
Consistency of the entropy test
Next we prove that the GE_{i }test is consistent for a wide variety of alternatives to the null. This is a valuable property since the test will reject asymptotically that the SNP_{i }distributes equally between cases and controls whenever this assumption is not true. The proof of the following theorem can be found in Appendix section. Since the proof is similar for both statistics we only prove it for GE_{i}.
Theorem 3. Let SNP_{i }be a single nucleotide polymorphism. If the SNP_{i }does not distribute equally in cases than in controls, then
for all real number 0 < C < ∞.
Since Theorem 3 implies GE_{i }→ +∞ with probability approaching 1 always SNP_{i }does not distribute equally in cases than in controls, then uppertailed critical values are appropriated.
Authors' contributions
MRM, MMG, and JAG conceived and designed the novel statistical test. MRM, AR, AGP and JG developed the analysis tool. JLSG and ARA implemented the software. MRM, AGP, AR and JG acquired and generated the datasets, analyzed the data and interpreted the results. MRM, AGP and JG wrote the paper. All authors read and approved the final manuscript
Appendix: Proof of consistency
Proof of Theorem 3 First notice that the estimators , and , of h(S_{i}, ca), h(S_{i}, co) and h(S_{i}) respectively, are consistent because , and Denote by H_{i }= h(C, C^{c}) + h(S_{i})  h(S_{i}, ca)  h(S_{i}, co) and notice that H_{i }can be written as
Hence, since ln(x) > 1  x for all x ≠ 1 we get that
always
for some X_{i }∈ S_{i}.
On the other hand, H_{0 }is equivalent to
Therefore under the alternative, H_{1}, condition (29) is always satisfied and hence H_{i }> 0.
Let 0 < C < ∞ be a real number and take N large enough such that
Then it follows that
Therefore, by (31) and (32) we get that
as desired.
Acknowledgements
This study used data from the SNP Resource at the NINDS Human Genetics Resource Center DNA and Cell Line Repository http://ccr.coriell.org/ninds/. We thank the participants and the submitters for depositing samples at the repository.
Funding: This work was supported in part by Agencia IDEA, Consejería de Innovación, Ciencia y Empresa (830882); Corporación Tecnológica de Andalucía (07/124); Ministerio de Educación y Ciencia (PCTA415027902007 and PCT010000200718); Ministerio de Ciencia e Innovación and FEDER (Fondo Europeo de Desarrollo Regional), grants MTM200803679 and MTM200907373. Programa de Ayudas Torres Quevedo del Ministerio de Ciencia en Innovación (PTQ20020206, PTQ20030549, PTQ20030546, PTQ20030782, PTQ20030783, PTQ20040838, PTQ0410006, PTQ0430718, PTQ0610002) and Farmaindustria.
References

McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, Hirschhorn JN: Genomewide association studies for complex traits: consensus, uncertainty and challenges.
Nat Rev Genet 2008, 9(5):35669. PubMed Abstract  Publisher Full Text

Cohen JC, Kiss RS, Pertsemlidis A, Marcel YL, McPherson R, Hobbs HH: Multiple rare alleles contribute to low plasma levels of HDL cholesterol.
Science 2004, 305(5685):86972. PubMed Abstract  Publisher Full Text

Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte CL, Irzyk GP, Lupski JR, Chinault C, Song XZ, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny DM, Margulies M, Weinstock GM, Gibbs RA, Rothberg JM: The complete genome of an individual by massively parallel DNA sequencing.
Nature 2008, 452(7189):8726. PubMed Abstract  Publisher Full Text

Dawy Z, Goebel B, Hagenauer J, Andreoli C, Meitinger T, Mueller JC: Gene mapping and marker clustering using Shannon's mutual information.
IEEE/ACM Trans Comput Biol Bioinform 2006, 3(1):4756. PubMed Abstract  Publisher Full Text

Li YM, Xiang Y, Sun ZQ: An entropybased measure for QTL mapping using extreme samples of population.
Hum Hered 2008, 65(3):1218. PubMed Abstract  Publisher Full Text

Cui Y, Kang G, Sun K, Qian M, Romero R, Fu W: GeneCentric Genomewide Association Study via Entropy.
Genetics 2008, 179:637650. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhao J, Boerwinkle E, Xiong M: An entropybased statistic for genomewide association studies.
AJHG 2005, 77:2740. Publisher Full Text

Dong C, Chu X, Wang Y, Jin L, Shi T, Huang W, Li Y: Exploration of genegene interaction effects using entropybased methods.
EJHG 2008, 16:229235. PubMed Abstract  Publisher Full Text

Kang G, Yue W, Zhang J, Cui Y, Zuo Y, Zhang D: An entropybased approach for testing genetic epistasis underlying complex diseases.
J Theor Biol 2008, 250(2):36274. PubMed Abstract  Publisher Full Text

Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC: A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility.
J Theor Biol 2006, 241(2):25261. PubMed Abstract  Publisher Full Text

Moore JH: Bases, bits and disease: a mathematical theory of human genetics.
Eur J Hum Genet 2008, 16(2):1434. PubMed Abstract  Publisher Full Text

Zhao J, Boerwinkle E, Xiong M: An entropybased genomewide transmission/disequilibrium test.
Hum Genet 2007, 121(34):357367. PubMed Abstract  Publisher Full Text

Fung HC, Scholz S, Matarin M, SimónSánchez J, Hernández D, Britton A, Gibbs JR, Langefeld C, Stiegert ML, Schymick J, Okun MS, Mandel RJ, Fernández HH, Foote KD, Rodríguez RL, Peckham E, De Vrieze FW, GwinnHardy K, Hardy JA, Singleton A: Genomewide genotyping in Parkinson's disease and neurologically normal controls: first stage analysis and public release of data.
Lancet Neurol 2006, 5(11):911916. PubMed Abstract  Publisher Full Text

Lehmann EL: Multivariate Linear Hypothesis. In Testing statistical hypothesis. 2nd edition. John Wiley & Sons, Inc, New York; 1986.