Abstract
Background
One of the challenges of the analysis of poolingbased genome wide association studies is to identify authentic associations among potentially thousands of false positive associations.
Results
We present a hierarchical and modular approach to the analysis of genome wide genotype data that incorporates quality control, linkage disequilibrium, physical distance and gene ontology to identify authentic associations among those found by statistical association tests. The method is developed for the allelic association analysis of pooled DNA samples, but it can be easily generalized to the analysis of individually genotyped samples. We evaluate the approach using data sets from diverse genome wide association studies including fetal hemoglobin levels in sickle cell anemia and a sample of centenarians and show that the approach is highly reproducible and allows for discovery at different levels of synthesis.
Conclusion
Results from the integration of Bayesian tests and other machine learning techniques with linkage disequilibrium data suggest that we do not need to use too stringent thresholds to reduce the number of false positive associations. This method yields increased power even with relatively small samples. In fact, our evaluation shows that the method can reach almost 70% sensitivity with samples of only 100 subjects.
Background
The availability of genotyping assays for hundreds of thousands of single nucleotide polymorphisms (SNP)s is making genome wide association (GWA) studies more accessible to a broad range of genotypephenotype investigations. The promise of this technology is that it will accelerate gene discovery for polygenic diseases and complex phenotypes of Mendelian disorders because data for all genes can be obtained simultaneously [1,2]. At the same time, the large number of significance tests performed is expected to result in a large number of false positive association signals. In fact, the number of signals observed by chance may well be greater than those that are authentic [3]. Thus, the development of analytic methods and strategies to distinguish authentic signals from those due to chance will contribute significantly to diseasegene association studies.
Here we describe a modular procedure to analyze data from poolingbased GWA studies that use the Illumina SNP microarray technology [4]. Rather than genotyping individual samples, the poolingbased technology types a carefully constructed pool of DNA samples that can be used to infer allele frequencies and is an affordable alternative to GWA studies that are still a financial burden for many investigators. Several studies have shown the usefulness of poolingbased GWA studies to discover SNPs associated with disease [59] using well calibrated methods [7,1012], and a variety of methods to estimate allele frequencies from pooledbased studies that use the Affymetrix microarray technology have been proposed [13,14]. Our objective is twofolds: (i) we wish to assess reproducibility and accuracy of the algorithm proposed by Illumina to detect chromosomal aberrations when used to estimate allele frequencies from pooled DNA samples [15]; and (ii) we propose a modular approach to the analysis of poolingbased GWA studies that limits the loss of power due to both the use of pools of DNA samples and the issue of multiple comparisons.
Several studies apply stringent thresholds on the significance level that is required to determine significant SNPphenotype associations [1618]. Contrary to this approach, our method integrates Bayesian tests for general associations [19] with decision rules based on the structure of linkage disequilibrium (LD) discovered through the International HapMap project [20], and other machine learning techniques to reduce the number of false positive associations. We also describe a hierarchical procedure to summarize the findings in terms of genes that can be further synthesized into gene sets using Gene Ontology annotations [21], pathways [22,23], or chromosomal bands. We evaluate this method using data from the sixty unrelated CEPH parents used for the International HapMap project [20] and two independent datasets. The first is a study of fetal hemoglobin (HbF) levels in African American subjects with sickle cell anemia and the objective is to discover genetic modulators of HbF. The second dataset is a study of exceptional longevity in a cohort of centenarians. In both datasets, using our novel analytic approach, we identified association signals in genes previously known to affect these phenotypes. The method is implemented in the R package and can be integrated with other R packages for genetic analysis, or GWA studies [24,25]. We develop the method for the analysis of pooled DNA samples [26,27], but the approach can be easily extended to the analysis of samples that are individually genotyped.
Results
We ran three sets of experiments to assess the reproducibility and accuracy of the estimates of the allele frequencies derived from pooled DNA samples, as well as the sensitivity and specificity of our modular procedure.
Experiment 1: accuracy and reproducibility
We obtained DNA samples from the 60 unrelated parents used in the HapMap CEU panel and created 2 duplicated pools of 30, 45 and 60 samples each (Table 1 provides a summary). The pooled DNA samples were analyzed in duplicates with the Illumina Sentrix HumanHap300 Genotyping BeadChip (v.1) and ballele frequencies were estimated using the Illumina LOH and Copy Number analysis tool. The reproducibility was assessed by the agreement between allele frequency estimates in the two replicate samples for each pool (Table 1). Shown in Figure 1 is the scatter plot of two independent replicates of allele frequency estimates for the 22842 SNPs tagging chromosome 1 (top), and the 5452 SNPs tagging chromosome 22 (bottom) obtained with pools of 60 samples. The plots show a high degree of agreement that is confirmed for different sample sizes as shown by the results summarized in Table 1. Plots for other chromosomes are in the supplementary material [28].
Table 1. Summary of the results of Experiment 1.
Figure 1. The reproducibility of the allele frequency estimates is shown by the scatter plot of repeated estimates of allele frequency inferred from pooled DNA samples (left). The labels "run 1" and "run 2" in the x and yaxis specify each replication. The accuracy of the allele frequency estimates is shown by the scatter plot of the estimates of allele frequency inferred from pooled DNA samples (yaxis in the right plots) and those computed from individually genotyped samples (xaxis). The analysis of the other chromosomes shows similar results.
We assessed the accuracy of the allele frequency estimates from pooled DNA samples by comparing the average estimates over the replicated pools with the allele frequencies computed using individually genotyped DNA samples that are available from the web site of the HapMap project [29]. A scatter plot of part of the results is displayed in Figure 1 for pools of 60 samples. The error analysis summarized in Table 1 suggests that, on average, the allele frequency based on the analysis of replicated pooled DNA samples differ from those based on individually genotyped data by approximately ± 0.04 but the error can be as large as ~0.12 = 0.04+2×0.06/√2 thus making differences in allele frequencies smaller than 0.24 difficult to detect because of technical errors. However, our analysis shows that less than 5% of the estimates based on pools of DNA samples differ from those based on individually genotyped samples by more than 0.12, and less than 10% differ by more than 0.08. This suggests reducing the minimum detectable allele frequency difference to 0.15 with a 10% chance of error. Furthermore, we have observed that amplifying DNA does not appear to affect either the reproducibility or the accuracy of the analysis.
To infer the effective sample size to be used in the analysis, we also looked at the distribution of the ratio between the two types of allele frequency estimates: say p(Si) = ni/n and q(Si) where ni is the frequency of the minor allele of the SNP Si computed from the samples that were typed individually, n is the overall sample size, and q(Si) is the frequency of the same minor allele computed from the analysis of the pooled DNA samples, in the different sets. The analysis demonstrated that log(q(Si)/p(Si)) has approximately a normal distribution with 0 mean and standard deviation 0.35. From this data, we deduced that about 95% of allele frequency estimates derived from the pooled DNA samples can be assumed to be within the interval p(Si) exp(± 1.95 × 0.35) from which we derive the empirical relation between p(Si) and q(Si): 0.51 p(Si) < q(Si) < 1.98 p(Si) with a range of uncertainty of 1.47 ni/n. The inequality suggests that when we infer allele frequency from pooled DNA samples, we have a loss of precision approximately equivalent to using 2/3 (= 1/1.47) of the original DNA sample size. We call this the "effective sample size" used in the calculation of the Bayesian test of association.
Experiment 2: specificity
To estimate the false positive rate (FPR) we used real data from pools of DNA samples to create artificial sets of pools. The original pools are described in Table 2 and were generated in duplicates to discover genetic variants associated with exceptional longevity [30], and fetal hemoglobin expression in subjects with sickle cell anemia [31]. The Illumina Sentrix HumanHap300 Genotyping BeadChip was used for all the experiments. We created the artificial sets of pools by mixing replicates of different pool sets. For example, we generated a set of two pools by taking one replicate of the pooled DNA samples from the female centenarians and one replicate of the pooled DNA samples from the younger female controls, and we constructed a second set of two pools by taking the remaining replicates from the two sets (See Figure 2 for an example). Because the two artificial sets of pools are homogeneous relative to the phenotype, the differences in allele frequencies between the two sets can be attributed to chance, and the SNPs with significant differences in allele distribution are false positives. We repeated this analysis by mixing different types of pools of DNA samples and using a BF>3, together with the LD and regional filters, we observed a false positive rate ranging between 0.001 and 4×104 with a mean of 0.001, and an average of 300 SNPs selected by chance. Note that this number is substantially smaller than the number of false positive associations that we would expect by chance using a BF>3. This threshold is equivalent to accepting an association when the posterior probability of the association is greater than 0.75, so that we expect 1 in 4 associations to be false. Also the specificity of the selected and significant genes was very high: The number of genes that by chance were selected in two unrelated analyses was 9 and this number was further reduced to 7 when we limited attention to significant genes. These numbers should provide a reference when we examine the reproducibility of findings in different studies, because we expect that, by chance alone, we would have an agreement in about 0.1% of findings. We note that long genes that are tagged by a larger number of SNPs are more likely to be selected by chance in different studies. Figure 3 displays the log10 BF in the 1,114 false positive associations generated in approximately the 106 association tests. The plot shows an exponential decay of the BF so that the chance to observe a very large BF has an exponential decay, and the probability of observing a BF greater than 10 by chance is 6 × 104, whereas the probability of observing a BF greater than 100 is 3 × 104, and greater than 1000 is 2 × 104. This analysis however shows that trying to reduce the false positive rate by imposing a stringent threshold on the BF would likely reduce the power of relatively small association studies and require unrealistically large sample sizes.
Table 2. Summary of the pools of DNA samples that were used for the validation of the analytical method. Each pool was done in duplicates.
Figure 2. Example of the artificial pool sets that we created to assess the specificity of the procedure. As an example, the top four pools were generated to compare the genome of centenarians (pools 1 and 2) with that of younger controls (pools 3 and 4). The two artificial pool sets are obtained by mixing pools of centenarians DNA with those of controls.
Figure 3. Distribution of the log_{10 }Bayes factor in 1,114 false positive associations generated in approximately 10^{6 }association tests with an estimated false positive rate 5 10^{6}. The analysis shows that the chance to observe a very large Bayes factor has an exponential decay, and the probability of observing a Bayes factor greater than 10 by chance is 6 10^{4}, the probability of observing a Bayes factor greater than 100 is 3 10^{4}, greater than 1000 is 2 10^{4}.
We also run experiments to assess the effect of LD and regional filters on the specificity. Using the same simulated sets, we run the analysis by using only a BF>3 to select the significant SNPs, and also examined the effect of adding either the LD filter or the regional filter or both on the false positive rate. Our results suggest that the LD filter reduces the false positive rate by 43%, while the regional filter alone increases the false positive rate by approximately 25%, and both filters decrease the false positive rate by 20%. These results are consistent with the intuition that the regional filter increases the power by finding clusters of SNPs that individually have small effects and would be disregarded by a oneSNPatatime analysis. However, the effect is to slightly increase the false positive rate. This conjecture is confirmed in the next experiments that we conducted to assess the sensitivity.
Experiment 3: sensitivity
In related work we are analyzing pools of DNA samples as a screening tool to discover genetic variants associated with exceptional longevity [30], and fetal hemoglobin (HbF) expression in subjects with sickle cell anemia [31]. As an indication of the sensitivity of technology and analytic method, we searched for SNPs in the Illumina Sentrix HumanHap300 Genotyping BeadChip that have been reported associated with either trait in independent studies, and verified whether an association was found based on the pooled DNA samples.
HbF experiment
We created two pools using DNA samples from 55 patients in the top and 54 patients in the bottom quartile of HbF concentrations. These patients were part of a clinical trial described in [32]. The pools were run in duplicates, and the data analyzed using the method proposed here. We searched the literature and found 36 SNPs with rs numbers that were reported associated with different levels of HbF [31,3335]. Thirteen of these SNPs are in the Illumina array, and only 3 of these were found associated in our analysis with a BF greater than 1, and 2 with a BF greater than 3. The moderate effect of the other 10 SNPs (odds ratios between 0.55 and 1.76) is consistent with the weak associations reported by other investigators and would not be detectable with our sample size of about 60 subjects per group. In fact, a sample size of 60 subjects would give at most 30% power to detect an odds ratio of 1.75 when the MAF in one group is 0.5. We also found 23 SNPs in the Illumina array that are within 150 kb of the other 23 reported SNPs, and are associated with HbF levels with a BF greater than 1, and 13 of these had a BF greater than 3. Fifteen of these SNPs were typed as part of the HapMap project and ten of these are in strong LD (Bayes D' > 0.8)[36]. Thus the analysis based on pooled DNA samples discovered association of 26 SNPs, and for 13 of them the association was strong. This analysis suggests a sensitivity of 72% and if we limit attention to associations supported by a BF of at least 3 the sensitivity is 36%. The details of the associations are in Table 3.
Table 3. List of SNPs that are known to be associated with different levels of HbF and results of the analysis based on pooled DNA samples.
Longevity experiment
We created pools of DNA samples from unrelated centenarians and younger controls. Because there is evidence of gender effect [37]  85% of centenarians are female we created distinct pools for males and females as summarized in Table 2. We searched the literature and found 36 SNPs with rs numbers reported as associated with longevity [3840]. Seven of these 36 SNPs are in the Illumina array, and five of these seven were found associated in our analysis. For 21 of the remaining 29, we found SNPs within 100 kb that were associated with longevity in either the males and female comparisons, or both. These SNPs are reported in Table 4. The analysis suggests approximately 67% sensitivity, and this is consistent with the sensitivity estimated with the HbF experiment. We also noted that the regional filter helped identify some of the associations that would be lost with a tight threshold on the BF. As an example, the SNP rs2227956 on HSPA1A was found associated with longevity in males only when the regional filter is used, and the two SNPs in WRN a well known longevity gene in mice [41] – were selected by the regional filter. A similar form of sensitivity analysis is to see whether the GSEA analysis can lead to discover sets of functionally similar genes that are known to be associated with longevity. GSEA analysis of the centenarian cohort revealed several enriched GO biological categories (Table 4 and manuscript in preparation). Among the significantly enriched categories were genes associated with immune response (e.g., CSF3) and DNA repair (e.g., XRCC4), see Table 4. Intriguingly, CSF3 (also known as GCSF) is reported to influence migration of stem cells between the bone marrow and blood [42,43] and appears to promote regeneration of myocardial tissue [4446], which has clear relevance to longevity. The gene XRCC4 has a well established role in DNA repair [47], and unrepaired DNA has been reported to accelerate ageing, possibly through dysregulating the IGF/growth axis [48]. Therefore, a comprehensive analysis of these and other genes present with the enriched gene sets that we detected will be essential to fully appreciate pathways engaged that contribute to the longevity phenotype.
Table 4. List of SNPs that were found associated with exceptional longevity in different studies, and results based on the analysis of pooled DNA samples.
Discussion and Conclusion
We have developed a hierarchical and modular approach to the analysis of genome wide genotype data based on pooled DNA samples. The method incorporates quality control data, information about linkage disequilibrium, Bayesian association tests, physical distance and gene ontology to identify associations warranting further investigation. Our evaluation using real data has shown the accuracy, reproducibility, sensitivity and specificity of the method.
Compared to other approaches, the integration of Bayesian tests with information about linkage disequilibrium and other machine learning techniques implies that we do not need to use too stringent thresholds to reduce the number of false positive associations. The implication of this fact is an increased power even with relatively small samples. In fact, our estimate of the sensitivity shows that the method can reach almost 70% sensitivity with samples of only 100 subjects.
Although we developed the approach to analyze pooled DNA samples, the method can also be used for the analysis of individually genotyped samples.
Methods
Genotyping
For the HbF study, DNA was obtained from the 60 subjects with HbF levels below the first quartile of the distribution, and the 60 subjects with HbF levels above the third quartile who were enrolled in the Multicenter Study of Hydroxyurea (MSH) study in Sickle Cell Anemia [49]. DNA samples from 260 centenarians and a control group of 230 subjects were obtained from the New England Centenarian Study: a crosssectional study of individuals aged 97 and older conducted at the Boston Medical Center. CEPH DNA samples of the sixty unrelated parents used for the International HapMap project [20] were obtained from the Coriell Institute, Camden, NY and used to compare the accuracy and reproducibility of the estimates of allele frequency in pooled DNA samples compared to individually genotyped samples. For DNA pool construction and to ensure that each individual contributed equally to the pool, we first measured DNA stock solutions using a fluorimetric method (RNAseP) against a standard curve constructed from known concentrations of human genomic DNA. We then diluted the stock solutions to 10 ng/ul and measured the concentrations of these working solutions by means of PicoGreen. In the case of samples for which the CV of the three measurements was greater than 10%, quantification was repeated in triplicate until the CV was smaller than 10. Measurements were highly reproducible, with a correlation coefficient of 0.97 between the third measurement and the average of the first two. Based on these concentrations, 50 ng of DNA were added to the pool for each individual. The pools of DNA were analyzed on the Sentrix HumanHap300 bead chip (Illumina) according to the manufacture's protocol. The data used in the HbF and longevity studies will be released with companion publications. We make available the data derived from pools of CEPH DNA samples from the supplementary web site [28]. The HbF and longevity studies were approved by the Institutional Review Boards of Boston University.
Association test
The overall analytic strategy is shown in Figure 4. The first module is a statistical procedure to test the allelic association between each individual SNP and the phenotype. The input data are the allele relative frequencies estimated with the "ballele frequency" value provided by the Illumina Beadstudio genotype module. This value represents the relative proportion of each allele in the DNA sample and is used by the Illumina loss of heterozygosity (LOH) and Copy Number analysis tool [15] to detect chromosomal aberrations and copy numbers by comparing the normalized intensity of the test sample (the pooled DNA samples) to a reference sample. We use the estimate of the allele frequencies θ_{ij }in test and control pools to reconstruct the expected allele frequencies as where is the effective sample size in pool i, the index i = 1 for cases and i = 2 for controls, and the index j = A,B denotes the A or B allele. We then use a Bayesian test of association to compare the distributions of allele frequency in the two different pools. The test is described in [19,50] and assumes that prior probabilities are available for the model of allelic association and the model of no association – say p(M_{a}), p(M_{i}) – and then uses the data to update these prior probabilities p(M_{a}), p(M_{i}) into the posterior probabilities by using Bayes' theorem. The decision rule is then to select the model of association if its posterior probability is at least 3 times larger than the posterior probability of the model of no association (as suggested in reference [51]). Formally, the ratio of the posterior probabilities is
and the ratio of the marginal likelihood functions p(n_{ij } M_{a})/p(n_{ij } M_{i}) is known as the Bayes factor (BF). When the prior probabilities of the two models are equal, the BF is equivalent to the posterior odds. Assuming the conjugate Beta distribution for the allele frequencies, the BF can be calculated in closed form and the formula for this calculation is reported in the appendix. To take into account the issue of multiple testing, we can use prior information about the number of SNPs that we expected to be associated to make the selection stronger. For example, if we expect 1,500 SNPs associated with the phenotype, then the prior odds for the alternative hypothesis of association are 0.005/0.995 when we test 300,000 SNPs, and the decision rule becomes to accept that a SNP is associated with the phenotype if the posterior odds for the association are at least 3 × 0.995/0.005 = 597. Initial experiments described in the Evaluation section suggest that a robust choice for an effective sample size is 2/3 of the original pool, and this is consistent with a larger sample size needed with the analysis of pooled DNA samples [52]. We note here that one advantage of this modular procedure is that the Bayesian test can be replaced by a standard χ^{2 }test for allelic association.
Figure 4. Schematic summary of the modular approach to the analysis of GWA data.
Filtering out false positives
Although we can take into account the issue of multiple comparisons by choosing appropriate prior odds for an association, the consequence of this approach is to reduce power and to require large sample sizes to detect associations with a small effect. This consequence can be problematic in studies where cases are relatively rare such as the study of exceptional longevity in which cases are subjects who lives 100 years and older. To fully exploit the power of small scale studies we developed a series of data filters that remove unreliable or suspicious associations (see Figure 4). The first filter is specific for allele association analysis using pooled DNA samples and accounts for the lack of precision of the technology. The other two filters take into account redundancy as well as reciprocal information of SNPs based on the LD structure of the human genome. So, rather than using "SNP pruning" as in PLINK to remove SNPs that are in LD [53], we leverage on dependencies determined by LD to improve the detection of false positive while reducing the false negative rate.
Quality control (QC) filter
The function of this filter is based on an extensive evaluation of the accuracy and reproducibility of the allele frequency estimates that are computed with the Illumina software. Allele frequencies obtained from genotyping of pooled DNA samples were compared with those derived from genotyping of individual samples (detailed in the below Evaluation section). The results suggest that alleles with minor allele frequency MAF <0.15 as well as differences in allele frequency of less than 0.15 are not reliable. We therefore filter out all SNPs with these characteristics, as well as those SNPs for which repeated estimates of allele frequencies in replications of the same pool differ by more than 0.15.
Linkage disequilibrium (LD) filter
SNPs in LD with each other would be expected to show similar patterns of association if the signal is authentic whereas a single SNP in a LD block showing association is more likely to represent a spurious association. Therefore, our procedure automatically checks this condition and disregards the associations for SNPs that are not supported by positive associations with other SNPs in the same LD block. To this end, we used genotype data collected within the HapMap project to compute pairwise measures of LD for all consecutive pairs of SNPs in the HumanHap300 platform. The estimation of LD was based on a novel Bayesian version of D' that we introduced in [36]. As the traditional D', our Bayesian estimator is defined in the interval [0;1] regardless of the allele frequency so that it is easier to interpret than other measures of correlation like r2 but it is much less biased toward disequilibrium. We use a Bayesian D' > 0.7 between pairs of consecutive SNPs as suggestive of strong LD and we filter out all the associations of the SNPs whose adjacent SNPs that are in strong LD are not associated with the phenotype. The value 0.7 was chosen based on experiments reported in [36] showing that the Bayesian D' rarely exceeds 0.7 under no LD. The Bayesian D' values for each pair of consecutive SNPs were built for Caucasians using the DNA samples from unrelated parents of thirty trios of the CEPH (Utah residents with ancestry from northern and western Europe, also known as CEU) and similarly for Africans, using Yoruba in Ibadan Nigeria. These data are available from the supplementary material web site [28].
Regional association filter
The rationale of this filter is that a region or gene showing authentic association would be expected to show a greater number of SNPs associated than would be expected by chance. In this filter, we analyze the data using a sliding window of 20 SNPs, and summarize the global measure of association within the window as the product of the posterior probabilities of associations of the 20 SNPs. Here, we assume that the association tests are independent, so that the product of the posterior probability of association of the individual SNPs becomes a measure of the global association of the region tagged by the 20 SNPs. Windows with a global measure of association exceeding 0.5 [or product of BFs >1] are then selected for further inspection. Although LD between SNPs in a window may introduce dependencies, the global measure of association does not seem to be affected by this approximation. Figure 5 shows some examples.
Figure 5. Relation between the pattern of LD (xaxis) and the global measure of association (yaxis) in the regional filter. The pattern of LD is measured by the average of the Bayes D' between consecutive SNPs in the region, and the global measure of association is the joint probability of association in the region. The two figures in the top half show the relation using data from the study of fetal hemoglobin in the sickle cell anemia subjects. The two figure in the bottom half show the relation using data from the longevity study. The different extent of LD reflect the fact that sickle cell anemia subjects are all African American while centenarians in the longevity study are all Caucasians The correlations in the four sets are 0.03, 0.18, 0.018, 0.10.
Hierarchical summary
The list of SNPs that are selected by the association test and the filters are labeled as "significant SNPs". The list is annotated by the SNP physical position, the position relative to known genes, the allele frequencies estimated in different populations, and the cytogenetic band. This information is collected through SNPPer [54] that integrates information from the UCSC human genome browser and dbSNP [55]. As a further level of summary we use those SNPs that are linked to genes to create a set of selected genes and a set of significant genes. The first set consists of genes that are tagged by at least one significant SNP. The set of significant genes is a subset of the selected genes and consists of genes in which the global measure of association given by the product of the posterior probability of association of the gene tagging SNPs is greater than 0.5 [Or equivalently, the product of BFs exceeds 1].
Ranking
We rank the significant genes by the global measure of association. To rank the selected genes we score them by two further measures that weigh the likelihood of selecting a gene by chance. In fact, there are genes that are tagged by a large number of SNPs: for example CSMD1 in chromosome 8 is tagged by 614 SNPs in the HumanHap300 array, and assuming a 5% false positive rate, we would expect about 30 SNPs to be selected from this gene by chance in any analysis. To take this issue into account, each selected gene is assigned 3 scores: the global measure of association, the ratio of selected SNPs relative to the number of tagging SNPs, and the probability of selecting this number by chance using the hypergeometric distribution. Each score determines a ranking and then the sum of the ranks is defined as a final ranking of selected genes.
Gene set enrichment analysis
To evaluate selected and/or significant genes for enrichment of biological categories associated with a variable phenotype, we implement a standalone version of the EASE statistical software [56]. This program computes a modified Fisher's exact probability score for observing the frequency of a biological category associated with a phenotype (e.g., dementia, sickle phenotype, infection), compared with the likelihood of identifying that category by chance given the total number of genes in the data set. An adjusted score is then reported representing the upper bound of the distribution of Jackknife Fisher exact probabilities for observing an enriched biological category. Enriched categories are then inspected for biological trends and overlapping or related categories, based on significance scores or categories with a pvalue << 0.05. For more detail, see Hosack et al. [56].
Authors' contributions
PS developed and implemented the analytic method, conceived the evaluation and drafted the manuscript. ZZ contributed to the implementation and evaluation. MAG, AR, SH, AS, developed the support material for the various filters and hierarchical summary. AD helped to design and conduct the experiments with the CEPH samples. MM contributed to the development of the hierarchical method and the interpretation of the analysis results. EM and CB carried out the poolingbased GW genotyping. DT, TTP, MS contributed to the development and evaluation of the method using the longevity study and the HbF study. All authors read and approved the final manuscript.
Appendix
Derivation of the Bayes Factor
This Bayesian association test assumes that the allele frequencies follow a binomial distribution with probabilities θ_{ij }the index i = 1 for cases and i = 2 for controls, and the index j = A,B denotes the A or B allele [See Table 5 for an example]. Under the hypothesis of general association, the parameters θ_{ij }describing the allele distributions in cases and controls follow different probability distributions, while the parameters θ_{ij }follow the same probability distribution under the hypothesis of no association. Therefore, the likelihood function under the hypothesis of general association M_{a }is
Table 5. parameters and allele frequencies from pooled DNA samples
While the likelihood function under the hypothesis of no association M_{i }is:
We assume independent Beta distributions to model the prior distributions of the parameters that are defined as
Where the hyperparameters are chosen as α_{1A }= α_{2A }= α_{A}/2 = α/4 and
β_{1B }= β_{2B }= β_{B}/2 = α/4. The parameter α is the overall prior precision and can be set based on prior information. The likelihood function and the prior distribution of the parameters are used to compute the marginal likelihood as the expected likelihood function, where the expectation is taken over the parameter distribution. Formally
p(M  n_{ij}) = ∫p(θ_{ij } n_{ij})p(θ_{ij})dθ_{ij}
Compared to the maximum likelihood that returns the likelihood function evaluated in the estimate of the parameters, the marginal likelihood incorporates the uncertainty about the parameters by averaging the likelihood functions for different parameter values. This conceptual difference is fundamental to understand the different approach to model selection: in the classical framework, model selection is based on the maximized likelihood and its sampling distribution to take into account sampling variability, for fixed parameter values. In the Bayesian framework, model selection is based on the marginal likelihood which takes into account the parameter variability, for fixed sample values. Therefore, no significance testing is performed when using this approach to model selection. Our experience with the Bayesian procedure to model selection is that it is usually more robust to false associations. The calculation of the marginal likelihood can be done in closed form and it is easy to show that
Where G is the gamma function and the ratio produces the Bayes factor.
Acknowledgements
Supported by NHLBI grants R21 HL080463 (PS); R01 HL68970 (MHS); K24, AG025727 (TP); K23 AG026754 (D.T.). We thank the anonymous reviewers and editors for their helpful suggestions.
References

Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls
Nature 2007, 447(7145):661678. PubMed Abstract  Publisher Full Text

Christensen K, Murray JC: What genomewide association studies can do for medicine.
N Engl J Med 2007, 356(11):10941097. PubMed Abstract  Publisher Full Text

de Bakker PI, Yelensky R, Pe'er I, Gabriel SB, Daly MJ, Altshuler D: Efficiency and power in genetic association studies.
Nat Genet 2005, 37(11):12171223. PubMed Abstract  Publisher Full Text

Fan JB, Chee MS, Gunderson KL: Highly parallel genomic assays.
Nat Rev Genet 2006, 7(8):632644. PubMed Abstract  Publisher Full Text

Craig DW, Huentelman MJ, HuLince D, Zismann VL, Kruer MC, Lee AM, Puffenberger EG, Pearson JM, Stephan DA: Identification of disease causing loci using an arraybased genotyping approach on pooled DNA.
BMC Genomics 2005, 6:138. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Melquist S, Craig DW, Huentelman MJ, Crook R, Pearson JV, Baker M, Zismann VL, Gass J, Adamson J, Szelinger S, Corneveaux J, Cannon A, Coon KD, Lincoln S, Adler C, Tuite P, Calne DB, Bigio EH, Uitti RJ, Wszolek ZK, Golbe LI, Caselli RJ, GraffRadford N, Litvan I, Farrer MJ, Dickson DW, Hutton M, Stephan DA: Identification of a novel risk locus for progressive supranuclear palsy by a pooled genomewide scan of 500,288 singlenucleotide polymorphisms.
Am J Hum Genet 2007, 80(4):769778. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hanson RL, Craig DW, Millis MP, Yeatts KA, Kobes S, Pearson JV, Lee AM, Knowler WC, Nelson RG, Wolford JK: Identification of PVT1 as a candidate gene for endstage renal disease in type 2 diabetes using a poolingbased genomewide single nucleotide polymorphism association study.
Diabetes 2007, 56(4):975983. PubMed Abstract  Publisher Full Text

Steer S, Abkevich V, Gutin A, Cordell HJ, Gendall KL, Merriman ME, Rodger RA, Rowley KA, Chapman P, Gow P, Harrison AA, Highton J, Jones PB, O'Donnell J, Stamp L, Fitzgerald L, Iliev D, Kouzmine A, Tran T, Skolnick MH, Timms KM, Lanchbury JS, Merriman TR: Genomic DNA pooling for wholegenome association scans in complex disease: empirical demonstration of efficacy in rheumatoid arthritis.
Genes Immun 2007, 8(1):5768. PubMed Abstract  Publisher Full Text

Meaburn EL, Harlaar N, Craig IW, Schalkwyk LC, Plomin R: Quantitative trait locus association scan of early reading disability and ability using pooled DNA and 100K SNP microarrays in a sample of 5760 children.
Mol Psychiatry 2007. PubMed Abstract  Publisher Full Text

Lavebratt C, Sengul S: Single nucleotide polymorphism (SNP) allele frequency estimation in DNA pools using Pyrosequencing.
Nat Protoc 2006, 1(6):25732582. PubMed Abstract  Publisher Full Text

Wilkening S, Chen B, Wirtenberger M, Burwinkel B, Forsti A, Hemminki K, Canzian F: Allelotyping of pooled DNA with 250 K SNP microarrays.
BMC Genomics 2007, 8:77. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Barratt BJ, Payne F, Rance HE, Nutland S, Todd JA, Clayton DG: Identification of the sources of error in allele frequency estimations from pooled DNA indicates an optimal experimental design.
Ann Hum Genet 2002, 66(Pt 56):393405. PubMed Abstract  Publisher Full Text

Docherty SJ, Butcher LM, Schalkwyk LC, Plomin R: Applicability of DNA pools on 500 K SNP microarrays for costeffective initial screens in genomewide association studies.
BMC Genomics 2007, 8:214. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Meaburn E, Butcher LM, Schalkwyk LC, Plomin R: Genotyping pooled DNA using 100K SNP microarrays: a step towards genomewide association scans.
Nucleic Acids Res 2006, 34(4):e27. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lips EH, Dierssen JW, van Eijk R, Oosting J, Eilers PH, Tollenaar RA, de Graaf EJ, van't Slot R, Wijmenga C, Morreau H, van Wezel T: Reliable highthroughput genotyping and lossofheterozygosity detection in formalinfixed, paraffinembedded tumors using single nucleotide polymorphism arrays.
Cancer Res 2005, 65(22):1018810191. PubMed Abstract  Publisher Full Text

Benjamini Y, Hochberg Y: Controlling the false discovery rate  a practical and powerful approach to multiple testing.

Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response.
Proc Natl Acad Sci U S A 2001, 98(9):51165121. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Roeder K, Bacanu SA, Wasserman L, Devlin B: Using linkage genome scans to improve power of association in genome scans.
Am J Hum Genet 2006, 78(2):243252. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Balding DJ: A tutorial on statistical methods for population association studies.
Nat Rev Genet 2006, 7(10):781791. PubMed Abstract  Publisher Full Text

International HapMap Consortium: A haplotype map of the human genome.
Nature 2005, 437(7063):12991320. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

The Gene Ontology (GO) project in 2006
Nucleic Acids Res 2006, 34(Database issue):D3226. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes.
Nucleic Acids Res 2000, 28(1):2730. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kanehisa M, Goto S, Hattori M, AokiKinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M: From genomics to chemical genomics: new developments in KEGG.
Nucleic Acids Res 2006, 34(Database issue):D3547. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Aulchenko YS, Ripke S, Isaacs A, van Duijn CM: GenABEL: an R library for genomewide association analysis.
Bioinformatics 2007, 23(10):12941296. PubMed Abstract  Publisher Full Text

Clayton D, Leung HT: An R package for analysis of wholegenome association studies.
Hum Hered 2007, 64(1):4551. PubMed Abstract  Publisher Full Text

Elston RC, Lin DY, Zheng G: Multistage Sampling for Genetic Studies.
Annu Rev Genomics Hum Genet 2007. PubMed Abstract  Publisher Full Text

Sham P, Bader JS, Craig I, O'Donovan M, Owen M: DNA Pooling: a tool for largescale association studies.
Nat Rev Genet 2002, 3(11):862871. PubMed Abstract  Publisher Full Text

Supplementary material [http:/ / www.bu.edu/ sicklecell/ downloads/ Projects/ web_supplement_pooling/ index.html] webcite

Gibbs RA, Belmont JW, Hardenbol P, Willis TD, Yu F, Yang H, Ch'ang LY, Huang W, Liu B, Shen Y, Tam PK, Tsui LC, Waye MM, Wong JT, Zeng C, Zhang Q, Chee MS, Galver LM, Kruglyak S, Murray SS, Oliphant AR, Montpetit A, Hudson TJ, Chagnon F, Ferretti V, Leboeuf M, Phillips MS, Verner A, Kwok PY, Duan S, Lind DL, Miller RD, Rice JP, Saccone NL, TaillonMiller P, Xiao M, Nakamura Y, Sekine A, Sorimachi K, Tanaka T, Tanaka Y, Tsunoda T, Yoshino E, Bentley DR, Deloukas P, Hunt S, Powell D, Altshuler D, Gabriel SB, Zhang H, Matsuda I, Fukushima Y, Macer DR, Suda E, Rotimi CN, Adebamowo CA, Aniagwu T, Marshall PA, Matthew O, Nkwodimmah C, Royal CD, Leppert MF, Dixon M, Stein LD, Cunningham F, Kanani A, Thorisson GA, Chakravarti A, Chen PE, Cutler DJ, Kashuk CS, Donnelly P, Marchini J, McVean GA, Myers SR, Cardon LR, Abecasis GR, Morris A, Weir BS, Mullikin JC, Sherry ST, Feolo M, Daly MJ, Schaffner SF, Qiu R, Kent A, Dunston GM, Kato K, Niikawa N, Knoppers BM, Foster MW, Clayton EW, Wang VO, Watkin J, Sodergren E, Weinstock GM, Wilson RK, Fulton LL, Rogers J, Birren BW, Han H, Wang H, Godbout M, Wallenburg JC, L'Archeveque P, Bellemare G, Todani K, Fujita T, Tanaka S, Holden AL, Lai EH, Collins FS, Brooks LD, McEwen JE, Guyer MS, Jordan E, Peterson JL, Spiegel J, Sung LM, Zacharia LF, Kennedy K, Dunn MG, Seabrook R, Shillito M, Skene B, Stewart JG, Valle DL, Jorde LB, Cho MK, Duster T, Jasperse M, Licinio J, Long JC, Ossorio PN, Spallone P, Terry SF, Lander ES, Nickerson DA, Boehnke M, Douglas JA, Hudson RR, Kruglyak L, Nussbaum RL: The International HapMap Project.
Nature 2003, 426(6968):789796. PubMed Abstract  Publisher Full Text

Perls T, Terry D: Understanding the determinants of exceptional longevity.
Ann Intern Med 2003, 139(5 Pt 2):445449. PubMed Abstract  Publisher Full Text

Steinberg MH: Predicting clinical severity in sickle cell anaemia.
Br J Haematol 2005, 129(4):465481. PubMed Abstract  Publisher Full Text

Steinberg MH, Barton F, Castro O, Pegelow CH, Ballas SK, Kutlar A, Orringer E, Bellevue R, Olivieri N, Eckman J, Varma M, Ramirez G, Adler B, Smith W, Carlos T, Ataga K, DeCastro L, Bigelow C, Saunthararajah Y, Telfer M, Vichinsky E, Claster S, Shurin S, Bridges K, Waclawiw M, Bonds D, Terrin M: Effect of hydroxyurea on mortality and morbidity in adult sickle cell anemia: risks and benefits up to 9 years of treatment.
Jama 2003, 289(13):16451651. PubMed Abstract  Publisher Full Text

Ma Q, Baldwin CT, Safaya S, Kutlar A, Farrer LA, Steinberg MH: Fetal hemoglobin in sickle cell anemia: association with single nucleotide polymorphisms in TOX (8q12).

Wyszynski DF, Baldwin CT, Cleves MA, Amirault Y, Nolan VG, Farrell JJ, Bisbee A, Kutlar A, Farrer LA, Steinberg MH: Polymorphisms near a chromosome 6q QTL area are associated with modulation of fetal hemoglobin levels in sickle cell anemia.
Cell Mol Biol (Noisylegrand) 2004, 50(1):2333. PubMed Abstract

Sebastiani P, Nolan VG, Baldwin CT, AbadGrau MM, Wang L, Adewoye AH, McMahon LC, Farrer LA, Taylor JG, Kato GJ, Gladwin MT, Steinberg MH: A network model to predict the risk of death in sickle cell disease.
Blood 2007, 110(7):27272735. PubMed Abstract  Publisher Full Text

Sebastiani P, AbadGrau MM: Bayesian estimates of linkage disequilibrium.
BMC Genet 2007, 8(1):36. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Arking R, Butler B, Chiko B, Fossel M, Gavrilov LA, Morley JE, Olshansky SJ, Perls T, Walker RF: Antiaging teleconference: what is antiaging medicine?
J Anti Aging Med 2003, 6(2):91106. PubMed Abstract  Publisher Full Text

Barzilai N, Atzmon G, Schechter C, Schaefer EJ, Cupples AL, Lipton R, Cheng S, Shuldiner AR: Unique lipoprotein phenotype and genotype associated with exceptional longevity.
Jama 2003, 290(15):20302040. PubMed Abstract  Publisher Full Text

Christensen K, Johnson TE, Vaupel JW: The quest for genetic determinants of human longevity: challenges and insights.
Nat Rev Genet 2006, 7(6):436448. PubMed Abstract  Publisher Full Text

Human aging genomic resources [http://genomics.senescence.info/genes/longevity.html] webcite

Lombard DB, Beard C, Johnson B, Marciniak RA, Dausman J, Bronson R, Buhlmann JE, Lipman R, Curry R, Sharpe A, Jaenisch R, Guarente L: Mutations in the WRN gene in mice accelerate mortality in a p53null background.
Mol Cell Biol 2000, 20(9):32863291. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Petit I, SzyperKravitz M, Nagler A, Lahav M, Peled A, Habler L, Ponomaryov T, Taichman RS, ArenzanaSeisdedos F, Fujii N, Sandbank J, Zipori D, Lapidot T: GCSF induces stem cell mobilization by decreasing bone marrow SDF1 and upregulating CXCR4.
Nat Immunol 2002, 3(7):687694. PubMed Abstract  Publisher Full Text

Levesque JP, Hendy J, Takamatsu Y, Simmons PJ, Bendall LJ: Disruption of the CXCR4/CXCL12 chemotactic interaction during hematopoietic stem cell mobilization induced by GCSF or cyclophosphamide.
J Clin Invest 2003, 111(2):187196. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Harada M, Qin Y, Takano H, Minamino T, Zou Y, Toko H, Ohtsuka M, Matsuura K, Sano M, Nishi J, Iwanaga K, Akazawa H, Kunieda T, Zhu W, Hasegawa H, Kunisada K, Nagai T, Nakaya H, YamauchiTakihara K, Komuro I: GCSF prevents cardiac remodeling after myocardial infarction by activating the JakStat pathway in cardiomyocytes.
Nat Med 2005, 11(3):305311. PubMed Abstract  Publisher Full Text

Leone AM, Rutella S, Bonanno G, Contemi AM, de Ritis DG, Giannico MB, Rebuzzi AG, Leone G, Crea F: Endogenous GCSF and CD34+ cell mobilization after acute myocardial infarction.
Int J Cardiol 2006, 111(2):202208. PubMed Abstract  Publisher Full Text

Capoccia BJ, Shepherd RM, Link DC: GCSF and AMD3100 mobilize monocytes into the blood that stimulate angiogenesis in vivo through a paracrine mechanism.
Blood 2006, 108(7):24382445. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gao Y, Ferguson DO, Xie W, Manis JP, Sekiguchi J, Frank KM, Chaudhuri J, Horner J, DePinho RA, Alt FW: Interplay of p53 and DNArepair protein XRCC4 in tumorigenesis, genomic stability and development.
Nature 2000, 404(6780):897900. PubMed Abstract  Publisher Full Text

Niedernhofer LJ, Garinis GA, Raams A, Lalai AS, Robinson AR, Appeldoorn E, Odijk H, Oostendorp R, Ahmad A, van Leeuwen W, Theil AF, Vermeulen W, van der Horst GT, Meinecke P, Kleijer WJ, Vijg J, Jaspers NG, Hoeijmakers JH: A new progeroid syndrome reveals that genotoxic stress suppresses the somatotroph axis.
Nature 2006, 444(7122):10381043. PubMed Abstract  Publisher Full Text

Steinberg MH, Lu ZH, Barton FB, Terrin ML, Charache S, Dover GJ: Fetal hemoglobin in sickle cell anemia: determinants of response to hydroxyurea. Multicenter Study of Hydroxyurea.
Blood 1997, 89(3):10781088. PubMed Abstract  Publisher Full Text

Sebastiani P, Ramoni MF, Nolan V, Baldwin CT, Steinberg MH: Genetic dissection and prognostic modeling of overt stroke in sickle cell anemia.
Nat Genet 2005, 37(4):435440. PubMed Abstract  Publisher Full Text

Kass RE, Raftery AE: Bayes factor.
J Am Statist Assoc 1995, 90:773795. Publisher Full Text

Zou G, Zhao H: The impacts of errors in individual genotyping and DNA pooling on association studies.
Genet Epidemiol 2004, 26(1):110. PubMed Abstract  Publisher Full Text

Purcell S, Neale B, ToddBrown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: A Tool Set for WholeGenome Association and PopulationBased Linkage Analyses.
Am J Hum Genet 2007, 81(3):559575. PubMed Abstract  Publisher Full Text

SNPPer [http://snpper.chip.org/] webcite

Riva A, Kohane IS: A SNPcentric database for the investigation of the human genome.
BMC Bioinformatics 2004, 5:33. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Hosack DA, Dennis G Jr., Sherman BT, Lane HC, Lempicki RA: Identifying biological themes within lists of genes with EASE.
Genome Biol 2003, 4(10):R70. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text