Abstract
Background
Genotyping technologies enable us to genotype multiple Single Nucleotide Polymorphisms (SNPs) within selected genes/regions, providing data for haplotype association analysis. While haplotypebased association analysis is powerful for detecting untyped causal alleles in linkagedisequilibrium (LD) with neighboring SNPs/haplotypes, the inclusion of extraneous SNPs could reduce its power by increasing the number of haplotypes with each additional SNP.
Methods
Here, we propose a haplotypebased stepwise procedure (HBSP) to eliminate extraneous SNPs. To evaluate its properties, we applied HBSP to both simulated and real data, generated from a study of genetic associations of the bactericidal/permeabilityincreasing (BPI) gene with pulmonary function in a cohort of patients following bone marrow transplantation.
Results
Under the null hypothesis, use of the HBSP gave results that retained the desired false positive error rates when multiple comparisons were considered. Under various alternative hypotheses, HBSP had adequate power to detect modest genetic associations in casecontrol studies with 500, 1,000 or 2,000 subjects. In the current application, HBSP led to the identification of two specific SNPs with a positive validation.
Conclusion
These results demonstrate that HBSP retains the essence of haplotypebased association analysis while improving analytic power by excluding extraneous SNPs. Minimizing the number of SNPs also enables simpler interpretation and more costeffective applications.
Background
Genotyping technology now enables population researchers to genotype dozens to thousands of SNPs within any selected candidate gene or within any genomic region. Such SNP data are increasingly collected in disease association studies, using a casecontrol study design [1,2], with the analytic objective of assessing association between SNP genotypes and a disease phenotype of interest. While traditional analyses have involved correlating phenotypes with individual SNP genotypes [3], a complementary approach involves inferring haplotypes of SNPs or their distributions, then assessment of haplotypic associations with the disease phenotype [47].
Haplotypebased association analysis has several advantages over association analysis with single SNPs. First, haplotypes of multiple SNPs can reduce the number of comparisons to be made during the analysis. Typically, SNPs within a narrow region are in high LD, i.e., adjacent SNP alleles on the same chromosome are highly correlated. Consequently, with K such SNPs, the total number of haplotypes formed by these SNPs is generally much smaller than the number of all possible haplotypes (= 2^{K1}). For a typical gene, the number of common haplotypes, even with variable numbers of SNPs, is on the order of 10–15 [8], with a few notable exceptions such as the major histocompatibility (MHC) genes [9]. Secondly, a haplotype is naturally interpreted as genetic polymorphisms of SNP alleles on the same chromosome. After filling in the nonSNP nucleotides between SNPs, one has a fully phased DNA sequence. This sequence, if it lies in the coding region of a gene, can be converted into an amino acid sequence, and thus haplotypic variations may result in protein variations, an important biological context to consider. Third, haplotypes themselves tend to be conserved and shaped by evolutionary processes. Recent population genetic studies of the human genome have suggested that recombination processes, together with other population genetic forces, have created longrange haplotype blocks [10,11]. These block structures are also useful for reducing the number of statistical comparisons, as well as for interpretation of disease associations with common extended haplotypes. Additionally, haplotypebased associations are useful for mapping unknown disease mutations. As opposed to assuming a direct relationship between a phenotype and an individual SNP, one or more diseasecausing mutations may be in high LD with adjacent SNPs; hence, extended haplotypes formed by these known SNPs may serve as markers for diseasecausing mutations yet to be discovered [12]. Indeed, a haplotype of multiple SNPs may be thought of as an allele at a multiallelic marker locus, and increasing polymorphism with multiple haplotypes improves the power to detect disease associations.
There are also disadvantages when using haplotypebased association analyses: the main disadvantage is that haplotypebased association analyses may have reduced power in detecting SNPlevel associations. If in truth, the disease association is with a single functional SNP, the haplotypebased association can be less efficient due to the fact that including irrelevant SNPs effectively divides the samples into multiple haplotype groups, hence reducing sample sizes and consequently decreasing the power of the study. Moreover, dividing a single SNP association into multiple haplotype associations will also incur the penalty associated with multiple comparisons. This loss of statistical efficiency is exacerbated if the haplotype analysis includes many SNPs, resulting in an excessively large number of haplotypes.
To retain the advantages of haplotypebased analysis and overcome its disadvantages, we propose a HBSP to systematically search for a subset of SNPs whose haplotypes associate with the disease phenotype. Following the principle of stepwise regression methodology, for example, forward or backward selection [13], we have developed forward and backward haplotypebased stepwise procedures. For example, in the backward procedure, one gradually deselects one SNP at a time, provided that the exclusion of individual SNPs does not alter the observed haplotypic association. In this report, we introduce the methodology of the HBSP, report our results from simulation studies with finite sample sizes, and illustrate its clinical applicability by using the HBSP to select functional SNPs within the BPI gene, which has been independently shown to be significantly associated with pulmonary function in posttransplant patients.
Methods
BPI and Pulmonary Function among Transplant Patients
Given the importance of innate immunity in protection from diseases of the airway, we conducted a genetic association study using a candidate gene approach to determine if polymorphisms in genes of the innate immune pathway are associated with the development of hematopoietic stem cell transplant (HCT) related airflow obstruction (AFO), the details of which have been published elsewhere [14]. This twotiered (including discovery and validation phases) casecontrol genetic association study selected tagging SNPs from 15 genes from the innate immunity pathway. Cases were defined as patients who experienced an annual decrease of the forced expiratory volume in the first second (FEV_{1}) > 5%, with their lowest posttransplant ratio of FEV_{1 }to forced vital capacity < 0.8. This study discovered and validated the association of multiple tagging SNP haplotypes on the BPI gene with the AFO phenotype. In the analyses below, eight tagging SNPs from the BPI gene are used for illustrative purposes.
Notation, Model and Estimation Procedures
Notation
Consider a casecontrol study with n subjects (i = 1,2,..., n), with cases denoted by d_{i}= 1 and controls denoted as d_{i}= 0. Let x_{i }= (x_{i1}, ...x_{ic})' denote a vector of c covariates, such as clinical, demographical, and lifestyle variables. Also obtained from the ith subject is a biological sample, which is genotyped for multiple SNPs. Let g_{i }= (g_{i1}, g_{i2}, ...,g_{iq}) denote genotypes of linearly ordered SNPs within a welldefined genomic region, such as a candidate gene region. Let g_{ij }= a_{ij}: denote a pair of alleles at the jth locus in the ith individual, where biallelic SNP alleles a_{ij }and take a value of 0 and 1, corresponding to the major and minor allele, respectively. Due to the nature of the genotyping technology, the parental origin (or phase) of individual alleles is unknown. Let Ω_{i }= (Ω_{i1}, Ω_{i2}, ..., Ω_{iq}) denote a vector of phase indicators: Ω_{ij }= 0 implies that the first allele at the jth locus for the ith subject a_{ij }is inherited from the father, with the other allele, , from the mother. In contrast, Ω_{ij }= 1 implies that a_{ij }is inherited from the mother and from the father. When phases are known, (g_{i}, Ω_{i}) defines two haplotypes called a diplotype, denoted as H_{i }: . Each haplotype consists of q SNPs and may be written as H_{i }= a_{i1}a_{i2}a_{i3}...a_{iq1}a_{iq}. For q SNPs, there are r possible haplotypes, denoted as (h_{1}, h_{2}, ..., h_{r}).
The penetrance of haplotypes and covariates to the disease phenotype is quantified through a logistic regression model. The logistic penetrance function can be formally written as
which takes values between 0 and 1, quantifying the probability of being diseased. The K(·) is a vector of (r1) indicator functions, i.e K(H_{i}) = (I(H_{i }= h_{2}), I(H_{i }= h_{3}), ..., I(H_{i }= h_{r}))' where the haplotype h_{1 }is treated as a reference. The unknown regression coefficient vector = (β1, β2,..., βr1)' quantifies haplotypespecific penetrance to the phenotype and is estimated from the data, along with other unknown regression parameters, the coefficient of intercept α and the coefficients of the other covariates . The regression coefficient β_{j }is also referred as the logarithm of odds ratio (OR). The estimation algorithm for parameters and their covariates were developed elsewhere [4].
A Wald Test Statistic
In our proposed HBSP described below, we either add (in forward selection) or remove (in backward selection) one SNP at each time. Suppose that there are q SNP loci with r different haplotypes denoted as (h_{1}, h_{2}, ..., h_{r}) considered in the above logistic regression model. The estimations of and its covariance are denoted by and .
To examine the contribution of a particular SNP to the overall haplotypic association, we remove one SNP at a time from the haplotypes. For example, if the qth SNP is removed, some of the haplotypes may be merged if their haplotypic differences were due to allelic difference at the qth SNP. To assess the contribution of the qth SNP to the disease association, it is equivalent to test if the coefficients of merged haplotypes are equal. Suppose s unique haplotypes are observed after removing the qth SNP. So (rs) haplotypes are merged with one of s unique haplotypes, thus number of equalities to be tested is (rs).
Under the null hypothesis that the qth SNP has no contribution to the haplotypebased association, we thus compute haplotypebased parameters for s haplotypes with (q1). Further, under the null hypothesis, one could use estimated s haplotypebased parameters to assign haplotypebased parameters for all r haplotypes, as if q SNPs been included. Let denote such haplotypebased parameters obtained under the null hypothesis with the qth SNP removed.
To test whether the qth SNP contributes significantly to the haplotypebased association, we construct the following Wald statistic:
where is the estimated covariance matrix of coefficients under the null hypothesis. By the Central Limit theorem, one can show that the above Wald statistic has an asymptotic chisquare distribution with rs degrees of freedom, under the null hypothesis. With finite sample sizes, our simulation results support the approximations by the stated chisquare distribution (not shown). Based upon this distribution, one can estimate the probability that quantifies the statistical significance in removing the qth SNP.
Two exceptional cases are worth mentioning. The first case is that if the qth SNP is in perfect LD with the remaining SNPs, removing that SNP would not alter the distribution of haplotypes. Consequently, estimated log ORs from the reduced haplotype analysis will be exactly the same as those in the full haplotype analysis, i.e., . Naturally, the above Waldstatistic equals zero, with zero degrees of freedom. In such a case, the SNP is removed without requiring a test. The second special case is that when only one SNP is in the model, the Wald statistic [2] degenerates to .
A Forward Selection Procedure
Consider a haplotype analysis with Q SNPs from a gene or region. A forward selection procedure can be used to evaluate haplotypic associations with a single SNP, two SNPs, and progressively increasing numbers of SNPs. This procedure will be terminated when the minimum pvalue exceeds a preset threshold (Figure 1A). Within this procedure, the threshold value c_{f }is chosen to control false positive error rates. Here, we control the overall false positive error rate less than a prefixed rate, say 5% (further explored below).
Figure 1. Outline of the three computational algorithms for the stepwise selection of SNPs. (A) Forward selection procedure. (B) Backward selection procedure. (C) Hybrid Selection Procedure.
A Backward Selection Procedure
While computationally efficient, the forward selection procedure may miss significant haplotypic associations due to variable haplotypic domains, i.e., the parameter domains are not necessarily hierarchical when a SNP is progressively reduced [6]. The desire to overcome this limitation motivates the backward selection procedure as an alternative to the forward selection procedure. The basic idea is to start with the haplotypes of all Q SNPs and then to eliminate irrelevant SNPs one SNP at a time (Figure 1B). The topic of choosing the threshold value c_{b }for the backward selection is discussed further below.
A Hybrid Selection Procedure
While being generally preferred, the backward selection procedure can be timeconsuming and prohibitive when the number of SNPs to be analyzed is large and each haplotypebased association analysis requires a substantial computation. To overcome this challenge, one may consider a hybrid selection procedure, i.e., combining both forward and backward selection procedures, patterning after the usual stepwise regression approach. One possible strategy is to initially use the forward selection procedure to add SNPs into haplotypes, and then to deselect those "selected SNPs" from established haplotypes with the backward selection procedure. The hybrid procedure, involving forward and backward selection threshold values c_{f }and c_{b}, stems directly from those described above (Figure 1C).
PermutationBased Assessment of False Positive Error Rates
As noted above, the threshold values c_{f }and c_{b }for forward and backward selection procedures, respectively, are closely connected with false positive error rates, and stringent threshold values correspond to low false positive error rates. Choosing threshold values can be a challenge due to multiple comparisons with a series of highly correlated chisquare tests. The correlatedness among these tests can not be easily quantified due to varying levels of LD within a gene or within a specific region. However, a simple Bonferroni correction ignoring the correlation could lead to excessively conservative results.
We propose to use a permutationbased assessment to evaluate the false positive error rate (FPER), based on which the corresponding threshold value is chosen. Without requiring any distributional assumptions, the basic idea is to permute disease phenotypes across all subjects to create samples that could arise from the null hypothesis, in which SNPs have no associations with the disease phenotype. Thus, analysis of the permuted data, utilizing either the forward or backward procedure or a hybrid of both, would yield relevant statistics, in particular, pvalues. Following analysis of the permuted data, the number of false positive errors is counted based on the prechosen threshold values (c_{f }or c_{b}). Repeating the permutation, say, 1000 times results in a sample of falsepositive error counts. The average value over all permutations is taken as an estimate of the FPER. Thus, the threshold value (c_{f }or c_{b}) is chosen in such a way that the ultimate FPER is controlled at the prefixed rate.
Simulation Studies
Simulation studies were conducted under the null hypothesis and also under alternative hypotheses. Assuming a typical coalescent process, we simulated 15 SNPs in varying degrees of LD, which were then used for the procedure (Figure 2). Under the null hypothesis, the simulated phenotype had no association with any simulated SNPs. For alternative hypotheses, we considered two different scenarios: 1) the phenotype was associated with a single SNP, and 2) the phenotype was associated with a haplotype of two SNPs.
Figure 2. LD patterns among 15 simulated SNPs and their haplotype block boundaries, indicated by solid lines: (a) Standard D'/LOD pattern is shown, and (b) the LD pattern with r^{2 }using confidence intervals.
Using a coalescent model, we generated genotype data under a typical evolutionary process, based upon Hudson's coalescent simulation program [15]. For the simulation, we specified a scaled recombination rate of 10, i.e., 4× (number of generations) × (recombination rate) to generate 2,000 haplotypes and with 150 segregation sites. From the 150 segregation sites, we selected 15 SNPs that had a minor allele frequency greater than 5%. Based upon 2,000 simulated haplotypes, we computed the haplotype frequencies of the 15 SNPs to represent the "true" haplotype frequencies in the study population. In this particular population, we observed 19 unique haplotypes with frequencies exceeding 0.1% (Table 1).
Table 1. Distribution of 19 simulated haplotypes
The following procedure was used to simulate the study population of one million people: we randomly drew a pair of haplotypes from the above haplotypic distribution to form individual diplotypes, but stripped away the phase information during this process. Using the penetrance function [1], with parameters specified corresponding to the null hypothesis and various alternative hypotheses, we computed the probability of an individual developing the disease. Based upon computed probabilities, we then simulated a binary disease status by the Bernoulli process. This simulation process was repeated one million times, resulting in the targeted study population. We also simulated two demographic variables: gender and age. For gender, we assumed that men and women were equally represented in the population. We assumed age to be uniformly distributed from 20 to 80 years. Under these assumptions, we randomly assigned gender and age to all individuals in the simulated study population
From the simulated study population, we randomly drew equal numbers of cases and controls to form casecontrol data sets. We considered different sample sizes in the simulation to test sample size effects. For each configuration of parameters, we repeated the simulations 1,000 times. To ensure the validity of the simulated results irrespective of SNP choices, we randomly selected one or two adjacent SNPs as functional SNPs in each replication. For each simulated data set, we used the HBSP to identify the functional SNP or haplotype and verified the finding. If the finding matched the functional element, it was a true positive finding. The percentages of true findings among the 1,000 replicates were recorded to quantify the true discovery rate (TDR), i.e., the percentage of true SNP associations identified. Since some SNPs were in LD with each other, we treated the discovery of SNP association, if it was at high LD with the true functional SNP (D^{' }≥ 0.8), as an acceptable discovery. To quantify this imperfect discovery, we introduced a statistic, the imperfect true discovery rate (iTDR). If the discovered SNP was neither the causal SNP nor at high LD with the causal SNP, it was counted as a false positive finding. Ideally, the rate of such false positive errors would be largely comparable to the FPER, which was controlled as described above.
Results
Simulated Data
Null Hypothesis
Under the null hypothesis, log ORs related to the haplotypes in the penetrance function [1] are set to 0. For gender and age, the coefficients were set at 0 and 0.01, respectively, with an intercept of 1. Sample sizes varied from 500 to 2000, with an equal number of cases and controls. Results of the simulation are reported in the first row of Table 2. The estimated FPER did not significantly deviate from 0.05, which was the chosen threshold, across the three sample sizes.
Table 2. False positive error rates are estimated under the null hypothesis.
Alternative Hypothesis with a Single Functional SNP
Under the alternative hypothesis with a single but randomly chosen SNP, we specified related ORs as 1.10, 1.20, 1.30, 1.40, 1.50, 2.00, 3.00 and 5.00 in the penetrance function [1], to correspond to associations ranging from weak to strong. Table 2 lists the estimated TDR, imperfect true discovery rate iTDR, and FPER percentages for each value. As expected, the TDR increased with increasing sample size, 57.6% to 77.8% and then to 83.0% to detect OR = 1.5, as the sample size increased from 500 to 1000 and to 2000, respectively. Similarly, the TDR increased with the increasing OR values, 13.6% to 77.8% and then to 96.3% to detect ORs = 1.1, 1.5 and 5.0, respectively, with the sample size fixed at 1000. As expected, estimated FPER values were largely around 5%, with a few exceptions. Some minor elevations in FPER were probably due to weak LD with the functional SNP.
Alternative Hypothesis with Two Functional SNPs
Under the alternative hypothesis with two associated SNPs, we randomly chose two adjacent SNPs; they form four possible haplotypes (00, 01, 10, 11) with 00 as the reference haplotype. We created five different scenarios assuming different ORs for different haplotypes. Under the first scenario, the OR corresponding to haplotype 10 took values ranging from 1.3 to 5.0, which is similar to the single SNP situation. As expected, the TDR, iTDR and FPER estimates were comparable to those under the single SNP alternative hypotheses. Specifically, a casecontrol study of 500 subjects would have a 57.5% chance of detecting the true functional SNP with an OR of 2.0. At an increased sample size of 2000, the study would be able to detect an OR of 2.0 with nearly 80% TDR. The greater iTDR was mostly due to more than two SNPs being at high LD with these two functional SNPs. FPER values were generally less than 5%, with a few exceptions, partly because some SNPs had weak LD with the functional SNPs.
Under the second scenario, the OR corresponding to haplotype 11 varied from 1.3 to 5.0. While FPER estimates were around 5%, the TDR of detecting such associations was quite low, ranging from 10% to 36% for detecting OR of 2.0 with sample size varying from 500 to 2000, respectively. The primary reason for the reduced power was that the haplotype frequency for haplotype 11 was much lower than the others. Again, iTDR values were much greater than comparable TDR values, for the same reason stated above. When we increased the OR associated with haplotype 10 to 1.3 (scenario 3) and 1.5 (scenario 4), the TDR appreciably increased.
Under the fifth scenario, three OR values for three haplotypes deviated from 1.0, and both the TDR and the iTDR for detecting such genetic associations by the HBSP became very powerful. Even with a sample size of 500 subjects, the study had a 50~54% TDR and an 82~90% iTDR for detection of the true SNPhaplotype associations with the stated OR values.
BPI Clinical Data
In our earlier analysis of a discovery cohort (N = 393), we identified BPI as an important candidate gene for the development of HCTrelated AFO [14]. BPI was tagged by eight SNPs (C2738G, G7258A, G9083C, A10214G, G17016G, C23356T, A33065G, G36045A). Three haplotypes, tagged by these SNPs, were found to be significantly associated with the phenotype. Repeat analysis of these tagging SNPs in an independent validation cohort (N = 209) again revealed that multiple tagging SNP haplotypes were significantly associated with the AFO phenotype [14]. In the interest of reducing the number of SNP markers necessary to identify atrisk patients in clinical practice, we applied the proposed algorithm to find the most informative tagging SNPs. We applied the forward, backward and hybrid procedures to the discovery cohort, while adjusting for clinical covariates that were previously identified as important clinical risk factors for HCTrelated AFO (age at transplant, pretransplant onesecond forced expiratory volume, extent of graft versus host disease, and duration of followup). All three models identified the same two SNPs (C23356T and A33065G) as the most significantly associated with the disease phenotype (Table 3). With TG as the reference haplotype, the CA, CG, and TA haplotypes had ORs ranging from 1.52, 1.75 and 3.36, respectively (pvalues 0.027, 0.014 and < 0.001, respectively). In the validation cohort, analysis of these SNPs again confirmed the statistical significance, with the exception of the TA haplotype. All ORs were comparable to those in the discovery cohort. These results confirmed that our approach can be applied to clinical data to identify the most informative SNP markers across a genetic region of significance.
Table 3. Estimated haplotype frequencies of two SNPs (C23356T and A33065G), estimated log odds ratios and their standard errors for all common haplotypes formed by identified SNPs
Discussion
A complementary approach to the HBSP is the direct application of the stepwise regression approach to assess disease associations with SNP alleles or genotypes at multiple SNP loci, as described by Clayton and his colleagues [3,16]. Basically, this approach treats individual SNP alleles or genotypes as covariates and then assesses their associations with the disease phenotype via the logistic regression model [1]. To identify functional SNPs, they propose using the usual stepwise regression technique to systematically analyze all SNP genotypes and their cross products. Those crossproduct terms, if significant, are surrogates for possible haplotypic associations. While its key advantage includes the simplicity and familiarity to most population researchers, the interpretation of crossproduct terms as possible haplotypic associations is not straightforward. Moreover, such an approach does not take full advantage of extended common haplotypes across many SNPs, because one has to numerate all cross products to detect a highorder interaction; e.g., eight SNPs will create 256 (= 2^{8}) allelic cross products or 6,561 (= 3^{8}) genotypic cross products.
To compare both stepwise approaches, we utilized simulation studies to assess the TDR, iTDR and FPER for the stepwise regression approach. We conducted the simulation studies under both null and alternative hypothesis and used the same 15 simulated SNPs from the simulation study population generated for our proposed approach. The simulation results are reported in Table 4. Under scenario 1, the OR corresponding to haplotype 10 takes values ranging from 1.3 to 5.0, and the FPER and iTDR estimates from the usual approach were comparable to those from the HBSP. However, the TDRs for detecting associations with the usual stepwise regression technique were lower than those obtained with HBSP (Table 2). Under scenario 2, the OR corresponding to haplotype 11 varies from 1.3 to 5.0. While the FPER estimates were around 5%, the TDR for detecting associations was quite low, even though the sample size reached 2000. Thus, compared to HBSP, the usual stepwise regression technique has less power for detecting true genetic associations and compatible power in discovering imperfect true genetic associations.
Table 4. False positive error rate under the null hypothesis, and true discovery rate (false positive error rate) under alternative hypotheses 1 when the Clayton's stepwise approach is used to select a subset of SNPs.
For illustrative purposes, we have also applied stepwise logistic regression models to the BPI discovery cohort. The forward stepwise selection procedures did not detect any SNPs at the significance level of α = 0.05. The backward stepwise elimination procedure detected the joint effect of the two SNPs (C23365T and A33065G), the results from which were consistent to those obtained by HBSP.
Recently, Cheng and colleagues have described another method of mapping functional sites with SNPhaplotypes [17], which shares a similar scientific objective to the HBSP. The key idea underlying their approach is to compute an overall statistic that summarizes all associations within a sliding window of multiple SNPs. Systematically covering genes/regions with sliding and overlapping windows, their method can pinpoint one or more sites that are functionally associated with the disease phenotype. The end result of this approach is to identify functional sites, which is complementary to that of the HBSP. In fact, one can apply the HBSP to produce the overall statistic for each sliding window, as a way of detecting haplotypic associations in an efficient manner.
While the HBSP performed well for the studies reported in this paper, its function can also be extended to other types of analysis. The HBSP can be adapted for other phenotypes, such as continuous, censored or categorical, via their corresponding link functions. As presented above, the key statistic needed to construct the Wald statistic [2] is the covariance matrix of all estimated parameters, which is typically obtainable from the maximum likelihood or estimating equations. In addition, the current method is structured to detect major genetic associations via the assumed penetrance model [1], and is not designed to detect geneenvironmental or genegene interactions. To achieve those objectives, we can extend the penetrance model by including those interactions, which have been elaborated elsewhere [4].
Conclusion
The HBSP described above is effective in selecting a subset of SNPs whose haplotypes are significantly associated with a disease phenotype by eliminating SNPs with random polymorphisms. The HBSP retains the advantages of haplotypebased analysis while minimizing the deficiencies associated with typical haplotypebased analysis that includes extraneous SNPs. Simulation studies indicated that our permutation scheme effectively controls the false positive error rate while HBSP has adequate power to identify those functional SNPs/haplotypes. The illustrative example with SNP data from the BPI gene in a transplant cohort demonstrated the success of the HBSP in identifying two consecutive SNPs out of eight SNPs from the discovery cohort and in validating their associations with the disease phenotype from clinical data.
Authors' contributions
YY carried out numerical calculations and simulations, SL implemented the methodology and simulation strategy, JC contributed BPI data and helped to interpret results, JA participated in the manuscript preparation, LZ conceived the idea, led the development and simulation, and participated in the manuscript preparation. All authors read and approved the final manuscript.
Acknowledgements
This work was supported in part by grants from the NCI (5R01 CA10632005, 5 P01 CA5399624, 1R01 CA11251201), NHLBI (K23HL69860), an American Lung Association of Washington Research Grant, and the National Marrow Donor Program (Amy Strelzer Manasevit Research Award).
References

Cheng R, Ma JZ, Elston RC, Li MD: Fine mapping functional sites or regions from casecontrol data using haplotypes of multiple linked SNPs.
Ann Hum Genet 2005, 69(Pt 1):10212. PubMed Abstract  Publisher Full Text

Neale BM, Sham PC: The future of association studies: genebased analysis and replication.
Am J Hum Genet 2004, 75(3):35362. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cordell HJ, Clayton DG: A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes.
Am J Hum Genet 2002, 70(1):12441. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhao LP, Li SS, Khalid N: A method for the assessment of disease associations with single nucleotide polymorphism haplotypes and environmental variables in case control studies.
Am J Hum Genet 2003, 72(5):123150. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhao H, Pfeiffer R, Gail MH: Haplotype analysis in population genetics and association studies.
Pharmacogenomics 2003, 4(2):1718. PubMed Abstract  Publisher Full Text

Niu T: Algorithms for inferring haplotypes.
Genet Epidemiol 2004, 27(4):33447. PubMed Abstract  Publisher Full Text

Clark AG: The role of haplotypes in candidate gene studies.
Genet Epidemiol 2004, 27(4):32133. PubMed Abstract  Publisher Full Text

Stephens JC, Schneider JA, Tanguay DA, Choi J, Acharya T, Stanley SE, Jiang R, Messer CJ, Chew A, Han JH, Duan J, Carr JL, Lee MS, Koshy B, Kumar AM, Zhang G, Newell WR, Windemuth A, Xu C, Kalbfleisch TS, Shaner SL, Arnold K, Schulz V, Drysdale CM, Nandabalan K, Judson RS, Ruano G, Vovis GF: Haplotype variation and linkage disequilibrium in 313 human genes.
Science 2001, 293(5529):48993. PubMed Abstract  Publisher Full Text

Cullen M, Noble J, Erlich H, Thorpe K, Beck S, Klitz W, Trowsdale J, Carrington M: Characterization of recombination in the HLA class II region.
Am J Hum Genet 1997, 60(2):397407. PubMed Abstract  PubMed Central Full Text

Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander ES: Highresolution haplotype structure in the human genome.
Nat Genet 2001, 29(2):22932. PubMed Abstract  Publisher Full Text

Wall JD, Pritchard JK: Haplotype blocks and linkage disequilibrium in the human genome.
Nat Rev Genet 2003, 4(8):58797. PubMed Abstract  Publisher Full Text

Lin S, Chakravarti A, Cutler DJ: Exhaustive allelic transmission disequilibrium tests as a new approach to genomewide association studies.
Nat Genet 2004, 36(11):11818. PubMed Abstract  Publisher Full Text

Snedecor GWCWG: Statistical methods. Iowa: The Iowa State University Press; 1989.

Chien JW, Zhao LP, Hansen JA, Fan WH, Parimon T, Clark JG: Genetic variation in bactericidal/permeabilityincreasing protein influences the risk of developing rapid airflow decline after hematopoietic cell transplantation.
Blood 2006, 107(5):22007. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hudson RR: Generating samples under a WrightFisher neutral model of genetic variation.
Bioinformatics 2002, 18(2):3378. PubMed Abstract  Publisher Full Text

Clayton D, Chapman J, Cooper J: Use of unphased multilocus genotype data in indirect association studies.
Genet Epidemiol 2004, 27(4):41528. PubMed Abstract  Publisher Full Text

Cheng C, Kimmel R, Neiman P, Zhao LP: Array rank order regression analysis for the detection of gene copynumber changes in human cancer.
Genomics 2003, 82(2):1229. PubMed Abstract  Publisher Full Text