There has been a growing interest in developing strategies for identifying single-nucleotide polymorphisms (SNPs) that explain a linkage signal by joint modeling of linkage and association. We compare several existing methods and propose a new method called the homozygote sharing transmission-disequilibrium test (HSTDT) to detect linkage and association or to identify SNPs explaining the linkage signal on chromosome 6 for rheumatoid arthritis using 100 replicates of the Genetic Analysis Workshop (GAW) 15 simulated affected sib-pair data. Existing methods considered included the family-based tests of association implemented in FBAT, a transmission-disequilibrium test, a conditional logistic regression approach, a likelihood-based approach implemented in LAMP, and the homozygote sharing test (HST). We compared the type I error rates and power for tests classified into three categories according to their null hypotheses: 1) no association in the presence of linkage (i.e., a SNP explains none of the linkage evidence), 2) no linkage adjusting for the association (i.e., a SNP explains all linkage evidence), and 3) no linkage and no association. For testing association in the presence of linkage, we found similar power among all tests except for the homozygote sharing test that had lower power. When testing linkage adjusting for association, similar power was observed between LAMP and HST, but lower power for the conditional logistic regression method. When testing linkage or association, the conditional logistic regression method was more powerful than FBAT.
The availability of high throughput single-nucleotide polymorphism (SNP) genotyping technologies at more affordable costs has generated increasing enthusiasm for genome-wide association study (GWAS) for a wide range of disorders . How to best analyze dense SNP data is of great interest to the scientific community. Family-based association tests model both linkage and association and thus can better localize the disease locus than linkage analyses alone and avoid spurious association results due to population admixture. Recently, there has been a growing interest in developing methods for identifying SNPs that account for all the observed linkage evidence , a goal that can be achieved by joint modeling of linkage and association.
There are three major types of family-based association tests categorized by their null hypotheses: 1) H0: no association, in the presence of linkage or the tested SNP is in linkage equilibrium (LE) with all disease loci (denoted T1); 2) H0: no linkage adjusting for the association, or the tested SNP is in complete linkage disequilibrium (LD) (r2 = 1) with all disease loci (denoted T2); 3) H0: no linkage and no association (denoted T3). It is vital to understand the differences among these hypotheses and the relative efficiencies of valid tests for each hypothesis. In what follows, we classify the tests we considered into the three categories and compare the power and type I error among them.
We used all 100 replicates of Genetic Analysis Workshop 15 (GAW15) simulated nuclear families, each containing both parents and two affected children (Problem 3). Rheumatoid arthritis is the phenotype in all of our analyses. An initial nonparametric multipoint genome-wide linkage scan  using SNP markers was performed with Merlin . A linkage peak with mean logarithm of odds (LOD) score of 79 (from 100 replicates) at about 50 cM on chromosome 6 was identified. Because this linkage region is broad, we selected 2102 dense SNPs between 40 and 60 cM under the linkage peak for assessing power of all methods. In addition, we selected 947 dense SNPs between 130 and 140 cM on chromosome 6 for evaluating type I error rates of the null hypothesis T1. Although the LOD scores range from 3 to 6, these 947 SNPs are far away from the disease loci (DR, C and D locus) and should be in LE with the disease loci. For T3, type I error rates were evaluated using all 16 chromosomes that do not harbor any disease loci. Available data from controls were used to determine LD, as measured by r2 between each of 2102 dense SNPs and the disease loci. For the tri-allelic DR locus, a generalized R2 was calculated for each SNP and tested using the LOGISTIC procedure in SAS version 8 . For the C and D loci, r2 was estimated using Haploview  and tested using the method proposed by Sabatti and Risch . For each SNP, the mean r2 from the 100 replicates was used to estimate more accurately LD with the disease loci. We briefly describe and categorize each method by its hypotheses. The analyses were performed with knowledge of the "answers".
Family-based association test (FBAT)
Rabinowitz and Laird  proposed the family-based association test (FBAT) that is applicable to multiple siblings, quantitative traits, and incomplete parental genotypes. FBAT is a valid test of T3. Lake et al.  extended FBAT to a valid test of T1 by incorporating an empirical variance estimate (FBAT-e).
Conditional logistic regression method
Millstein et al.  recently proposed a pseudo-control approach for joint modeling of linkage and association. Let g1, g2, gm, gf denote the genotypes at a studied locus for two affected offspring, mother, and father, respectively. D1 and D2 are the disease states for the two sibs. Conditional on parental genotypes and their disease states, the likelihood for the children is
which can be modelled as
where g* represents the four possible offspring genotypes; e12 = E[ibd12|g1, g2, gm, gf], the expected identical-by-decent (IBD) sharing between g1 and g2 given the observed marker genotypes and e1* is the expected IBD sharing between g1 and . A test of β = 0 is a test of T1 (denoted Millstein-b), a test of γ = 0 is a test of T2 (denoted Millstein-c), and the two degree of freedom likelihood ratio test (LRT) of β = 0 and γ = 0 is a test of T3 (denoted Millstein-a).
Likelihood-based approach – LAMP
Li et al.  proposed a method to identify SNPs in LD with the disease locus through estimation of the degree of LD between the tested SNP and the putative disease locus. The method is implemented in the software called LAMP. They use a likelihood function that 1) assumes a single di-allelic disease locus, 2) assumes no recombination between the tested SNP and the disease locus, 3) uses disease-SNP haplotype frequencies and disease penetrances as parameters, and 4) can incorporate information from flanking markers in LE with the tested SNP. Two LRTs are proposed. The first (denoted as LAMP-LE) assesses whether the tested SNP is in LE with the disease locus, while the second (denoted as LAMP-LD) assesses whether the tested SNP is in complete LD with the disease locus. Therefore, LAMP-LE is a test of T1 and LAMP-LD is a test of T2. The statistical significance of these two tests is assessed empirically by comparing the observed statistic with simulated null distributions. We exclude flanking short tandem repeat (STR) markers in LD  with tested SNPs and use the remaining STRs in our application of LAMP.
Homozygote Sharing Test (HST)
The HST statistic [11,12] is constructed using a likelihood function conditional on parental genotypes. It compares the observed IBD sharing from homozygous and heterozygous parents to determine if a SNP explains partially the evidence for linkage. HST capitalizes on the fact that parents who are homozygous at all disease loci in a linkage region should not transmit any alleles preferentially to the affected siblings, and hence no excess IBD sharing should be observed from homozygous parents. Additionally, the IBD sharing from homozygous and heterozygous parents should be equal for SNPs in LE with all disease loci. For the intermediate case in which the tested SNP is in partial LD with disease loci, some increased sharing may be observed from homozygous parents in a linkage region. The HST statistic to identify SNPs explaining some of the linkage evidence is derived from the likelihood ratio of the following hypotheses H0:1/2 <αhomo = αhet vs. H1: 1/2 ≤ αhomo <αhet, where αhomo and αhet are the probabilities that an affected sib-pair shares one allele IBD with respect to homozygous and heterozygous parents, respectively. The HST is defined as
where and denote the number of sib pairs sharing "j" allele IBD from homozygous and heterozygous parents respectively (j = 0,1). This HST statistic (denoted HST-LE) is a test of T1. Once subsets of SNPs explaining some of the linkage evidence have been identified, one can then test H0: 1/2 = αhomo <αhet vs. H1: 1/2 <αhomo <αhet with the following HST statistic:
Rejection of the null hypothesis indicates that the tested SNP does not explain fully the linkage evidence. This HST statistic (denoted HST-LD) is a test of T2. Both HST-LE and HST-LD are LRTs under independent parental transmissions, equivalent to assuming a multiplicative model of transmission. Under the null hypothesis of LE between the tested SNP and disease loci, both HST statistics asymptotically follow a chi-square mixture distribution of , assuming independent parental transmissions.
HSTDT – Combination of HST-LE and transmission-disequilibrium test (TDT)
The original TDT, proposed by Spielman et al. , tests linkage and association between a marker and a disease locus using ascertained affected individuals and his/her parental marker information. The TDT can also be used to test association in a linked region (Spielman and Ewens ) using data that consist of nuclear families with a single affected child. The TDT examines the allelic transmission to the affected child from his/her heterozygous parents. For families with multiple affected siblings, the transmissions are correlated among siblings if there is linkage, and TDT is no longer a valid test of association in the presence of linkage. To solve this problem, Martin et al.  focused on the set of transmissions from a heterozygous parent shared by all his/her affected children. For affected sib-pair data and a marker with two alleles M1 and M2, they showed that for a marker in LE with the disease loci, the probability that both affected siblings receive M1 (denoted as ) and the probability that both affected siblings receive M2 (denoted as ) from their heterozygous parent are equal. Thus, there should be no over-transmission of M1 or M2 to affected offspring. In what follows we use TDT to refer Martin et al.'s strategy, which is a test of T1 . Note that the TDT does not use information from homozygous parents, while HST-LE compares the observed allele sharing from homozygous and heterozygous parents without considering which allele is over-transmitted from heterozygous parents. To fully use all available information to identify whether a SNP explains some of the linkage evidence, we propose HSTDT, which combines HST-LE and TDT by decomposing the allele sharing from heterozygous parents (αhet) into two allele-specific IBD sharing probabilities ( and ), to test H0: vs. H1: . The HSTDT statistic is defined as
Similar to HST-LE, HSTDT is a LRT under the assumption of independent parental transmission. Under the null hypothesis of LE between the tested SNP and disease loci (T1), the HSTDT asymptotically follows a chi-square mixture distribution of , assuming independent parental transmissions.
The major distinction between homozygote sharing tests (HST and HSTDT) and other tests of T1 or T2 is that the former can be used to test if a SNP explains the linkage peak (by using IBD information at the linkage peak). When assuming no recombination between the tested SNP and the presumed disease locus at the linkage peak, testing association in the presence of linkage is equivalent to testing whether a SNP partially explains the linkage peak; while testing no linkage, adjusting for association is equivalent to testing whether the tested SNP fully explains the linkage peak. However, when the assumption is violated, for tests of T1 and T2 other than HST and HSTDT, one may not be able to claim that the tested SNP explains the peak linkage evidence but rather that it explains the linkage evidence at the location of the tested SNP. When a linkage signal is identified in a linked region, the LOD score at the linkage peak should be of greatest interest and is the usual quantity reported. In this report, HST and HSTDT are applied to identify SNPs explaining the peak linkage evidence.
SNPs were classified into five groups according to their LD with the disease loci. In Table 1, the first group (labeled r2 = 0) included 947 dense SNPs between 130 and 140 cM on chromosome 6 that were used to assess type I error rates for T1. In Table 2, the first group (labeled r2 = 0, θ = 0.5) included 6597 SNPs from all 16 chromosomes that did not harbor any disease loci and were used to assess type I error rates for T3. The remaining SNPs were grouped by the maximum of the three mean r2 values with the three disease loci (C, D, and DR). For example, the group 0.1 <r2 <= 0.3 included SNPs with mean r2 between 0.1 and 0.3 with at least one of the disease loci. Power and type I error rates were assessed at significance levels of 5%, 1%, and 0.1%.
Table 1. Type I error rates and power (in percentage) for testing association in the presence of linkage (T1)
Table 2. Type I error rates and power (in percentage)for testing linkage or association (T3)
Testing association in the presence of linkage (T1)
Table 1 presents results for tests of T1. All methods have appropriate type I error rates (r2 = 0), except that HSTDT has slightly inflated type I error rates. TDT is the most powerful among these six methods, followed closely by HSTDT, FBAT-e, Millstein-b and then LAMP-LE. HST-LE appears to be less powerful.
Testing linkage adjusted for association (T2)
Table 3 presents results for tests of T2. These tests are used to examine if a SNP explains all the observed linkage evidence. LAMP-LD and HST-LD test whether a SNP is in complete LD with the disease loci. Equivalently, Millstein-c tests if there is residual linkage when conditioning on the genotype covariate of the tested SNP and is essentially a binary-trait version of the test proposed by Almasy and Blangero . Because none of the SNPs is in complete LD with all disease loci, type I error rates cannot be evaluated. HST-LD and LAMP-LD have similar power and are more powerful than Millstein-c.
Table 3. Power (in percentage) for testing linkage adjusted for association (T2)
Testing linkage or association (T3)
Table 2 presents results for tests of T3. The type I error rates (r2 = 0, θ = 0.5) of FBAT and Millstein-a are appropriate. FBAT is less powerful than Millstein-a for SNPs in low LD with the disease loci. Note that Millstein-a directly uses IBD between sib pairs in the model, while FBAT uses allelic transmission from parents to children, possibly explaining the difference in their ability to pick up the linkage signal.
With a dense SNP map, it is natural to speculate whether all disease loci under a linkage peak have been identified or their contributions to the phenotypic variation are fully explained by a small subset of SNPs in association with these disease loci. Recently, there was much interest in testing whether a SNP can partially or fully account for all the observed linkage evidence.
We examined several methods for joint linkage and association analysis and identifying SNPs that explain the linkage evidence. For testing association in the presence of linkage and for testing linkage or association, all methods have appropriate type I error rates. For testing association in the presence of linkage, TDT is most powerful (but only slightly more powerful than HSTDT, FBAT-e, Millstein-b, and LAMP). For testing whether a SNP explains all the linkage evidence, HST and LAMP have similar power and are more powerful than Millstein-a; the difference in power may be explained by the difference in type I error, which we were unable to assess in the GAW15 data set because there was no single disease locus explaining all of the linkage evidence. For testing linkage or association, we found that Millstein-c was more powerful than FBAT. These conclusions may not extend to study designs other than nuclear families each with two affected children with parental genotypes available, a requirement for HST, HSTDT, and Millstein et al.  Furthermore, the excessively high LOD score observed in this study may explain the slightly inflated type I error rate observed for HSTDT.
In this study, there are three disease loci in the linked region and they do not contribute equally to the linkage signal, with the DR locus having a major effect on affection status. A different scenario may lead to different results for methods that use linkage peak information to identify SNPs, explaining the linkage evidence. However, for a complex disease, there may be multiple disease loci acting interactively, so methods that do not assume a single causal variant would be most helpful in identifying SNPs associated with disease loci. There is a great need for developing methods suitable for multiple disease loci.
The author(s) declare that they have no competing interests.
This work is supported in part by National Heart, Lung and Blood Institute's Framingham Heart Study Contract NO1-HC-25195 (QY), and the Brigham Rheumatoid Arthritis Sequential Study funded by Millennium Pharmaceuticals (JC). The computing was conducted on the Linux Cluster for Genetic Analysis funded by the NIH (National Center for Research Resources) Shared Instrumentation grant (1S10 RR163736-01A1).
This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.