We analyzed a case-control data set for chromosome 18q from the Genetic Analysis Workshop 15 to detect susceptibility loci for rheumatoid arthritis (RA). A total number of 460 cases and 460 unaffected controls were genotyped on 2300 single-nucleotide polymorphisms (SNPs) by the North American Rheumatoid Arthritis Consortium. Using a multimarker approach for association mapping under the framework of the Malecot model and composite likelihood, we identified a region showing significant association with RA (p < 0.002) and the predicted disease locus was at a genomic location of 53,306 kb with a 95% confidence interval (CI) of 53,295–53,331 kb. A common haplotype in this region was protective against RA (p = 0.002). In another region showing nominal significant association (51,585 kb, 95% CI: 51,541–51,628 kb, p = 0.037), a haplotype was also protective (p = 0.002). We further demonstrated that reducing SNP density decreased power and accuracy of association mapping. SNP selection based on equal linkage disequilibrium (LD) distance generally produced higher accuracy than that based on equal kilobase distance or tagging.
Rheumatoid arthritis (RA) is a common chronic disease, with a moderately strong genetic component. Chromosome 18q has shown evidence for linkage in the U.S. and French linkage scans . The North American Rheumatoid Arthritis Consortium (NARAC) performed fine mapping on a 10-Mb region on 18q with a dense single-nucleotide polymorphism (SNP) map and the data were collected by the Genetic Association Workshop (GAW) 15 for Problem 2. Here we applied a novel association mapping approach based on the Malecot model and composite likelihood to identify disease associated regions and predict the locations of possible disease loci . Haplotype analysis on the candidate regions was performed. We also studied the effect of region length and SNP density on the accuracy of association mapping in comparison with our analysis of the simulated data in Problem 3 of GAW15 .
A total of 2300 SNPs in a 9,519.224 kb region of 18q were genotyped by NARAC in 460 cases of RA and 460 controls. Controls were recruited from a New York City population. Seven SNPs showing significant departure from Hardy-Weinberg equilibrium (HWE) in the control samples using a likelihood ratio chi-square test (χ2 ≥ 10) were removed, resulting in a total of 2293 SNPs . Further removal of 81 SNPs with a minor allele frequency (MAF) of less than 5% resulted in a total of 2212 SNPs for our main data analysis.
Physical locations of these SNPs were determined from build 35 (UCSC May 2004) of the human genome sequence. An LD map expressed in linkage disequilibrium (LD) units (LDUs) was created using the control samples with the LDMAP-cluster program, a parallel version of LDMAP program that rapidly constructed the map http://www.som.soton.ac.uk/research/geneticsdiv/epidemiology/ldmap/ webcite. LDU is determined by the product of the ε and distances in kilobases for an interval of two adjacent SNPs and is additive, where ε represents the exponential decline of LD with distance for that interval. The LD map length was 151.115 LDUs, which is essentially the same as the 2293 SNPs containing rare SNPs.
We also used the LD map built from the CEU samples of the HapMap Phase II data http://www.som.soton.ac.uk/research/geneticsdiv/epidemiology/LDMAP/map2.htm webcite. The same region on the CEU LD map contains 8086 SNPs with a length of 202 LDUs. Despite its higher SNP density, 185 SNPs were missing and therefore their LDU locations were linearly interpolated. However, alternative LD maps did not seem to exert a significant effect on the results (data not shown), and we hereby only report the results using the LD map constructed from the GAW15 data.
The 10-Mb segment of 18q was divided into 14 non-overlapping consecutive regions. Each region had a minimum of 10 LDUs and 30 SNPs by default without breaking LD blocks and was analyzed individually. We also used a 5-LDU region length, resulting in a total of 26 regions for association analysis. In the Malecot model, association is a function of several parameters, the most important of which is S, the predicted location of the disease variant . Composite likelihood combines information of all pairwise marker-disease associations in each region. S and its 95% confidence interval (CI) are estimated by fitting the model to the data and maximizing the composite likelihood. Significance tests are carried out by contrasting two hierarchical models. Model A assumes no association and no parameters are estimated. Model D assumes an association and S and two other parameters are estimated with ε specified. The difference in the -2 natural log composite likelihood (denoted as Λ) between the two models (denoted as ΛA - ΛD) is a statistic monotonic to a chi-square with 3 degrees of freedom (). A permutation test was performed for each region with hundreds of replicates under the null hypothesis of no association by shuffling case-control status to obtain an empirical p-value . The specified value for ε of 1.0543 was obtained by fitting the LD map to the genotype data for the control samples. However, similar values for ε did not seem to have an appreciable effect on the results (data not shown).
This approach has been implemented in the CHROMSCAN program and a parallel version, CHROMSCAN-cluster, based on cluster computing was used for permutations with 1000 replicates http://www.som.soton.ac.uk/research/geneticsdiv/epidemiology/chromscan/ webcite. Pearson's χ2s were obtained for allelic associations between single SNPs and RA.
Haplotype analysis for candidate regions
Haplotype analysis using the PHASE program version 2  was performed for candidate regions showing nominal significant association. The five most common haplotypes and their frequencies were compared between cases and controls. A chi-square test was applied to identify significant associations by testing each haplotype in turn against all others, including rare ones.
SNP density and region length
To generate different SNP density, we used Tagger in Haploview software to select tagging SNPs based on pairwise LD (r2) using the control samples with rare SNPs included . For comparison, we selected the same number of SNPs as Tagger but by equidistance in LDU or kilobases. To do this, SNPs with the same LDU were reassigned LDU locations by linear interpolation (tilting) so that every SNP would have a unique location. By centering at the "disease locus", we also studied the effect of LDU region length on the results.
Results and discussion
Association mapping of disease locus
We found a nominally significant association between region 5 and RA. The estimated location of the disease locus S1 was at 53,306 kb near a SNP of global maximal chi-square (rs3745064, χ2 = 12.25, p = 0.00047) using the LD map with a 10-LDU region length (p = 0.002). After Bonferroni correction for 14 regions the results were still statistically significant (pc = 0.02). Removing this SNP did not change the results. However, inclusion of SNPs with MAF < 5% resulted in a wider 95% CI (53,274–53,342 kb, point location at 53,307 kb). Figure 1 shows that the estimated location for the disease variant was in a 10-kb LD block where a cluster of SNPs showed modest association with RA. At such a significance level, none of the SNPs were statistically significant after correction for multiple comparisons. This is in contrast to multimarker approaches, in which one cluster of SNPs is considered at a time in light of LD among nearby SNPs, markedly reducing the number of tests.
Figure 1. An LD map in relation to the putative disease loci S1and S2. The details of the SNPs and LD patterns for the regions around S1 and S2 are enlarged in the upper diagrams. The vertical black solid line indicates the location of the point estimate within the 95% CI. The black dots on the map represent SNPs showing nominal significant association with RA (p < 0.05) and the gray dots represent SNPs showing no association.
At 5-LDU region length with a total of 26 regions, we found a locus (S2) at 51,585 kb showing nominal significant association (p = 0.04, Table 1 and Figure 1). Similar results were obtained using the data with rare SNPs (MAF < 5%) included. In this region there was also a cluster of SNPs associated with RA. However, consideration of multiple testing this region would not be statistically significant.
Table 1. Association mapping of RA susceptibility loci
Haplotype analysis for candidate regions
Haplotype analysis was performed on sub-regions containing S1 and S2 with the majority of nominally associated SNPs included. S1 sub-region (53,297–53,312 kb, 0.043 LDUs) contained 16 SNPs (Table 2) and S2 sub-region (51,556–51,616 kb, 0.368 LDUs) contained 21 SNPs (Table 3). Haplotypes H1 and H3 at S1 and H5 at S2 appeared to be significantly associated with RA, with the first two haplotypes being almost complementary. Both H1 at S1 and H5 at S2 showed protective effects against RA.
Table 2. Common haplotype analysis of the S1 candidate region
Table 3. Common haplotype analysis of the S2 candidate region
Further analyses of S1 categorized all individuals into three groups, H1/H1, H1/H- and H-/H-, where H- is a haplotype other than H1. There was a significant association between haplotype pairs and disease status ( = 10.3, p = 0.006). An individual carrying H1/H1 had a lower risk of RA than those carrying H1/H- or H-/H-. The odds ratios (ORs) were 0.58 (95% CI: 0.37–0.92) and 0.85 (0.54–1.34) for an individual carrying H1/H1 or H1/H- compared to H-/H- haplotypes, respectively. We performed analyses conditional on whether an individual carried H1/H1. Interestingly, H5 appeared to be significant only in H1/H- or H-/H- carriers (χ2 = 7.647, p < 0.05), but not in H1/H1 carriers (χ2 = 1.509), indicating a possible interaction between the two haplotypes.
Genes and mRNA at the candidate regions
The UCSC genome browser (May 2004) was used to find genes and mRNAs within the 95% CI of loci S1 and S2. No known genes have been found nearby S1, but the area of the 95% CI for locus S1 contains four human mRNA (CR590917, AK021717, AK124558, and BC013134), two of which span the point estimate of S1. Therefore, this region might contain genes not yet identified. In addition, this area is highly conserved across species, implying functional importance of the genomic sequence. A known gene (AK127787) is within the 95% CI of S2, but is 10 kb away from its point estimate.
Region length and SNP density
Point estimates of S were identical for all region lengths centred at S1. The 95% CI was also relatively stable, with a slow increase with region length (Table 4). Enlarged region length compromised the significance levels, perhaps due to noise from distant SNPs, given that the informative SNPs were clustered in a rather small region. Computing time prolonged with increasing number of SNPs. Small region lengths, however, resulted in a heavy penalty for multiple testing. Four LDUs provided the most significant result for S1(Pc = 30 × 0.0008 = 0.02, Table 4).
Table 4. The impact of region length on association mapping
Our analysis of simulated data indicated that reduced SNP density decreased mapping accuracy, and SNP selection based on equal LD distance produced smaller location errors than that based on equal kilobase distance or tagging . Interestingly, among the three selection approaches for S1 region, Tagger selected the most number of SNPs while equal kilobase distance, the least number. SNPs selected by equal LDU distance generally provided the highest location accuracy (Table 5). Power was reduced with decreasing density, as indicated by the values of ΛA - ΛD (Table 5). Using the kilobase map resulted in higher location errors in most cases and lower ΛA - ΛD values, indicating reduced power in all circumstances compared with using the LD map (data not shown).
Table 5. SNP density and accuracy – selection by tagging or equidistance
We reported a significant association between a region of 18q and RA. The estimated genomic location of the disease variant was at 53,306 kb. The Malecot model and composite likelihood approach has narrowed the possible disease locus to a 36-kb candidate region. A haplotype significantly associated with reduced risk of RA was identified in this region. DNA sequences between 53,295–53,331 kb of this region are highly conserved in vertebrates. A haplotype around 51,585 kb was also identified as reducing the risk of RA. Further sequencing or functional studies may be helpful to identify the disease variants. Reducing SNP density decreases power and location accuracy. We also conclude that SNP selection based on equal LD distance can maximally retain the prediction accuracy of the disease loci than that based on equal physical distance or SNP tagging.
The author(s) declare that they have no competing interests.
T-YK, WL, and CH were supported by the Ph.D. studentships funded by the Taiwan Ministry of Education, University of Southampton, and Shanghai Jiaotong University, respectively. WZ was supported by the Institute of Cancer Research, Sutton, Surrey, UK.
This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.
Ann Hum Genet 1999, 3:535-538. Publisher Full Text