Abstract
Although single chisquare analysis of the North American Rheumatoid Arthritis Consortium (NARAC) data identifies many singlenucleotide polymorphisms (SNPs) with pvalues less than 0.05, none remain significant after Bonferroni correction. In contrast, CHROMSCAN evades heavy Bonferroni correction and autocorrelation between SNPs by using composite likelihood to model association across all markers in a region and permutation to assess significance. Analysis by CHROMSCAN identifies a 36kb interval that includes the most significant SNP (msSNP) observed in a 10Mb target suggested by linkage. Unexpectedly, stratification by gender and age of onset shows that association evidence comes almost entirely from females with age of onset less than 40. Combining evidence from a metaanalysis of linkage studies and three subsets of the NARAC data provides significant evidence for a determinant of rheumatoid arthritis in a 36kb interval and illustrates the principle that estimates of location and its information are more powerful than estimates of pvalues alone.
Background
Initially, linkage mapping dealt with rare and highly penetrant genes. Without cytogenetic assignment, the preferred strategy was segregation analysis to determine all relevant parameters except recombination, followed by linkage analysis to determine recombination frequency [1]. Complex inheritance with uncertain segregation parameters proved much more difficult, giving rise to many unconfirmed claims based on microsatellites and leading to metaanalysis without point locations [2]. The HapMap project provides dense SNPs that can be used to localize causal loci with or without pedigrees. This procedure, called association mapping, revolutionized identification of disease genes. Recent developments of linkage disequilibrium units (LDU), composite likelihood, control of autocorrelation, and metaanalysis are incorporated into the CHROMSCAN program [3,4] to increase its precision for association mapping. Here we use these methods to establish the location and weight of evidence for a gene predisposing to rheumatoid arthritis.
Methods
Data preparation
The data, provided by NARAC (North American Rheumatoid Arthritis Consortium) consist of 2300 singlenucleotide polymorphisms (SNPs) in a 10Mb region of 18q21 with linkage evidence in U.S. and French scans [5]. Illumina genotyped these markers in 460 cases and 460 controls, matched for age and gender, from New York. The genotypic data for controls were screened and 7 SNPs with > 10 for the HardyWeinberg test [6] were removed, leaving 2293 to be analyzed. CHROMSCAN requires SNPs to be located on both physical and LDU scales. Physical locations were taken from build 35 of the human genome sequence. Unlike physical maps, studyspecific and various LDU maps are available, corresponding to the four HapMap samples separately and combined (CEU, CHB, JPT, YRI, and cosmopolitan). The LDU map with the highest SNP density and population attributes closest to the experimental data should be optimal. We therefore used LDU locations relative to the CEU HapMap data with a density of 1 SNP per 863 bp compared to 1 SNP per 4139 bp in the NARAC data. We also used the kilobase map to determine the robustness and power of LDU maps compared with physical maps.
LDU map construction
The theory for constructing LDU maps has been described [7]. Briefly, the LDU distance for the i^{th }SNP interval is given by ε_{i}d_{i}, where ε_{i }describes the exponential decline of association with physical distance d_{i }in kb. Values of ε_{i }are estimated by composite likelihood that fits the Malecot model [8] to multiple pairwise diplotype data. The Malecot equation, given by , uses additional parameters to describes association at the last major bottleneck (M), and residual association at large distance (L) to predict rho (ρ), the probability of association.
Association mapping
The CHROMSCAN program [3] uses a model similar to LDU maps except the exponential term is replaced by εΔ(S_{i } S) to estimate the location (S) of a disease gene, where S_{i }is the location of the i^{th }marker in kilobases or LDU. The Kronecker Δ is used for map direction and assures a correct sign, with Δ = 1 if S_{i }≥ S or 1 if S_{i }<S. To calculate the expected association with distance, z_{i}, the model becomes , where M is diminished by complex inheritance and L is the association at large distance. The observed association is determined by a 2 × 2 table between affection status and the two alleles of each SNP to give and , where ad  bc ≥ 0 and b ≤ c is ensured by rearrangement of columns and rows [9]. Given the observed associations , the Malecot parameters are estimated iteratively using composite likelihood, which evades a heavy Bonferroni correction by combining information over all loci within a region as , where and z_{i }are the observed and expected association values, respectively, at the i^{th }SNP. Their squared difference is weighted by information (K_{i}) which is estimated as: , where is the Pearson from the 2 × 2 table.
Subhypotheses of the Malecot model are used to test for a causal polymorphism. Model A, which estimates none of the parameters and uses M = 0 with predicted L [10], is taken as the null hypothesis H_{0 }in which there is no association between affection status and SNPs. Model D estimates M, S, and L. Therefore the Λ_{A } Λ_{D }comparison tests for a disease determinant at location S. For both models, ε is fixed to 1 for the LDU map and to a value of ε determined from pairwise markerbymarker association data for the kilobase map. In order to account for autocorrelation between SNPs as a result of LD, the significance of evidence is determined by a rankbased permutation test [3].
Three separate analyses of the data were performed by CHROMSCAN. The first is a preliminary screen of the entire 10Mb bin, which is divided into 18 nonoverlapping regions, each with at least 30 SNPs and covering at least 10 LDUs. To determine accurate levels of significance, the number of permutation replicates must approach the actual level of significance so that interpolation of the variance under H_{1 }is reliable. To minimize computation time, the initial analysis was restricted to 100 replicates. Significant regions identified by the initial screen were reanalyzed separately using 1000 and 5000 replicates in order to verify convergence. To demonstrate the power of LDU maps, this analysis was repeated using the kilobase map and two estimates of the exponential decline ε derived from the significant region and the 10Mb region [11]. The risk for rheumatoid arthritis is elevated in females, especially with late onset (≥35–≤60) [12]. Our third analysis therefore stratified cases into three groups corresponding to males, females with onset ≤39, and females with onset ≥40. The partition of females around an onset age of 40 was chosen to give approximately equal numbers of 'early' and 'late' onset cases. Unaffected controls for the three groups were all males (with similar age and total number of individuals as affected males), and females divided by current age to give similar total numbers of individuals as cases, respectively. This analysis was restricted to significant regions from the initial screen and used 5000 replicates.
Results
Association mapping
Single chisquare analyses of the 10Mb region identifies 125 SNPs with p < 0.05, none of which reach significance after Bonferroni correction (0.05/2293). The initial screen by CHROMSCAN divides the 18q21 bin into 18 nonoverlapping regions. Although the most significant SNP (msSNP, rs3745064) occurs in region 6, the next msSNP in region 11 is deceptively close in terms of significance, and several other regions contain suggestive SNPs (Table 1). In contrast, the composite likelihood approach, which models association across all markers in a region, identifies region 6 as the only significant region (p = 0.01259). The intensive screen of region 6 identified a large increase in significance between 100 and 1000 replicates, which is attributed to the relationship between number of replicates and significance, while the small decrease in significance between 1000 and 5000 replicates suggests that convergence has been achieved (Table 2). These analyses estimate a causal locus (S) at 53308 kb.
The CHROMSCAN analysis of region 6 was repeated using the kilobase map so that its performance can be compared with the LDU map. Using a kilobase map requires specification of the exponential decline ε [11]. Two values of ε, corresponding to the 10 Mb interval (0.021) or region 6 alone (0.031), were investigated. Despite the large difference between ε values for the kilobase map, the significance level and location were almost identical. However, the ratios of indicate that the kilobase maps have a relative efficiency of 75% compared with an LDU map at 1000 replicates (Table 2).
Because King et al. [12] demonstrated that the risk for rheumatoid arthritis is elevated in females, especially with late onset, we stratified cases into three groups according to sex and age of onset. The effect of this stratification is highly suggestive despite its crudeness (Table 3) and small sample sizes. Females with onset ≤39 account for most of the association. The other two classes give such small chisquare values that they would undoubtedly be assigned to other regions if the partition test had not been restricted to region 6 on the pooled evidence. However, when considering region 6 alone, there is remarkable agreement between point estimates for 'early' and 'late' onset females and those from males. At this time it is impossible to say whether this consistency is caused by imperfectly divided onset groups or a small effect at late age.
Table 3. Stratification by gender and age of onset (5000 replicates)
Linkage
Choi et al. [13] reported a metaanalysis of four linkage studies with microsatellites in a 10Mb bin of chromosome 18. The results from this study were reported as pvalues without estimates of location or standard errors. Without this information, the power for metaanalysis is reduced because the sum of two values must be converted back to and LOD_{1 }instead of weighting estimates of location by their information. Perhaps because of this inefficiency, the combined LOD_{1 }from this metaanalysis is 1.542, well below the conventional value of 3 for asserting significance. The corresponding pvalue in largesample theory is 0.007714, providing strong but inconclusive evidence for localization in the 18q21 region. Despite its limitations, linkage contributes evidence that should not be ignored.
Joint significance of linkage and association
The simplest metaanalysis is based on n independent samples, the i^{th }of which contributes a P_{i }value that on the null hypothesis is uniformly distributed. Then 2 ln P_{i }would be distributed as , with . This is the only test applicable to data that do not provide an estimate of location S_{i }and information K_{i}, but has three disadvantages; first, equal weight is given to samples with different standard errors; second, there is no test of homogeneity; and third, there is no point estimate to become more precise as n increases. As a consequence, much information is lost. Accepting these limitations and assuming accuracy of the P estimates, Table 4 shows that combining pooled association with linkage provides suggestive evidence to assign a gene for rheumatoid arthritis to the 18q21.31 interval. The LOD_{1 }with no Bonferroni correction is 2.676 for linkage and pooled association. When location and information weight are available, the evidence for association is combined by determination of the difference between with n degrees of freedom and , which tests for heterogeneity with n  1 degrees freedom where . When the stratified association samples are combined in this manner, the heterogeneity test is negligible. As expected, power is increased when pooled with linkage (LOD_{1 }= 3.401, p = 0.000076). Even with conservative adjustment of the pvalue to account for the 18 regions tested by association (18*0.000569), and despite strong although not formally significant, evidence from linkage for at least one causal gene in the 18 regions, the metaanalysis is supportive (LOD_{1 }= 2.327, p = 0.001062). We conclude that evidence for region 6 is probative, with linkage and association both providing critical evidence despite lack of a point estimate and information weight for linkage.
Table 4. Metaanalysis of association (5000 replicates) and linkage
Discussion
This application demonstrates that CHROMSCAN is a powerful approach for gene mapping in complex inheritance, which is applicable to metaanalysis. Obvious extensions include identification of a causal locus and more precise definition of the phenotype associated with it. The 95% confidence interval, given by S ± 1.96 (SE), covers 36 kb between 53296 and 53332 kb and includes the msSNP rs3745064. Although no described genes are within this region, it does include four human mRNAs from GenBank: CR590917, AK021217, AK124558, and BC01314, all to the left of point estimate (S). Of these, CR590917 appears to be the most interesting because it is expressed within T cells and could therefore conceivably affect risk for rheumatoid arthritis. Finally, geneid [14] and Genscan [15] predict a similar gene, which is the closest annotated sequence to the point estimate (S). However, nothing is known about the function of this gene and its reliability is questionable. The fascinating directions revealed by these findings have yet to be explored. Ultimately, interaction with other contributing loci and environmental factors will be recognized and, more importantly, locusspecific treatment will be found.
Recent papers testify to growing interest in metaanalysis, looking backward to linkage rather than forward to association mapping. Rank permutation provides a valid significance test, but the genome search metaanalysis (GSMA) that uses regional assignment with arbitrary weights cannot give a reliable estimate of effect and therefore has low power for estimating point location and detecting heterogeneity [16,17]. Most of the few papers on association mapping assume family data rarely feasible for diseases of late onset and are restricted to single markers without composite likelihood to estimate both location S and its information K. One manuscript presented in GAW15 that used metaanalysis without those estimates failed to detect the strong signal on chromosome 18q demonstrated by composite likelihood [18].
Competing interests
The author(s) declare that they have no competing interests.
Acknowledgements
This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/17536561/1?issue=S1.
References

Shuman S: Structure, mechanism, and evolution of the mRNA capping apparatus.

Morton NE: Sequential tests for the detection of linkage.
Am J Hum Genet 1955, 7:277318. PubMed Abstract  PubMed Central Full Text

Levinson DF, Levinson MD, Segurado R, Lewis CM: Genome scan metaanalysis of schizophrenia and bipolar disorder. Part I: Methods and power analysis.
Am J Hum Genet 2003, 73:1733. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Morton NE, Maniatis N, Zhang W, Ennis S, Collins A: Genome scanning by composite likelihood.
Am J Hum Genet 2007, 80:1928. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

CHROMSCAN [http://www.som.soton.ac.uk/research/geneticsdiv/epidemiology/chromscan/] webcite

Amos CI, Chen WV, Lee A, Li W, Kern M, Lundsten R, Batliwalla F, Wener M, Remmers E, Kastner DA, Chrisiwell LA, Seldin MF, Gregersen PK: Highdensity SNP analysis of 642 Caucasian families with rheumatoid arthritis identifies two new linkage regions in 11p12 and 2q33.
Genes Immun 2006, 7:277286. PubMed Abstract  Publisher Full Text

Gomes I, Collins A, Lonjou C, Thomas NS, Wilkinson J, Watson M, Morton N: HardyWeinberg quality control.
Ann Hum Genet 1999, 63:535538. PubMed Abstract  Publisher Full Text

Maniatis N, Collins A, Xu CF, McCarthy LC, Hewett DR, Tapper W, Ennis S, Ke X, Morton NE: The first linkage disequilibrium (LD) maps: delineation of hot and cold blocks by diplotype analysis.
Proc Natl Acad Sci USA 2002, 99:22282233. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Collins A, Morton NE: Mapping a disease locus by allelic association.
Proc Natl Acad Sci USA 1998, 95:17411745. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Maniatis N, Morton NE, Gibson J, Xu CF, Hosking LK, Collins A: The optimal measure of linkage disequilibrium reduces error in association mapping of affection status.
Hum Mol Genet 2005, 14:145153. PubMed Abstract  Publisher Full Text

Morton NE, Zhang W, TaillonMiller P, Ennis S, Kwok PY, Collins A: The optimal measure of allelic association.
Proc Natl Acad Sci USA 2001, 98:52175221. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lau W, Kuo TY, Tapper W, Cox S, Collins A: Exploiting large scale computing to construct high resolution linkage disequilibrium maps of the human genome.
Bioinformatics 2007, 23:517519. PubMed Abstract  Publisher Full Text

King RA, Rotter JI, Motulsky AG: The Genetic Basis of Common Disease. New York: Oxford University Press; 1992:598599.

Choi SJ, Rho YH, Ji JD, Song GG, Lie YH: Genome scan metaanalysis of rheumatoid arthritis.
Rheumatology 2006, 45:166170. PubMed Abstract  Publisher Full Text

Blanco E, Parra G, Guigó R: Using geneid to identify genes. In Current Protocols in Bioinformatics. Edited by Baxevanis AD, Davison DB. New York: John Wiley & Sons Inc; 2002:126.

Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA.
J Mol Biol 1997, 268:7894. PubMed Abstract  Publisher Full Text

Zintzaras E, Kitsios G: Identification of chromosomal regions linked to premature myocardial infarction: a metaanalysis of wholegenome searches.
J Hum Genet 2006, 51:10151021. PubMed Abstract  Publisher Full Text

Lewis CM, Levinson DF: Testing for genetic heterogeneity in the genome search metaanalysis method.
Genet Epidemiol 2006, 30:348355. PubMed Abstract  Publisher Full Text

Segurado R, Hamshere ML, Glaser B, Nikolov I, Moskvina V, Holmans P: Combining linkage datasets for metaanalysis and megaanalysis: the GAW15 rheumatoid arthritis data set.