Representation of genetic association via attributable familial relative risks in order to identify polymorphisms functionally relevant to rheumatoid arthritis

Bermejo, Justo Lorenzo; Fischer, Christine; Schulz, Anke; Cremer, Nadine; Hein, Rebecca; Beckmann, Lars; Chang-Claude, Jenny; Hemminki, Kari

doi:10.1186/1753-6561-3-S7-S10

Volume 3 Supplement 7

Genetic Analysis Workshop 16

Proceedings
Open access
Published: 15 December 2009

Representation of genetic association via attributable familial relative risks in order to identify polymorphisms functionally relevant to rheumatoid arthritis

Justo Lorenzo Bermejo^1,2,
Christine Fischer³,
Anke Schulz⁴,
Nadine Cremer⁴,
Rebecca Hein⁴,
Lars Beckmann⁴,
Jenny Chang-Claude⁴ &
…
Kari Hemminki^2,5

BMC Proceedings volume 3, Article number: S10 (2009) Cite this article

1338 Accesses
2 Citations
Metrics details

Abstract

The results from association studies are usually summarized by a measure of evidence of association (frequentist or Bayesian probability values) that does not directly reflect the impact of the detected signals on familial aggregation. This article investigates the possible advantage of a two-dimensional representation of genetic association in order to identify polymorphisms relevant to disease: a measure of evidence of association (the Bayes factor, BF) combined with the estimated contribution to familiality (the attributable sibling relative risk, λ_s). Simulation and data from the North American Rheumatoid Consortium (NARAC) were used to assess the possible benefit under several scenarios. Simulation indicated that the allele frequencies to reach the maximum BF and the maximum attributable λ_s diverged as the size of the genetic effect increased. The representation of BF versus attributable λ_s for selected regions of NARAC data revealed that SNPs involved in replicated associations clearly departed from the bulk of SNPs in these regions. In the 12 investigated regions, and particularly in the low-recombination major histocompatibility region, the ranking of SNPs according to BF differed from the ranking of SNPs according to attributable λ_s. The present results should be generalized using more extensive simulations and additional real data, but they suggest that a characterization of genetic association by both BF and attributable λ_s may result in an improved ranking of variants for further biological analyses.

Introduction

Susceptibility to rheumatoid arthritis (RA) is determined by both genetic and environmental factors, with an estimated sibling relative risk of 5-10. The HLA-DRB1 and PTPN22 genes explain approximately 50% of the familial aggregation. A recent genome-wide association (GWA) study carried out by the Wellcome Trust Case Control Consortium (WTCCC) found nine single-nucleotide polymorphisms (SNPs) with associated probability values of less than 10^-5 that did not map to loci previously related to RA [1]. In agreement with the common disease-common variant hypothesis, the novel variants were common (minor allele frequencies from 0.07 to 0.41) and they showed genotype relative risks (GRRs) of less than 2.3. The authors of the WTCCC article recognized a disparity between the large population-attributable fraction (PAF) and the modest sibling relative risk (λ_s) explained by the newly identified variants [1]. This disparity is related to the technical characteristics of the platforms used in GWA studies, designed to scrutinize polymorphisms (common variants).

Let us assume that a polymorphism is a marker of a rarer causal variant, and the marker-specific PAF equals the PAF of the causal variant. Under this assumption, we have demonstrated that the causal variant has a higher attributable λ_s than the marker [2]. In other words: for a fixed PAF, causal variants contribute more to familial aggregation (λ_s) than the polymorphisms they are represented by.

There is much debate concerning the representation of statistical evidence in GWA studies. The frequentist p-value depends on the power of the study and it does not provide a consistent measure of evidence because identical p-values do not imply identical evidence of genetic association [3]. In contrast, the Bayes factor (BF) statistic provides a single measure of the strength of evidence of association.

We hypothesize here that the representation of genetic association by BF together with the attributable λ_s may help to identify polymorphisms relevant to RA, and use simulation and data from the North American Rheumatoid Consortium (NARAC) to evaluate this assumption.

Methods

Derivation of attributable risks and Bayes factors

The sibling relative risk (λ_s) attributable to a gene reflects the extent to which a particular genetic locus contributes to familial aggregation. The calculation of attributable λ_s values was first described in an early paper of James [4]. If the frequency of the high-risk allele is denoted by RAF, the relative risk of RA for high-risk allele homozygotes compared to low-risk allele homozygotes by GRR_hom, the relative risk of heterozygotes compared to low-risk allele homozygotes by GRR_het and the disease prevalence among low-risk allele homozygotes by κ₀, the attributable λ_s is given by: λ_s = 1+(1/2V_a+1/4V_d)/K², where V_a (additive genetic variance divided by κ₀²) equals 2RAF(1-RAF)[(1-RAF)(1-GRR_het)-RAF(GRR_hom-GRR_het)]², V_d (dominance genetic variance divided by κ₀²) equals RAF²(1-RAF)²[1+GRR_hom-2GRR_het]² and K (population prevalence divided by κ₀) equals RAF²GRR_Hom+2RAF(1-RAF)GRR_het+ (1-RAF)². Note that if the overall disease prevalence is 50%, λ_s cannot exceed two by definition. However, an important property of the attributable λ_s is that it only depends on RAF, GRR_hom, and GRR_het. That is, the attributable λ_s is independent of the disease prevalence among low-risk homozygotes κ₀.

The BF statistic is the ratio of the probability of the observed data under the assumption that there is a true association to its probability under the null hypothesis (absence of association). A small BF provides evidence in favor of a true association. To investigate the relationship between attributable λ_s and the BF, we calculated the expected distribution of genotypes in 1000 cases and 1000 controls when GRR_hom = 1.5, 2, 3, 4, 5 and 10. Calculations were carried out for the dominant model, which assumes that GRR_het = GRR_hom, the recessive model (GRR_het = 1), and for the additive model (GRR_het = (1+GRR_hom)/2). Under Hardy-Weinberg equilibrium, the expected distribution of genotypes in controls (D = 0) is: Pr(G = aa|D = 0) = (1-RAF)²; Pr(G = Aa|D = 0) = 2RAF(1-RAF); Pr(G = AA|D = 0) = RAF² The expected distribution of genotypes in cases (D = 1) satisfies:

The expected distributions of genotypes were investigated by Bayesian logistic regression. We considered a general three-genotype model of association. Calculation of BFs requires assumptions about effect sizes. We assumed N(0,1) priors on the mean and the two genetic effects. The function MCMClogit (R package MCMCpack [5]) was used to generate a 10,000 sample dataset from the posterior distribution using a random walk Metropolis algorithm. The BayesFactor function in the same package was used to compute log marginal likelihoods for a model 'with' compared to a model 'without' genotypic information using a Laplace approximation.

Investigated regions

This study investigated regions around the 12 signals detected in the WTCCC study with associated probability values of less than 10^-5 (Table 1). The left side of Table 1 represents the 12 selected regions from the WTCCC study. For example, the most associated marker on 1p13 was rs6679677. We retrieved NARAC data from 1000-kb regions centred on the positions of the 12 signals from the WTCCC study. For example, our first region comprised the chromosomal interval (10,901,850::11,901,850) around rs6679677. The WTCCC study reported two SNPs in 6p21 that resulted in two overlapping intervals which were independently investigated.

Table 1 Description and results from the 12 investigated regions

Full size table

Calculation of attributable risks and Bayes factors for NARAC data

SNPs in the 12 selected intervals were extracted from NARAC data, and the association between RA and the retrieved SNPs was represented by 1) the base 10 logarithm of the BF (log₁₀BF) and 2) the attributable λ_s. RA status (affected or unaffected) was modelled by logistic regression, taking into account the covariates gender, number of HLA-DRB1 shared-epitope alleles (NN = 0, SN = 1 or SS = 2) and genotype (three levels). Individuals with missing information were excluded from the calculations. Only SNPs with the three genotypes represented in both cases and controls were considered.

As already mentioned, calculation of BFs requires assumptions on effect sizes. We calculated frequentist estimates of logistic regression coefficients over the corresponding entire chromosomes using NARAC data. Density plots and tests of goodness of fit of frequentist estimates of log₁₀GRR_Hom and log₁₀GRR_Het for the seven investigated chromosomes indicated that GRR variation was better represented by the median absolute deviation (MAD) than by the standard deviation (data not shown). Therefore, we assumed N(0, MAD²) priors on the logarithms of genetic effects. MCMClogit and BayesFactor [5] were used to calculate BFs for each SNP in the selected regions based on NARAC data. The number of individuals with complete data and the observed allele frequencies in controls were used to draw 10,000 samples from a binomial distribution, which were combined with the posterior estimates of genetic effects to simulate the distribution of attributable λ_s for each SNP.

A bagplot is a bivariate generalization of the non-parametric univariate boxplot [6]. In a bagplot, the region which contains 50% of the observations with greatest bivariate depth is denominated the bag. Observations outside the bag expanded three times are considered outliers, they are too far away from the data's central bulk. These outliers are indicated by closed dots in univariate boxplots. Bagplots have been shown to be equivariant for linear transformations and not limited to elliptical distributions. Due to these properties and the characteristics of our data (unknown shape of the underlying distributions), we used non-parametric bagplots to identify deviating SNPs. Bagplots were determined using the function compute.bagplot in the R package aplpack [7].

Results

Theoretical relationship between Bayes factor and attributable risk

Table 1 shows the RAFs at which the maximum log₁₀BF and λ_s were reached assuming several GRR_hom values under the dominant, recessive and additive modes of inheritance. For example, under GRR_hom = 2 and dominance, the maximum log₁₀BF (25.3) was reached when RAF = 0.22 and the maximum λ_s (1.06) for RAF = 0.17. Alternative sample-sizes/priors would result in different log₁₀BFs, but the RAF to reach the maximum log₁₀BF would not change. The representation of log₁₀BF versus attributable λ_s (data not shown) indicated that the two parameters were correlated, but they did not correspond one-to-one. In the three models, the RAF corresponding to the highest attributable λ_s and the RAF corresponding to the highest log₁₀BF decreased with increasing GRR_hom values. The RAF at maximum log₁₀BF was always higher than the RAF at maximum λ_s, i.e., balanced designs resulted in stronger association evidences. The difference between RAF at maximum log₁₀BF and RAF at maximum attributable λ_s increased with increasing GRR_hom, for example (0.24-0.21) = 0.03 if GRR_hom = 1.5 versus (0.14-0.05) = 0.09 when GRR_hom = 10 in the dominant model.

Table 2 Allele frequencies at maximum λ_s and maximum log₁₀(BF)

Full size table

Bayes factors and attributable sibling relative risks in the investigated regions

The analysis of complete chromosomes by frequentist logistic regression resulted in 2(GRR_hom, GRR_het) × 8(chromosomes 1, 4, 6, 7, 10, 13, 21, 22) sets of genetic effects. Table 1 shows the MADs of log₁₀GRR_hom and log₁₀GRR_het used as scale parameters in the normal prior distributions. As expected, the a priori variance of genetic effects was highest for chromosome 6 (MHC region).

Figure 1 represents the bivariate distribution of log₁₀BF by attributable λ_s based on NARAC data for individual SNPs in the 12 explored regions. Gray points show SNPs within the bulk of data, outlying observations (observations outside the bag expanded three times) are represented by black points. The right part of Table 1 shows SNP numbers and close genes for the most extreme outliers. For example, rs2476601 in the 1p13 region: based on NARAC data, the log₁₀BF was 3.91, the high-risk allele frequency in controls is 0.084, and the estimated attributable λ_s with 95% confidence interval was 1.048 (1.017 to 1.098) for this SNP. The other two regions on chromosome 1 did not reveal strong associations with RA. The investigation of the two overlapping intervals in chromosome 6p21 showed practically identical results; the highest attributable λ_s values were found for SNPs rs7765379 (1.175) and rs2395175 (1.164). Many SNPs in this region showed a strong association with RA. The investigation of the 6q23, 7q32, and 10p15 regions resulted in log₁₀BF values below three. The log₁₀BF and the attributable λ_s for SNP rs1407961 in the 13q12 region clearly departed from the majority of data. Two extreme outlier SNPs were identified in the 21q22 region and three outliers in the 22q13 region.

Discussion

This study explored the advantage of combining BF, a measure of statistical evidence of association, and the attributable λ_s, a measure of contribution to familial aggregation, in order to characterize the effect of a particular polymorphism on disease risk. Both simulation and NARAC-based results showed that BF usually correlates with the attributable λ_s. Most markers identified in GWA studies confer a slightly increased risk of disease. For this kind of modest genetic effects (GRR ≤ 2), simulation indicated that the maximum log₁₀BF and the maximum attributable λ_s were expected to be found for variants with similar frequencies: around 20% (dominant), 40% (additive) and 70% (recessive penetrances). The larger the size of the genetic effect, the lower the RAF to reach the maxima, and the larger the difference between RAFs. Causal variants generally show stronger effects than markers from GWA studies. For example, a recent GWA study identified five novel breast cancer susceptibility loci. The highest GRR was 1.6, which can be compared with GRR = 6 to 30 for carriers of BRCA1 mutations. Simulation suggested that, for this kind of strong genetic effects, attributable λ_s and BF are two different dimensions.

The representation of λ_s versus log₁₀BF based on NARAC data confirmed that SNPs which have shown replicated associations clearly departed from the majority of polymorphisms in the selected regions. For example, the functionally relevant SNP rs2476601 was unequivocally separated from the rest of SNPs in the 1p13 region. Other genes located in the proximity of the most extreme SNPs were ZMYM2, MX1, MX2, IL2RB, and CSF2RB. This list includes SNPs which show strong evidence of association (log₁₀BF higher than three) and outlying λ_s values. SNPs with high attributable λ_s values but relatively low BFs may also be of interest. Simulation showed that one BF value corresponds to two different λ_s values. In the 12 regions investigated using NARAC data, the ranking of SNPs according to BF differed from the ranking according to attributable λ_s. The difference was particularly clear in the low-recombination MHC region, where rs7765379 showed the maximum λ_s and rs2395175 showed the maximum log₁₀BF. This difference in rankings suggests that, in addition to a measure of association evidence, the extensive fine-mapping needed to characterize etiological variants may benefit from the consideration of the attributable λ_s. Future studies based on more extensive simulations and additional real data should investigate this possibility.

The present study made use of Bayesian and robust statistics. The Bayesian approach has been overlooked in the analysis of GWA studies, where the adjustment for multiple testing, the relationship between power and statistical significance, and the selection of disease models are important issues [1, 3]. We only considered SNPs with the three genotypes represented in both cases and controls. This strategy would initially hamper the identification of rare causal variants but, on the other hand, logistic regression has not been robust to particular types of outliers, and a three-genotype model with well represented categories would result in more robust estimates. Robust methods aim to describe the structure best fitting to the bulk of the data, and to identify deviating observations. Robust statistics are used a lot less than they ought to be in genetic epidemiology. The use of MADs and bagplots in the present article illustrates their advantage over classical methods in practical situations.

Conclusion

The association results from GWA studies are usually summarized by a measure of evidence of association (frequentist or Bayesian probability values), which do not reflect the contribution of the identified signals to familial aggregation. We propose here a two-dimensional characterization of genetic association consisting of the attributable λ_s and the BF.

Abbreviations

BF:: Bayes factor
GRR:: Genotype relative risk
GWA:: Genome-wide association
λ _s :: sibling relative risk
MAD:: Median absolute deviation
MHC:: Major histocompatibility complex
PAF:: Population-attributable fraction
RA:: Rheumatoid arthritis
SNP:: Single-nucleotide polymorphism
WTCCC:: Wellcome trust case control consortium

References

Wellcome Trust Case Control Consortium: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007, 447: 661-678. 10.1038/nature05911.
Article Google Scholar
Hemminki K, Försti A, Bermejo JL: The 'common disease-common variant' hypothesis and familial risks. PLoS ONE. 2008, 3: e2504-10.1371/journal.pone.0002504.
Article PubMed Central PubMed Google Scholar
Wakefield J: Bayes factors for genome-wide association studies: comparison with P-values. Genet Epidemiol. 2009, 33: 79-86. 10.1002/gepi.20359.
Article PubMed Google Scholar
James JW: Frequency in relatives for an all-or-none trait. Ann Hum Genet. 1971, 35 (1): 47-49. 10.1111/j.1469-1809.1956.tb01377.x.
Article CAS PubMed Google Scholar
Martin AD, Quinn KM: MCMCpack: (MCMC) Package. [http://mcmcpack.wustl.edu/wiki/index.php/Main_Page]
Rousseeuw PJ, Ruts I, Tukey JW: The Bagplot: a Bivariate boxplot. Am Stat. 1999, 53: 382-387. 10.2307/2686061.
Google Scholar
Wolf P: aplpack: Another Plot PACKage. [http://www.wiwi.unibielefeld.de/~wolf/software/aplpack/]

Download references

Acknowledgements

The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences. JLB was supported by Deutsche Krebshilfe and the Swedish Cancer Society. RH was supported by NGFN 2 (SMP-GEM, PGE-S30T09), funded by the BMBF.

This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/3?issue=S7.

Author information

Authors and Affiliations

Institute of Medical Biometry and Informatics, University Hospital Heidelberg, INF 305, 69120, Heidelberg, Germany
Justo Lorenzo Bermejo
Division of Molecular Genetic Epidemiology, German Cancer Research Center, INF 520, 69120, Heidelberg, Germany
Justo Lorenzo Bermejo & Kari Hemminki
Institute of Human Genetics, University of Heidelberg, INF 366, 69120, Heidelberg, Germany
Christine Fischer
Division of Cancer Epidemiology, German Cancer Research Centre, INF 280, 69120, Heidelberg, Germany
Anke Schulz, Nadine Cremer, Rebecca Hein, Lars Beckmann & Jenny Chang-Claude
Center for Family and Community Medicine, Karolinska Institute, 14183, Huddinge, Sweden
Kari Hemminki

Authors

Justo Lorenzo Bermejo
View author publications
You can also search for this author in PubMed Google Scholar
Christine Fischer
View author publications
You can also search for this author in PubMed Google Scholar
Anke Schulz
View author publications
You can also search for this author in PubMed Google Scholar
Nadine Cremer
View author publications
You can also search for this author in PubMed Google Scholar
Rebecca Hein
View author publications
You can also search for this author in PubMed Google Scholar
Lars Beckmann
View author publications
You can also search for this author in PubMed Google Scholar
Jenny Chang-Claude
View author publications
You can also search for this author in PubMed Google Scholar
Kari Hemminki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Justo Lorenzo Bermejo.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JLB designed the study, performed the statistical analyses and drafted the manuscript. CF and KH contributed to the study design and revised critically the manuscript. AS, NC, RH, LB, and JCC made substantial contributions to interpretation of results. All authors read and approved the final manuscript.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Bermejo, J.L., Fischer, C., Schulz, A. et al. Representation of genetic association via attributable familial relative risks in order to identify polymorphisms functionally relevant to rheumatoid arthritis. BMC Proc 3 (Suppl 7), S10 (2009). https://doi.org/10.1186/1753-6561-3-S7-S10

Download citation

Published: 15 December 2009
DOI: https://doi.org/10.1186/1753-6561-3-S7-S10

Genetic Analysis Workshop 16

Representation of genetic association via attributable familial relative risks in order to identify polymorphisms functionally relevant to rheumatoid arthritis

Abstract

Introduction

Methods

Derivation of attributable risks and Bayes factors

Investigated regions

Calculation of attributable risks and Bayes factors for NARAC data

Results

Theoretical relationship between Bayes factor and attributable risk

Bayes factors and attributable sibling relative risks in the investigated regions

Discussion

Conclusion

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Keywords

BMC Proceedings

Contact us

Genetic Analysis Workshop 16

Representation of genetic association via attributable familial relative risks in order to identify polymorphisms functionally relevant to rheumatoid arthritis

Abstract

Introduction

Methods

Derivation of attributable risks and Bayes factors

Investigated regions

Calculation of attributable risks and Bayes factors for NARAC data

Results

Theoretical relationship between Bayes factor and attributable risk

Bayes factors and attributable sibling relative risks in the investigated regions

Discussion

Conclusion

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Proceedings

Contact us