Center for Research in Environmental Epidemiology (CREAL), Barcelona, Spain

Institut Municipal d’Investigació Mèdica (IMIM), Barcelona, Spain

Joint Research Unit on Genomics and Health, Centre for Public Health Research (CSISP) and Cavanilles Institute for Biodiversity and Evolutionary Biology, University of Valencia, Valencia, Spain

CIBER Epidemiología y Salud Pública (CIBERESP), Spain

Abstract

Background

An important question in genetic studies is to determine those genetic variants, in particular CNVs, that are specific to different groups of individuals. This could help in elucidating differences in disease predisposition and response to pharmaceutical treatments. We propose a Bayesian model designed to analyze thousands of copy number variants (CNVs) where only few of them are expected to be associated with a specific phenotype.

Results

The model is illustrated by analyzing three major human groups belonging to HapMap data. We also show how the model can be used to determine specific CNVs related to response to treatment in patients diagnosed with ovarian cancer. The model is also extended to address the problem of how to adjust for confounding covariates (e.g., population stratification). Through a simulation study, we show that the proposed model outperforms other approaches that are typically used to analyze this data when analyzing common copy-number polymorphisms (CNPs) or complex CNVs. We have developed an **R**
**bayesGen**

Conclusions

Our proposed model is useful to discover specific genetic variants when different subgroups of individuals are analyzed. The model can address studies with or without control group. By integrating all data in a unique model we can obtain a list of genes that are associated with a given phenotype as well as a different list of genes that are shared among the different subtypes of cases.

Background

The aim of genome-wide association studies (GWAS) is to assess the association between single nucleotide polymorphisms (SNPs) and common diseases. Recent GWAS have been successful in discovering SNPs significantly associated with complex diseases

Several techniques and platforms have been developed for GWAS involving CNVs, such as array-based comparative genomic hybridization (aCGH). For targeted studies, other techniques such as real time PCR, or Multiplex Ligation-dependent Probe Amplification (MLPA) assays have been used to compare the copy number status of particular loci in cases and controls. In both cases, a signal intensity is measured for each CNV as a continuous variable, from which the copy number status is inferred. In many cases, the distribution of the observed CNV probe measurements is continuous and multimodal, representing the unobserved copy number status as a latent variable

Despite the existence of these methods, CNV association studies often analyze CNVs with very low uncertainty that are not likely genotyping artefacts. For example, in the GWAS performed in the Myocardial Infarction Genetics Consortium ^{2}, Fisher or Mann-Whitney tests

In this article, we present a Bayesian shared component model for CNV-based association studies. We illustrate the model with a case study to determine those CNVs that are specific to a given population when comparing individuals belonging to the HapMap project. In this example it is expected to find differences in a large proportion of CNVs due to ethnic background. An example including patients with ovarian cancer is analyzed in order to illustrate how our model identifies phenotype-associated CNVs when a tiny number of CNVs are expected to be differente accross groups. Our approach adapts and extends the model suggested by

Methods

Data sets

The first motivating data were collected from a genetic study conducted at the Center for Genomic Regulation (CRG) in Barcelona, Spain. The study aimed to determine those CNVs that are specific to major human ethnic groups included in the HapMap project (e.g., African, Asian or European)

The second motivating data belongs to an study on ovarian cancer. The data are obtained from The Cancer Genome Atlas (TCGA) data portal

As previously mentioned, a very simple approach to determine the CNVs that are specific to each subgroup of individuals is to compare the observed CNV frequencies between individuals from different groups ^{2}, Fisher or Mann-Whitney tests can be underpowered. In addition, most of the analyzed CNVs have similar frequencies accross ethnic groups, and only a few, if any, show differences between them. Therefore, the use of a shared component model can be very useful in the context of CNVs.

The Bayesian Model

Let {_{
ijp
} ∈ _{
ijp
} ∈ {0,1,2,3,4,…}. The motivation for this assumption relies on the fact that we are looking for associations between CNVs and populations. If a given CNV is linked to a specific population, it is expected that most of the individuals in that population have similar values for that CNV.

Now, let _{
jd
} denotes the number of individuals in population

where _{
jp
} is the mean number of copies for CNV

We introduce the next shared component formulation with Gaussian likelihood to decompose the variability of _{
jp
}

where _{
p
} is a population-specific intercept, _{
j
} is the component shared by all populations, _{
p
} denotes the loading of the common component into population _{
jp
} encodes the population-specific components. In order to make the model as flexible as possible we have considered that _{
jp
} has the same variance for each population group, ^{2}. The likelihood of our proposed model is

Figure

Schematic representations of the shared component model using a symmetric formulation (i.e., no reference group)

**Schematic representations of the shared component model using a symmetric formulation ( i.e., no reference group).** The index

In the Bayesian framework, all parameters must be assigned prior distributions that, in turn, may depend on new parameters, which are referred to as hyperparameters. Prior distributions (hyperpriors) must also be assigned to these. To complete the Bayesian formulation, the prior and hyperprior distributions for the model parameters are needed. Our basic principle in specifying these distributions is to let the data likelihood dominate over the prior information. To achieve this, it is common to consider prior distributions with large variances that allow for a really wide range of potential values for the parameters thus being non-informative a priori. Following this we chose flat prior distributions. We also refer to previous similar studies that specify prior distributions in this way. We assumed the following priors

and non-informative hyperpriors for the standard deviations of the random effects

For the sake of identifiability we fixed _{
jp
}, were considered as zero-mean _{1} = 1, where _{1} corresponds to the reference population.

Inclusion of covariates

In almost all situations the disease is affected not only by genetic factors but also by environmental determinants. In these situations the association between the disease and CNVs has to be adjusted by some covariates that indicate whether an individual is exposed or not to those environmental variables. Our model can accomodate this information in the case of having categorical covariates (e.g., exposed vs non-exposed, males vs females, smokers vs non-smokers, ...) by aggregating the data in more categories. For instance, suppose we have a categorical covariate

Prior distributions should also be assigned to the additional parameters _{
k
} and _{
jk
}. These could be analogous to the priors for _{
p
} and _{
jp
}.

Notice that if we are interested in adjusting by continuous covariates we should create some categories before including them into the model. One possibility is to create some categories using tertiles or quartiles (e.g. when measuring the exposure to the compsumtion to any nutrient) or use a priori cut-points (e.g. age can be categorized depending on the risk groups). A special case when an adjustment for continuous covariates is required appears in genetic studies when the population structure has to be considered. In these cases, principal component analysis (PCA) is used to determine subpopulation the structure

Estimation of model parameters

The JAGS software (available at

MCMC is computationally intensive, even more in the case of analyzing genetic data where normally thousands of genes are analyzed. To overcome this difficulty we also used the Integrated Nested Laplace Approximation (INLA) approach to make statistical inference of our model. INLA provides a fast (it gives answers in minutes when MCMC requires hours and days) deterministic alternative to MCMC

Results

Genomic differences between human populations

Armengol et al. ^{2} or Fisher tests).

The final data set we use for illustration purposes consists of 120 CNV loci (we removed 32 CNV loci that were not variable among populations) and 261 individuals (56 CEU, 58 YRI and 147 CHB/JPT) belonging to the MLPA experiment. Therefore, our data consists of a 261 × 120-dimensional matrix with values corresponding to the observed copy number status _{
ijp
}∈ {0,1,2,3,4}. After aggregating the counts of each number of copies over the individuals in each population for each CNV loci we fit the model 2 to the aggregated data _{
jp
}where _{
jp
}’s.

Table _{
p
} for the shared component model assuming a symmetric formulation. The specific intercept for all three populations, _{
p
}, is around 2 as expected. The shared component, _{
p
} we observe that _{
CEU
} = 0.0756, while _{CHB/JPT
} = 0.0306 and _{YRI} = 0.0362. This indicates that there is more variability among european individuals, which decreases the power of finding any specific CNV locus for european population. Trace plots and Gelman-Rubin scale reduction factor indicate good convergence of MCMC parameter estimates (see Additional file

**Supplementary tables and figures.**

Click here for file

**Group**

**Parameter**

**median (95%CI)**

CEU

_{1}

1.95 (1.90, 2.02)

YRI

_{2}

1.99 (1.94, 2.04)

CHB/JPT

_{3}

1.97 (1.93, 2.03)

Estimates of specific components, _{jp}, for each CNV and each human populations belonging to HapMap data example

**Estimates of specific components, λ _{jp}, for each CNV and each human populations belonging to HapMap data example.** Each point represents the posterior medians, while segments show its 99.98% credibility intervals. CNVs that are statistically significant specific of each population are coloured in red (gains) and blue (losses).

Armengol et al. ^{2} or Fisher tests. In order to compare the performance of both approaches we tested the existence of population stratification (i.e. genetic differences among individuals) using a principal component analysis (PCA) as suggested in

Specific CNV loci associated with response to treatment in ovarian cancer

This data set contains 8587 CNV loci and 456 individuals. The number of observed copies ranged from 0 to 6. This example was analyzed using INLA configuration of _{
p
} for the shared component model under a symmetric formulation (e.g. no control group). Again, as expected, these intercepts are around 2. Regarding the specific components, we observe that only 57 CNV loci are statistically significant. As previouly mentioned, we were expecting a little number of CNV loci that are specific for each group, since analyzed individuals belong to the same ethnicity. HapMap data showed about 20% of CNV loci to be specific of each subgroup (33 out of 152 detected in _{
jp
} estimates. This figure illustrates those CNVs that are specific to get each response after treatment.

**Group**

**Parameter**

**median (95%CI)**

Complete response

_{1}

2.00 (1.98, 2.03)

Partial response

_{2}

1.99 (1.97, 2.01)

Null response

_{3}

1.99 (1.97, 2.01)

Estimates of specific components, _{jp}, for each CNV and each group of individuals depending on response to treatments belonging to ovarian cancer example

**Estimates of specific components, **_{jp}**, for each CNV and each group of individuals depending on response to treatments belonging to ovarian cancer example.** Each point represents the posterior medians, while segments show its 99.9994% credibility intervals. CNVs that are statistically significant specific of each population are coloured in green (gains) and red (losses).

Simulation Studies

In real datasets we can only illustrate the methods, the truth about which CNV loci are really associated with each group is unknown. In order to evaluate our proposed method we carried out a small-scaled simulation study that mimics the real data analysis presented in previous section. We considered three different groups and 500 and 2,000 CNV loci. Only two of the CNVs were in a different proportion for one population (i.e. these two CNV loci were specific for such group of individuals). We simulated 3 different scenarios for the trully associated CNV loci. The first one considers that the two CNV loci are highly associated with one of the populations (OR=2.0), the second one considers a moderate increase on risk (OR=1.5), while the third one is designed to study the performance of our proposed method in a low risk scenario (OR=1.2). The simulation emulates a likely association between thousands of genes and disease. In genetic studies only a few of the analyzed genes are trully associated with the phenotype of interest. For instance, the WTCCC analyzed 3,432 CNV loci among different diseases and only found 3 loci associated with disease

The copy number status for the loci were simulated considering two types of CNV data. The first one assumes that CNVs were common, meaning that they can be tagged by SNPs ( i.e. analysis of CNPs). In this scenario the copy number status can only be {0,1,2}. This kind of data has been obtained by several authors when analyzing CNVs ^{2} test, a non-parametric Kruskall-Wallis test and a multinomial logistic regression comparing the null model versus the model including the CNV using the likelihood ratio test. Bonferroni correction was used in order to deal with multiple comparisons. We also computed corrected credible intervals for the specific components. Given that the Bonferroni-like correction requires estimation of extreme percentiles for the posterior distribution, which are difficult to be obtained from MCMC samples, we computed a credible interval based on the normal approximation. Finally, we considered the posterior probability as an alternative criterion to detect significant CNV loci. We compared the different approaches by computing the true positive and negative rates (TPR and TNR, respectively) in 500 simulations.

Table

**Bayesian Shared Model**

**Multinomial**

**Posterior**

**Normal**

**Posterior**

**# SNPs**

**
χ
**

**K-W**

**regression**

**Distribution**

**Approximation**

**Probability**

Results for the simulation described in Simulation Studies Section for the case of having common CNVs with major allele frequency simulated from U(0.01, 0.1). The different scenarios are described in that section. We compare four different approaches: ^{2} test, Kruskall-Wallis (K-W), Multinomial regression using likelihood ratio test, and our proposed Bayesian model. The comparison was based on computing the True Positive and Negative Rates, TPR and TNR respectively. Results are expressed in %.

high risk scenario (OR=2.0)

TPR

2000

100.00

0

100.00

100.00

100.00

100.00

TNR

2000

100.00

100.00

100.00

99.98

99.99

99.96

TPR

500

100.00

0

100.00

100.00

100.00

100.00

TNR

500

99.73

100.00

99.73

99.99

99.95

99.80

moderate risk scenario (OR=1.5)

TPR

2000

60.25

0

56.75

75.25

75.50

75.00

TNR

2000

99.95

100.00

99.95

99.98

99.99

99.95

TPR

500

69.25

0

67.50

96.25

96.25

95.75

TNR

500

99.81

100.00

99.81

99.96

99.99

99.98

low risk scenario (OR=1.2)

TPR

2000

0.75

0

0.75

10.50

10.25

10.25

TNR

2000

99.99

100

99.9

100.00

100.00

99.98

TPR

500

1.50

0

3.25

25.25

26.50

25.50

TNR

500

99.99

100

99.99

99.99

99.99

99.98

**Bayesian Shared Model**

**Multinomial**

**Posterior**

**Normal**

**Posterior**

**# SNPs**

**
χ
**

**K-W**

**regression**

**Distribution**

**Approximation**

**Probability**

Results for the simulation described in Simulation Studies Section for the case of having polymorphic CNVs with major allele frequency simulated from U(0.01, 0.1). The different scenarios are described in that section. We compare four different approaches: ^{2} test, Kruskall-Wallis (K-W), Multinomial regression using likelihood ratio test, and our proposed Bayesian model. The comparison was based on computing the True Positive and Negative Rates, TPR and TNR respectively. Results are expressed in %.

moderate risk scenario (OR=2.0)

TPR

2000

48.50

0

52.25

75.25

74.25

75.50

TNR

2000

100.00

100

100.00

100.00

100.00

100.00

TPR

500

46.25

0

42.50

64.50

64.75

64.25

TNR

500

100.00

100

100.00

100.00

100.00

100.00

moderate risk scenario (OR=1.5)

TPR

2000

30.25

0

35.45

58.50

58.50

57.75

TNR

2000

100.00

100

100.00

99.98

99.99

99.97

TPR

500

20.50

0

23.25

44.25

44.25

44.50

TNR

500

99.99

100

99.99

99.96

99.96

99.94

low risk scenario (OR=1.2)

TPR

2000

0.70

0

0.70

20.25

20.25

20.75

TNR

2000

99.98

100

99.99

99.97

99.99

99.98

TPR

500

0.50

0

0.50

16.25

16.25

15.75

TNR

500

99.99

100

99.99

99.99

99.99

99.98

Regarding to computation time, we compared the required time to fit a model with 2,000 CNV loci and 3,000 individuals (1,000 for each of the 3 populations) and chi-square approach took 7sec, Kruskal-Wallis 28sec, multinomial logistic regression 7min 40sec, Bayesian model using INLA 1min 39sec and Bayesian model using MCMC 1h 10m. All computations were done in a workstation Dual Intel Xeon X5482 3,2GHz 2x6 Mb, Quad-Core with 32Gb RAM.

Conclusions

Here we considered the problem of determining copy number variants that are specific to different subgroups of individuals or different subphenotypes when thousand of markers are analyzed and only a few of them are truly associated with a given group. We have demonstrated the utility of our model by analyzing two real datasets. One focuses on describing how to find specific CNV loci for the three major ethnic groups, while the second example illustrates how to detect specific CNV loci related to the response to treatment in patients diagnosed with ovarian cancer. We have implemented a Bayesian shared component model to decompose the observed variability in the number of copies of each CNV loci into two components: shared and specific. Simulation results showed a better performance than other existing methods.

We established the CNV loci that are specific to each group by computing credible intervals of the posterior mean of the specific components and their posterior probabilities. In order to avoid false positive results, we adopted a Bonferroni-like correction. Therefore, credible intervals require estimation of extreme percentiles. This may lead to some difficulties when using MCMC samples. Thus, we also calculated credible intervals based on normal approximation. Simulation studies showed that this method slightly outperforms the method based on percentiles.

The model has been formulated using a hierarchical structure. Therefore, it is straightforward to add further levels of hierarchy if needed. For instance, CNVs can be in the same pathway or may have the same function. Thus, this information can be incorporated in the model in order to estimate better the effect of each CNV locus, as described in

by

and then assign hyperpriors to the parameters _{
gp
}that would pick up the variation at the pathway level. With this formulation, large values of _{
gp
} would indicate an association between pathway

Our model considers that the number of copies for each CNV locus is measured without uncertainty, as considered by some authors

We conclude that our proposed model is useful to discover specific genetic variants for different subgroups of individuals. This could help in determining differences in disease predisposition or response to pharmaceutical treatments. Estimating model parameters can be very time consuming, however we have developed an R package (

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

JRG designed and coordinated the study. JA developed the statistical model. CA implemented the estimating algorithms. JRG wrote the bayesGen R package and carried out data analysis and simulations. All authors contributed to the interpretation and discussion of the results. JA and JRG drafted the manuscript. All authors read and approved the final manuscript.

Acknowledgements

We thank Prof. Havard Rue for help with the implementation of the shared component model using INLA. We also thank Xavier Estivill and Lluís Armengol for providing access to the HapMap data. The authors also thankfully acknowledge the TCGA research network for providing the data corresponding to the ovarian cancer example.

This work has been supported by the Spanish Ministry of Science and Innovation (MTM2008-02457 and Statistical Genetics Network - GENOMET, MTM2010-09526-E to JRG) and Grants GVPRE/2008/010 and AP-055/09 from Generalitat Valenciana (JA).