CIBER Epidemiology and Public Health (CIBERESP), Barcelona, Spain

Cardiovascular Epidemiology & Genetics group, Inflammatory and Cardiovascular Disease Programme, Institut Municipal d'Investigaci/'o Mèdica (IMIM), Barcelona, Spain

Statistics Department, University of Barcelona (UB), Barcelona, Spain

Structural Biology and Biocomputing Programme, Spanish National Cancer Centre (CNIO), Madrid, Spain

Center for Research in Environmental Epidemiology (CREAL), Barcelona, Spain

Abstract

Background

Copy number variants (CNV) are a potentially important component of the genetic contribution to risk of common complex diseases. Analysis of the association between CNVs and disease requires that uncertainty in CNV copy-number calls, which can be substantial, be taken into account; failure to consider this uncertainty can lead to biased results. Therefore, there is a need to develop and use appropriate statistical tools. To address this issue, we have developed

Results

Here we present a new R package, CNVassoc, that can deal with different types of CNV arising from different platforms such as MLPA o aCGH. Through a real data example we illustrate that our method is able to incorporate uncertainty in the association process. We also show how our package can also be useful when analyzing imputed data when analyzing imputed SNPs. Through a simulation study we show that CNVassoc outperforms CNVtools in terms of computing time as well as in convergence failure rate.

Conclusions

We provide a package that outperforms the existing ones in terms of modelling flexibility, power, convergence rate, ease of covariate adjustment, and requirements for sample size and signal quality. Therefore, we offer CNVassoc as a method for routine use in CNV association studies.

Background

The proportion of variation in risk of complex diseases explained by the single nucleotide polymorphisms (SNPs) that have been discovered in recent years using the genome-wide association approach appears to limited. This has lead to the suggestion that other, possibly more complex, genetic variants could partly explain the remaining disease susceptibility. Technological advances now allow a class of genetic variants known as copy number variants (CNV) to be genotyped with increasing levels of accuracy, and several studies have recently explored the relationship between these variants and risk of complex disease

**User's manual**.

Click here for file

Implementation

We developed a set of functions to analyse copy number variants and integrated them as an R package called

The R software is a general purpose and open source program commonly used in all type of statistical analysis. Having incorporated the functions as an R package allows user to take advantage of R flexibility in manipulating the input and the results when analysing CNVs with

Software main features

To date, only one other R package,

Inferring copy number status

By separating the CNV calling and association testing steps,

Considering batch effect

In

Formally, the intensity signal distribution,

where, _{cb }_{cb }_{c }_{c}_{cb }_{cb }

In

where _{b }

Improved association test

To incorporate CNV copy number uncertainty in the association test,

Adjustment for covariates

Response phenotypes

Inheritance models

Analysis of multiple CNVs

To perform association testing of multiple CNVs with greater computational efficiency, a function called multi

Computational Efficiency

Using the same sample sizes and probe signal intensity distributions as used in

Performing association tests

First, an object of class cnv must be created by

Functions to simulate CNV data

In

Association analysis on imputed SNPs

Also, it is possible to analyse association of imputed SNPs and response. Taking the genotypes probabilities obtained from any software capable to impute SNPs, such as IMPUTE

Results and Discussion

In this section we show the results obtained in inferring copy number status and association analysis on a real data set including 360 cases and 291 controls (data described in

A more detailed description of all these analyses and others (imputed SNPs, aCGH data, other phenotypes distributions -poisson, weibull and normal-) can be found in Additional file

Inferring copy number status

Previous to association analysis, inferring copy number status process must be done. To do so, the function cnv is used. In this subsection, gene 2 from MLPA data example is used. This data set can be load from the

The peak intensities of gene 2 are assumed to follow a mixture of normal distributions, and the method used to estimate this distribution is specified by the

Plot of a cnv object generated from CNV signal intensity data

**Plot of a cnv object generated from CNV signal intensity data**.

A measure that quantifies the amount of uncertainty in the CNV calling estimation can be computed using the function getQualityScore. Various measures are available; the following is an example of how to obtain the quality score (uncertainty measure) described in the

In some cases, it may be preferable to infer copy number status using another algorithm that is not implemented in

Performing association models

To carry out association analysis between CNV and disease, the function

Here, we continue with the same MLPA data taking the CNV object for gene 2 in the previous section. To fit a logistic regression model with case-control status as a response and CNV copy number as a predictor, and assuming an additive genetic effect, we type

By applying the summary function to the result, we obtain odds ratios, confidence intervals, and p-values for every copy number status with respect to the reference copy number category.

To compute the global CNV significance p-value, the CNVtest function can be used as follows:

In this example, a Likelihood Ratio Test (LRT) is computed, comparing a model containing CNV to a model lacking CNV (i.e. a model without predictors or the null model).

Using the

Response phenotypes: Weibull

In this section, we illustrate how to analyse a time-to-event response variable (Weibull distributed) using simulated data generated with the function simCNVdataWeibull. In the following example, a CNV has been generated with 0, 1 and 2 possible copies with probabilities of 25%, 50% and 25% respectively, with intensity signal standard deviation of 0.4 for each copy status, and means of 0, 1 and 2 respectively. The response variable has been simulated under a Weibull distribution with shape parameter equal to 1 and disease incidence equal to 0.05 (per person-year) among the population with zero copies (reference). The proportion of observed events (non-censored) was set to 10%. Finally, these data have been generated assuming a additive CNV effect with a Hazard Ratio of 1.5 per copy.

Once the CNV data and phenotype has been generated, inferring copy number status and fitting the association model is performed in the following two steps:

(1) Inferring copy number status, as for case-control studies:

Note that 3 copy number statuses has been estimated by BIC criteria. By default 1, 2 and 3 copies are assigned. The number of copies for each status can be changed to 0, 1 and 2 respectively by modifying the num.copies attribute.

2) Testing for association between CNV and time-to-event, specifying the family argument as "weibull":

Note that, Hazard Ratios (HR) are displayed instead of Odds Ratios. In this case, an additive CNV effect has been assumed in performing the association model.

Computational Efficiency

In this section, we compare the performance of

Number of failed convergence simulations out of 500 using CNVassoc and CNVtools according to inferring copy number uncertainty

**
Q
**

**CNVassoc**

**CNVtools**

**CNVassoc**

**CNVtools**

6.0

0

0

0

15

5.5

0

0

0

20

5.0

0

0

0

65

4.5

0

0

0

92

4.2

0

0

0

187

4.0

0

0

0

246

3.7

0

0

0

294

3.5

0

1

0

299

3.2

0

13

212

389

3.0

0

65

331

400

We have also observed a marked difference in the speed of each procedure: when analyzing 10,000 CNVs in 2,000 cases and 2,000 controls, and with a

Conclusions

We present a new package for performing analysis of association between copy number variants and disease, appropriately taking uncertainty in CNV copy number calls into account. The numerical procedure for fitting the model is simple and computationally efficient, handling thousands of CNVs in reasonable time. In addition, it is possible to adjust for covariates which may be necessary to control for population stratification. A central feature of

In conclusion, considering the advantages in terms of modelling flexibility, power, convergence rate, ease of covariate adjustment, and requirements for sample size and signal quality, we offer

Availability and requirements

1. Project name:

2. Project home page:

3. Operating system(s): Platform independent

4. Programming language: R

5. R Dependencies: mixdist, mclust, survival

6. R Suggested: CGHcall, CGHregions, snow,

7. License: GPL or newer

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JRG conceived the idea of incorporation probabilities to address uncertainty in CNV association studies. IS and JRG created the R functions and the package. IS implemented some R functions to simulate CNV data. GL drafted the manuscript. IS, GL, RD-U and JRG designed, performed and interpreted the simulation studies to compare

Acknowledgements

The authors would like to express their gratitude to Dave MacFarlane and Alejandro Caceres for their helpful comments and for reviewing the manuscript. This work has been supported by the Spanish Ministry of Science and Innovation (MTM2008-02457 to JRG, BIO2009-12458 to RD-U and statistical genetics network MTM2010-09526-E (subprograma MTM) to JRG, IS, GL and RD-U). GL is supported by the Juan de la Cierva Program of the Spanish Ministry of Science and Innovation.

Pre-publication history

The pre-publication history for this paper can be accessed here: