Center for Theoretical Biology, Peking University, Beijing, 100871, People's Republic of China

LMAM, School of Mathematical Sciences, Peking University, Beijing, 100871, People's Republic of China

Molecular and Computational Biology Program, University of Southern California, Los Angeles, CA, USA

National Center for Mathematics and Interdisciplinary Sciences, and the Key Laboratory of Systems and Control, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, People's Republic of China

Center for Statistical Science, Peking University, Beijing, 100871, People's Republic of China

Abstract

Background

Copy number variation (CNV) is essential to understand the pathology of many complex diseases at the DNA level. Affymetrix SNP arrays, which are widely used for CNV studies, significantly depend on accurate copy number (CN) estimation. Nevertheless, CN estimation may be biased by several factors, including cross-hybridization and training sample batch, as well as genomic waves of intensities induced by sequence-dependent hybridization rate and amplification efficiency. Since many available algorithms only address one or two of the three factors, a high false discovery rate (FDR) often results when identifying CNV. Therefore, we have developed a new CNV detection pipeline which is based on hybridization and amplification rate correction (CNVhac).

Methods

CNVhac first estimates the allelic concentrations (ACs) of target sequences by using the sample independent parameters trained through physicochemical hybridization law. Then the raw CN is estimated by taking the ratio of AC to the corresponding average AC from a reference sample set for one specific site. Finally, a hidden Markov model (HMM) segmentation process is implemented to detect CNV regions.

Results

Based on public HapMap data, the results show that CNVhac effectively smoothes the genomic waves and facilitates more accurate raw CN estimates compared to other methods. Moreover, CNVhac alleviates, to a certain extent, the sample dependence of inference and makes CNV calling with appreciable low FDRs.

Conclusion

CNVhac is an effective approach to address the common difficulties in SNP array analysis, and the working principles of CNVhac can be easily extended to other platforms.

Background

Copy number variations (CNVs) play an essential role in facilitating human diseases susceptibility

High quality CNV calls for accurate estimation of raw copy numbers and requires that statistical models be optimized

Cross-hybridization between probes and off-target sequences is a longstanding problem in microarray analysis

In addition to cross-hybridization, Maris et al. have stated that “whole-genome microarrays with large-insert clones designed to determine DNA copy number often show variation in hybridization intensity that is related to the genomic position of the clones.”

Finally, it has long been known that different sample batches can lead to inconsistent results, even if data are collected by the same lab

To the best of our knowledge, existing methods only address one or two of the three factors discussed above. In this study, we developed a novel CNV detection pipeline based on hybridization and amplification rate correction (CNVhac^{a}) to accurately detect CNVs for Affymetrix SNP array. In contrast to previous methods, CNVhac takes into account all three factors by proper modeling of cross-hybridization, smoothing genomic waves and alleviating sample batch dependence of parameter estimation, thus significantly improving the accuracy of CNV detection. Starting from dozens of basic constants concerning binding affinity, which can be well trained from one single array and are quite stable between arrays, CNVhac is able to get the binding affinity between all probes and sequences without suffering from sample batch dependence. Then CNVhac applies the PICR method

Methods

Dataset

Dataset I. ‘The International HapMap project’

Dataset II. Conrad et al. recently used the ultra-high-resolution NimbleGen tiling arrays (42 M probes) to identify CNVs for HapMap samples

Estimation of raw CNs

The problems usually confronted in the estimation of raw CNs are discussed in the background section. Array intensities not only rely on ACs of target sequences, but also probe binding affinities. Based on

Modeling hybridization and cross-hybridization

Considering one probe in a certain SNP probeset, we have the basic model

where _{s} and _{bg} stand respectively for probe intensity, specific hybridization intensity caused by target sequences and background nonspecific binding intensity, and _{s} has been further modeled by Langmuir-like adsorption principle, and Equation (1) can be rewritten as:

where

where _{i} is a weight factor which is dependent on the position of consecutive bases along the oligonucleotides, _{i} is the _{i}_{i + 1}) and _{i} known as basic constants which hardly change between arrays

However, the model ignores cross-hybridization. There are two alleles (allele A and allele B) in the genome for a certain single polymorphic locus. For high sequence similarity, each allele has a high possibility of binding to the probe which is designed to interrogate the other allele. This cross-hybridization may bring bias when estimating the AC of target sequences (See _{s} follows an additive model of _{sA} and _{sB}. Their meanings are clear: the contribution of allele A and B target sequences, respectively, to probe intensity. Both _{sA} and _{sB} can be modeled by Equation (2); thus our proposed model is

where _{A} and _{B} are ACs for allele A and B, respectively, and _{A} and _{B} denote binding free energy. With quite a few probes in one probeset, the ordinary least squares (OLS) method yields unbiased estimates of _{A} and _{B}. The summation of _{A} and _{B} gives the total concentration

Normalization between arrays

In order to eliminate the systematic bias between arrays which may arise from the different library preparation conditions of the experimental process, we use the following transformation:

where _{mk} is the total concentration for array

Calibration for amplification efficiency

We have found that

where _{,} a pool of reference samples is needed. In the case–control assay pattern, the control arrays are treated as the reference pool. In this article, the HapMap samples from dataset I are used to estimate

CNV calling

CNVhac implements a HMM-based algorithm to call CNVs. HMM methods have previously been successfully applied to other studies

Results

The pipeline of CNVhac mainly consists of two major steps. The preprocessing step first estimates the raw CNs

Raw CN estimation on HapMap CEU samples

We assess the performance of raw CN estimation from two aspects: the accuracy in classifying the sex of HapMap individuals and the amplitude of genomic waviness. Females have two copies of X chromosome, while males only one; therefore, the CN of X chromosome can naturally be used as the benchmark to evaluate the power of the raw CN estimates to differentiate between one or two copies. We collected the same 59 CEU parents in Dataset I to do this classification task as

ROC curves of the sex classification for CNVhac, CRMA_v2 and cn.FARMS on 59 HapMap CEU founders

**ROC curves of the sex classification for CNVhac, CRMA_v2 and cn.FARMS on 59 HapMap CEU founders.** Left: Full ROC curves. Right: Top-left corner of ROC curves. CNVhac performs better than CRMA_v2 and cn.FARMS.

The better result of sex classification by CNVhac may be attributed to better control of genomic waviness. To assess the waviness, we investigated the estimated raw CNs of chromosome X used above. The three sets of raw CNs were separately scaled to the same median. For females, the median is set as 2 and for males 1. Figure

Genomic wave patterns on a segment of Chromosome X of one CEU female founder, NA06985, for (a) cn.FARMS, (b) CRMA_v2 and (c) CNVhac

**Genomic wave patterns on a segment of Chromosome X of one CEU female founder, NA06985, for (a) cn.FARMS, (b) CRMA_v2 and (c) CNVhac.** CNVhac has the smallest amplitude of estimated raw CNs.

Density of raw CNs estimated by different methods for (a) male CEU founders and (b) female CEU founders on chromosome X

**Density of raw CNs estimated by different methods for (a) male CEU founders and (b) female CEU founders on chromosome X.** Raw CNs are scaled to the same median (for males 1 and females 2). CNVhac shows significantly smaller variance than CRMA_v2 and cn.FARMS (F test, all

CNV calling on HapMap samples

The cross-platform verified regions in dataset II are defined as true CNVs to assess the power of CNV detection for CNVhac and Birdsuite on the 269 samples from dataset I (NA19012 is missing in the result of

1-precision versus recall curves for CNV detection on 269 HapMap samples

**1-precision versus recall curves for CNV detection on 269 HapMap samples.** A curve that is located more toward the upper-left corner indicates better performance. Note: FDR is 1-precision. Compared to Birdsuite, CNVhac shows an appreciably lower FDR when calling CNVs.

Sample batch dependence of CNV calling

As described in the Background section, different parameters trained from different sample batches may cause an in-consistent inference. To evaluate the sample batch dependence of CNV calling of CNVhac, we compare it with Bird-suite. In CNVhac, estimating adjustment factor _{i}, three sets of CNV regions can be detected through different _{i} was put to the other two groups which do not contain it. Hence, one can also obtain three sets of identified CNVs. We chose 6 individuals (2 CEU, 2 YRI, 1JPT and 1CHB) to call CNVs based on different groups. Table

**Birdsuite**

**CNVhac**

**G1**^{§}

**G2**

**G3**

**I**^{¶}

**U**^{†}

**Ratio**^{‡}

**G1**

**G2**

**G3**

**I**

**U**

**Ratio**

§The number of predicted CNVs using group 1 for parameter training.

^{¶}The number of CNVs in intersection set of “G1”, “G2” and “G3”.

^{†}The number of CNVs in union set of “G1”, “G2” and “G3”.

^{‡}The ratio of intersection to union.

NA12156

17

19

21

14

22

0.64

15

17

18

15

17

0.88

NA12878

22

21

19

15

28

0.54

29

26

24

20

33

0.61

NA18507

19

15

20

10

23

0.43

16

20

20

15

21

0.71

NA18517

20

21

21

14

25

0.56

21

21

18

16

23

0.7

NA18555

16

16

15

11

20

0.55

16

14

17

11

18

0.61

NA18956

13

12

16

9

16

0.6

20

21

24

16

24

0.67

Discussion

For years, the array-based technologies have been widely used for exploring CNV events. However, the inherent noise of microarray data may lead to high FDR when making inferences. In array experiments, hybridization is highly correlated with the sequence constitutions

Motivated by addressing the cross-hybridization of probes, genomic waves of intensities and sample dependence of parameter estimation, we propose in this article a single-array preprocessing method, termed CNVhac, to estimate more accurate raw CNs. Based on the previous PICR method

CNVs have attracted much attention in recent years because they are assumed to play a significant role in causing human disease

Since CNVhac is a single-array based strategy, the running time could be reduced by executing CNVhac on multiple processors in parallel when analyzing a large set of samples. Also, since parameters are consistent between arrays, there is no need to reprocess the early data when new samples are hybridized.

Conclusion

Cross-hybridization and different amplification efficiencies of probes are the common difficulties in microarray analysis. Most studies attempt to solve the problem by training numerous model parameters from a large dataset, but this might incur inconsistent results. Moreover, the statistical power of this methodology may be significantly reduced when the training dataset is not big enough. In this article, we first addressed cross-hybridization problem through physico-chemical law and then proposed a simple adjustment for the various amplification rates. Our method, CNVhac, avoids complicated statistical models which need many samples for training. By comparing CNVhac with other methods, we have established that our simple process is effective and suitable for all Affymetrix SNP array types with similar design standards. Finally, the working principle of CNVhac can be easily extended to other platforms, such as Illumina and Agilent arrays.

Endnotes

CNVhac^{a}: The algorithm is implemented in R and C++ and is available at

Abbreviations

CN, Copy number; CNV, Copy number variation; FDR, False discovery rate; AC, Allelic concentration; HMM, Hidden Markov Model; GWAS, Genome-wide association studies; PICR, Probe intensity composite representation; PDNN, Position-dependent nearest-neighbor; OLS, Ordinary least squares; CRMA, Copy-number estimation using Robust Multichip Analysis; cn.FARMS, Factor analysis for robust microarray summarization; ROC, Receiver operating characteristic; AUC, Area under ROC curve.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

MPQ and MHD conceived the project. MPQ, LW and MHD proposed the main idea. QW and PCP developed the program. QW implemented the methods, analyzed the data, and wrote the manuscript. MPQ, LW and MHD finalized the manuscript. All authors read and approved the final manuscript.

Funding

This work was supported by the National Natural Science Foundation of China [No.31171262, No.11021463] and the National Key Basic Research Project of China [No.2009CB918503].

Acknowledgements

We thank Linbo Wang and Yongjian Kang for helpful discussions.

Pre-publication history

The pre-publication history for this paper can be accessed here: