Department of Mechanical Engineering, University of Melbourne, Parkville, VIC 3010, Australia

Bioinformatics Core Facility, Peter MacCallum Cancer Centre, VIC 3002, Australia

Abstract

Background

One of the main types of genetic variations in cancer is Copy Number Variations (CNV). Whole exome sequenicng (WES) is a popular alternative to whole genome sequencing (WGS) to study disease specific genomic variations. However, finding CNV in Cancer samples using WES data has not been fully explored.

Results

We present a new method, called CoNVEX, to estimate copy number variation in whole exome sequencing data. It uses ratio of tumour and matched normal average read depths at each exonic region, to predict the copy gain or loss. The useful signal produced by WES data will be hindered by the intrinsic noise present in the data itself. This limits its capacity to be used as a highly reliable CNV detection source. Here, we propose a method that consists of discrete wavelet transform (DWT) to reduce noise. The identification of copy number gains/losses of each targeted region is performed by a Hidden Markov Model (HMM).

Conclusion

HMM is frequently used to identify CNV in data produced by various technologies including Array Comparative Genomic Hybridization (aCGH) and WGS. Here, we propose an HMM to detect CNV in cancer exome data. We used modified data from 1000 Genomes project to evaluate the performance of the proposed method. Using these data we have shown that CoNVEX outperforms the existing methods significantly in terms of precision. Overall, CoNVEX achieved a sensitivity of more than 92% and a precision of more than 50%.

Background

Commercial products of Next Generation Sequencing (NGS) Technologies such as Roche/454 FLX, Illumina Genome Analyzer/HiSeq, Applied Biosystems SOLiD™System and Helicos Heliscope™have enabled the sequencing of DNA much faster and cheaper than before

Cancer arises due to the acquisition of many somatic variations by the DNA of cancer cells

In this work, we present CoNVEX, a method that evaluates exon level depth of coverage ratios to assess variation in copy number of whole exome capture data produced from cancer samples. We propose to use Discrete Wavelet Transformation denoising to reduce the variability of coverage ratios and then use HMM to detect copy number variations. Our method reduces the number of false positives by efficient pre-processing of the data, which results in a mean precision of more than 50%.

Methods

Data pre-processing

Depth of coverage ratios at each targeted region

Number of reads covering each base at a targeted region is calculated using BEDTools

Where

DWT smoothing of the data

The actual copy number of the exon regions can be masked by the noise present in the data itself. This would lead to lot of false positives. The raw signal of exon level ratios can be represented as below,

Here, _{i}^{2})where _{i}^{th}

CNV prediction using a Hidden Markov Model

The copy number state for each targeted region is assigned using a Hidden Markov Model. The copy numbers are represented by the hidden states and as default we have used states from 0 to 5. These six states can be interpreted in biological context as homozygous deletion (copy 0), hemizygous deletion (copy 1), no CNV or copy neutral (copy 2), 1 copy gain (copy 3), 2, and 3 copy amplification (copy 4 and 5). DWT smoothed ratios,

1. The total number of hidden states in the model is given by _{1}, _{2},..., _{K}^{th}_{l}_{k}

2. The initial state distribution _{k}

3. The state transition probability distribution _{mp}

4. The emission probability distribution is given by _{k}**O**)} where

Here, _{k}

The above HMM can be represented compactly as

The optimal

Relationship between DOC ratio and copy number

Without any imperfections, the normalized ratios between regional DOC of tumour and control samples

where

where _{T }

Data from 1000 Genome Project

We randomly selected six samples, NA18536, NA18543, NA18544, NA18548, NA18557, NA18558, from 1000 Genome project, which share some common attributes, to evaluate the performance of the proposed method. These selected individuals have been studied by the HapMap project

Simulated data with known copy number variations

We used depth of coverage data at each exon of 1000 Genome samples to simulate CNV. This ensures that we retain as much intrinsic noise present in non copy number varying regions. The simulation procedure is as follows,

1. First, we retain only the copy number neutral regions in each sample. The CNV information were downloaded from the HapMap project genotype file.

2. We selected one sample (NA18536) as the Control and others as the tumour with known CNV.

3. To do the simulation of gains and losses we randomly selected a region in the Chr 1 and reduce (e.g: multiplied by 0.05) or amplified (e.g: multiplied by 2) the number of reads in that particular region. For each variation type, we perform 100 simulations.

4. When we evaluated the performance using only one sample (NA18543), we used 100 simulations for each variation type. When we used 5 samples in simulations, 20 variations were simulated in each individual sample.

5. To incorporate contamination in the simulation, we mix the control sample and simulated sample as per the relationship (

Results and discussion

Exon level depth of coverage ratios to detect CNV in whole exome data

We have used normalized depth of coverage ratios of the exons among tumour/normal pair to identify the underlying copy number losses and gains. As a quality control procedure, all the regions in matched normal sample, with less than an average coverage of 10 are eliminated in both tumour and normal data sets. However, the useful signal to be used in CNV detection is depleted by the noise present in data itself. This can be attributed to the GC content bias, mappability, bait capture bias

Exon level coverage ratios before and after smoothing

**Exon level coverage ratios before and after smoothing**. Exon level coverage ratios among tumour and matched normal samples (A, C) before DWT smoothing and (B, D) after DWT smoothing. (A, B) show the ratios against the mean log coverage among the two samples. (C, D) show ratios of chromosome 1 exons against their start locations.

Different methods have been proposed to reduce the experimental biases present in TR data. These include GC content bias reduction using regression methods

In this work, we propose to combine the strengths of both DWT and HMM to robustly predict copy number variations in cancer samples. The main novelty of our approach is the use of DWT smoothing to reduce experimental biases present in whole exome sequencing data prior to applying a Hidden Markov Model. These experimental biases are modelled here as additive noise to the true signal. The wavelet coefficients, which are the differences between two nearby data blocks, can be used to reduce noise. This is achieved through approximating some coefficients that do not by pass a certain threshold to zero. After thresholding step when the inverse transform is performed on these wavelet coefficients, we can generate a smoother version of the input signal. Exon level ratios, before and after DWT smoothing, for data downloaded from 1000 Genome project (

After smoothing, we applied an HMM described in Methods section to detect copy gains and losses. Hidden Markov Models have been previously used to detect CNV in exome data (an R package called ExomeCopy)

• ExomeCopy uses HMM to identify CNVs in male patients with X-linked Intellectual Disabilities (XLID)

• They have used depth of coverage of exons as observations or emissions of hidden states

• The robustness in copy identification is achieved by pooling coverage data from all patients

Therefore, it fails to identify relative copy number in cancer samples against a matched normal.

Comparison of the performance of CoNVEX against other methods

Comparison against ExomeCNV using simulated data

We carried out a comparison between the proposed method and the existing method, ExomeCNV

A true positive (TP) is identified when the gain or loss of an exon is correctly identified by the algorithm and a false positive (FP) identification is defined in the same manner. When using ExomeCNV, we used their primary CNV detection method (here after referred to as ExomeCNV1) and the extension which combines DNACopy

We used simulated data as described in Methods section to carry out the comparison. For this, we simulated deletions and duplications in different size ranges. The results of this evaluation are given in Table

Performance of proposed method for 100 simulations.

**Type**

**Proposed Method**

**Sensitivity**

**Specificity**

**Precision**

**Accuracy**

Deletions (1 k -1 M bp)

97.82 ± 12.37%

99.94 ± 0.081%

79.25 ± 23.23%

99.94 ± 0.081%

Duplications (1 k -1 M bp)

95.25 ± 19.64%

99.93 ± 0.082%

77.04 ± 26.43%

99.93 ± 0.085%

Performance of CoNVEX in terms of sensitivity, specificity, recall and accuracy. We listed mean and the standard deviation of the each performance measure.

Performance of ExomeCNV1 for 100 simulations.

**Type**

**ExomeCNV1**

**Sensitivity**

**Specificity**

**Precision**

**Accuracy**

Deletions (1 k - 1 M bp)

97.91 ± 2.81%

86.20 ± 1.57%

8.76 ± 6.54%

86.24 ± 1.56%

Duplications (1 k - 1 M bp)

90.68 ± 9.02%

86.26 ± 1.55%

8.96 ± 8.57%

86.28 ± 1.54%

Performance of ExomeCNV in terms of sensitivity, specificity, recall and accuracy. These results are obtained from running the primary method of ExomeCNV. Each point indicates mean and standard deviation of the measure.

Performance of ExomeCNV2 for 100 simulations.

**Type**

**ExomeCNV2**

**Sensitivity**

**Specificity**

**Precision**

**Accuracy**

Deletions (1 k - 1 M bp)

99.26 ± 2.11%

96.00 ± 1.67%

8.69 ± 6.50%

96.01 ± 1.66%

Duplications (1 k - 1 M bp)

99.98 ± 0.16%

96.06 ± 1.65%

9.62 ± 9.25%

96.08 ± 1.64%

Performance of ExomeCNV in terms of sensitivity, specificity, recall and accuracy. These results are obtained from running the extension of ExomeCNV which includes DNACopy package. Each point indicates mean and standard deviation of the measure.

When compared with ExomeCNV2, our method showed superior performance in terms of specificity, precision and accuracy. Slight decrease in sensitivity was observed in CoNVEX, this is mainly due to the detecting short variations involving 1 or 2 exons. This can be attributed to the smoothing step we performed using DWT. Because of this we separately tested the performance of CoNVEX for shorter variations sizes as described below. Both versions of ExomeCNV, showed very poor performance when it comes to precision, as it tries to detect as many as possible variations to maintain a higher sensitivity rate.

Performance assessment of other methods against CoNVEX

To evaluate the performance of CoNVEX against VarScan2

Performance of CoNVEX against other methods.

**Method**

**True positives**

**False positives**

CoNVEX

9/10

10/15850

Var Scan2

6/7

4983/15283

ExomeCopy

0/10

9/15850

CONTRA

0/10

0/15847

Table shows the number of exons (numerator) that have been identified as true positives and false positives by each method. The denominator shows the total true positives (2^{nd}^{rd}

ExomeCopy and CONTRA did not identify any of the variations present in the test sample. This can be attributed to the fact that these are specifically designed for using a background sample

Performance of proposed method at different duplication and deletion sizes

We observed that small deletions or duplications only span one exon and at most 2 exons due to the sparseness of the exome data. To evaluate the performance of CoNVEX in short variation sizes, we carried out a performance assessment using simulated data of small deletions and duplications in chromosome 1 of NA18536 and NA18543 individuals. The results are given in Figure

Performance at different variations sizes

**Performance at different variations sizes**. Performance of CoNVEX when detecting short variations. Performance is measured by sensitivity and precision. Both graphs show median (solid lines), 0.1 quantile (dashed lines) and 0.9 quantile (dashed lines) of the results from 100 simulations of (A) duplication and (B) deletions. The solid blue line shows the median sensitivity and the solid red line shows the median precision. The sizes considered are 200, 400, 600, 800, 1 k, 1.2 k, 1.4 k, 2 k, 5 k and 10 k bases.

Median sensitivity of CoNVEX for small variation detection is 100%. Every deletion of size, more than 200 bp was detected by our method. Hence, giving a mean sensitivity of 100% for detecting deletions. Mean sensitivity of detecting each duplication size was more than 85%. As seen in the graph, almost every variation of size of more than 800 bases can be detected by the proposed method. Also, a median precision of more than 30% can be achieved.

Performance assessment at different levels of contamination

Normal cell admixture in cancer sample is an issue that has to be taken into account when predicting copy number losses and gains. The presence of admixture shrinks the DOC ratios to 1 (also discussed in Methods). Our method works on the assumption that the user will provide the contamination percentage as an input. However, these data might not be available for every experiment. Hence, we carried out an evaluation of our method based on simulated data from NA18543 for two scenarios. First scenario was to consider the availability of admixture rate and second was to run the programme without any indication of contamination. The performance of CoNVEX, for admixture rates ranging 10% to 70%, in terms of sensitivity, under the first scenario is given in Figure

Performance of CoNVEX at different admixture rates

**Performance of CoNVEX at different admixture rates**. The plots show mean sensitivity of CoNVEX at different admixture rates for (A) duplications and (B) deletions. The dashed red line shows the sensitivity when, user provides the admixture rate as an input. The dashed blue line shows the sensitivity of the model when it expects zero admixture. The size range of duplications and deletions considered here is 1 k -10 k bp.

Conclusions

Exome sequencing data can be used to detect copy number variations as an initial screening procedure. It is a cheap and time efficient method. We have successfully applied the proposed method on exome data to identify CNVs spanning one to thousands of exons. However, actual breakpoint of the CNV would not necessarily lie in the coding region. This limits the use of WES in identifying actual breakpoints of the CNV.

As discussed in the Results and Discussion section, we have achieved a higher precision than existing methods in detecting variations due to the data smoothing step. However, detection of some of the small variations may be missed by this smoothing step, as these can be recognised as noise. Further analysis is needed in order to better detect these variations among higher level of noise.

Although, we have used a matched normal sample to detect CNVs, the CNV identification can be done based on a pooled normal sample as described in

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

KCA designed the method, evaluated the performance and drafted the manuscript. JL and SKH contributed to improve the method. All authors read and approved the final manuscript.

Acknowledgements

We thank Dr. Isaam Saeed and Dr. Suhinthan Maheswararajah for initial discussions on HMM. We used resources from both University of Melbourne and Peter MacCallum Cancer Centre for data processing and analysis. This work is partially funded by Australian Research Council (grant DP1096296).

This article has been published as part of

Declarations

The funding for open access charges were provided by The University of Melbourne.