Biomedical Informatics, Dept of Informatics, University of Oslo, Oslo, Norway

Centre for Cancer Biomedicine, University of Oslo, Oslo, Norway

Cancer Genome Project, Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK

Dept of Human Genetics, VIB and University of Leuven, Leuven, Belgium

Dept of Genetics, Institute for Cancer Research, Oslo University Hospital Radiumhospitalet, Oslo, Norway

Institute of Clinical Medicine, Faculty of Medicine, University of Oslo, Oslo, Norway

Dept of Oncology, Division of Cancer, Surgery and Transplantation, Oslo University Hospital Radiumhospitalet, Oslo, Norway

Dept of Immunology, Institute for Cancer Research, Oslo University Hospital Radiumhospitalet, Oslo, Norway

Breast Cancer Functional Genomics, Cancer Research UK Cambridge Research Institute and Dept of Oncology, University of Cambridge, Li Ka-Shing Centre, Cambridge, UK

Cambridge Breast Unit, Addenbrookes Hospital and Cambridge National Institute for Health Research Biomedical Research Centre, Cambridge University Hospitals NHS Foundation Trust, Cambridge, UK

Abstract

Background

Cancer progression is associated with genomic instability and an accumulation of gains and losses of DNA. The growing variety of tools for measuring genomic copy numbers, including various types of array-CGH, SNP arrays and high-throughput sequencing, calls for a coherent framework offering unified and consistent handling of single- and multi-track segmentation problems. In addition, there is a demand for highly computationally efficient segmentation algorithms, due to the emergence of very high density scans of copy number.

Results

A comprehensive Bioconductor package for copy number analysis is presented. The package offers a unified framework for single sample, multi-sample and multi-track segmentation and is based on statistically sound penalized least squares principles. Conditional on the number of breakpoints, the estimates are optimal in the least squares sense. A novel and computationally highly efficient algorithm is proposed that utilizes vector-based operations in R. Three case studies are presented.

Conclusions

The R package

Background

In cancer, the path from normal to malignant cell involves multiple genomic alterations including losses and gains of genomic DNA. A long series of studies have demonstrated the biological and clinical relevance of studying such genomic alterations (see, e.g.,

To achieve high processing efficiency, dynamic programming is used (see

• Independent as well as joint segmentation of multiple samples

• Segmentation of allele-specific SNP array data

• Preprocessing tools for outlier detection and handling, and missing value imputation.

• Visualization tools

Implementation

Systems overview

The _{2}-transformed copy number measurements from one or more aCGH, SNP-array or HTS experiments. Allele-frequencies may also be specified for the segmentation of SNP-array data. It is strongly recommended to detect and appropriately modify extreme observations (outliers) prior to segmentation, as these can have a substantial negative effect on the analysis. For this purpose, a specially designed Winsorization method is included in the software package. A missing-value imputation method appropriate for copy number data is also available.

An overview of the

**An overview of the ****copynumber ****package.** Depending on the aim of the analysis, the input will be copy number data and possibly allele frequencies from one or more experiments. Preprocessing tools are available for outlier handling and missing data imputation, and three different methods handle single sample, multi-sample and allele-specific segmentation. Several options are also available for the graphical visualization of data and segmentation results.

Segmentation methods for three different scenarios (single sample, multi-sample and allele-specific segmentation) are implemented in the package. All these methods are referred to as Piecewise Constant Fitting (PCF) algorithms and seek to minimize a penalized least squares criterion. In single sample PCF, individual segmentation curves are fitted to each sample. In multi-sample PCF, segmentation curves with common segment borders are simultaneously fitted to all samples. In allele-specific PCF, the segmentation curves are fitted to bivariate SNP-array data, providing identical segment borders for both data tracks. A set of graphical tools are also available in the package to visualize data and segmentation results, and to plot aberration frequencies and heatmaps. Also included are diagnostics to explore different trade-offs between goodness-of-fit and parsimony in terms of the number of segments. In the remaining part of this section, a formal description of the algorithms is given. However, note that these details are not a prerequisite for reading later sections or for using the

Preprocessing: Outlier handling

A challenging factor in copy number analysis is the frequent occurrence of outliers - single probe values that differ markedly from their neighbors. Such extreme observations can be due to the presence of very short segments of DNA with deviant copy numbers, to technical aberrations, or a combination. When identification of CNVs is a purpose of the study, the multi-sample method described below may be applied for such detection. However, when the focus is on detection of broader aberrations, the potentially harmful effect of extreme observations on aberration detection methods induces a need for outlier handling procedures (see, e.g., _{1},…,_{
p
}, the corresponding Winsorized observations are defined as

Here, _{1},…,_{
p
}. For normally distributed observations, _{
M
}=1.4826·MAD corresponds to SD.

Winsorization of copy number data may be achieved by first estimating the trend in the data and then Winsorizing the residuals. Let the observations representing copy numbers in **y**=(_{1},…,_{
p
}), ordered according to genomic position. A simple estimator of the trend is the median filter. The trend estimate _{
j−k
},…,_{
j + k }for some _{
M
}, and Winsorized observations

**This pdf-file contains a formal description of the iterative PCF-based Winsorization algorithm.**

Click here for file

Single sample segmentation

Consider first the basic problem of obtaining individual segmentations for each of a number of samples. Suppose attention is restricted to one chromosome arm on one sample. For each of the

where _{
j
}is an unknown parameter reflecting the actual amount of sample DNA at the j’th locus and _{
j
}represents measurement noise. A breakpoint is said to occur between probe _{
j
}≠_{
j + 1}. The sequence _{1},…,_{
p
} thus implies a segmentation _{1},…,_{
M
}} of the chromosome arm, where _{1 }consists of the probes before the first breakpoint, _{2 }consists of the subsequent probes until the second breakpoint, and so on. To fit model (1), we minimize the penalized least squares criterion

with respect to the sequence _{1},…,_{
p
}. Here, |

where _{
I }denotes the number of probes in segment

Naive optimization of the cost function (5) with respect to the segmentation ^{2}) operations is available. Dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems, and specifically for problems where global decisions can be decomposed into a series of nested smaller decision problems. The crucial observation that allows the use of dynamic programming to solve the present segmentation problem is that the optimal segmentations on each side of a breakpoint are mutually independent. This can be used to iteratively build up a solution to the global segmentation problem. Suppose we know the optimal segmentations from the first probe up until the (_{
r }are known for all probes

Then the total error for the optimal solution up until index _{
j−1}) and the penalty for the break point (

where _{0}=0. The main work load of the above computation is to determine

where (

Algorithm 1: Single sample PCF

_{1},…,_{
p
}; penalty _{1},…,_{
M }and segment averages

1. Calculate scores by letting **a**
_{0}=[ ] and **e**
_{0}=0, and iterate for

•
**a**
_{
k
}=[**a**
_{
k−1} 0] + _{
k
}

•
**d**
_{
k
}=−**a**
_{
k
}∗**a**
_{
k
}/(

•

storing also the index _{
k
}∈{1,2,…,

2. Find segment start indices (right to left)

3. Find segment averages _{0}=

Throughout the paper we will tacitly assume that the penalty for the

Multi-sample segmentation

Detection of very short or very low amplitude segments requires a small penalty

where **y**,

Algorithm 2: Multi-sample PCF

**y**
_{1},…,**y**
_{
p
}∈^{
n
}; penalty _{1},…,_{
M }and segment averages

1. Calculate scores by letting **A**
_{0}=[ ] and **e**
_{0}=0, and iterate for

•
**A**
_{
k
}=[**A**
_{
k−1} 0] + **y**
_{
k
}

•

•

storing also the index _{
k
}∈{1,2,…,

2. Find segment start indices (right to left)

3. Find segment averages _{0}=

The multi-sample PCF algorithm (see Algorithm 2) is in principle quite similar to single sample PCF. However, when updating the solution from **y**
^{
i
} by _{
i }are weights and

Allele-specific segmentation

The PCF algorithm is easily adapted to variants of the basic segmentation problem discussed above. Here, we consider an adaptation to handle SNP genotype data. We then have for each SNP locus a measurement of (total) copy number (logR) as well as the B allele frequency (BAF). We may also have measurements of copy number only for a number of additional loci. The B allele frequency is a number between 0 and 1 indicating the allelic imbalance of a SNP. For a homozygous locus we have BAF close to 0 or 1, while for a heterozygous locus with an equal number of the two alleles A and B, BAF will be close to 0.5. An imbalance between the number of A’s and B’s results in a BAF value deviating from 0.5. A change in the total number of copies of a segment will alter the logR value, hence result in a level shift in the logR track. Unless the copy number change is balanced with respect to the two alleles, the BAF value will also change. In cases involving multiple copy number events at the same locus, the change may manifest itself only in one of the two tracks. For example, a loss of one copy of A followed by a gain of one copy of B would lead to unchanged logR and changed BAF. The purpose of the allele-specific PCF algorithm is to detect breakpoints for all such events. It fits piecewise constant curves simultaneously to the logR and the BAF data, forcing breakpoints to occur at the same positions in both. We emphasize that the purpose of the allele-specific PCF algorithm is segmentation only and not to make allele-specific copy number calls. However, such calls can be made on the basis of the segmentation described below, and this is done e.g. in the ASCAT algorithm (Allele-Specific Copy number Analysis of Tumors) which estimates allele-specific copy numbers as well as the percentage of cells with aberrant DNA and the tumor ploidy _{
j
}
_{
j
}) for _{
j }denotes the logR value and _{
j }the BAF value at the _{
j }is given and _{
j }will be missing (henceforth coded as NA). For germline homozygous probes, the BAF values are noninformative and should be omitted from the analysis. If the germline genotype is known (e.g. from a matching blood sample), the user should replace the corresponding BAF values by NA. If the genotype is not known, the algorithm will apply a proxy to handle this issue (see below). Prior to segmentation, the allele-specific PCF algorithm performs the following steps:

The BAF data are mirrored around 0.5 by replacing _{
j
}with 1−_{
j
}if _{
j
}>0.5.

BAF values _{
j
}<

Let

The remaining part of the allele-specific PCF algorithm is then essentially an adaptation of the multi-sample PCF algorithm applied to two samples. It finds a common segmentation

where

Fast implementations of PCF

The PCF algorithms may be generalized to allow breakpoints only at certain prespecified positions. Combined with simple heuristics, this may be used to further enhance the computational speed of PCF. For brevity we describe only the single sample segmentation case here; however the ^{2}), where

Having found the solution for the

Algorithm 3: Fast PCF

_{1},…,_{
p
}; penalty _{1},…,_{
M }and segment averages

1. Apply heuristics to find potential breakpoints _{0},_{1},…,_{
q
}, where _{0}=1 and _{
q
}=

2. Form aggregates by letting

3. Calculate scores by letting **a**
_{0}=[ ], **c**
_{0}=[ ], **e**
_{0}=0, and iterate for

•
**a**
_{
k
}=[**a**
_{
k−1} 0] + _{
k
}

•
**c**
_{
k
}=[**c**
_{
k−1} 0] + _{
k
}−_{
k−1}

•
**d**
_{
k
}=−**a**
_{
k
}∗**a**
_{
k
}/**c**
_{
k
}

•

storing also the index _{
k
}∈{1,2,…,

4. Find segment start indices (right to left)

5. Find segment averages _{0}=

Results and discussion

Selection of penalty

The selection of parameters determining the trade-off between high sensitivity (i.e. few missed true aberrations) and high specificity (i.e. few false aberrations) is important in all segmentation procedures. In PCF, this is controlled by the single penalty parameter

The effect of changing the penalty

**The effect of changing the penalty ****in PCF.** The plot in the upper left corner shows the copy number data for a selected chromosome (in this case, chromosome 17), while the lower right plot shows the number of segments found by PCF as a function of

Aberration calling

Aberration calling is used for detection of recurring alterations and in many other analyses. Introducing a parameter _{ + }and _{−} may be used for gains and losses. To examine how well PCF aberration calling manages to distinguish between normal and aberrant regions, performance was compared with a very accurate measurement method. Specifically, aberration calls obtained with PCF on the basis of 1.8M SNP array data on 40 samples were compared with calls obtained with MLPA (Multiplex Ligation-dependent Probe Amplification; see Additional file

**This pdf-file contains a description of the three data sets used in this paper.**

Click here for file

Aberration calling accuracy.

**Aberration calling accuracy.** The ROC-curves show the sensitivity and specificity for a sequence of thresholds as calculated by comparing aberration calls to the classifications made in a MLPA-analysis on the same data material. In panel **(a)**, classifications were made based on PCF segmentations found for a wide range of **(b)** shows that aberration calls based on multi-sample PCF segmentations are about as accurate as those based on single sample PCF. In panel **(c)**, ROC-curves are shown for calls made on the basis of the segmentations found by PCF and CBS, a running median with window size 50 and raw data. In terms of aberration calling accuracy, PCF and CBS give nearly the same results, while using the running median gives slightly less accurate classifications. Using only raw data leads to much poorer accuracy. Note the range on the ordinate axis.

Single- versus multi-sample segmentation

Whether the initial segmentation of a dataset is most appropriately done using single- or multi-sample methods depends both on the purpose and the data. Using methods with common breakpoints for samples will increase the power for detecting concordant but quantitatively weak segments, while it will reduce the ability of detecting (or correctly positioning) discrepant breakpoints. A well known example of aberrations with common boundaries is germline copy number variants (CNVs), thus some proposed algorithms for CNV detection utilize segmentation with joint segment borders (e.g.

Comparison of results from single sample and multi-sample PCF.

**Comparison of results from single sample and multi-sample PCF.** In single sample PCF, **b**) should be well suited as variables in statistical analyses. On a more detailed level there are differences, e.g., longer segments in the single sample analysis (panel **a**) are divided into subsegments with slightly different estimates in the multi-sample analysis. The plot was created with the function

Comparing tracks: Analysis of disseminated tumor cells

Disseminated tumor cells (DTCs) are detected in the bone marrow of some patients with breast carcinomas. The presence of DTCs in the bone marrow identifies patients with less favorable outcome (see, e.g.,

Analysis of disseminated tumor cells (DTCs) with multi-sample PCF.

**Analysis of disseminated tumor cells (DTCs) with multi-sample PCF.** The top panel shows the primary tumor and the three panels below show single cells morphologically classified as DTCs (all for chromosome 2). High noise levels make separate analyses of each DTC difficult; co-analyzing multiple DTCs, possibly together with a primary tumor, thus facilitates an evaluation of the degree of correspondence between the aberration patterns. In the present case, two DTCs seem to have aberration patterns similar to the primary tumor, while the last cell has an essentially flat (balanced) pattern and is probably a hematopoietic cell misclassified as a DTC. The plot was created with the function

Defining variables: Genetic evolution in follicular lymphoma

Follicular lymphoma is normally a slowly progressing malignancy, but relapses are common and the disease is usually fatal. In a recent study, 100 biopsies from 44 patients diagnosed with follicular lymphoma were evaluated using a custom-made aCGH platform consisting of 3k BAC/PAC probes

Whole-genome view of aberrations in the follicular lymphoma data.

**Whole-genome view of aberrations in the follicular lymphoma data.** The plot is based on all 100 biopsies, and aberrations were defined as copy number estimates above 0.05 (for gains) or below -0.05 (for losses). Aberration frequencies are shown in red for gains and green for losses. Correlations between the copy number activity at different genomic locations are shown as arcs (blue for positive correlations and yellow for negative correlations), using a correlation threshold of ±0.68 to determine which correlations to display. Aberration frequencies are based on the segmentation found with single sample PCF (with

Although the delineation of segments varied between biopsies, several areas with a high frequency of aberrations could be detected. To try to identify aberrations with prognostic potential, we therefore found a common segmentation for the initial biopsies taken from each of the 44 patients using the multi-sample PCF algorithm. Removing very low variance segments, 93 segments remained. The corresponding copy number estimates were used as covariates in a multivariate Cox proportional hazards regression. This revealed 11 segments for which gains were significantly associated with a survival disadvantage. A particularly strong association was detected for gains on chromosome X in male patients. To study the relation between successive biopsies taken from the same patient, multi-sample PCF was applied to each patient individually (see Additional file

Allele-specific copy number analysis in breast cancer

Copy number alterations have been extensively studied in breast cancer. To what degree gains and losses are associated only with certain alleles has been less studied. In a recent study, genotyping of 112 breast carcinoma samples was performed using Illumina 109K SNP arrays, and the ASCAT method was used to infer the allele-specific copy numbers at each locus

Allele-specific PCF analysis of SNP array data.

**Allele-specific PCF analysis of SNP array data.** Results are shown for a breast carcinoma sample in the MicMa cohort for chromosome 1 (panel **a**) and chromosome 17 (panel **b**). The points in the upper two panels show observed total copy numbers (logR) while the points in the lower two panels show observed B allele frequencies (BAF). The red curves show the result of applying the allele-specific PCF segmentation method to the data. The plot was created with the function

If for a certain SNP locus one allele is substantially more frequently gained than the other allele, one may hypothesize that the former allele is subject to a larger selective pressure to change copy number. This, in turn, may be an indication of different roles being played by the two alleles with respect to cancer progression and evolution, suggesting that loci subject to allelic skewness can be potential unique markers for breast cancer development. Even from a relatively small number of samples, probes with highly significant allelic skewness have been identified in a genome-wide statistical evaluation

Outliers and Winsorization

While least squares methods are often favored due to their optimality properties, they are also known to be sensitive to extreme observations. Thus, except if the purpose is to search for short aberrations of biological origins (CNVs), we advise the use of an outlier handling procedure. To evaluate the proposed Winsorization scheme, we first established a suitable way of simulating extreme observations. A classical way is to use “contaminated normals”, where the error distribution is a mixture of two normal distributions ^{2}), and with probability ^{2}
^{2}), typically with

**Type**

**Distribution**

**Sensitivity(%)**

**Specificity(%)**

**False aberrations (%)**

Shown is the effect of Winsorization on simulated data with outliers and artificial (low-amplitude) aberrations. Two types of aberrations are considered: (A) aberrations of height 1.5 and length 10 probes and (B) aberrations of height 1.0 and length 30. The contamination consists of normals with SD=3 and the MAD estimate of SD equals 1.0. Sensitivity is the percentage of amplified probes that are detected as amplified, while specificity is the percentage of non-amplified probes classified as such. The false aberration column gives the percentage of aberrations not covering the central part of the real amplifications.

Normal

79.5

96.5

0.15

A

Normal w/5% contam.

78.8

93.7

1.04

Normal w/5% contam., Winsor.

78.1

96.0

0.13

Normal

78.9

93.6

0.20

B

Normal w/5% contam.

77.8

90.6

1.06

Normal w/5% contam., Winsor.

77.5

93.3

0.15

Another way to avoid that a few extreme observations result in a segment is to impose a lower limit on the length (number of probes) of a segment. With a lower length limit of five probes, we found about twice as many false spikes as with Winsorization when adjusting

Computational performance

In R, using the vector based PCF implementation described in Algorithm 1 implies a substantial efficiency gain over loop based implementation, roughly a 10-20 times reduction in time requirements. The fast implementation of PCF (Algorithm 3) gives a further marked reduction in computing time. On the MicMa 244k dataset (longest arm ≈10000 probes), the implemented fast version is about 15 times faster than the exact one, and uses around 3.5 minutes to process the 49 samples (4 seconds per sample, see Table

**Method**

**R package**

**Agilent 244K**

**Illumina 1.1M**

**Raw data**

**Outliers removed**

**Raw data**

**Outliers removed**

The average computation time (in seconds) per sample is shown for

PCF

copynumber

4 (0.2)

4 (0.2)

23 (0.7)

22 (0.4)

Fused Lasso

cghFLasso

5 (0.2)

5 (0.2)

97 (0.7)

99 (3.3)

CBS

DNAcopy

15 (4.7)

35 (4.1)

71 (12.9)

219 (12.8)

The deviations between the solutions found by the exact PCF and fast PCF on the MicMa set were small; in terms of reduction in variance (difference between sample variance and residual variance after fitting PCF curves) below 0.01%. The differences observed for the curves were typically small shifts in the border of aberrations. Thus, we conclude that the results from the fast procedure for practical purposes may be regarded as global solutions to (3), and the fast version is therefore used by default in

Segmentation accuracy

We further compare the accuracy of the segmentation solutions found by PCF and CBS. Figure

**This pdf-file describes a comparison of segmentations performed by CBS and PCF on a MicMa sample.**

Click here for file

Conclusions

Copy number segmentation based on least squares principles and combined with a suitable penalization scheme is appealing, since the solution will be optimal in a least squares sense for a given number of breakpoints. We have proposed a suite of platform independent algorithms based on this principle for independent as well as joint segmentation of copy number data. The algorithms perform similarly as other leading segmentation methods in terms of sensitivity and specificity. Furthermore, the proposed algorithms are easy to generalize and are computationally very efficient also on high-resolution data. The Bioconductor package

Several extensions and modifications of the proposed least-squares framework are possible. In principle, the L2-based distance measure used in the current implementation of PCF is easily extended to general Lp-distances. However the current implementation is highly optimized for L2, and other distance measures would require substantial heuristics to obtain comparable computational performance. Another extension is to introduce locus specific penalties for breakpoints, thus essentially introducing a prior on the location of breakpoints. Work in progress includes specialized routines to handle high throughput sequencing data more efficiently and joint analysis of multiple samples in allele-specific PCF.

Availability and requirements

**Project name:** Copynumber

**Project home page:**

**Operating system(s):** All systems supporting the R environment

**Programming language:** R

**Other requirements:** No

**License:** GNU Artistic License 2.0.

Abbreviations

aCGH: Array Comparative Genomic Hybridization; AIC: Akaike’s Information Criterion. ASCAT: Allele-Specific Copy number Analysis of Tumors; BAC: Bacterial Artificial Chromosome; BAF: B-Allele Frequency; BIC: Schwarz’s Bayesian Information Criterion; CBS: Circular Binary Segmentation; CNV: Copy Number Variation; DTC: Disseminated Tumor Cells; FL: Fused Lasso; HTS: High-Throughput Sequencing; IQR: Interquartile Range; MAD: Median Absolute Deviation; MLPA: Multiplex Ligation-dependent Probe Amplification; PCF: Piecewise Constant Fitting (the method used for segmentation in this paper); ROC: Receiver Operating Characteristic curve; SNP: Single-nucleotide Polymorphism.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

The study was initiated by KL, ALBD and OCL. GN, KL and OCL drafted the manuscript. The software was written by GN with contributions from KL based on algorithms developed by GN, KL and OCL. PVL, HKMV, MBE and LOB contributed with examples and in discussions of the manuscript and software. OMR, SFC, RR and CC provided and analysed the MLPA data. All authors have read, commented on and accepted the final manuscript.

Acknowledgements

GN, KL and OCL received funding from the Centre of Cancer Biomedicine (CCB) at the University of Oslo for equipment and travelling. PVL is a postdoctoral researcher of the Research Foundation - Flanders (FWO).