Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Methodology article

Simple binary segmentation frameworks for identifying variation in DNA copy number

Tae Young Yang

Author Affiliations

Department of Mathematics, Myongji University, Yongin, Kyonggi, 449-728, Korea

BMC Bioinformatics 2012, 13:277  doi:10.1186/1471-2105-13-277

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/13/277


Received:30 April 2012
Accepted:22 October 2012
Published:30 October 2012

© 2012 Yang; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Variation in DNA copy number, due to gains and losses of chromosome segments, is common. A first step for analyzing DNA copy number data is to identify amplified or deleted regions in individuals. To locate such regions, we propose a circular binary segmentation procedure, which is based on a sequence of nested hypothesis tests, each using the Bayesian information criterion.

Results

Our procedure is convenient for analyzing DNA copy number in two general situations: (1) when using data from multiple sources and (2) when using cohort analysis of multiple patients suffering from the same type of cancer. In the first case, data from multiple sources such as different platforms, labs, or preprocessing methods are used to study variation in copy number in the same individual. Combining these sources provides a higher resolution, which leads to a more detailed genome-wide survey of the individual. In this case, we provide a simple statistical framework to derive a consensus molecular signature. In the framework, the multiple sequences from various sources are integrated into a single sequence, and then the proposed segmentation procedure is applied to this sequence to detect aberrant regions. In the second case, cohort analysis of multiple patients is carried out to derive overall molecular signatures for the cohort. For this case, we provide another simple statistical framework in which data across multiple profiles is standardized before segmentation. The proposed segmentation procedure is then applied to the standardized profiles one at a time to detect aberrant regions. Any such regions that are common across two or more profiles are probably real and may play important roles in the cancer pathogenesis process.

Conclusions

The main advantages of the proposed procedure are flexibility and simplicity.

Keywords:
Bayesian information criterion; Circular binary segmentation; Consensus molecular signature; Overall molecular signature; Variation in DNA copy number

Background

Copy number variations (CNVs) in DNA, due to gains and losses of chromosome segments, is common among healthy individuals and an important feature of tumor genomes. In healthy individuals, CNVs (most of which are inherited) are usually short and spaced far apart, whereas in tumor subjects, they can be quite long, sometimes spanning entire chromosomes. Because genomic instability can trigger the overexpression or activation of oncogenes and the silencing of tumor suppressors, mapping regions of common genomic aberrations have been used to discover cancer-related genes. Understanding genome aberrations is important for a basic understanding of cancer, as well as for diagnosis and clinical practice [1,2]. CNVs from cancer tissues, referred to as copy number aberrations (CNAs), are acquired somatic aberrations most often observed only in cancer tissues. There is significant interest in locating CNVs in normal individuals and CNAs in tumor subjects [3].

Various microarray technologies, including array comparative genomic hybridization (aCGH), Affymetrix single-nucleotide polymorphism (SNP) genotyping arrays, Illumina Infinium arrays, and other SNP arrays, are used to investigate the roles of CNVs/CNAs. Here we describe aCGH in detail [4,5]. In this technique, DNA from a test sample and a normal reference sample are labeled differentially, using different fluorophores, and hybridized to several thousand spots on microarray chips. The spots are derived from most of the known genes and non-coding regions of the genome, printed on a glass slide. The recorded value for each probe in a given sample is usually the log2 ratio of the copy number measurement at the probe to its reference value, often computed from a set of normal population controls. The log2ratio of the normal state, in which the copy number of the target agrees with that of the control, should have a mean equal to zero. A contiguous stretch of measurements that are on average higher or lower than zero suggests a gain or loss in copy number.

The analysis of DNA copy number data consists of identifying amplified or deleted regions in each individual. There can be multiple CNVs/CNAs in a chromosome from a single sample. The binary segmentation procedure proposed by Vostrikova [6] has been widely used for locating multiple change-points. In each stage of this procedure, a single-change-point model is compared to a constant model with no change-points. Thus, the procedure is easily implemented and circumvents the computational complexity normally faced in problems with a variable number of change-points. A potential problem with the binary segmentation procedure is that it cannot detect a small segment buried in the middle of a large segment. Olshen et al. [7] modified the binary segmentation procedure to compare a model with a pair of change-points to a constant model with no change-points in each stage. This modified procedure is called circular binary segmentation, which is particularly useful for detecting short regions of a chromosome [7]. This approach recursively splits chromosomes into segments based on a statistic similar to the Student statistic, whose p-value is estimated by a time-consuming permutation process. To locate multiple CNVs/CNAs, we propose using circular binary segmentation based on a sequence of nested hypothesis tests, each using the Bayesian information criterion (BIC) [8]. Note that our version is based on the existing circular binary segmentation strategy, but the proposed BIC is computationally simple, and is different from previous methods. Various authors [9-11] have suggested a BIC criterion for determining the number of change-points.

In Methods Section, we describe the derivation of the proposed procedure and present a numerical example and simulation study. The proposed procedure can be flexibly adapted to analyze multiple DNA copy number data sets to discover both consensus and overall molecular signatures. In Results Section, these two general situations are separately discussed in “Integration of multiple platforms” and “Cohort analysis of multiple individuals”.

Methods

Let xi denote the log2ratio of the copy number measurement at the i-th probe of an individual. The vector X = (x1,…,xm) is then a DNA copy number data set for one chromosome of the individual, arranged according to genomic order along the chromosome.

For a given threshold τ+ > 0, we construct a Bernoulli data set A = (a1,…,am) for gain events such that

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M1">View MathML</a>

(1)

In a hypothetical situation for aCGH, Pollack et al. [12] specified log20.8 ≤ log2 ratio <log21.2 (-0.32 to 0.26) for the normal state, log21.2 ≤ log2 ratio < log22.0 (0.26 to 1) for low amplification, log22.0 ≤ log2 ratio < log23.0 (1 to 1.58) for medium amplification, and log2ratio > log23.0 (=1.58) for high amplification. To locate low, medium, and high amplification, we would use τ+=0.32, 1, and 1.58, respectively. If there are gain events in the target chromosome of the individual, we expect to see many consecutive 1s in A.

For a given threshold τ< 0, we create D = (d1,…,dm) such that

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M2">View MathML</a>

(2)

Pollack et al. [12] also specified log2ratio < log20.8 (=-0.32) for loss. We would use τ= −0.32. If there are loss events in the target chromosome, we expect to see many consecutive 1s in D.

The search for gain events is performed separately from that for loss events. To detect gain (loss) regions for an individual, we apply the following procedure to A(D).

Circular binary segmentation procedure

We assume that the success rate for a Bernoulli data set A at probe location i changes according to

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M3">View MathML</a>

where δ(E) is the indicator function for event E and 0 = c0 < c1 < ⋯ < cK < cK + 1 = m are the unknown integer-valued change-points with associated success rates p1,…,pK + 1. The goal of the change-point problem is to identify the number of change-points K, the change-points c1,…,cK, and the associated success rates p1,…,pK + 1.

We let M0 denote the constant model with no change-points (i.e. θ0 = p1 = ⋯ = pm). In M0, the likelihood is

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M4">View MathML</a>

(3)

Using the circular binary segmentation procedure, we reduce the complexity of the problem by assuming that the segment forms a circle. We test the hypothesis that the arc from c1 + 1 to c2 and its complement have different success rates. Let M1 denote the change-point model given by a pair of c1and c2. This implies that θ1 = p1 = ⋯ = <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M5">View MathML</a>, where 1 ≤ c1 < c2 m. In M1, the likelihood is

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M6">View MathML</a>

(4)

Let us consider the constant model M0. The likelihood function (3) is maximized by <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M7">View MathML</a>, giving <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M8">View MathML</a>. For M1, the likelihood (4) is maximized along 1 ≤ c1 < c2 m via

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M9">View MathML</a>

The fully maximized likelihood in the segmentation model <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M10">View MathML</a> is then obtained by maximizing <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M11">View MathML</a> over the finite set 1 ≤ c1 < c2 m.

We choose between M0 and M1 in accordance with the BIC. We define

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M12">View MathML</a>

(5)

where the last term in (5) is a penalty function that adjusts for the difference in dimensionality between the two models. In this application, q1 = 4 and q0 = 1. If BIC10 is negative, the decision is to accept M0. If BIC10 is positive, we reject the constant model and estimate the first segment given by the pair of <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M13">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M14">View MathML</a>.

To test M0 versus M1, the procedure begins by setting c1 = 1 and c2 = m. Let <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M15">View MathML</a> be the observed BIC10, and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M16">View MathML</a> be the corresponding interval. If <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M17">View MathML</a>, we choose M0, estimate the constant success rate to be <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M18">View MathML</a> for i ∈[1,m] with <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M19">View MathML</a>, and stop. If <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M20">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M21">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M22">View MathML</a>, and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M23">View MathML</a> are recursively scanned using the same procedure. The recursion stops when none of the subregions contains its corresponding <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M24">View MathML</a>.

Application to aCGH data

Snijders et al. [5] used aCGH to detect low-level DNA copy number gains and losses, as well as high-level amplifications for breast cancer specimen S1514. Their array contained 2276 probes for the mapped bacterial artificial chromosomes (BACs), which are large segments of DNA, typically 100 to 200 kilo-bases. Figure 1(a) shows a plot of the normalized log2 ratios of S1514. Low-level gains and losses, as well as high-level amplifications were found in S1514.

thumbnailFigure 1. The breast cancer S1514. (a) The points are normalized log2ratios. The BACs are ordered by position in the genome, beginning at 1p and ending at Xq. The inserts are chromosome numbers. The borders between chromosomes are represented by vertical bars. (b) Our analysis of S1514. The normal state, where the copy number in the target agrees with that in the control, should have mean 0. A contiguous range of measurements whose average is higher or lower than 0 suggests a respective gain or loss in copy number. To identify gains and losses, we used τ+ = 0.3 and τ= −0.3, respectively. We found single-copy duplication from the center to the end of chromosome 3. We identified double-copy duplication at the beginning of chromosome 5 and single-copy duplication in the remaining region of chromosome 5. We also identified very high-level amplification from the center to the end of chromosome 20. We found low-level losses on chromosome 13. The red lines indicate the mean values among the probes in segments detected by our method.

In Figure 1(b), we respectively use τ+ = 0.3 in Equation (1) and τ= −0.3 in Equation (2) to identify gains and losses. Our procedure was executed to detect aberrated regions for each of the 23 chromosomes. The red lines indicate the mean values among clones in segments obtained by our procedure. We found gains on chromosomes 3 and 5, loss on chromosome 13, and high-level amplification on chromosome 20.

As we increase τ + , higher-level gains are readily identifiable, as shown in Figure 2. As we decrease τ, lower-level losses are readily identifiable, as shown in Figure 3. From Figure 2 and Figure 3, amplified and deleted regions of an individual are clearly separated, because these regions would trigger the activation of oncogenes and the silencing of tumor suppressors, respectively.

thumbnailFigure 2. Our analysis of the breast cancer S1514. The borders between chromosomes are represented by vertical bars. The red lines indicate the mean values among the probes in segments obtained by the proposed procedure. As we increase the value of τ+, higher-level gains are readily identifiable. (a) For τ+ = 0.2, we identified single-copy duplication at the end of chromosomes 3 and 11. We identified double-copy duplication at the beginning of chromosome 5 and single-copy duplication in the remaining region of chromosome 5. We also identified very high-level amplification from the center to the end of chromosome 20. (b) For τ+ = 0.3, we found single-copy duplication from the center to the end of chromosome 3. We identified double-copy duplication at the beginning of chromosome 5 and single-copy duplication in the remaining region of chromosome 5. We also identified very high-level amplification from the center to the end of chromosome 20. We did not identify alternations on chromosome 11, due to low-level amplified signals. (c) For τ+ = 0.5, we identified double-copy duplication in the center of chromosome 20. At the end of chromosome 20, we identified very high-level gain (more than triple-copy). We identified double-copy duplication at the beginning of chromosome 5 and single-copy duplication in the remaining region of chromosome 5. We did not identify alternations on chromosomes 3 and 11, due to low-level amplified signals.

thumbnailFigure 3. Our analysis of the breast cancer S1514. The borders between chromosomes are represented by vertical bars. The red lines indicate the mean values among the probes in segments obtained by the proposed procedure. (a) For τ= −0.3, we identified single-copy loss at chromosome 13. (b) For τ= −0.4, we identified single-copy loss at chromosome 13.

Simulation study

We evaluated the performance of our algorithm. The data to be segmented were generated from the model xiN(μi,1),1 ≤ i m, where m is the number of probes and μ denotes the mean. Let μi = c when l < i l + k, and μi = 0 otherwise. The mean parameter c was set equal to 1, 2, or 3. The value c = 1 represents low-level amplification. The values c = 2 and c = 3 represent moderate and high-level amplification, respectively. We simulated 1000 data sets from 500 probes using this simulation setup.

We randomly selected k from (3,…,30), and l from (1,2,…,mk). The values of l and k control the location of the change and the width of the changed segment, respectively. Note that the width of the changed segment is at least 3 probes. Each data set had one elevated region ranging from 3-30 probes, and the elevation varied according to c.

The power is the proportion of data sets in which the estimated change-points equal the true change-points. Table 1 lists the power for various c and τ + . The power was lower for c = 1 because c = 1 represents low-level amplification. However, it increased as c increased.

Table 1. Power for various τ+ and c

When τ+ ≤ 2.0, we identified low- and higher-level amplification, and thus the power was high. In contrast, when τ+ ≥ 2.5, we only observed higher-level amplification as τ+ increased, and consequently the power was lower.

Results

Integration of multiple platforms

Several sources (platforms, analytical methods, and labs) were used to study the variation in copy number of the same individual. Their profiles may have different mean levels of copy number aberrations and different noise levels [13,14]. They may also have different numbers of loci and variable coverage in different parts of the genome. If data sets from several sources are analyzed individually, it is difficult to reach a consensus when they disagree on the identity of a CNV/CNA. Combining data sets may increase resolution, facilitating the discovery of genes and probes that are important in the individual. To derive a consensus molecular profile, we combine multiple sources into a single sequence, and then apply our procedure to this sequence.

The observed data constitute a two-dimensional array xij for i = 1,…,mj and j = 1,…,n, where xij is the data point at the i-th probe and the j-th source, and n is the total number of sources. For the j-th source, mj probes are ordered by chromosome location <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M25">View MathML</a>, which may have variable loci and coverage in parts of the chromosome.

For a given threshold <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M26">View MathML</a>, an indicator variable aij is defined to classify the DNA copy number level as increased or not; i.e.,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M27">View MathML</a>

(6)

We then construct a Bernoulli data set <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M28">View MathML</a> for each source j. Because different sources exhibit different degrees of attenuation of the true DNA copy number, we use a threshold <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M29">View MathML</a> for each source, rather than applying a common threshold to all sources. Note that we do not require pre-standardization of different sources. We keep these sequences ordered according to chromosome position, and integrate <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M30">View MathML</a> into a single sequence, which is the union of the chromosomic locations of probes from all profiles. Then A1,…,An are integrated into A along the single sequence. A provides a consensus molecular profile and higher resolution for detecting CNAs. If there are amplification events in the target chromosome, we expect to see many consecutive 1s in A. To identify amplification regions, we apply the proposed procedure to A, as discussed in Methods Section.

The search for loss events is performed separately from that for gain events. For a given threshold <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M31">View MathML</a> and for each source j, dij is defined to classify the DNA copy number level as decreased or not:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M32">View MathML</a>

(7)

We then construct a Bernoulli data set <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M33">View MathML</a> for j = 1,…,n, and D1,…,Dn are integrated into Dalong the integrated single sequence. To identify deletion regions for the individual, we apply the proposed procedure to D.

Application to The Cancer Genome Atlas data

The Cancer Genome Atlas (TCGA) project ( http://tcga-data.nci.nih.gov/tcga webcite) is a collaborative initiative for a better understanding of cancer, using existing large-scale complete-genome technologies [15]. One of the tumor types studied is glioblastoma multiforma (GBM), which is a brain tumor. The TCGA-02-0104 (vials 01A) sample is known to have a large number of copy number aberrations on chromosome 3 at different mean levels [13]. To provide an application to somatic CNAs, we analyze TCGA-02-0104 samples from two TCGA centers: the Memorial Sloan-Kettering Cancer Center and Harvard Medical School. Both centers adopted Agilent CGH 244 K arrays, which have 236000 loci, 12.7 kb average between loci, and 60-mer probes. The different TCGA centers have identified aberrant regions independently of one another. It has been suggested that more accurate, precise, and higher-resolution results could be obtained if copy number estimates from the different sites were combined.

The proposed procedures were separately applied to detect amplification or deletion in the 33-42-mb (start-end) region on chromosome 3. Figure 4(a) and 4(b) show the individual results of the Memorial Sloan-Kettering Cancer Center and Harvard Medical School, respectively. Here, we used <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M34">View MathML</a> in Equation (6) and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M35">View MathML</a> in Equation (7) because the two centers used the same Agilent platform. Figure 4(c) shows a consensus estimate along the integrated sequence. We found two short fluctuations, located in the 38.4-mb region and the 40.2-mb region, as indicated by the arrows in the figure. Note that these two segments were not identified by the single-source analyses presented in Figure 4(a) and 4(b).

thumbnailFigure 4. Consensus estimate. The points are normalized log2ratios in the 33-42-mb section on chromosome 3 of the TCGA-02-0104 sample from (a) the Memorial Sloan-Kettering Cancer Center (MSKCC) and (b) Harvard Medical School. The red lines indicate the mean values among the probes in segments obtained by the proposed circular binary segmentation procedure. Panel (c) shows a multi-platform consensus based on the combined data sets of Memorial Sloan-Kettering Cancer Center and Harvard Medical School. We found two more small segments, located in the 38.4-mb and 40.2-mb regions, which are indicated by the arrows. These were not identified in (a) and (b).

In Figure 5, our results are compared with popular CNV segmentation algorithms including circular binary segmentation [7], CGH-seg [16], and GLAD [17]. Their segment results are obtained by a web-based tool, CGHweb [18]. All methods show that gain and loss regions are respectively 35-38 mb (3p22.2-3p22.3) and 38-40 mb (3p22.1-3p22.2). However, our method and circular binary segmentation [7] are sensitive to the detection of short segments in this example.

thumbnailFigure 5. Comparison of the proposed method, circular binary segmentation, CSGseq, and GLAD. The points are based on the combined log2ratios from Memorial Sloan-Kettering Cancer Center and Harvard Medical School. The top panel shows our segments. The last three panels show segments from circular binary segmentation, CSGseq, and GLAD. The red lines indicate the mean values among the probes in segments.

Circular binary segmentation [7] based on permutation took 95 seconds to detect the segmentation results of a total of 1358 probes, as shown in Figure 5. In contrast, the proposed procedure based on BIC took less than 15 seconds, where the computation was done on a 2.66 GHz Intel i5 core processor.

Cohort analysis of multiple individuals

We turn next to the cohort problem of discovering overall molecular signatures. Each profile is obtained from a different individual with the same type of cancer, and is assayed on the same platform type. The observed data are a two-dimensional array xij for i = 1,…,m,j = 1,…,n, where xij is the data point at the i-th probe according to its genomic order along the chromosome, and the j-th individual profile. Note that m is the number of probes and n is the number of individuals. To derive overall molecular signatures, we provide a simple statistical framework, which standardizes data across multiple profiles before segmentation. Then, we analyze the standardized profiles one at a time to detect aberrant regions.

We standardize xij. For each probe i, we let <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M36">View MathML</a> for j = 1,…,n. Hence the zij have a common mean equal to 0 and a common variance equal to 1. An indicator variable aij is defined to classify the DNA copy number level for the i-th probe and j-th individual as increased or not; i.e.,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M37">View MathML</a>

(8)

For the following numerical example, we used γ+ = 3. A segment with probes deviating by three standard deviations from the mean value of all samples is likely to indicate true gain. For large γ+, higher-level gains are readily identifiable. If there are gain events in the target chromosome of the j-th individual (j = 1,…,n), we expect to see many consecutive 1s in Aj = (a1j,…,amj). To identify the amplification regions for the j-th individual, we apply the proposed procedure to Aj, as discussed in Methods Section. When common amplified regions occur for more than one individual, the aberrations are probably real and important for cancer pathogenesis processes.

The search for loss events is performed separately from that for gain events. dij is defined to classify the DNA copy number level for the i-th probe and the j-th individual as decreased or not; i.e.,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M38">View MathML</a>

(9)

For our numerical example, we used γ= −3. If there are deletion events in the target chromosome of the j-th individual, we expect to see many consecutive 1s in Dj = (d1j,…,dmj). To identify the deletion regions of the j-th individual, we apply the proposed procedure to Dj.

Standardization across multiple samples provides a multi-sample summary for the overall molecular signatures. However, one drawback to this type of standardization is that it restricts inferences about increased and decreased DNA copy numbers relative to the mean of the samples under study. When most or all samples are either two-fold over-expressed or under-expressed relative to normal tissue (i.e., a majority of the samples have identical increases or decreases), it is impossible to properly identify these aberrations using the proposed standardization. These situations are very rare, and most aberrant intervals appear only in some significant subset of the samples. When pooling data across multiple individuals, not all samples are expected to carry the same aberrant regions.

Application to fibroblast cell lines

We applied our framework to the aCGH data presented by Snijders et al. [5]. The data were obtained from single-array experiments on 15 fibroblast cell lines. The data are available in Tables E to H at http://www.nature.com/ng/journal/v29/n3/suppinfo/ng754_S1.html webcite. Each array contains 2276 mapped BACs spotted in triplicate. Because spectral karyotyping has shown that aberrations occur within a particular chromosome for each of GM01524, GM01535, GM01750, GM03134, GM03563, GM05296, GM07081, GM13031, and GM13330, we limited our analysis to these nine cell lines. The data from a typical cell line experiment, specifically from cell line GM13300, can be seen in Figure 6. The proposed procedure was employed to detect aberrated regions for each of the 23 chromosomes. We used γ+ = 3 in Equation (8) and γ= −3 in Equation (9), respectively. GM13300, shown in Figure 6, has known aberrations only on chromosomes 1 and 4. The results shown in Figure 7 are consistent with those of Snijders et al. [5], in that our framework correctly identified aberrations only on chromosomes 1 and 4. Our procedure also correctly identified aberrations on chromosomes 3 and 9 of GM03563 (Figure 8).

thumbnailFigure 6. The fibroblast cell line GM13300 has known alterations only on chromosomes 1 and 4. The points are normalized log2ratios. The borders between chromosomes are represented by vertical bars.

thumbnailFigure 7. Our analysis of GM13300. The fibroblast cell line GM13300 has known alterations only on chromosomes 1 and 4. The points are normalized log2ratios, and the lines indicate the mean values among the points in segments obtained by our method. (a) CNVs of GM13300 on BAC clones from chromosome 1. (b) CNVs of GM13300 on BAC clones from chromosome 4.

thumbnailFigure 8. Our analysis of GM03563. The fibroblast cell line GM03563 has known alterations only on chromosomes 3 and 9. The points are normalized log2ratios, and the lines indicate the mean values among the points in segments obtained by our method. (a) CNVs of GM03563 on BAC clones from chromosome 3. (b) CNVs of GM03563 on BAC clones from chromosome 9. The first two clones with log2ratio≈ -1 indicate a single-copy deletion.

Of the 15 aberrated regions listed in Table 2, which were found by spectral karyotyping, 13 were identified by our framework. The two unidentified regions were on chromosome 12 (GM01535) and chromosome 15 (GM07081). The aberrated region on GM01535 is represented by only a single probe, and single aberrated probes cannot be found. For GM07081, our result is consistent with that of Snijders et al. [5], in that no evidence of an aberration appears in the aCGH data. Therefore, our procedure found everything it should have found. For a particular cell line and chromosome, we define a false positive to be an aberration that is identified by our framework but is not detected by spectral karyotyping. Our procedure produced only one false positive, at chromosome 4 on GM01524, although we do not know that this is a real aberration that is undetectable by spectral karyotyping. Hence, our procedure was able to identify the aberrations with only a single false positive, whereas the circular binary segmentation method of Olshen et al. [7] produced at least nine false positives. Furthermore, the aberrations identified by our procedure perfectly matched the CNVs found via spectral karyotyping.

Table 2. Summarized results of applying the proposed framework to nine cell lines

Discussion

Our procedure is versatile in the sense that only higher- or lower-level gains/losses are readily identifiable. In particular, there are two interesting types of aberrated regions. The first of these is a spike, which is often a small region with extremely large or small log2 ratios. Only spikes are readily identifiable when large positive values of τ+ and large negative values of τare used in Equations (1) and (2), respectively. The second type is a consistent gain or loss region, whose log2ratios may not deviate very much from 0, but tend to remain positive or negative over the greater region. Only lower-level gains are readily identifiable when we define a new Bernoulli data set A = (a1,…,am) for a small positive value of τ+ and a positive value of ε, such that

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M39">View MathML</a>

Similarly, only lower-level losses are readily identifiable for a small negative value of τ when we define a new Bernoulli data set D = (d1,…,dm) such that

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/277/mathml/M40','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/277/mathml/M40">View MathML</a>

We pointed out that our procedure lacks the ability to detect CNAs when a whole chromosome is duplicated or deleted. For example, in Figure 1, the elevated X chromosome ratios of S1514 reflect the male-female difference in the X chromosome copy numbers shown. These elevations are known to be constant for single-copy gains on a complete X chromosome. Because there were no fluctuations on the elevated, complete X chromosome, our procedure could not detect the aberrations when based on a chromosome-wide search. To detect aberrations spanning complete chromosomes, our procedure should be based on a genome-wide search, which uses all 23 chromosomes together. Figure 9 shows the genome-wide search, which properly identified single-copy duplication in the entire X chromosome.

thumbnailFigure 9. Genome-wide analysis of the breast cancer S1514. (a) The points are normalized log2 ratios. The BACs are ordered by position in the genome, beginning at 1p and ending at Xq. The inserts are chromosome numbers. The borders between chromosomes are represented by vertical bars. (b) Genome-wide search of S1514. To identify gains and losses, we used τ+ = 0.3 and τ= −0.3, respectively. We found single-copy gain from the center of chromosome 20 to the end of 23 (chromosome X), and single-copy loss from the beginning of chromosome 13 to the end of chromosome 14. The red lines indicate the mean values among the probes in segments detected by our method.

Conclusions

To locate the aberrated regions in an individual, we propose a circular binary segmentation procedure based on BIC, which is nonparametric in the sense that it does not rely on any assumptions regarding independence or underlying distributions. The procedure does not require data to be transformed with missing values imputed or with extreme outliers truncated. At each stage of the procedure, we need only to compare a model with a pair of change-points to a constant model with no change-points. Thus the procedure is easy to implement, and circumvents the computational complexity we would normally face in problems with a variable number of change-points. The procedure can be flexibly adapted to analyze multiple DNA copy number data sets, to discover consensus molecular signatures or overall molecular signatures. Moreover, we provide two simple statistical frameworks appropriate for detecting these signatures.

Competing interests

The author declares that he have no competing interests.

Acknowledgements

This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MEST) (2012-0005352). The author thanks the two reviewers for their constructive comments and suggestions.

References

  1. Lipson D, Aumann Y, Ben-Dor A, Linial N, Yakhini Z: Efficient calculation of interval scores for DNA copy number data analysis.

    J Comput Biol 2006, 13:215-228. PubMed Abstract | Publisher Full Text OpenURL

  2. Sun W, Wright FA, Tang Z, Nordgard SH, Loo PV, Yu T, Kristensen VN, Perou CM: Integrated study of copy number states and genotype calls using high-density SNP arrays.

    Nucleic Acids Res 2009, 37:5365-5377. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  3. Shen J, Zhang N: Change-point model on nonhomogeneous Poisson processes with application in copy number profiling by next-generation DNA sequencing.

    Ann Appl Stat 2012, 6:476-496. Publisher Full Text OpenURL

  4. Pollack JR, Perou CM, Alizadeh AA, Eisen MB, Pergamenschikov A, Williams CF, Jeffrey SS, Botstein D, Brown PO: Genome-Wide Analysis of DNA Copy-Number Changes Using cDNA Microarrays.

    Nat Genet 1999, 23:41-46. PubMed Abstract | Publisher Full Text OpenURL

  5. Snijders AM, Nowak N, Segraves R, Blackwood S, Brown N, Conroy J, Hamilton G, Hindle AK, Huey B, Kimura K, Law S, Myambo K, Palmer J, Ylstra B, Yue JP, Gray JW, Jain AN, Pinkel D, Albertson DG: Assembly of Microarrays for Genome-Wide Measurement of DNA Copy Number.

    Nat Genet 2001, 29:263-264. PubMed Abstract | Publisher Full Text OpenURL

  6. Vostrikova L: Detecting disorder in multidimensional random process.

    Soviet Math Dokl 1981, 24:55-59. OpenURL

  7. Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary segmentation for the analysis of array-based dna copy number data.

    Biostatistics 2004, 5:557-572. PubMed Abstract | Publisher Full Text OpenURL

  8. Schwarz G: Estimating the dimension of a model.

    Ann Statist 1978, 6:461-464. Publisher Full Text OpenURL

  9. Zhang NR, Siegmund D: A modified bayes information criterion with applications to the analysis of comparative genomic hybridization data.

    Biometrics 2007, 63:22-32. PubMed Abstract | Publisher Full Text OpenURL

  10. Yang TY, Kuo L: Bayesian binary segmentation procedure for a Poisson process with multiple changepoints.

    J Comput Graphical Statist 2001, 10:772-785. Publisher Full Text OpenURL

  11. Yang TY: Bayesian binary segmentation procedure for detecting streakiness in sports.

    J R Stat Soc Ser A 2004, 167:627-637. Publisher Full Text OpenURL

  12. Pollack JR, Srlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE, Tibshirani R, Botstein D, Brresen-Dale AL, Brown PO: Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors.

    Proc Natl Acad Sci USA 2002, 99:12963-12968. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  13. Bengtsson H, Ray A, Spellman P, Speed T: A single-sample method for normalizing and combining full-resolution copy numbers from multiple platforms, labs and analysis methods.

    Bioinformatics 2009, 25:861-867. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  14. Zhang NR, Senbabaoglu Y, Li JZ: Joint estimation of DNA copy number from multiple platforms.

    Bioinformatics 2010, 26:153-160. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  15. The TCGA Research Network: Comprehensive genomic characterization defines human glioblastoma genes and core pathways.

    Nature 2008, 455:1061-1068. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  16. Picard F, Robin S, Lavielle M, Vaisse C, Daudin JJ: A statistical approach for array CGH data analysis.

    BMC Bioinformatics 2005, 6:27. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  17. Hupe P, Stransky N, Thiery JP, Radvanyi F, Barillot E: Analysis of array CGH data: from signal ratio to gain and loss of DNA regions.

    Bioinformatics 2004, 20(18):3413-3422. PubMed Abstract | Publisher Full Text OpenURL

  18. Lai W, Choudhary V, Park PJ: CGHweb: a tool for comparing DNA copy number segmentations from multiple algorithms.

    Bioinformatics 2008, 24(7):1014-1015. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL