Department of Statistics, University of California, Los Angeles, CA, USA

Department of Human Genetics, Biomathematics and Statistics, University of California, Los Angeles, CA, USA

Department of Health Research and Policy and Statistics, Stanford University, Stanford, CA, USA

Abstract

Background

Variations in DNA copy number carry information on the modalities of genome evolution and mis-regulation of DNA replication in cancer cells. Their study can help localize tumor suppressor genes, distinguish different populations of cancerous cells, and identify genomic variations responsible for disease phenotypes. A number of different high throughput technologies can be used to identify copy number variable sites, and the literature documents multiple effective algorithms. We focus here on the specific problem of detecting regions where variation in copy number is relatively common in the sample at hand. This problem encompasses the cases of copy number polymorphisms, related samples, technical replicates, and cancerous sub-populations from the same individual.

Results

We present a segmentation method named generalized fused lasso (GFL) to reconstruct copy number variant regions. GFL is based on penalized estimation and is capable of processing multiple signals jointly. Our approach is computationally very attractive and leads to sensitivity and specificity levels comparable to those of state-of-the-art specialized methodologies. We illustrate its applicability with simulated and real data sets.

Conclusions

The flexibility of our framework makes it applicable to data obtained with a wide range of technology. Its versatility and speed make GFL particularly useful in the initial screening stages of large data sets.

Background

Genomic duplications and deletions are common in cancer cells and known to play a role in tumor progression

The HMM approach takes advantage of the implicitly discrete nature of the copy number process (both when a finite number of states is assumed and when, as in some implementations, less parametric approaches are adopted); furthermore, by careful modeling of the emission probabilities, one can fully utilize the information derived from the experimental results. In the case of genotyping arrays, for example, quantification of total DNA amount, relative allelic abundance, and prior information such as minor allele frequencies can be considered.

No apriori knowledge of the number of copy number states is required in the segmentation approach—an advantage in the study of cancer where polyploidy and contamination with normal tissues result in a wide range of fractional copy numbers. Possibly for the reasons outlined, HMMs are the methods of choice in the analysis of normal samples

While a number of successful approaches have been derived along the lines described above, there is still a paucity of methodology for the joint analysis of multiple sequences. It is clear that if multiple subjects share the same variation in copy number, there exists the potential to increase power by joint analysis. Wang et al.

In the present work we consider a setting similar to

In concluding this introduction, we would like to make an important qualification. The focus of our contribution is on segmentation methods, knowing that this is only one of the steps necessary for an effective recovery of CNVs. In particular, normalization and transformation of the signal from experimental sources are crucial and can have a very substantial impact on final results, as documented in

Before describing in detail the proposed methods for joint segmentation of multiple sequences, we start by illustrating various contexts where joint analysis appears to be useful.

Genotyping arrays and CNV detection

Genotyping arrays have been used on hundreds of thousands of subjects. The data collected through them provides an extraordinary resource for CNV detection and the study of their frequencies in multiple populations. Typically, the raw intensity data (representing hybridization strength) is processed to obtain two signals: quantification of total DNA amount (from now on log R Ratio, LRR, following Illumina terminology) and relative abundance of the two queried alleles (from now on B allele frequency, BAF). Both these signals contain information on CNVs, and one of the strengths of HMMs has been that they can easily process them jointly. Segmentation models like CBS have traditionally relied only on LRR. While this is a reasonable choice, it can lead to substantial loss of information, particularly in tumor cells, where polyploidy and contamination make information in LRR hard to decipher. To exploit BAF in the context of a segmentation method, a signal transformation has been suggested

Multiple platforms

LRR and BAF are just one example of the multiple signals available in some samples. Often, as research progresses, the samples are assessed with a variety of technologies. For example, a number of subjects who have been genotyped at high resolution are now being resequenced. Whenever the technology adopted generates a signal that contains some information on copy number, there is an incentive to analyze the available signals jointly.

Tumor samples from the same patient obtained at different sites or different progression stages

In an effort to identify mutations that are driving a specific tumor, as well as study its response to treatment, researchers might want to study CNVs in cells obtained at different tumor sites or at different time points

Related subjects

Family data is crucial in genetic investigations, and hence it is common to analyze related subjects. When studying individuals from the same pedigree, it is reasonable to assume that some CNVs might be segregating in multiple people and that joint analysis would reduce Mendelian errors and increase the power of detection.

The rest of the paper is organized as follows: In the Methods section, we first present the penalized estimation framework, and then describe how the model can be used for data analysis by: (a) outlining an efficient estimation algorithm, (b) generalizing it to the case of uncoordinated data, and (c) describing the choice of the penalization parameters. In the results section, we discuss our findings on two simulated data sets (descriptive of normal and tumor samples) and two real data sets. In one case multiple platforms are used to analyze the same sample, and in the other case samples from related individuals benefit from joint analysis.

Methods

A model for joint analysis of multiple signals

Assume we have observed _{
i
j
} being the observed value of sequence

where _{
i
j
} represent noise, and the mean values _{
i
j
} are piece-wise constant. Thus, there exists a linearly ordered partition _{
i
}. In other words, most of the increments |_{
i
j
} − _{
i,j−1} | are assumed to be zero. When two sequences _{
k
j
} − _{
k,j−1} | and |_{
l
j
} − _{
l,j−1} | will be different from zero at the change point _{
i
j
} = 0 can be interpreted as corresponding to the appropriate normal copy number equal to 2. We propose to reconstruct the mean values **
β
**by minimizing the following function, called hereafter the generalized fused lasso (GFL):

which includes a goodness-of-fit term and three penalties, whose roles we will explain one at a time. The _{1} penalty **
β
**, in favor of values

The incorporation of the latter two penalties can also be naturally interpreted in view of image denoising. To restore an image disturbed by random noise while preserving sharp edges of items in the image, a 2-D total variation penalty _{
i
j
} is the true underlying intensity of pixel (

Using matrix notation, and allowing the tuning parameter _{1}, _{2} and _{3} to be sequence specific, we can reformulate the objective function as follows. Let **Y** = (_{
i
j
})_{
M×N
} and **
β
** = (

where ||·||_{
F
} is the Frobenius matrix norm, _{1} and _{2} vector norms, **
β
**

An MM algorithm

While the solution to the optimization problem (3) might have interesting properties, this approach is useful only if an effective algorithm is available. The last few years have witnessed substantial advances in computational methods for _{1} -regularization problems, including the use of coordinate descent

With specific regard to the fused-lasso application to CNV detection, we were successful in developing an algorithm with per iteration cost _{1} norm we substitute _{2} norm we substitute

Adopting an MM framework **
β
**. At each iteration, the MM principle chooses

separates as a sum of similar functions in the the row vectors **
β
**

where

Here each _{1} penalties, as

**Supplementary Text.** Specification of surrogate function, justification of choice of tuning parameters, details of calling procedure.

Click here for file

Stacking observations at different genomic locations

While copy number is continuously defined across the genome, experimental procedures record data at discrete positions, for which we have used the indexes

Let _{
i
} be the subset of locations with measurements in sequence _{
i
j
} for all _{
i
}, _{
i
j
} will be determined simply on the basis of the neighboring data points, relying on the regularizations introduced in (3). The goodness-of-fit portion of the objective function is therefore redefined as

The MM strategy previously described applies with slight modifications of the matrix

The attentive reader will have noted that the _{
i
j
} values with _{
i
} can be considered missing data, and evaluation of the missingness pattern is appropriate. In general, the _{
i
j
} cannot be considered missing at random. The most important example is the case of mBAF, where homozygous markers result in missing values. Homozygosity is clearly more common when copy number is equal to 1 than when copy number is equal to 2. Therefore, there is potentially more information on _{
i
j
} to be extracted from the signals than what we capture with the proposed method. Although most of the information on deletions is obtained through LRR, BAF does convey additional information on duplications, where the changes in LRR are limited by saturation effects. On the other hand, it does appear that our method does not increase the rate of false positives. Hence, it can be considered as an operational improvement over segmentation based on LRR only, even if in theory, it does not completely use the information on BAF.

Choice of tuning constants and segmentation

One of the limitations of penalization procedures is that values for the tuning parameters need to be set, and clear guidelines are not always available. Path methods that obtain a solution of the optimization problem (3) for every value of a tuning parameter can be attractive, but recent algorithmic advances

We have found the following guidelines to be useful in choosing penalty parameter values:

for **y**
_{
i
}, _{1}, _{2} and _{3} are positive multipliers adjusted to account for different signal-to-noise ratios and CNV sizes. We discuss the function

While a more rigorous justification is provided in the Additional file

● The sequence-specific penalizing parameters are proportional to an estimate of the standard deviation of the sequence signal. In other words, after initial normalization, the same penalties would be used across all signals.

● The tuning parameter for the total variation (fused lasso) and the Euclidean (group fused lasso) penalties on the jumps depend on _{1}, as the lasso penalty can be understood as providing a soft thresholding of the solution of (3) when _{1} = 0. Given the penalization due to _{2} and _{3}, the solution of (3) when _{1} = 0 will have much smaller dimension than

● The group penalty depends on

● The relative weight of the fused-lasso and group-fused-lasso penalties is regulated by

The standard deviation _{
i
j
} = _{
i,j + 1} − _{
i,j
}, for _{
i
j
} for sequence

where **
Δ
**

As mentioned before, the exact values of the penalty parameters should be adjusted depending on the expectations of signal strengths. Following the approach in

Following again the approach in

1. Sequences are jointly segmented by minimizing (3) for a relatively lax choice of the penalty parameters.

2. Jumps are further thresholded on the basis of a data-driven cut-off.

Step 2 allows us to be adaptive to the signal strength and can be carried on with multiple methods. For example, one can adopt the modified Bayesian Information Criteria (mBIC)

In data analysis, we often apply an even simpler procedure where the threshold for jumps is defined as a fraction of the maximal jump size observed for every sequence. Specifically, for sequence

as a “ruler” reflecting the scale of a possible real jump size, taking _{
i
} as the cut-off in removal of most small jumps. In all analyses for this paper, we fix

Calling procedure

Even if this is not the focus of our proposal, in order to compare the performance of our segmentation algorithm with HMM approaches, it becomes necessary to distinguish gains from losses of copy number. While the same segmentation algorithm can be applied to a wide range of data sets, calling procedures depend more closely on the specific technology used to carry out the experiments. Since our data analysis relies on Illumina genotyping arrays, we limit ourselves to this platform and briefly describe the calling procedure adopted in the results section.

Analyzing one subject at the time, each segment with constant mean is assigned to one of five possible copy number states (**x**
_{
R
} **y**
_{
R
}) = {(_{
j
}
_{
j
}),

explicitly defined in Additional file _{1} is a pre-specified cut-off.

As noted in

Results and discussion

We report the results of the analysis of two simulated and two real data sets, which overall exemplify the variety of situations where joint segmentation of multiple sequences is attractive, as described in the motivation section. In all cases, we compare the performance of the proposed procedure with a set of relevant, often specialized, algorithms. The penalized estimation method we suggest in this paper shows competitive performance in all cases and often a substantial computational advantage. Its versatility and speed make it a very convenient tool for initial exploration. To calibrate the run times reported in the sequel, we state for the record that all of our analyses were run on a Mac OS X (10.6.7) machine with 2.93 GHz Intel Core 2 Duo and 4 GB 1067 MHz DDR3 memory.

Simulated CNV in normal samples

We consider one of the simulated datasets described in _{1} = 0.1, _{3} = 0 and the group fused lasso to _{1} = 0.1, _{2} = 0, and _{1} = 10 and _{2} = 1(1.5) for duplication (deletion). Performance is evaluated by the same indexes we used in

**CNV size**

**CNV type**

**PennCNV**

**CBS**

**Fused Lasso**

**Group Fused Lasso**

TPR and FDR are measured as the percentage of related SNPs. Overall accuracy is calculated by pooling all sequences with a given type of CNV. Also reported are the average and standard deviation of the number of seconds required for the analysis of one sequence.

**TPR**

**FDR**

**TPR**

**FDR**

**TPR**

**FDR**

**TPR**

**FDR**

5

Deletion

83.80

4.92

78.20

0.68

63.93

1.74

64.27

1.83

Duplication

58.53

4.67

11.67

10.26

20.00

37.76

39.87

14.33

10

Deletion

95.03

1.45

88.37

0.56

88.50

0.60

88.87

0.56

Duplication

93.43

0.78

56.50

4.40

83.90

12.60

91.60

3.85

20

Deletion

94.63

0.58

90.50

0.39

90.80

0.47

90.83

0.47

Duplication

96.13

0.92

86.22

3.58

92.77

4.95

94.98

2.13

30

Deletion

94.57

0.28

93.30

0.29

89.38

0.52

89.77

0.53

Duplication

96.09

0.05

90.77

1.61

94.32

1.78

94.98

1.29

40

Deletion

97.83

0.59

97.58

0.09

97.28

0.19

97.28

0.19

Duplication

94.61

0.46

92.77

0.98

93.94

1.15

94.63

0.75

50

Deletion

94.33

0.07

92.76

0.04

90.47

0.11

90.48

0.11

Duplication

94.50

0.09

93.81

0.74

93.11

0.79

93.64

0.49

Overall Deletion

95.02

0.55

93.06

0.19

91.08

0.33

91.19

0.34

Overall Duplication

93.82

0.44

86.92

1.55

90.56

2.85

92.46

1.38

Overall

94.42

0.49

89.99

0.85

90.82

1.60

91.83

0.87

Time (sec.)

0.48 (0.01)

0.78 (0.69)

0.22 (0.13)

0.28 (0.05)

Not surprisingly, all algorithms perform similarly well for larger deletions/duplications, and it is mainly for variants that involve 10 or fewer SNPs that differences are visible. Algorithms that rely only on LRR (for example, CBS and fused lasso) underperform in the detection of small duplications. Comparison is particularly easy for duplications involving 10 SNPs, where the selected parameter values lead to similar FDRs in the three segmentation methods. The group fused lasso can almost entirely recover the performance of PennCNV and outperforms CBS in this context.

Out of curiosity, we analyzed all sequences simultaneously. While this represents an unrealistic amount of prior information, it allows us to evaluate the possible advantages of joint analysis. FDR practically became 0 (<0.02%) for all CNV sizes, but power increases only for CNVs including fewer than 10 SNPs.

Finally, it is useful to compare running times. Summary statistics of the per sample time are reported in Table

A simulated tumor data set

To explore the challenges presented by tumor data, we rely on a data set created by

**Table S1.** Regions of allelic imbalance imputed to the HapMap sample NA06991.

Click here for file

For ease of comparison, we evaluate the accuracy of calling procedures as in the original reference

Following other analyses, we do not pre-process the data prior to CNV detection. BAFsegmentation and PSCN were run using recommended parameter values. For each of the diluted data sets, we applied the GFL model on each chromosome, simultaneously using both LRR and mBAF, whose standard deviations are normalized to 1. Tuning constants are set to _{1} = 0,

Figure

Sensitivity as a function of percentage contamination by normal cells in the 10 different simulated CNV regions

**Sensitivity as a function of percentage contamination by normal cells in the 10 different simulated CNV regions.** Sensitivity is not defined at 100% contamination.

Specificity as a function of percentage contamination by normal cells

**Specificity as a function of percentage contamination by normal cells.** Note that

PSCN, like GFL, is implemented in R with some computationally intensive subroutines coded in C. BAFsegmentation relies on the R package DNAcopy, whose core algorithms are implemented in C and Fortran. BAFsegmentation wraps these in Perl. A comparison of run times indicate that GLF and BAFsegmentation are comparable, while PSCN is fifty times slower than GFL (see Additional file

**Table S2.** Speed comparison of three methods: GFL, BAFsegmentation and PSCN.

Click here for file

One sample assayed with multiple replicates and multiple platforms

We use the data from a study

**Table S3.** Sample information and reference CNV regions summarized for each sample by their types and sizes.

Click here for file

The test experiments are based on 1,020,596 and 2,390,395 autosomal SNPs, which after quality control reduce to a total of 2,657,077 unique loci. Since our focus here is to investigate how to best analyze multiple signals on the same subject, rather than on the specific properties of any CNV calling method, we carry out all the analyses using different settings of GFL in segmentation while keeping the same CNV calling and summarizing procedures. All segmentation is done on LRR only, while calling procedure uses both LRR and BAF (with cut-off _{1} = 10 and _{2} = 1). Here we compare three segmentation settings to analyze these 6 experiments per subject (see Additional file

1. The signals from the three technical replicates with one platform are averaged and then segmented and subjected to calling procedure separately. The final CNV list is the union of CNV calls from the two platforms.

2. The signals from the three technical replicates with one platform are each segmented and separately subjected to calling. A majority vote of at least two out of three is used to summarize each CNV result for each platform. The final CNV list is the union of the two platforms’ lists.

3. The signals from the three technical replicates of both platforms (6 LRR sequences) are segmented jointly. Calling is still done on each replicate separately, and the same majority vote is used to summarize the CNV result for each platform. Again, the final CNV list is the union of the two platforms’ results.

**Table S4.** Summary of results for four real samples under different CNV analyses.

Click here for file

To benchmark the result of joint analysis, we use MPCBS

Table

**Analysis**

**NA15510**

**NA18517**

**NA18576**

**NA18980**

**Time (min.)**

The number of CNV detected (Det.) and overlapping (Ovlp.) and the average computation time (in minutes) for each sample under the different analyses.

**#Det.**

**#Ovlp.**

**#Det.**

**#Ovlp.**

**#Det.**

**#Ovlp**

**#Det.**

**#Ovlp**

Analysis 1

170

38

144

34

160

25

145

22

1.2

Analysis 2

102

36

109

33

93

25

91

20

3.7

Analysis 3

80

38

82

32

69

25

56

15

8.5

MPCBS

98

34

88

28

59

18

68

21

313.9

Multiple related samples assayed with the same platform

In the context of a study of the genetic basis of bipolar disorder, the Illumina Omni2.5-Quad chip was used to genotype 455 individuals from 11 Columbian and 13 Costa Rican pedigrees. We use this data set to explore the advantages of a joint segmentation of related individuals. In the absence of a reference evaluation of CNV status in these samples, we rely on two indirect methods to assess the quality of the predicted CNVs. We used the collection of CNVs observed in HapMap Phase III

Another indirect measure of the quality of CNV calls derives from the number of Mendelian errors encountered in the pedigrees when we consider the CNV as a segregating site. De novo CNVs are certainly a possibility, and in their case Mendelian errors are to be expected. However, when the CNV in question is a common one (already identified in HapMap), it is reasonable to expect that it segregates in the pedigrees as any regular polymorphism. We selected a very common deletion on Chromosome 8 (HapMap reports overall frequency >0.4 in 11 populations) and compared different CNV calling procedures on the basis of how many Mendelian errors they generate.

As mentioned before, PennCNV represents a state-of-the-art HMM method for the analysis of normal samples and, therefore, we included it in our comparisons. However, the parameters of the underlying HMM algorithm had not been tuned on the Omni2.5-Quad at this time, resulting in sub-standard performance. Segmentation methods are less dependent on parameter optimization; hence, GFL analysis of LRR and BAF one subject at a time can provide a better indication of the potential of single-sample methods. We considered two multiple-sample algorithms: GFL and MSSCAN

Prior to analysis, the data was normalized using the GC-content correction implemented in PennCNV _{1} = 0.1, _{2} = 0, and _{1} = 0.1, _{1} = 10 and _{2} = 1 was applied to both the GFL and MSSCAN results.

Table

**Method**

**#Detected CNVR**

**#Overlap**

**%Overlap**

**Time (min.)**

The number and overlap of CNP regions with frequency ≥0.1 detected in our sample by different methods. These CNP regions were compiled from HapMap. Computation time is given in minutes per sample.

PennCNV

189

63

33.33%

3.44

GFL-Individual (LRR+BAF)

95

50

52.63%

3.90

GFL-Pedigree (LRR)

106

62

58.49%

1.57

Table

**Method**

**#CN= 0**

**#CN=1**

**#CN=3**

**#Families with Mendelian errors**

**Time (min.)**

Across the various algorithms, subjects are assigned to one of 4 copy numbers. For each algorithm, we report the total numbers of CN≠2 identified, the total number of nuclear families with Mendelian errors, and the average computation time (in minutes) per sample for the analysis of Chromosome 8.

PennCNV

125

39

102

35

0.19

GFL-Individual

123

97

0

20

0.21

GFL-Pedigree

123

137

0

15

0.09

MSSCAN-Pedigree

123

154

0

15

0.11

CNV detection and Mendelian errors for a Central American pedigree

**CNV detection and Mendelian errors for a Central American pedigree.** Displayed are four families derived from an extended pedigree. Circles and squares correspond to females and males. The dashed line is used to indicate identical individuals. Beneath each individual, from top to bottom, are CNV genotypes by PennCNV and by GFL. The subjects for whom PennCNV and GLF infer different CNV genotypes are highlighted in red and blue. Red is used when PennCNV genotypes result in Mendelian error, while GFL genotypes do not. Blue is used when both genotypes are compatible with Mendelian transmissions. Orange singles out a member for whom both PennCNV and GFL genotypes result in Mendelian error.

Conclusions

We have presented a segmentation method based on penalized estimation that is capable of processing multiple signals jointly. We have shown how this leads to improvements in the analysis of normal samples (where segmentation can be applied to both total intensity and allelic proportions), tumor samples (where we are able to deal with contamination effectively), measurements from multiple platforms, and related individuals. Given that copy number detection is such an active area of research, it is impossible to compare one method to all other available methods. However, for each of the situations we analyzed, we tried to select state of the art alternative approaches. In comparison to these, the algorithm we present performs well. Its accuracy is always comparable to that of the most effective competitor and its computation times are better contained. Given its versatility and speed, GFL is, in our opinion, particularly useful for initial screening.

There are of course many aspects of CNV detection, ranging from normalization and signal transformation to FDR control of detected CNV, that we have not analyzed in this paper. There are also a number of improvements to our approach that appear promising, but at this stage are left for further work. For example, it is easy to modify algorithms so that the penalization parameters are location dependent and incorporate prior information on known copy number polymorphisms. It will be more challenging to develop theory and methods to select the values of these regularization parameters in a data-adaptive fashion.

Finally, while our scientific motivation has been the study of copy number variations, the joint segmentation algorithm we present is not restricted to specific characteristics of these data types, and we expect it will be applied in other contexts.

Implementation and availablity

We have implemented the segmentation routine, which is our core contribution, in an R package (Piet) available at R-forge

**Figure S1.** Visualization of pedigree-wise CNV analysis results of Chromosome 8 data in the bipolar disorder study. In the main body of the plot, CNVs estimated for each individual are marked by small segments with color code: CN= 0 in blue, CN=1 in light blue, CN=3 in red and CN=4 in brown. Each subject is a row, each SNP a column. Subjects belonging to the same pedigree are stacked together. The pedigree names are indicated on the left-hand side with the number of pedigree members included in parentheses. On the right-hand side, the barplot represents the number of CNVs detected per subject. Two shades of green are switched alternately to indicate the pedigree to which the subject belongs. At the bottom, the gray histogram shows the GC content along the chromosome. Coordinated with the representation of CNVs in the main body, the green histogram counts the frequency of CNVs among the subjects represented. Vertical dotted line marks the centromere.

Click here for file

Abbreviations

BAF: B allele frequency; CN: Copy number; CNV: Copy number variant; CNP: Copy number polymorphism; CN-LOH: Copy neutral loss of heterozygosity; GFL: Generalized fused lasso; HMM: Hidden Markov model; LRR: Log R ratio; MML: Majorization-minimization.

Competing interests

The authors declare that they have no competing interests.

Author’s contributions

ZZ, KL, and CS conceived this study and participated in model and algorithm development. ZZ performed the statistical analysis and wrote the R Piet implementation. All authors participated in writing the final manuscript. All authors read and approved the final manuscript.

Acknowledgements

The authors thank Nelson Freimer and all of the collaborators of the Bipolar Endophenotype Mapping project for authorizing use of their genotype data. We also thank Susan Service and Joseph DeYoung for assistance in data management and interpretation and Pierre Neuvial and Henrik Bengtsson for helpful discussion. CS gratefully acknowledges support from NIH/NIGMS GM053275, MH075007 and P30 1MH083268 and KL from NIH/NIGMS GM053275.