Genome Analysis Platform, CIC bioGUNE & CIBERehd, Technologic Park of Bizkaia, Building 502, 48160 Derio, Spain

Abstract

Background

SNP arrays output two signals that reflect the total genomic copy number (LRR) and the allelic ratio (BAF), which in combination allow the characterisation of allele-specific copy numbers (ASCNs). While methods based on hidden Markov models (HMMs) have been extended from array comparative genomic hybridisation (aCGH) to jointly handle the two signals, only one method based on change-point detection, ASCAT, performs bivariate segmentation.

Results

In the present work, we introduce a generic framework for bivariate segmentation of SNP array data for ASCN analysis. For the matter, we discuss the characteristics of the typically applied BAF transformation and how they affect segmentation, introduce concepts of multivariate time series analysis that are of concern in this field and discuss the appropriate formulation of the problem. The framework is implemented in a method named CnaStruct, the bivariate form of the structural change model (SCM), which has been successfully applied to transcriptome mapping and aCGH.

Conclusions

On a comprehensive synthetic dataset, we show that CnaStruct outperforms the segmentation of existing ASCN analysis methods. Furthermore, CnaStruct can be integrated into the workflows of several ASCN analysis tools in order to improve their performance, specially on tumour samples highly contaminated by normal cells.

Background

Two chief genetic instabilities associated to tumoural cells are genomic copy number alterations (CNAs) and somatic loss of heterozygosity (LOH) events, which represent a deviation from the normal allele-specific copy numbers (ASCN). Both imbalances have been reported to affect the expression of oncogenes and tumour-suppressor genes

Single nucleotide polymorphism (SNP) arrays of Illumina

In the study of ASCNs over tumour samples with SNP arrays, three additional issues need to be considered. First, there is a LRR baseline shift that depends on the ploidy of the sample. Second, tumour biopsies can be contaminated with normal cells, whose genotypes are mainly diploid, which make the LRR and BAF signals to shrink and converge towards those of a diploid state proportionally to the degree of contamination

Two approaches are used for the detection of ASCNs in tumour samples on SNP arrays, both of which inherit from methodologies applied to aCGH. The most recurrent approach is based on a combination of a hidden Markov model (HMM) and an expectation-maximisation (EM) algorithm. OncoSNP

Methods based on change-point detection algorithms are typically comprised by segmentation followed by a calling step

In the boundary-based differential approach, change-points are seen as inflection points, this is, places where the first derivative has local extrema. Only local information around each point is used to compute the derivative, often resulting in spurious and merged change-points. Multiresolution analysis can be performed by computing the derivative at various window sizes, but region-based approaches are the most adequate to obtain more information for segmentation decisions, although they sacrifice change-point location accuracy. Region-based approaches can be broken down into segment-growing, split-and-merge and global optimisation. Region-growing starts with a number of random single-point regions. Neighbouring points are added to a region if they are similar enough, according to a certain homogeneity criterion; otherwise, a new segment is started. A representative example of split-and-merge is the binary segmentation, which selects as a change-point the position that divides the data into two segments with the most different means. The process is recursively applied to each segment until it cannot be divided into two subsegments with a mean difference that is significant enough. Then, similar regions are merged back together following some pruning criterion. Circular binary segmentation (CBS)

Current change-point detection methods

The application of the univariate segmentation methods to the bivariate data from SNP array requires: (i) knowing how the transformation typically applied to the BAF signal influences the applicability of certain segmentation methods and their extension to the bivariate case, and (ii) a mathematical model that generalises the extension from the univariate to the bivariate case. We provide such formalisations, illustrate that the approach taken by ASCAT is a specific case of the bivariate generalisation and discuss why there are more suitable formulations of the bivariate segmentation for ASCN analysis. Then, we show how the bivariate framework is applied to the SCM model in order to achieve CnaStruct, a method that outperforms the segmentation of existing approaches.

Methods

BAF transformation and characterisation

Methods for the detection of changes in mean on univariate data can be extended to the bivariate case in order to be applied jointly to LRR and BAF, called “variables” from here on. However, a transformation of the BAF variable, which leaves a mostly single-banded signal along the genomic axis, is preferred for posterior segmentation. For the matter, BAF is first mirrored along the 0.5 axis in order to obtain mirrored BAF (mBAF). Then, non-informative SNPs, defined as those in homozygous bands of heterozygous regions, are removed, leaving a transformation that has already been described

Transformation from BAF to mBAF and then, into imBAF

**Transformation from BAF to mBAF and then, into imBAF. **A sample toy BAF signal is transformed to mBAF and then into imBAF. X-axis: probe index. Y-axis: B allele frequency (first) and B allele frequency mirrored along the 0.5 axis (second and third). Grey points represent homozygous SNPs within heterozygous regions.

The resulting imBAF is not homoscedastic for two reasons: (i) the homozygous band resembles a mixture of a point mass function and a truncated normal distribution with lower variance than its heterozygous counterparts; (ii) the distribution of the heterozygous band, when near the 0.5 axis, is truncated due to the mirroring and thus has lower variance. Nevertheless, homoscedasticity violations seem to be sufficiently small so as to not impact segmentation performance of the approaches we assessed (CBS and SCM).

Non-polymorphic probes yield missing values on the BAF variable. Additionally, the transformation of BAF into imBAF generates more missing values, all of which can be easily removed for the application of univariate segmentation approaches. However, the removal of missing values on bivariate approaches typically implies the exclusion of the corresponding LRR observations and, thus, loss of information. Therefore, missing values should be either handled by the segmentation method or imputed, which can be easily done through interpolation. In general, we observed that constant interpolation is more adequate for change-point detection than linear interpolation, because this latter inserts values that lie between imBAF bands, distorting the profile.

Bivariate segmentation

The methodology of univariate change in mean segmentation can be generically formalised in the following way. Consider the energy value

The generalisation can be extended to the multivariate case, where the objective of segmentation ramifies into finding recurrent changes in mean or changes present on a subset of variables. Approaches that detect points where the variables change together are based on the change in the covariance structure. However, we also seek to detect points where the variables LRR and imBAF change in the opposite direction and where just one of them suffers a relevant mean change. The reason is that the copy number may remain constant along two segments with different allelic ratio, and vice versa. This takes us to the adequate model for our problem: a bivariate change in mean. Here, the bivariate decision function

Minkowski distances

**Minkowski distances. **(**A**) Example with Minkowksi distances for the case where _{LRR} and _{imBAF}), 2 and infinite. Green shapes: lines that delineate mean differences with the same Minkowski distances of orders 1 (rotated rectangle), 2 (oval), and infinite (rectangle), with respect to the first segment mean. (**B**) Shapes in the bidimensional space of Minkowski distances of different orders.The points that make up each shape are, from the shape’s centre, at an equal

CnaStruct

The model

The SCM segmentation

where _{1}…_{
s+1} parameterise the borders of the _{
s
} is the mean value of the _{
k
} are the residuals.

CnaStruct is based upon SCM and extends it to a bivariate form that is suitable for ASCN analysis on SNP-array data. For the description of the bivariate form, consider first the residual sums of squares (RSSs) of a segment _{
s
} and _{
s+1}, in the LRR and imBAF variables respectively:

where _{
k
} and _{
k
} are the LRR and imBAF observations at the indexed SNP probe

SNP-array data may contain missing values in the LRR and BAF variables and, in addition, the transformation from BAF results in a high percentage of missing values in imBAF. Such cases do not contribute to the corresponding RSS and thus _{
s
} should be normalised with respect to the number of actually observed values in each variable:

where _{
r,s
} and _{
b,s
} are the number of non-missing observations of a segment

Under the bivariate SCM, the model in Equation 4 is fitted by minimising the following cost function, which is the sum of all segment

A dynamic programming algorithm (see _{1}…_{
S
}. The decision function _{
s
} if the segment

where ^{2}) to O(nl)

Because this is a fitting problem, Minkowski distances of order

Model selection

Data can always be fitted better by increasing the number of change-points

Assuming that the residual errors ε _{
k
} in Equation 4 are independent, the log-likelihood of a model using the Bayesian information criterion (BIC) is:

where

Software

We built a CnaStruct R package that is freely available at

Results and discussion

We evaluated the performance of CnaStruct against the two latest HMM-based methods (GPHMM

All the assessed methods can handle Illumina data, so we evaluated them on the benchmarking dataset from Mosén-Ansorena et al.

A true change-point was considered recalled if at least one predicted change-point falls within a window of 3 probes from it, a threshold that is wide enough to recover most of the correct predictions in the benchmark dataset. Furthermore, from such window on, between-method differences do not vary significantly. Given that GAP outputs the result of merging three segmentations, the calculation of the specificity does not penalise repeated calls of the same change-point in order not to deflate its specificity.

Receiver operating characteristic (ROC) curves allow visual assessment of method performance and the influence of sensitivity parameterisation (Figure

ROC curves for the different methods over a subset of the synthetic data

**ROC curves for the different methods over a subset of the synthetic data. **ROC curves that arise from running methods with different sensitivity parameterisations over the complex-patterned samples with 50% normal cell contamination. The combinations of pattern and contamination level were chosen for being representative of the overall performance. Sensitivity is shown in the vertical axis and specificity in the horizontal axis. Colour code: purple (ASCAT), red (CnaStruct), orange (GAP), black (GPHMM), blue (OncoSNP). Bigger dots correspond to the results obtained with default parameterisations (two for GAP due to the different parameterisations of CBS in the two versions of GAP). If applicable, squares correspond to the best non-default parameterisations. Grey lines are F-measure isocurves (the F-measure integrates, with the same weight, sensitivity and specificity in a single value).

The default parameterisations in OncoSNP and ASCAT are aimed to the detection of longer regions than the ones included in the analysed synthetic samples, so, in order to account for parameterisation differences and keep further comparisons fair, we replaced the default sensitivity-related values with those that achieved the best combination of specificity and sensitivity in the corresponding ROC curves. Such combination is called F-measure, the harmonic mean of specificity and sensitivity. However, notice that the traditional F-measure gives the same importance to both measures, which may not be adequate, as it has been noted that sensitivity is preferable over specificity

**Recall rates by normal cell contamination and alteration pattern, and alteration length for different parameterisations. **Recall rates (y-axis) by normal cell contamination level, sample pattern and alteration length (x-axis) for two different parameterisations of ASCAT (violet: default; brown: segmentation penalisation scaled by a factor of 0.35). Recall rates converge as region length increases, suggesting that both parameterisations achieve similar recall rates at long lengths, but the one that focuses on sensitivity is able to recall more short regions.

Click here for file

We ran the five methods with their optimal parameterisations based on their ROC curves and F-measures, with the exception of GPHMM, which does not allow parameterisation tuning. GAP was run with its default segmentation parameterisation in its original and updated version, which achieved similar F-measures. CnaStruct consistently achieves the best change-point sensitivities and F-measures out of the compared methods along the five alteration patterns and four normal cell contamination levels (Figure

Change-point sensitivity (y-axis) and specificity (x-axis) by sample pattern

**Change-point sensitivity (y-axis) and specificity (x-axis) by sample pattern.** Change-point sensitivity (y-axis) and specificity (x-axis) by sample pattern. Dots connected by a line correspond to the sensitivity and specificity achieved by the corresponding method at the following normal cell contamination levels: 0%, 25%, 50% or 75%. Colour code: purple (ASCAT), red (CnaStruct), orange (GAP with CBS parameterisations from: original (left); updated (right)), black (GPHMM), blue (OncoSNP). Grey lines are F-measure isocurves (the F-measure integrates, with the same weight, sensitivity and specificity in a single value).

To test whether downstream characterisation of allele-specific copy numbers improves with CnaStruct segmentation, we replaced the segmentation algorithms in GAP and ASCAT with CnaStruct (see Additional file

Recall rates by normal cell contamination and alteration pattern

**Recall rates by normal cell contamination and alteration pattern. **Recall rates (y-axis) of each of the assessed methods, calculated by normal cell contamination (x-axis), over each of the five sample patterns. Colour code: purple (ASCAT), orange (GAP), black (GPHMM), blue (OncoSNP). Thicker lines correspond to the workflows in which CnaStruct was integrated.

**Description of the procedures to couple CnaStruct with GAP, ASCAT and TAPS.**

Click here for file

**Recall rates by normal cell contamination and alteration pattern, and alteration length for assessed methods. **Recall rates (y-axis) of each of the assessed methods, calculated by normal cell contamination and alteration length (x-axis) over each of the five sample patterns. Colour code: purple (ASCAT), orange (GAP), black (GPHMM), blue (OncoSNP). Thicker lines correspond to the workflows in which CnaStruct was integrated.

Click here for file

Although we only assessed CnaStruct on Illumina-like data, we ran it in combination with GAP, ASCAT and TAPS on samples from either the Illumina or Affymetrix platform (Additional file

**Results of the analyses of real data with a combination of CnaStruct and other methods.** The analyzed samples are: (i) Two samples from the Affymetrix platform, which are bundled with the TAPS software package (example02 and example16). These samples were analyzed with CnaStruct-TAPS. Provided TAPS results format: columns “Start” and “End” specify probe genomic positions within chromosome (ii) Two samples from the Illumina plaform, which come from a cell-line dilution series

Click here for file

**Plots for the analysis of real data with a combination of CnaStruct and other methods.** The LRR profiles of several samples as analyzed with different combinations of CnaStruct and other methods are displayed. Colour code: blue, segment is called as being CN4 or higher; green, CN3; grey, CN2; red, CN1 or CN0. Only segments with more than 10 SNPs are superimposed. Even though ASCAT fails at the calling step on the 53% contamination sample, both ASCAT and GAP detect a loss on chromosome 13 not present in the pure tumour sample.

Click here for file

Conclusions

We have first identified the issues that arise on segmentation due to imBAF characteristics, namely high value missingness and heteroscedasticity. Although such transformation had already been described, no literature existed on how imBAF’s peculiarities affect segmentation, and more specifically bivariate segmentation.

Then, we have introduced and formalised the bivariate segmentation of SNP-array data for the characterisation of ASCNs in tumour samples. The formalisation generalises the problem and describes the extension from the univariate to the bivariate case, so further univariate methods can eventually be extended to the bivariate SNP-array case through such mathematical framework. With an appropriately selected Minkowski order, the generalisation considers the interaction between variables and their common features, but it is still capable of retrieving changes in a single variable. Thus, the proposed segmentation approach offers an intermediate stand between univariate approaches (e.g. CBS in GAP), which do not include the information available from both variables in the same model and are prone to skipping changes common to the two variables, and bivariate approaches with

CnaStruct exemplifies the benefits of bivariate segmentation with adequately selected Minkowski order and outperforms existing methods at change-point detection on synthetic data. Besides, when coupled with the pattern recognition processes of GAP or ASCAT, the new workflows improve the downstream ASCN analysis in comparison to their original counterparts and the rest of compared methods. Notably, given its performance under the low contrast situations produced by high normal cell contamination levels and intra-tumour heterogeneity, CnaStruct should greatly improve allele-specific copy number characterisation in samples extracted from tumour biopsies, which are typically highly contaminated with normal cells, and in samples from advanced tumours, which are expected to present greater intra-tumour cellular heterogeneity.

Competing interests

The authors declare no competing interests.

Authors’ contributions

DMA conceived the study, supervised by AMA. Both authors participated in the writing of the manuscript. DMA devised the statistical model and performed the analyses. Both authors read and approved the final manuscript.

Acknowledgements

DMA is supported by the Government of Navarra, Spain through the grant “Ayuda predoctoral para realizar una tesis doctoral y obtener el título de doctor (Plan de Formación y de I + D 2010/2011)”. AMA and the research expenses are supported by the Department of Industry, Tourism and Trade of the Government of the Autonomous Community of the Basque Country (Etortek Research Programs 2010/2012) and from the Innovation Technology Department of the Bizkaia County.