School of Computer Science and Technology, Xidian University, Xi'an, P. R. China

Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA, USA

Center for Sleep Sciences and Medicine, Stanford University School of Medicine, Palo Alto, CA, 94304, USA

Departments of Gynecology/Obstetrics and Oncology, Johns Hopkins University School of Medicine, Baltimore, MD, 21231, USA

Lombardi Comprehensive Cancer Center and Department of Oncology, Georgetown University, Washington, DC, 20057, USA

Research Center for Genetic Medicine, Children's National Medical Center, Washington, DC, 20010, USA

The International Baccalaureate Magnet Diploma Program, Richard Montgomery High School, Rockville, MD, 20852, USA

Department of Pathology, Johns Hopkins Medical Institutions, Baltimore, MD, 21231, USA

Abstract

Background

Somatic Copy Number Alterations (CNAs) in human genomes are present in almost all human cancers. Systematic efforts to characterize such structural variants must effectively distinguish significant consensus events from random background aberrations. Here we introduce Significant Aberration in Cancer (SAIC), a new method for characterizing and assessing the statistical significance of recurrent CNA units. Three main features of SAIC include: (1) exploiting the intrinsic correlation among consecutive probes to assign a score to each CNA unit instead of single probes; (2) performing permutations on CNA units that preserve correlations inherent in the copy number data; and (3) iteratively detecting Significant Copy Number Aberrations (SCAs) and estimating an unbiased null distribution by applying an SCA-exclusive permutation scheme.

Results

We test and compare the performance of SAIC against four peer methods (GISTIC, STAC, KC-SMART, CMDS) on a large number of simulation datasets. Experimental results show that SAIC outperforms peer methods in terms of larger area under the

Conclusions

Supported by a well-grounded theoretical framework, SAIC has been developed and used to identify SCAs in various cancer copy number data sets, providing useful information to study the landscape of cancer genomes. Open–source and platform-independent SAIC software is implemented using C++, together with R scripts for data formatting and Perl scripts for user interfacing, and it is easy to install and efficient to use. The source code and documentation are freely available at

Background

Somatic copy number alterations (CNAs) are common genetic events in the development and progression of various human cancers, and significantly contribute to tumorigenesis

By studying a sufficiently large collection of cancer samples, Significant Copy Number Aberrations (SCAs), defined as significantly recurrent CNAs that affect the same region in multiple tumors, are widely considered as informative surrogates of “driver” mutations that may help pinpoint novel cancer-causing genes

Significance testing for aberrant copy number (STAC) starts by converting the normalized log-ratios into a binary matrix, with zeros indicating no change and ones indicting losses and gains

Existing methods have several limitations. When working with unprocessed raw intensity ratios

We now report Significant Aberration in Cancer (SAIC), a carefully motivated method for accurately identifying SCAs using CNAs data from multiple samples. To distinguish between different biological roles of CNAs types and between noise and sporadic CNAs, we use discretized CNAs data and separately analyze copy number amplifications and deletions. By exploiting the intrinsic correlation among consecutive probes, we calculate and assign a score (test statistics) to each CNA unit instead of each single probe, based on both the amplitude and frequency of CNAs within the unit. To accurately estimate the null distribution governing sporadic CNAs, we perform random positional permutations on CNA units that preserve correlations inherent to the copy number data. More importantly, to minimize the unwanted participation of true SCAs in determining the null distribution

We tested SAIC on extensive simulation data sets, observing significantly improved performance with larger areas under the

Methods

Data format and definitions

Preprocessed log-ratio data are stored in a numeric _{nm} represents DNA copy number (in log2-ratio) for sample _{n} corresponds to copy number for _{amplification} +_{deletion}, where

with _{amplification} and _{deletion} being the pre-specified thresholds. For brevity, we focus all subsequent discussion on _{amplification} and make comments on _{deletion} when necessary.

Definition 1

Any copy number probe

To exploit correlations inherent in copy number data, we first merge consecutive CNA probes into CNA regions, leaving the gaps consisting of only non CNA probes, see Figure _{ij} between CNA probes

where _{i} and _{j} are the estimated means and standard deviations of copy numbers at probes _{ij} is less than a pre-specified threshold

An illustration on how CNA units are defined

**An illustration on how CNA units are defined.** Left: Consecutive CNA probes are merged into two intervals, with the first interval containing probes 1–10 and the second interval containing probes 14–16. Right: Each of the two intervals is split into CNA units according to the correction coefficients between CNA probes defined by Eq. (2),

Definition 2

A sequence of consecutive CNA probes with no breakpoints is defined as a CNA unit, denoted by

Intuitively, a CNA unit consists of a sequence of highly correlated consecutive CNA probes. Figure

Summary statistics and significance assessment

Units that exhibit high or low average copy number are of interest, so it is natural to examine summary statistics for each unit. SAIC identifies significant aberration units through two steps. First, the method calculates a statistic (

Second, the method assesses the statistical significance of each CNA unit by comparing the observed statistic to the

Sporadic CNA units often occur throughout the genome, so a null distribution for _{k, L} under the hypothesis that no SCAs are present, can be estimated by randomly permuting the overall pattern of presumed all-sporadic CNA units across the genome

Let ^{(t)} be the random positional permutation of

Algorithm 1

Assessing the statistical significance of _{k, L}

(1) Perform ^{(1)}, ^{(2)}, …, ^{(T)} of the data matrix

(2) Compute the value of summary statistic

(3) Calculate and assign a P-value to each observed CNA unit

where

The empirical P-values on _{deletion} are calculated by the extreme left-hand tail probabilities and reversing the inequality in Eq. (4). Both definitions produce P-values that are easy to interpret, and the “max” operation automatically adjusted P-values for multiple comparisons across CNA units thus controls the family-wise error rate

In algorithm 1, it is important to note that when we generate a randomly permuted dataset based on the observed data, we do not re-define the CNA units but re-use the already-defined CNA units. Specifically, in each permutation, we randomly place the already-defined CNA units over the whole genome or each chromosome within each sample, and calculate the summary

Iterative estimation of unbiased null distribution

One important issue concerning Algorithm 1 is the presence of true SCAs (departing from null distribution) in cancer genomes that presumably contribute high copy number deviations to the estimation of overall null distribution (governing only sporadic CNAs), potentially reducing power to detect less-extreme SCAs due to theoretical conservativeness _{-SCAs} in which already-detected SCAs becomes null.

Algorithm 2

Assessing iteratively the statistical significance of _{k, L}

(1) Perform Algorithm 1;

(2) Check whether ‘new’ SCAs are detected. If ‘yes’, continue; if “no”, stop and re-calculate the P-values for all SCAs using truth converging null distribution;

(3) Mask the CNA units associated with newly detected SCAs as zeros and let

It has been shown experimentally that additional power to detect SCAs can be gained by removing the effect of newly detected SCAs after each iteration

Theorem 1

SAIC algorithm and data preprocessing

Figure

Schematic flowchart of combined SAIC algorithms 1 and 2

**Schematic flowchart of combined SAIC algorithms 1 and 2.**

Results

In the absence of definitive ground truth about the recurrent CNAs in the cancer genomes, the validation of a new method for detecting SCAs is always problematic

Simulation studies

Multiple simulation data sets with definitive ground truth and various design or parameter settings were generated based on the modified benchmark models proposed in _{k, L} (

**Null simulation model**

**Empirical FWER at****= 0.05****level**

Copy number data

0.0488

Clumped copy number data (25%)

0.0500

Clumped copy number data (50%)

0.0493

Clumped copy number data (75%)

0.0505

We then assessed the detection power of SAIC as compared to GISTIC. Based on the simulation model proposed in _{λ} and _{λ} being the mean and standard deviation of normal cell fraction in the sample. Each sample contains two sporadic CNA regions, one deletion and one amplification randomly drawn from integer sets {0, 1} and {3, 4,…,8}, respectively. Each data set contains two recurrent CNA regions that are contributed from a fraction of samples according to a specified frequency

** = 60,****= 0.2,**_{λ} **= 0.6,**_{λ}**=**

**0.15**

**0.2**

**0.25**

**0.3**

**0.35**

GISTIC

89%

86%

79%

74%

72%

SAIC

96%

94%

86%

86%

82%

_{λ} = 0.25, _{λ} =

0.4

0.5

0.6

0.7

0.8

GISTIC

83%

81%

82%

72%

79%

SAIC

93%

91%

87%

79%

74%

_{λ} = 0.25, _{λ} = 0.6,

40

50

60

70

80

GISTIC

58%

73%

79%

86%

89%

SAIC

65%

83%

87%

93%

94%

_{λ} = 0.25, _{λ} = 0.6,

0.1

0.15

0.2

0.25

GISTIC

30%

58%

80%

92%

SAIC

37%

72%

87%

97%

We further assessed the overall performance of SAIC, measured by both sensitivity and specificity via ROC curves, as compared with the four peer methods (GISTIC, STAC, KC-SMART, CMDS). Based on the modified benchmark model proposed in _{L} and _{ω} to modify the length and frequency of these SCAs. Other parameter settings include _{ρ} = 0.75, _{amplification} = 0.1 and _{deletion} = −0.1 (default setting by GISTIC and CBS) for defining CNAs probes and units. Based on the estimated true positive rate (TPR) and corresponding FPR at different significance levels, Figure _{z} under the ROC curves or increased sensitivity at low FPR. More simulation studies are given in Additional file

**Table S1.** Comparative detection rates of ground truth SCA boundaries by STAC, GISTIC, KC-SMART, CMDS, and SAIC for simulation data sets under various model parameter settings. The results are calculated based on 100 replications for each of the parameter settings and using p-value (or q-value) cutoff threshold <0.05.

Click here for file

Comparative performance of SAIC and four peer methods (STAC, GISTIC, KC-SMART, CMDS) on realistic simulation data sets, quantified by the partial ROC curves (north-west) (TPR: true positive rate; FPR: false positive rate)

**Comparative performance of SAIC and four peer methods (STAC, GISTIC, KC-SMART, CMDS) on realistic simulation data sets, quantified by the partial ROC curves (north-west) (TPR: true positive rate; FPR: false positive rate).** The results are the averages calculated based on 100 replications under each of various parameter settings.

Application to four real cancer copy number data sets

We applied SAIC to four real cancer copy number data sets and identified many SCAs that encompass established or potentially novel cancer ‘driver’ genes. The data sets are from ovarian cancer

Results on the ovarian cancer data set

Our in-house ovarian cancer data set consists of _{ρ} = 0.95, _{amplification} = 0.263 (2.4 copies) and _{deletion} = −0.322 (1.6 copies) _{10}

**Table S2 and Table S3.** Details about the implicated SCAs and full list of genes covered by these SCAs, derived from the ovarian cancer data set.

Click here for file

Genome-wide landscapes of recurrent or sporadic CNAs derived from 63 ovarian cancer samples

**Genome-wide landscapes of recurrent or sporadic CNAs derived from 63 ovarian cancer samples.** Amplifications and deletions are displayed on the left and right sides, separately, where dashed lines correspond to the significance level _{α = 0.05} for calling SCAs.

Results on the metastatic prostate cancer dataset

Our in-house prostate cancer data set consists of _{ρ} = 0.95, _{amplification} = 0.263 and _{deletion} = −0.322, the same as used in analyzing ovarian cancer data. The genome-wide landscape of recurrent or sporadic CNAs observed in metastatic prostate cancer data is given in Figure

**Table S4 and Table S5. **Details about the implicated SCAs and full list of genes covered by these SCAs, derived from the prostate cancer data set.

Click here for file

Genome-wide landscapes of recurrent or sporadic CNAs derived from 13 metastatic prostate cancer samples

**Genome-wide landscapes of recurrent or sporadic CNAs derived from 13 metastatic prostate cancer samples.** Amplifications and deletions are displayed on the left and right sides, separately, where dashed lines correspond to the significance level

Results on the lung adenocarcinoma and glioblastoma datasets

The lung adenocarcinoma data set consists of _{amplification} = 0.848 and _{deletion} = −1.15, in addition to _{ρ} = 0.9. The genome-wide landscape of recurrent or sporadic CNAs observed in lung adenocarcinoma data is given in Figure

**Table 6 and Suplementary Table 7. **Details about the implicated SCAs and full list of genes covered by these SCAs, derived from the lung adenocarcinoma data set.

Click here for file

Genome-wide landscapes of recurrent or sporadic CNAs derived from 371 lung adenocarcinoma samples

**Genome-wide landscapes of recurrent or sporadic CNAs derived from 371 lung adenocarcinoma samples.** Amplifications and deletions are displayed on the left and right sides, separately, where dashed lines correspond to the significance level _{α = 0.05} for calling SCAs.

Venn diagram on the numbers of common and distinct focal SCAs detected by SAIC and GISTIC in the lung adenocarcinoma samples

**Venn diagram on the numbers of common and distinct focal SCAs detected by SAIC and GISTIC in the lung adenocarcinoma samples.**

The glioblastoma data set consists of

**Table S8 and Table S9.** Details about the implicated SCAs and full list of genes covered by these SCAs, derived from the glioblastoma data set.

Click here for file

Genome-wide landscapes of recurrent or sporadic CNAs derived from 141 glioblastoma samples

**Genome-wide landscapes of recurrent or sporadic CNAs derived from 141 glioblastoma samples.** Amplifications and deletions are displayed on the left and right sides, separately, where dashed lines correspond to the significance level _{α = 0.05} for calling SCAs.

Venn diagram on the numbers of common and distinct focal SCAs detected by SAIC and GISTIC in the glioblastoma samples

**Venn diagram on the numbers of common and distinct focal SCAs detected by SAIC and GISTIC in the glioblastoma samples.**

The common SCAs regions (e.g., 7p11.2, 12p12.1, 9p21.3, etc.) are highly consistent with previous reports, and largely encompass well-known oncogenes or tumor suppressor genes. For example, EGFR (epidermal growth factor receptor) is an oncogene within 7p11.2 whose mutations or amplifications have been shown to contribute to uncontrolled cell division (a predisposition for cancer)

Discussion

SAIC is similar to many peer methods in that it assesses statistical significance of SCAs using a permutation-based null distribution

As for the _{amplification} and _{deletion} parameters in the SAIC algorithm, there is no general guideline about how to select their values

Similar situation occurs to the selection of _{ρ} in defining CNA units _{ρ} often produce longer CNA units while higher values of _{ρ} often produce shorter CNA units. It has been reported that the average successive probe correlation of the segmented data can be as high as 0.985

It is important to note that the general conclusion on the relative performance of our SAIC and peer methods, at least based on the extensive simulation studies, remains largely true. We have used the same parameter values in all methods so that a fair comparison on their relative performances can be assured. Based on our analysis of real datasets using current parameter settings, it appears that SAIC performs well when compared to peer methods. In addition, the results of extensive simulation studies, performed under a variety of probe correlation schemes, show that SAIC preserves well the expected type 1 error, even when the probes follow non-stationary correlation structures similar to those found in real data

SAIC currently can perform either genome-wide (except X/Y chromosome due to its distinct biological role) or chromosome-based CNA unit permutations. In the application of SAIC to real cancer data sets, we performed genome-wide, autosome-based, and X/Y-chromosome-based permutations. The combined results from using different permutation schemes contain more SCAs that may involve novel cancer driver genes. By exploiting the novel concepts of CNA probe, CNA unit, and multiscale permutation, experimental results show that SAIC can accurately detect the boundaries of SCAs with different lengths, see Additional file

We have also performed simulation studies (data not shown) that indicate that detection power of SAIC can be further improved by correcting for normal tissue contamination using a recently developed BACOM method

Conclusions

We have presented a novel approach to accurately detect significant recurrent CNAs in cancer genomes which is both statistically-principled and which, as illustrated by real examples, can be very effective at revealing SCAs within data. The concepts of CNA unit and iterative permutation are relatively simple to interpret, yet still convey considerable novel mathematical insights into data structure and bias correction.

It is worth noting that there are three novel features associated with SAIC. First, we define CNA unit to capture the intrinsic correlation structure in copy number data. Second, we perform iterative SCA-exclusive permutation to produce an unbiased null distribution. Third, we apply SAIC to real cancer copy number datasets and detect most previously reported SCAs covering well-known cancer genes.

Two important pending issues with the present algorithm are the expected significant impact of intratumor heterogeneity and normal cell contamination

Appendix A

for iterations

and

Additional files

Competing interests

The authors declared that they have no competing interests.

Authors’ contributions

XY, GY and YW participated in the design of concepts and methods. XY and GY developed the permutation strategy and CNA simulation algorithm. XY implemented the C++ code. RRW implemented the R code of GISTIC. GY, XY and XH analyzed and evaluated the algorithm. XH and YW constructed and proved Theorem 1. YW, XY and GY drafted the manuscript. IMS and EPH interpreted the results on real cancer data. JZ, RC and EPH help edited the manuscript. YW, RC and ZZ conceived of the study, participated in its design and coordination, and helped edited the paper. All authors read and approved the final manuscript.

Acknowledgements

This work was supported in part by the US National Institutes of Health under Grants CA160036, CA149147, NS029525, and GM085665, and the Project Supported by Natural Science Basic Research Plan in Shaanxi Province of China (Program No. 2012JQ8027), and the Fundamental Research Funds for the Central Universities (No.K50511030002), and the Natural Science Foundation of China under Grants 61070137, 91130006, and 60933009.