Department of Computer Science, Brown University, Providence, RI, USA

Department of Urology, University of California at San Francisco, San Francisco, CA, USA

Department of Pathology, Baylor College of Medicine, Houston, TX, USA

Vancouver Prostate Centre, Vancouver, BC, Canada

Center for Computational Molecular Biology, Brown University, Providence, RI, USA

Abstract

Background

Copy number variants (CNVs), including deletions, amplifications, and other rearrangements, are common in human and cancer genomes. Copy number data from array comparative genome hybridization (aCGH) and next-generation DNA sequencing is widely used to measure copy number variants. Comparison of copy number data from multiple individuals reveals recurrent variants. Typically, the interior of a recurrent CNV is examined for genes or other loci associated with a phenotype. However, in some cases, such as gene truncations and fusion genes, the target of variant lies at the boundary of the variant.

Results

We introduce Neighborhood Breakpoint Conservation (NBC), an algorithm for identifying rearrangement breakpoints that are highly conserved at the same locus in multiple individuals. NBC detects recurrent breakpoints at varying levels of resolution, including breakpoints whose location is exactly conserved and breakpoints whose location varies within a gene. NBC also identifies pairs of recurrent breakpoints such as those that result from fusion genes. We apply NBC to aCGH data from 36 primary prostate tumors and identify 12 novel rearrangements, one of which is the well-known TMPRSS2-ERG fusion gene. We also apply NBC to 227 glioblastoma tumors and predict 93 novel rearrangements which we further classify as gene truncations, germline structural variants, and fusion genes. A number of these variants involve the protein phosphatase PTPN12 suggesting that deregulation of PTPN12, via a variety of rearrangements, is common in glioblastoma.

Conclusions

We demonstrate that NBC is useful for detection of recurrent breakpoints resulting from copy number variants or other structural variants, and in particular identifies recurrent breakpoints that result in gene truncations or fusion genes. Software is available at

Background

Copy number variants (CNVs) are genomic rearrangements that result in a different number of copies of a segment of the genome, and include deletions, amplifications, and unbalanced translocations. CNVs are common in the human genome, and CNVs have been associated with several diseases

Array comparative genome hybridization (aCGH)

Some recurrent rearrangements do not target a gene within the aberrant interval, but rather target a gene or locus at the boundary of the interval. A striking example is the TMPRSS2-ERG fusion gene in prostate cancer

We introduce a novel algorithm called Neighborhood Breakpoint Conservation (NBC) to identify recurrent breakpoints in copy number data. NBC computes the probability that a breakpoint occurs between each pair of adjacent probes over

Methods

The Neighborhood Breakpoint Conservation (NBC) algorithm takes, as input, aCGH data from many individuals and identifies recurrent breakpoints and pairs of recurrent breakpoints in a subset of the individuals (Figure

The Neighborhood Breakpoint Conservation (NBC) algorithm

**The Neighborhood Breakpoint Conservation (NBC) algorithm**. NBC consists of two steps: computing breakpoint probabilities and recurrent breakpoint detection. Copy number ratios (CNRs) derived from aCGH data from multiple individuals are segmented using a Bayesian change-point algorithm that computes the probability of a breakpoint between adjacent probes (in red). The breakpoint probabilities are then combined to detect recurrent breakpoints (black rectangles). We identify recurrent breakpoints that occur between adjacent probes as well as recurrent breakpoints that occur within a set of probes defined by a genomic interval. To detect CNVs, we identify pairs of recurrent breakpoints.

While many existing methods produce a single segmentation for aCGH data **X **given a segmentation

The second step of NBC is to combine breakpoint probabilities in each individual to determine breakpoints that appear in multiple individuals. Similar to

A Probability Model for Segmentation and Breakpoint Analysis

A probabilistic formulation of the segmentation problem assigns a probability to each possible segmentation of **X**. The probability of other events, such as a breakpoint occurring at a particular locus, are readily computed from this model. Probabilistic segmentation approaches have been previously applied to CNV detection

**The Appendix includes full derivations of the segmentation model, comparisons to other segmentation algorithms, and data aquisition and implementation details**.

Click here for file

Our algorithm is based on the change-point model described in **X **= (_{1},...,_{n}
_{i }
_{2 }ratio of test.reference DNA at the _{1},...,_{K}
_{i }
_{i }
^{2}. The variance ^{2 }is a hyperparameter whose value must be set. Below we describe how we estimate this value from the data. The mean _{i }
_{s }
_{i }
_{j }
_{max }denote the maximum number of segments in the test genome.

We define the **A **= (_{1}, ..., _{
K+1}), where _{v }
_{
K+1 }= **A**). Thus,

The unknowns in our model are the breakpoint sequence **A**, the number of segments **A **and _{s }
_{0 }and variance **A **such that all **A **with _{max}, are equally likely. Note that these priors do not make any strong assumptions about the data. essentially, the a priori assumption is that with probability

From the priors **A**|_{0}, _{0}, the joint distribution **X**, **A**, Θ,

Hyperparameter Estimation

The segmentation and breakpoint analysis algorithm relies on setting values for the hyperparameters _{0 }(the baseline mean), ^{2 }(the variance in probe measurements). We describe how to estimate these from the copy number profile **X **= (_{1},...,_{n}
_{0 }to be the median of the _{i}
^{2}, we form sliding windows of 10 probes. Let _{0}. We set the measurement variance ^{2 }= 2

To test the sensitivity of our results to our particular estimates of the hyperparameters - in particular our estimates of ^{2 }and

**Simulation #1 **We generated an artificial chromosome with 100 probes containing a 40 probe single-copy gain (log_{2 }ratio of 1) placed in the center. We then introduced various amounts of gaussian noise

**Simulation #2 **We generated an artificial chromosome with 100 probes with gaussian noise _{2 }ratios. 0.5, 1, 2, 3, 4, 5, and 6. For each log_{2 }ratio, we generated 100 such chromosomes.

A representative sample of the datasets for Simulation #1 and Simulation #2 are shown in Additional File

We ran NBC on datasets from the two simulations with different estimates for the variances ^{2}, detailed below. To assess the quality of the resulting breakpoint predictions, we consider probe locations with Pr(breakpoint) ≥ 0.5 to be a predicted breakpoint. We assume that a predicted breakpoint detects a true breakpoint if the predicted breakpoint location is ≤ 2 probes away from the true breakpoint location. We count the number of true positive predictions (0, 1, or 2). Additionally, we count the number of false positive predictions for each dataset. We average the true positives and false positives over the 100 artificial chromosomes.

Simulation #1 has a fixed aberration log_{2 }ratio, so we set the segment variance ^{2}: ^{2 }= 2^{2 }= ^{2 }= 3^{2 }= ^{2 }= 2^{2 }= 3^{2}, with our estimate ^{2 }= 2

Hyperparameter Sensitivity

**Hyperparameter Sensitivity**. The number of true positive (TP) breakpoints (0,1, or 2) and the number of false positive (FP) breakpoints for Simulation #1 and Simulation #2 over various values of variance parameters ^{2 }(top row) and

Since Simulation #2 has fixed measurement error, we set the measurement variance ^{2 }= 2^{3 }(Figure

The simulations underscore that the ability to detect the breakpoints of a segment is related to both the copy number of the segment (governed by the segment variance ^{2}). For example, in Simulation #1 (where ^{2 }increases the average number of false positive breakpoints increases while the average number of true positives remains below one. To avoid such situations, we do not segment the data and immediately report 0 breakpoints when our estimates of _{0 }satisfy _{0}.

Computing Breakpoint Probabilities

We compute the probability of a breakpoint between pairs of adjacent probes by sampling breakpoint sequences **A **from the distribution **A**|**X**) and counting the proportion of samples that have a breakpoint between adjacent probes. Note that the probability of a breakpoint between adjacent probes can be analytically computed (see

_{[i:j] }= (_{i}
_{j}
_{(i:j] }(_{
i+1},..., _{j}
_{[i:j) }= (_{i}
_{
j-1}) The probability of **X **is

where

Here, **X**|**X **given that the test genome is divided into **X**|**A **and is computationally infeasible. A dynamic program allows the efficient computation of this term.

Dynamic program

Let _{[i:j]}
_{[i:j] }given that it is generated from _{[1:j]}|_{max }and 1 ≤

The final row of the dynamic programming table contains **X**|_{max}, which is used in Equation (1) to compute **X**).

Recursive sampling

We use **X**|_{[i:j]}|1) and intermediate terms _{[1:j]}|**A **using a backward sampling technique

1. Draw **X**), determined by inverting **X**|

2. Set _{
k+1 }=

3. Draw _{
k
}, _{
k-1}, ..., _{1 }recursively using the conditional distributions computed by the recurrences in Equation (4). Given _{q}
_{
q-1 }is obtained as follows.

From a set of breakpoint sequences sampled in proportion to **A**|**X**), we determine the probability of a breakpoint occurring between two adjacent probes by counting the proportion of samples that contain a breakpoint at that locus. Other probabilities derived from these sampled breakpoint sequences are described in subsequent sections.

Runtime analysis

The base cases _{[i:j]}|1) require ^{2}) computations and the dynamic program requires _{max}) computations; thus computing **X**|_{max})) time. All computations necessary to sample a breakpoint sequence **A **are already computed in the dynamic program, so sampling is linear in the number of breakpoints **X**|

Identifying Recurrent Breakpoints

After sampling breakpoint sequences for a set of individuals, we identify recurrent breakpoints that appear in many individuals at the same genomic locus. Let _{j }
_{1}, ..., _{n}
_{j }
_{
j'
}. We analyze recurrent breakpoints at two levels of resolution.

•

•

In addition to analyzing these types of recurrent breakpoints, we also consider pairs of recurrent breakpoints to identify recurrent CNVs. Note that these pairs may indicate

Recurrent probe breakpoints

For each probe, we define a score that measures the presence of a breakpoint in a subset of individuals. We design this score to account for the observation that the number of breakpoints in copy-number profiles, particularly in a set of cancer samples, is highly variable. That is, in a set of cancer samples, even from the same cancer type, there will typically be highly rearranged cancer genomes with many breakpoints, and less rearranged genomes with relatively few breakpoints. This variability in the number of breakpoints is maintained following our Bayesian segmentation approach - despite the fact that we use the same flat prior for each individual - because there is strong evidence to support a larger number of breakpoints in some samples. Since there is a greater chance of recurrent breakpoints occurring randomly in a collection of highly rearranged genomes than a collection of less rearranged genomes, it is advantageous to consider the number of breakpoints in each profile when scoring recurrent breakpoints. Because the variability of number of breakpoints across different individuals is typically not well matched by a standard distribution, one approach is to use a permutation test that preserves the number and probability of breakpoints in each profile while permuting their location. We instead derive a score for recurrent probe breakpoints based on a binomial order statistic

Let _{i }
_{i}
_{j}
_{j}
**A **that have a breakpoint between _{
j
}(_{j }

Let _{j}

where we are only interested in scoring those breakpoints that are present in at least _{min }patients. Note that because the binomial order statistic is computed from the empirical distribution _{j }

Finally, we assume that a recurrent breakpoint is also conserved in the direction of the copy number change: all samples with a recurrent breakpoint are either breakpoints that go from relatively low copy number to high copy number of vice versa. A breakpoint sequence **A **defined a segmentation, and we use the mean values of each segment to determine the direction of copy number change. The copy number change is positive if the mean of the segment to the right of the breakpoint is higher than the mean of the segment to the left. We test both cases for each recurrent breakpoint, doubling the number of hypotheses we test. We control the False Discovery Rate (FDR) using the method of Benjamini and Hochberg

Recurrent interval/gene breakpoints

We extend our approach to find recurrent breakpoints that lie within a genomic interval _{
j
}(_{j }

The conditional probabilities _{j}
_{j}
_{j}
_{j}
**A **and counting the number of samples that contain one or more breakpoints in the interval _{j}
_{j}

Here, the last term in Equation (9) counts the number of ways to choose _{
j
}(W) for all

Finally, using the _{j}
_{j }

For the experiments below, we define the the copy number change for an interval

Pairs of recurrent interval/gene breakpoints

We identify pairs of non-overlapping recurrent interval breakpoints using a log-odds score similar to Equation (8) that scores two breakpoints occurring in intervals _{1 }and _{2}. An important case we will consider is when _{1 }and _{2 }are genes. Let _{1 }be the event that a breakpoint lies between any pair of adjacent probes within _{1}, and Let _{2 }be the event that a breakpoint lies between any pair of adjacent probes within _{2}. We define the score for intervals _{1 }and _{2 }for a particular patient _{j}

Each term is computed similarly to Equation (8). If _{1 }and _{2 }are on different chromosomes, the events _{1}) and _{2}) are independent and Equations (9) and (10) are used to compute the scaling factor

Where

The denominator in the scaling factor is then

The _{1}, _{2}) is computed by normalizing as in Equation (11) according to the empirical distribution of log-odds scores over all pairs of non-overlapping intervals and then using the binomial order statistic to determine the final _{1 }and _{2 }by considering the four combinations of direction of copy number change: {(+, +), (-, -), (-, +), (-, -)}. Note that restricting _{1 }and _{2 }to each contain a single probe identifies pairs of recurrent probe breakpoints.

Predicting Structural Variants, Gene Truncations, and Fusion Genes

Our statistics for single recurrent breakpoints (_{1}, _{2})) provide a flexible framework to predict particular rearrangement configurations. In this paper, we classify predictions into structural variants, gene truncations, and fusion genes.

Structural variants

Pairs of recurrent probe breakpoints may indicate germline or somatic rearrangements that have recurrent breakpoints at the highest resolution allowed by the spacing of probes. To identify these rearrangements, we compute the pairs of recurrent probe breakpoint statistic for every pair of probes within each chromosomal arm. Note that this limits the structural variant predictions to intrachromosomal rearrangements only.

Gene truncations

Recurrent breakpoints found within a single gene may indicate a gene truncation, resulting in the loss of functionality for a particular gene. To predict gene truncations, we compute the recurrent interval breakpoint detection statistic, using the set of gene regions from RefSeq as our intervals of interest.

Fusion genes

Pairs of recurrent interval breakpoints found within genes suggest potential fusion genes. We compute pairs of recurrent interval breakpoints using all pairs of gene regions from RefSeq as our intervals of interest. Note that not all pairs of recurrent genes suggest functional fusion genes. For example, a rearrangement that joins the 3' end of one gene to the 3' end of another gene is typically not a functional fusion gene. Thus, we restrict our attention to pairs of interval breakpoints with particular configurations (Figure

Fusion Gene Configurations

**Fusion Gene Configurations**. Fusion genes are pairs of recurrent genes that have the following configuration. (A) Each gene _{1 }and _{2 }has an associated orientation, _{1}) and _{2}). Additionally, each recurrent breakpoint has an associated change in relative copy number, _{1}) and _{2}). (B) A fusion gene joins the ends of _{1 }and _{2 }such that the 5' end of one gene is joined to the 3' end of the other gene.

Specifically, consider a pair of recurrent intervals _{1 }and _{2 }that represent gene regions. Each gene has an orientation, _{1}) ∈ {+, -} and _{2}) ∈ {+, -}. Additionally, the breakpoint that lies within each recurrent interval has an associated direction of copy number change, _{1}) ∈ {+, -} and _{2}) ∈ {+, -}. We assume that a fusion gene contains the 5' end of one gene joined to the 3' end of the other gene and thus satisfies the following rule.

Filtering and Ranking Predictions

We apply a number of additional steps to remove and prioritize predictions. In the case of fusion genes, if there are many predictions remaining we rank these predictions by the preservation of copy number across the fusion point.

Removing single probe aberrations

Single probe aberrations are segments consisting of a single probe. Since these are difficult to distinguish from experimental artifacts, we remove them from further consideration. Single probe aberrations are characterized by two large changes in copy number in adjacent probes, where the segments adjacent to this aberration have a similar copy number. We identify these probes and remove them from the analysis.

Removing known CNVs

We remove predictions that are new known CNVs. We say that a single probe is "near" a known CNV in the Database of Genomic Variants (DGV)

Ranking predictions

Since fusion genes (and other recurrent pairs of breakpoints) are physically joined in the test genome, we expect the copy number of either side of the breakpoint to be the same. Thus, we rank these predictions by calculating the root mean squared difference (RMS) between the copy number levels of probes surrounding the breakpoint. Consider fusion gene predictions. we know the configuration of the gene partners, but we do not know exactly where the breakpoint lies. Thus, we determine the copy number on each side of the fusion as the average of the three flanking probes of the left gene partner and the three flanking probes of the right gene partner. If

Results

We applied NBC to two aCGH datasets. a collection of 36 primary prostate tumors, and 227 glioblastoma (GBM) tumors. For each dataset, we computed recurrent probe breakpoints, recurrent gene breakpoints, pairs of recurrent probe breakpoints, and pairs of recurrent gene breakpoints.

Prostate Dataset

We applied NBC to Agilent aCGH data from a collection of 36 primary prostate tumors. Each sample contained copy number ratios for 235,719 aCGH probes that were mapped to the hg17 human reference genome. We examined recurrent gene breakpoints using the gene regions from 16,162 hg17 RefSeq genes. Table **A**. We predict one novel gene truncation, which occurs in the Complement factor H (CFH) gene (Figure ^{-33}, Figure ^{-10 }(Figure

Predicted Recurrent Breakpoints in 36 Prostate Samples.

**Breakpoint Type**

**Rearrangement Type(s)**

**# Predicted**

**# in DGV**

**# Novel**

Recurrent Probes

Highly Conserved Breakpoints

80

66

14

Recurrent Genes

Gene Truncations

6

5

1

Pairs of Recurrent Probes

Germline or Somatic

Structural Variants

38

28

10

Pairs of Recurrent Genes

Intrachromosomal Fusion Genes

2

1

1

With Fusion Gene Config.*

Interchromosomal Fusion Genes

2

2

0

Breakpoint types are described by the indicated rearrangement type. '# Predicted' is the number of predictions that are significant with FDR < 0.01. '# in DGV' counts the breakpoints near known structural variants in the Database of Genomic Variants (DGV). '# Novel' is the number of predictions that are not near any known variant in DGV.

* Novel pairs of recurrent gene breakpoints consistent with the fusion gene configuration.

**Tables of all the breakpoints and pairs of breakpoints predicted for the prostate dataset and the GBM dataset**. Note that the values reported for the prostate dataset (e.g. the RMS difference) are log base 10, while the values reported for the GBM dataset are log base 2.

Click here for file

A Predicted Gene Truncation in Prostate Cancer

**A Predicted Gene Truncation in Prostate Cancer**. The Complement Factor H (CFH) gene on Chromosome 1 contains a recurrent gene breakpoint, suggesting the truncation of the 3' region in 9 individuals.

A Predicted Rearrangement Highly Conserved at the Probe Level in Prostate Cancer

**A Predicted Rearrangement Highly Conserved at the Probe Level in Prostate Cancer**. This amplified region on Chromosome 8 lies in the DEFB locus, and the recurrent breakpoints are conserved at the probe level in 17 individuals. Arrows indicating DEFB genes are approximate.

The TMPRSS2-ERG Fusion Gene in Prostate Cancer

**The TMPRSS2-ERG Fusion Gene in Prostate Cancer**. We identify the TMPRSS2-ERG fusion gene in 5 prostate cancer patients. The mean segmentations for each patient (shown in blue) are computed by finding the segment parameters **A **drawn from the posterior distribution **A**|**X**) and then averaging these values across all segmentations.

Comparison to Segmentation Approaches

To demonstrate the importance of breakpoint uncertainty in computing recurrent breakpoints, we compared our fusion gene predictions to those obtained using a single segmentation for each individual. We segmented copy number profiles from each individual using Circular Binary Segmentation (CBS)

Glioblastoma Dataset

We next applied our method to Agilent 244 K aCGH data of glioblastoma (GBM) tumors from The Cancer Genome Atlas

Predicted Recurrent Breakpoints in 227 GBM Samples and 107 Blood Samples.

**Breakpoint Type**

**Rearrangement Type(s)**

**# Predicted**

**# in DGV**

**# in Blood**

**# Novel**

Recurrent Probes in Tumor

Highly Conserved Breakpoints

538

343

13

189

Recurrent Genes in Tumor

Gene Truncations

92

69

23

23

Pairs of Recurrent Probe in Blood*

Germline Structural Variants

88

53

N/A

35

Pairs of Recurrent Genes in Tumor w/Fusion Gene Config. **

Intrachromosomal Fusion Genes

75

45

5

7

Interchromosomal Fusion Genes

396

316

53

26

Columns are described in Table 1, except for '# in Blood' which indicates the number of predictions that also appear in the blood samples and are ignored as somatic predictions.

* FDR is increased to < 0.1 for blood samples.

** Novel pairs of recurrent gene breakpoints consistent with the fusion gene configuration.

We predict 23 gene truncations from the tumor samples, three of which are shown in Figure _{2 }copy number ratios at the 3' end, many fusion gene predictions consist of a deletion of the 3' end (i.e. Figure

Predicted Gene Truncations in GBM

**Predicted Gene Truncations in GBM**. These three recurrent gene breakpoints found on Chromosome 7, Chromosome X, and Chromosome 6 respectively suggest truncations of genes associated with glioblastoma or other neuronal diseases. (A) The recurrent breakpoint in ECOP has a large change in copy number; this gene is near EGFR and is the breakpoint location for the EGFR amplification. (B) PCDH11X appears to arise from a short deletion within a relatively amplified region, though the deletion breakpoint varies within the PCDH11X gene region. (C) RUNX2 contains two probe locations with recurrent probe breakpoints that each have small copy number change at approximately 45.42 Mb and 45.58 Mb.

Predicted Intrachromosomal Fusion Genes in GBM

**Predicted Intrachromosomal Fusion Genes in GBM**. (A) The INTS2-MED13 rearrangement on Chromosome 17 is identified in 9 individuals and arises from an amplification. A tandem duplication that affects the 3' end of MED13 and the 5' end of INTS2 will fuse the promoter region of INTS2 to MED13. (B) The PPP1R9A-PSMC2 rearrangement on Chromosome 7 is identified in 6 individuals and arises from a deletion.

Predicted Rearrangments involving PTPN12 in GBM.

**Recurrent Gene PTPN12**

**Gene**

**Genomic Location**

**# Patients**

PTPN12

chr7.77004708-77106533

16

Intrachromosomal Fusion Gene Predictions

5' End Gene

3' End Gene

# Patients

RMS

PTPN12

chr7.77005287-77106533

RSBN1L

chr7.77163678-77246421

8

0.1081

PTPN12

chr7.77004708-77106533

LUC7L2

chr7.138695173-138757626

8

0.2605

Interchromosomal Fusion Gene Predictions

5' End Gene

3' End Gene

# Patients

RMS

TMEM30A

chr6.76019357-76051074

PTPN12

chr7.77005287-77106533

6

0.1306

RNF150

chr4.142006174-142273412

PTPN12

chr7.77005287-77106533

5

0.1409

PTPN12

chr7.77005287-77106533

MED13

chr17.57374747-57497348

9

0.1906

CLK1

chr2.201425977-201434830

PTPN12

chr7.77005287-77106533

8

0.3168

ZRANB2

chr1.71301561-71319266

PTPN12

chr7.77005287-77106533

9

0.3250

PTPN12

chr7.77005287-77106533

UBR1

chr15.41022389-41185512

9

0.3475

PTPN12

chr7.77005287-77106533

LINGO1

chr15.75692423-75711712

8

0.3787

PPIL3

chr2.201443923-201460583

PTPN12

chr7.77004708-77106533

6

0.4741

The phosphatase PTPN12 appears in 10 predicted fusion genes, and is also a predicted gene truncation for 16 patients. The predictions are ranked according to the root mean squared difference (RMS) of the copy number on either side of the fusion point.

Predicted Fusion Genes with PTPN12 as a Gene Partner

**Predicted Fusion Genes with PTPN12 as a Gene Partner**. (A) The predicted intrachromosomal fusion gene PTPN12/RSBN1L is one of two predicted intrachromosomal fusion genes. This fusion gene arises from a deletion within an amplified region, and is only present in 8 individuals out of 16 that have some rearrangement with PTPN12. (B) The predicted interchromosomal fusion gene TMEM30A-PTPN12 is one of 8 predicted interchromsomal fusion genes. While the breakpoint in TMEM30A appears to arise due to a short amplification, a translocation occurring after an amplification (where all of TMEM30A is amplified) may also explain this fusion gene signature.

Discussion

NBC successfully identifies known fusion genes and structural variants. For fusion genes, NBC's consideration of uncertainty and variability in the locations of breakpoints provides an advantage over methods that compare individual segmentations of copy number profiles. This advantage is mitigated for variants with highly conserved breakpoints such as germline structural variants that are common in a population. However, it is possible that NBC would be helpful for complex, or overlapping, structural variants, where recurrent breakpoints might be a stronger signal than recurrent aberrant intervals.

NBC relies on a Bayesian change point algorithm, which requires specifying both prior distributions and a few hyperparameters. The weak priors that we use do not make strong assumptions about the data. However, hyperparameter estimation for Bayesian change point algorithms remains a difficult problem, and is sensitive to the particular type of data to be segmented. While our method chooses the hyperparameters systematically from the data rather than requiring user-defined input, poor parameter estimation leads to excessive breakpoint calling if there are no breakpoints to find or if the experimental error cannot be modeled by a constant ^{2}. We presented one approach to estimate hyperparameters from aCGH data, but more sophisticated methods (e.g. empirical Bayesian approaches) could be used

In this paper, we focused on applications of NBC to aCGH data. But NBC is equally applicable to copy number profiles generated by mapping DNA sequence reads to a reference genome

Conclusions

We have introduced Neighborhood Breakpoint Conservation (NBC), an algorithm that identifies recurrent breakpoints in data from multiple individuals. NBC correctly identifies a known fusion gene (TMPRSS2-ERG) in aCGH data from 36 prostate tumors and predicts gene truncations, structural variants, and fusion genes in aCGH data from glioblastoma. We expect that application of our method to additional samples will allow us to uncover and categorize other recurrent germline and somatic rearrangements.

Authors' contributions

PLP, MMI, and CC provided aCGH data from prostate cancer samples. AR implemented the algorithm and performed experiments. BJR conceived of the project and supervised the work. AR and BJR wrote the manuscript. All authors read and approved the manuscript.

Acknowledgements

We thank Chip Lawrence, Bill Thompson, and Eric Ruggieri for technical discussions, and Brendan Hickey and Hsin-Ta Wu for their contributions to preliminary analysis of fusion genes. We also thank the anonymous reviewers of an earlier version of the manuscript for helpful suggestions. AR is supported by a National Science Foundation Graduate Research Fellowship. BJR is supported by a Career Award at the Scientific Interface from the Burroughs Wellcome Fund, DOD/CDMRP Breast Cancer Synergy Award W81XWH-07-1-0710, and the Susan G. Komen Breast Cancer Foundation. This work was made possible in part with funding from the ADVANCE Program at Brown University, under NSF Grant No. 0548311. Prostate data sample collection was funded by the National Cancer Institute to the Baylor Prostate Cancer SPORE (P50CA058204)