Abstract
Background
Copy number variants (CNVs) have been demonstrated to occur at a high frequency and are now widely believed to make a significant contribution to the phenotypic variation in human populations. Arraybased comparative genomic hybridization (arrayCGH) and newly developed readdepth approach through ultrahigh throughput genomic sequencing both provide rapid, robust, and comprehensive methods to identify CNVs on a wholegenome scale.
Results
We developed a Bayesian statistical analysis algorithm for the detection of CNVs from both types of genomic data. The algorithm can analyze such data obtained from PCRbased bacterial artificial chromosome arrays, highdensity oligonucleotide arrays, and more recently developed highthroughput DNA sequencing. Treating parameterse.g., the number of CNVs, the position of each CNV, and the data noise levelthat define the underlying data generating process as random variables, our approach derives the posterior distribution of the genomic CNV structure given the observed data. Sampling from the posterior distribution using a Markov chain Monte Carlo method, we get not only best estimates for these unknown parameters but also Bayesian credible intervals for the estimates. We illustrate the characteristics of our algorithm by applying it to both synthetic and experimental data sets in comparison to other segmentation algorithms.
Conclusions
In particular, the synthetic data comparison shows that our method is more sensitive than other approaches at low false positive rates. Furthermore, given its Bayesian origin, our method can also be seen as a technique to refine CNVs identified by fast pointestimate methods and also as a framework to integrate arrayCGH and sequencing data with other CNVrelated biological knowledge, all through informative priors.
Background
Stable but not static, the DNA of human genome is subject to a variety of heritable changes of different types, which significantly contribute to the phenotypic differences of individuals in human populations. In addition to the single nucleotide polymorphisms (SNPs), these genetic changes also include the chromosomal structural variations, such as insertions, deletions, duplications, inversions, and translocations, on various genomic scales. Recent studies showed that insertions, deletions, and duplications of DNA segments of 1 kb or longer in the genome collectively referred to as the copy number variants (CNVs)occur at a much higher frequency than previously expected [14]. A recent global study of CNVs in the human genome showed that the regions of CNVs covered more nucleotide content per genome than SNPs [1]. It is now widely believed that CNVs are as important as SNPs and other small genomic changes in their contribution to the phenotypic variation in human populations.
Currently, unbalanced structural variants can be experimentally identified by methods based on microarray technology, polymerase chain reaction, or DNA sequence comparison. Arraybased method is a natural, highthroughput extension of the comparative genomic hybridization (CGH) analysis, which was originally developed as a method to reveal any regions of allele loss or aneuploidy by fluorescence microscopy [5]. Highdensity oligonucleotide microarrays, which offer high genomic resolution, have been used in several recent arrayCGH studies [4,6,7]. The last several years have seen rapid advancement in the field of sequencing technology. Novel methods [810] are being developed to reduce the cost and increase the throughput by generating massive amounts of sequence that can be aligned to the genomic reference. This development has made it possible to resequence whole genomes from multiple individuals.
Indeed, a major sequencing project, the 1000 Genomes Project, has been launched to resequence the genomes of at least a thousand people from around the world using the new sequencing technologies to produce the most detailed map of human genetic variation for disease studies. As the technologies mature and their uses spread, new sequencingbased methods to detect structural variations have been developed to take advantage of the massively parallel sequencing. In the readdepth approach, after DNA fragments are sequenced from one or both ends, the reads are mapped to the genome and then counted in a nonoverlapping sliding window. Both methods provide a rapid, robust, and comprehensive approach to identify CNVs on the wholegenome scale.
Both arrayCGH and readdepth sequencing generate genomic copy number (GCN) data in a very similar format: they consist of genomic signal output indexed by the genomic locations. The signals are logratios of normalized intensities from the test sample to those from the reference sample for arrayCGH and sequence read counts after mean subtraction for readdepth sequencing, respectively. The goal of analyzing such data is to detect CNVs by identifying regions with signals that are consistently higher or lower than the normalized baseline. Implicitly, there are two distinct and yet closely related estimation problems: one is to estimate the number of CNVs, and the other is to determine the boundaries and the average signal strength of each of them. Many statistical and computational methods have been developed to identify CNVs in individual genomes. They include approaches built on hidden Markov model [1113] or in a Bayesian framework [1416]. Recently a method to identify recurrent CNVs within a group of individuals has also been proposed [17]. Based on their data analysis approaches, algorithms that have been developed to analyze such data can be roughly grouped into three types: some only smooth the raw logratio data and the regions with logratios higher or lower than a preset threshold are identified as CNVs [18,19], others estimate the number of CNVs and their boundaries directly using the original logratio data [2023], and the rest use a combined approach [2427]. The relative performance of these algorithms has been assessed [28].
Here we present a Bayesian statistical framework to analyze both arrayCGH and readdepth data. Treating parameters that define the underlying genomic copy number variation encoded in the data as random variables, our approach derives the posterior distribution of those parameters given observed data. This statistical method models the location of regional changes and their corresponding associated copy number, and estimates the overall noise level in the data at the same time. Sampling from the posterior distribution using Markov chain Monte Carlo (MCMC) simulation is able to give both the best estimate and a corresponding Bayesian credible interval for each parameter in the model. We discuss how our model was derived and implemented, and the empirical results from applying our method to both microarray and sequencing data for CNV discovery.
Statistical model
In the life sciences we are often faced with the task of making inferences about some object of interest given incomplete and noisy experimental data. In the case of CNV study, we are primarily concerned with inferring the number of the DNA copy number variations, their locations, sizes, and corresponding copy numbersassociated amplitude measurements, given the genomic copy number data, which are logratios of sample and control intensities measured on microarrays or read depths generated by shotgun genomic sequencing. To demonstrate the application of our method to the readdepth data, we take a set of sequence reads from the 1000 Genomes Project and construct a 'readdepth intensity signal' spectrum by first mapping the reads to the human genome reference sequence and then counting the number of reads in a sliding window, a procedure that transforms sequencing data into arraylike intensity signal. We capture these unknown quantities in a probability model that relates them to the observed data. Our model is Bayesian in essence as we assign prior distributions to parameters and use the posterior distribution function to estimate the underlying data generating process. Given the posterior distribution, we then use the Markov chain Monte Carlo method to fit the model. By doing so, we get not only the best estimates for these unknown parameters but also Bayesian credible intervals for the estimations at the same time.
Given a set of genomic copy number data D = {g_{k}, x_{k}}, k = 1, 2, ..., M, in which g_{k }is the sorted genomic location of the kth data point, x_{k }the signal at this location, and M the number of data points, we try to infer the genomic 'spectrum' ℱ, which is the unknown function defined by the CNVs encoded in the data set with the same measurement unit as x_{k}. Assuming that the measurements of CNVs are all step functions, the spectrum ℱ can be written as
where N is the number of 'smoothed' CNVs detectable in the data set, and s_{j}, w_{j}, a_{j }are the start genomic location, the width, and the amplitude of the jth CNV respectively. Thus the 'ideal' data set corresponding to D based on this model is F = {g_{k}, f_{k}}, k = 1, 2, ..., M. For simplicity, we assume that X_{1}, X_{2}, ..., X_{M }measured in D are independent random variables each subjected to additive Gaussian noise around ℱ with a standard deviation σ.
Given the aforementioned model (Figure 1), the set of parameters to be inferred from D is θ = {N, (s_{j}, w_{j}, a_{j}), σ^{2}}, j = 1, 2, ..., N. Sometime for convenience, instead of reporting the estimate of w_{j}, we report the estimate of e_{j}, the end of the jth CNV (e_{j }= s_{j }+ w_{j } 1). The conditional probability distribution function p(θD) summarizes our inference about θ given the data and our prior knowledge about the CGH spectrum ℱ.
Figure 1. The graphical representation of the proposed Bayesian model. All quantities are shown as nodes in this directed graph. The parameter triplets s_{j}, w_{j}, and a_{j }(j = 1, 2, ..., N) are combined together. The solid arrows indicate the modeled data set F is determined by a logical function with parameters s_{j}, w_{j}, and a_{j }(j = 1, 2, ..., N), and the dashed arrows signify the stochastic dependence of the real data set D on both F and the overall data noise level σ.
Bayes' theorem relates the posterior probability distribution function p(θD) to the likelihood probability distribution function p(Dθ) that can be calculated from the data and the prior probability distribution function p(θ) that encodes the prior knowledge,
where the normalization constant p(D) is omitted for simplicity.
 Likelihood
where
 Prior
▫ Uniform distributions for N, s_{j}, and w_{j }(j = 1, 2, ..., N):
• p(N) = 1/N_{max}
• p(s_{j}) = 1/(s_{max}s_{min}) = 1/M
• p(w_{j}) = 1/M
▫ Normal distribution for a_{j}: p(a_{j}) ~ (τ_{j}, κ_{j}^{2})
▫ Inverse gamma distribution for
Thus the prior probability distribution function is assigned as
After rearrangement and removal of the constant N_{max}, we have
where α, β, τ_{j }and κ_{j }are the hyperparameters that characterize the prior distribution. See the Implementation subsection below for their parameterization.
 Posterior
For a given model {ℳ: θ ∈ Θ}, where N is known and thus θ = {(s_{j}, w_{j}, a_{j}), σ^{2}}, j = 1, 2, ..., N, the posterior distribution of θ given the data D and the model ℳ can be expressed as
 Informative prior
▫ Uniform distributions for s_{1 }and w_{1}:
Keeping priors on other parameters the same as before, we have
With this informative prior, the posterior is the same as (7), but only nonezero for s_{1 }∈[a, b] and w_{1 }∈ [c, d ]. This condition simplifies subsequent parameter estimation, as s_{1 }and w_{1 }only need to be sampled in these two intervals during MCMC simulation.
In some case, we only know the start and the length of a particular CNV (similar to the case above) but still have to estimate N and the parameters of the other CNVs. This is a case that mixes the general and the special ones presented above. It is easy to show the informative prior is a mix of (7) and (8):
When analyzing clean readdepth data, if the amplitude of the jth CNV, a_{j}, occurs discretely at several different values (for example a_{j }∈ {c, c, 2c}, where c is the genomewide average haploid read depth), the prior distribution p(a_{j}) of a_{j }can be modeled naturally by a multinomial distribution.
Algorithm and implementation
Parameter estimation by Markov chain Monte Carlo simulation
Analytically summarizing the posterior distribution p(θD) is difficult. For example, even though in theory the posterior expectation of an arbitrary function of θ, g(θ), can be computed as
the calculation is usually impracticable for two reasons. Firstly, p(θD) is only known up to some multiplicative constant due to the proportionality form of equation (8). Secondly, even if the exact form of p(θD) is known, given the number of parameters in θ (at least four in a nontrivial case), the high dimensional integral required in equation (8) is very difficult to be carried out in practice and soon becomes intractable as the number of parameters increases. However, Markov chain Monte Carlo (MCMC) provides an alternative whereby the posterior can be directly sampled to obtain sample estimates of the quantities of interest. Thus using a random sequence of K draws θ^{(1)}, θ^{(2)}, ..., θ^{(K) }from p(θD), E(g(θ)D) can be approximated by simply taking the average of these draws. Similar methods can be used to compute the posterior standard deviation or quantiles, probabilities that parameters take particular values, and other quantities of interest.
The Gibbs sampling algorithm [29] was implemented to sample from the target distribution {p(θD,ℳ): θ ∈ Θ ⊆ ℝ^{3N+1}}. To do so, the Gibbs sampler first constructs an aperiodic and irreducible Markov chain whose stationary distribution is p(θD,ℳ) in the state space Θ, and then draws a sequence of random samples from conditional distributions to characterize the joint target distribution. More precisely, it was implemented by (i) taking some initial values θ^{(0)}; (ii) repeating for each t = 1, 2, ..., T, where T is the preset number of iterations, generating θ_{i}^{(t) }from p(θ_{i}^{(t)} θ_{1}^{(t)}, ..., θ_{i1}^{(t)}, θ_{i+1}^{(t1)}, ..., θ _{θ}^{(t1)},D,ℳ) for i = 1, 2, ..., θ; (iii) continuing the previous step for T times after the estimated target distribution converges.
To calculate the conditional probabilities of s_{j }and w_{j }required by the second step of the Gibbs sampling stated above, all possible s ∈ [g_{1}, g_{M}] and w ∈ [w_{min}, w_{max}] are evaluated. Given the normality assumption about the data, conjugate prior distributions of a_{j }and σ^{2 }can be used to simplify the calculation of their conditional probabilities. If the prior distribution of a_{j }takes the conjugate from p(a_{j}) ~ (τ_{j}, κ_{j}^{2}), the conditional distribution of a_{j }given other parameters, the data D, and the model ℳ is also a normal distribution as
, where is the average logratios of probe intensities in the jth CNV. Given the conjugate prior distribution of σ^{2}, p(σ^{2}) ~ ℐnvmma(α, β), the conditional distribution of σ^{2 }given other parameters, the data D, and the model ℳ is also an inverse gamma distribution, .
Model selection using Bayes factor
Model selection is required to determine N, the number of CNVs, as different N changes the model parameterization θ. Suppose that the data D have arisen under one of the two models, {ℳ_{1}: θ_{1 }∈ Θ_{1}} and {ℳ_{2}: θ_{2 }∈ Θ_{2}}, according to a probability density p(D ℳ_{1}) or p(D ℳ_{2}).
Given prior probabilities p(ℳ_{1}) and p(ℳ_{2}) = 1  p(ℳ_{1}), the data produce posterior probabilities p(ℳ_{1}D) and p(ℳ_{2}D) = 1 p(ℳ_{1}D). From Bayes' theorem, we obtain
so that
For a given model {ℳ: θ ∈ Θ}, p(D ℳ) can be approximated by the sample harmonic mean likelihoods,
based on K MCMC draws θ^{(1)}, θ^{(2)}, ..., θ^{(K) }from the posterior distribution p(θD). The harmonic mean estimator is consistent since as K → + ∞. It may, however, have infinite variance across simulations. To solve this problem, Newton and Raftery [30] proposed an alternative estimator,
which does not display any of the instability of . We implemented to calculate the Bayes factor for model comparison
Implementation
Our method, including the Markov chain Monte Carlo simulation and the model comparison, is currently implemented in R [31]. To use noninformative priors for our Bayesian inference, we set τ_{j }= 0 and κ_{j }= 100, which effectively makes the prior distribution of a_{j }flat around 0. We also assign 1 to both α and β for the inverse gamma distribution of σ^{2}. We tested various values of the hyperparameters (τ_{j}, κ_{j}, α, and β), and the simulation results showed that the parameter inference was insensitive to the values assigned to these hyperparameters, which is expected given the large number of data points. A 500iteration MCMC simulation of the posterior distribution (7) given a data set with M = 500 and N = 1 took 126 seconds of CPU time on a personal computer with one 1400Mhz x86 Intel processor and 500 MB RAM. To assess the convergence of the Markov chain after 500 iterations, we started multiple chains from different values of θ^{(0)}. The simulations showed that after initial dozens of iterations all chains converged to the same solution. Based on this observation, we concluded that 500 iterations were sufficient for Bayesian inference in this case. We used the same convergence diagnostic for all inferences.
Because of great computational intensity of the MCMC simulation, to process a large GCN data set, we use a 'divide and conquer' strategy. We first sort array/sequencing features from each chromosome according to their genomic locations and then group 1000 consecutive features into subsets for parallel processing on a computer cluster.
Results
Simulated arrayCGH data
We first used simulated arrayCGH data sets to test our Bayesian model and its implementation. To generate such synthetic data, we first specified values for the parameters in θ = {N, (s_{j}, w_{j}, a_{j}), σ^{2}}, j = 1, 2, ..., N, in which N and (s_{j}, w_{j}, a_{j}) define the artificial genomic CNV structure encoded as a step function and σ^{2 }determines the overall noise level in the data. The simulated data were then generated by superimposing this predefined step function with random Gaussian noise. Typical simulated array data with one and multiple CNVs are shown in Figures 2A and 3A respectively.
Figure 2. Parameter estimation by MCMC simulation for a simulated arrayCGH data set. (A) The logratio vs. probe genomic index plot of a simulated oneCNV arrayCGH data set. The data D (M = 500) were generated with θ = {N = 1, (s_{1 }= 200, w_{1 }= 50, a_{1 }= 1.5), σ^{2 }= 0.4^{2}}. (B) The logarithm of the posterior probability (calculated up to some multiplicative constant) at consecutive 500 MCMC sampling iterations. In the stationary phase, the posterior probability of the MCMCsampled parameter values given data D, , fluctuates closely beneath the maximum value p(θD). (CF) Histograms of the 500 estimates of s_{1}, w_{1}, a_{1}, and σ respectively. (GJ) Traces of the estimates of s_{1}, w_{1}, a_{1}, and σ through the consecutive 500 MCMC sampling iterations.
Figure 3. Model selection. (A) A simulated fourCNV arrayCGH data set. The data D (M = 1000) were generated with θ = {N = 4, (s_{1 }= 100, w_{1 }= 30, a_{1 }= 0.7), (s_{2 }= 200, w_{2 }= 20, a_{2 }= 0.3), (s_{3 }= 400, w_{3 }= 80, a_{3 }= 1.5), (s_{4 }= 600, w_{4 }= 90, a_{4 }= 0.6), σ^{2 }= 0.1^{2}}. The scatter plot of the multiCNV arrayCGH data are overlaid with the segmentation found by our algorithm using different models. The step function lines in red, green, blue, and orange represent models with N = 1, 2, 3, and 4, respectively. For figure clarity, the estimated model with N = 5 is not shown. (B) Model comparison. The comparison between each of the models with N = 1, 2, 3, 4 and the basal, null model (N = 0) was quantified by the logarithm of the Bayes factor.
The simulated array data (M = 500) plotted in Figure 2A were generated with θ = {N = 1, (s_{1 }= 200, w_{1 }= 50, a_{1 }= 1.5), σ^{2 }= 0.4^{2}}, which was to be estimated. Taking N = 1, we started the Markov chain at some random θ^{(0) }= {(s_{1 }= 100, w_{1 }= 0, a_{1 }= 0), σ^{2 }= 0.1^{2}} and ran it for 500 iterations. The sampling results are shown in Figure 2CF. As the parameter trace plots (Figure 2GJ) show, the Markov chain quickly converged to stationarity after approximately ten iterations (Figure 2B). To err on the side of caution, we discarded the samples from the first 100 iterations as the 'burnin' samples and estimated the parameter values from the rest 400 samples, which gave given N = 1.
Remarkably, all these samples have the very similar s_{1 }and w_{1}, which are 200 and 50 respectively. Because of this small variation in their estimation, the estimates of s_{1 }and w_{1 }from the data are of extremely high confidence. The distributions of a_{1 }and σ in the 400 samples are approximately normal as (1.57, 0.057^{2}) and (0.38, 0.012^{2}) respectively. Based on their normal distributions, we can easily calculate a Bayesian credible interval for both a_{1 }and σ. For example, a 95% Bayesian credible interval for a_{1 }is [1.46, 1.68], which suggests that, after observing the data, there is a 95% chance that the average logratio of intensities in this CNV falls between 1.46 and 1.68.
We also simulated arrayCGH data (M = 1000) with multiple CNVs (Figure 3A) using θ = {N = 4, (s_{1 }= 100, w_{1 }= 30, a_{1 }= 0.7), (s_{2 }= 200, w_{2 }= 20, a_{2 }= 0.3), (s_{3 }= 400, w_{3 }= 80, a_{3 }= 1.5), (s_{4 }= 600, w_{4 }= 90, a_{4 }= 0.6), σ^{2 }= 0.1^{2}}. To identify the CNVs encoded in this data set, first the modelspecific parameters {(s_{j}, w_{j}, a_{j}), σ^{2}}, j = 1, 2, ..., N were estimated under different models with N = 0, 1, ..., 5. In Figure 3A, the scatter plot of the multiCNV arrayCGH data are overlaid with the segmentation found by our algorithm using different models. The figure shows that the most prominent CNV was identified first when the number of CNVs, N, was set to 1 and less prominent CNVs were progressively identified as the model became more permissive (i.e., N was increased). To select the most plausible model from which the observed data were generated, each of the models with N = 1, 2, 3, 4, 5 was then compared with the basal, null model (N = 0). Quantification of these comparisons by the logarithm of the Bayes factor, which gives 691.02, 926.94, 1091.13, 1556.23, and 1173.67 respectively, clearly indicates that the model with N = 4 is the best model among the ones tested (Figure 3B). It is noteworthy that since the numbers aforementioned are the logarithms of the Bayes factor the actual increase in the marginal evidence p(DM) between neighboring models is very substantial. For example, the increase in the marginal evidence from N = 3 to N = 4 is e^{1556.231091.13 }= e^{465.11 }≈ 9.86 × 10^{201 }fold.
Lai et al. [28] examined the performance of 11 arrayCGH data analysis methods: CGHseg, quantreg, CLAC, GLAD, CBS, HMM, wavelet, lowess, ChARM, GA, and ACE. To assess the performance of our algorithm in conjunction with these methods, we used the same simulated data as Lai et al. used for the assessment in their study to calculate the true positive rates (TPR) and the false positive rates (FPR) as the threshold for determining a CNV is varied. See Lai at al. for the definitions of TPR and FPR and the details of the simulated data sets. We calculated the receiver operation characteristic (ROC) curve of our algorithm using the most noisy (thus the lowest signaltonoise ratio, SNR = 1) data set with the CNV width of 40 probes. This ROC curve, together with the ROC curves of other arrayCGH methods based on the same data set, was plotted in Figure 4. These curves show that our Bayesian algorithm is appreciably more sensitive than all other methods at low (< 10%) false positive rates. We need to point out that the comparison was conducted in a fair manner, if not to the disadvantage of our method: all the results from Lai et al. were used directly without modification and our method has no free parameters to tune.
Figure 4. The ROC curves of arrayCGH data analysis methods. (A) The complete plot. All curves were calculated using the same data set. They show that our Bayesian algorithm (black line) is appreciably more sensitive than all other methods (gray lines) at low (< 10%) false positive rates. (B) Details of the ROC curves in the low FPR region of (A) inside the box with dashed border. See Lai et al. [28] for the identities of the gray ROC curves. TPR, the true positive rate; FPR, the false positive rate.
Glioblastoma Multiforme arrayCGH data
Lai et al. [28] compared 11 different arrayCGH data analysis algorithms that are based on diverse statistical or computational techniques. In addition to testing those methods using simulated data, they also characterized their performance on chromosomal regions of interest in real data sets obtained from patients with Glioblastoma Multiforme (GBM) [32]. These cDNA microarraybased data sets were generated by CGHprofiling copy number alterations across 42,000 mapped human cDNA clones, in a series of 54 gliomas of varying histogenesis and tumor grade.
It was observed that the GBM data contain a mixture of larger CNV regions with low amplitude and smaller ones with high amplitude. These two types of arrayCGH data are nicely represented by data sets GBM31 and GBM29 respectively (Figure 5AB). In sample GBM31, a large region on chromosome 13 was lost, and the overall magnitude of this loss is very low due to the low penetrance of this genetic variation in tumor cells in this sample. In sample GBM29, on the other hand, there are three highamplitude small duplications. To evaluate our Bayesian approach in a comparable way, we also used these two GBM data sets processed and utilized by Lai et al. to test our method.
Figure 5. CNV profiles of three experimental arrayCGH data sets. Probe segmentation of (A) GBM31, (B) GBM29, and (C) βglobin small deletion arrayCGH data. In all cases, the MCMC sampler was run for 1000 iterations. The first 100 samples were discarded as the 'burnin' samples, and the mean and the standard deviation of each parameter were estimated from the rest 900 samples. The estimated means are plotted as a step function (the black line), and the estimated standard deviations are indicated as yellow boxes, each defined by mean ± 1.96 × standard deviation, which corresponds to a 95% credible interval.
Figures 5A and 5B show the arrayCGH profiles of chromosomes 13 and 22 in Glioblastoma Multiforme samples GBM31 and GBM29, respectively, overlaid with the segmentation found by our algorithm. As seen in Figure 5A, our algorithm detected the single broad proximal deletion of part of chromosome 13 in GBM31, spanning from the 59^{th }to the 542^{nd }probe with a logration intensity at 0.30 with corresponding standard deviations, , for calculating each Bayesian credible interval). The breakpoint at the probe genomic index 542 was also identified by all the programs that detected this deletion in the test conducted by Lai et al. The other breakpoint at 59 was again found by CLAC and ACE evaluated in the same test [28]. The small sample standard deviations in connote the reliability of the parameter estimation despite a rather low signal to noise ratio of the GBM31 data. Our algorithm also detected all three highamplitude amplifications of parts of chromosome 22 in GBM29 (Figure 5B). Even though there are only four probes separating the first two amplifications, our method still segmented them clearly. Moreover, our method also pinpointed the six breakpoints of these three CNVs (their sample standard deviations are all zeros), which makes these predictions highly reliable.
βglobin highdensity arrayCGH data
One recent significant development in the microarray technology is the emergence of the tiling array technology, which can be used to cover large genomic regions or even an entire genome on one or several microarrays in an unbiased fashion by using oligonucleotides (a.k.a. tiles) uniformly sampled from presented genomic sequences. The current trend is to migrate from PCRbased arrays to tiling arrays for a much higher resolution and a comprehensive genomic coverage.
In a recent study [7], in order to test the resolution limits of tiling arrays when they are used with CGH for CNV discovery, Urban et al. designed microarrays that tile through 100 kb of the βglobin locus with overlapping isothermal oligonucleotides spaced 9 bp apart alone the tiling path. They compared the test DNA from a patient with a known heterozygous deletion of 622 bp in the βglobin locus and the reference DNA pooled from seven individuals without this chromosomal aberration. Figure 5C shows the arrayCGH profile of the βglobin locus of the patient overlaid with the segmentation found by our algorithm. This deletion in the βglobin locus was detected, and the estimate of its length, , corresponding to 641 bp in the genomic coordinate system, is highly accurate in comparison with the actual length of the deletion (622 bp).
Readdepth genome resequencing data
The genome of a Utah resident with Northern and Western European ancestry from the CEPH collection (NA12878) has been sequenced by the 1000 Genomes Project using both the 454 pairedend and the Illumina shotgun sequencing technologies, which produced long (120bp) sequence reads with low coverage (0.5×) and short (50bp) ones with high coverage (40×), respectively.
After using these two sequence sets to generate the 'known' genomic deletions in and the readdepth data from this individual, we apply our method to the readdepth data and compare the finding with the 'known' genomic deletions. Despite a very low sequencing depth, we are able to use 454 reads to detect several large genomic deletions in this individual based on the gapped (i.e., 'split') alignment of some of these long reads. These deletions are taken as known, and we use a 2653bp deletion on chromosome 6 from 32,669,938 to 32,672,591 to illustrate the application of our readdepth method. After mapping approximately 2.4billion 50bp Illumina reads to the human reference genome, we count the number of reads in a 200bp nonoverlapping sliding window to produce the readdepth data. Figure 6A shows the read distribution profile based on the Illumina short reads surrounding the 2653bp deletion locus.
Figure 6. Identification of CNV in the readdepth data. (A) CNV profile of a deletion locus in a readdepth data set. Short reads generated for a CEPH individual were mapped to the human reference genome and then counted in a 100bp nonoverlapping sliding window to produce the readdepth data. The counts are centered to their mean. The MCMC sampler was set up in a similar fashion as in the previous cases. The estimated means are plotted as a step function, and the 95% credible intervals as yellow boxes. The 2653bp deletion, which is found by long sequence reads that encompass this deletion locus and thus split around it when aligned to the reference sequence, is taken as known and shown as the thin green line on a lower track. (B) The averaged standard deviation in the estimates of the start and the end positions (s and e) at different sequencing depths. The S.D. unit is the window size, which is 200 bp in this case.
Our method detected this deletion in the readdepth data and estimated its parameters to be . To investigate how the sequencing depth affects the estimation of the start and the end positions of a CNV, we simulate a series of sequencing depths by randomly sampling (without replacement) different numbers of mapped Illumina reads and then apply our method to the simulated data. The standard deviation in the estimates of the start and the end positions, s and e, reflects how well these two parameters can be estimated from the readdepth data. In figure 6B we plot the averaged standard deviation in the estimates of the s and e at different sequencing depths. It is clear as the sequencing depth decreases from the original depth (37×) the estimates of the terminal positions become less accurate. In fact, when the coverage is below 1×, it becomes very difficult to find the deletion at all.
Discussion and Conclusion
The MetropolisHastings and the Gibbs sampling algorithms, two Markov chain Monte Carlo simulation methods, have been widely used for Bayesian inference. Developed by Metropolis et al. [33] and subsequently generalized by Hastings [34], the MetropolisHastings algorithm generates, based on the current state and a preselected proposal density, candidate states that are accepted (or rejected) stochastically with a certain acceptance probability but then retains the current value when rejection takes place. Gibbs sampling [29] draws a sequence of random samples from conditional distributions of unknown parameters to characterize their joint target distribution. In fact, the Gibbs sampling can be regarded as a special case of the MetropolisHastings algorithm as the acceptance probability is always onei.e., every proposal is automatically accepted.
For our Bayesian analysis of genomic copy number data, we implemented both the random walk MetropolisHastings (RWMH) and the Gibbs sampling algorithms and observed that in this application Gibbs sampling is much more suitable for parameter inference. RWMH worked well for oneCNV data. However, if the data contain two widely separated CNVs, it can only identify one of them but not both. To investigate this limitation, we plotted the landscape of the posterior probability distribution in a twodimensional parameter space. A twoCNV data set D (M = 700) with θ = {N = 2, (s_{1 }= 100, w_{1 }= 20, a_{1 }= 2), (s_{2 }= 600, w_{2 }= 20, a_{2 }= 2), σ^{2 }= 0.4^{2}} was first simulated, and then the posterior probability was evaluated with various combinations of s_{1 }and s_{2 }while all other parameters were kept fixed at their true values.
The surface plot in Figure 7A shows a global maximum peak located at s_{1 }= 100 and s_{2 }= 600 as expected and an overall very rugged posterior distribution 'terrain': the landscape is full of local maxima with, especially, two prominent 'ridges' of local maxima at s_{1 }= 100 and s_{2 }= 600, respectively. It is clear from Figures 7A and 7B that if the Markov chain of RWMH gets to a local maximum on the ridge at s_{1 }= 100 or s_{2 }= 600 but fortuitously far from the global maximum, it will be trapped on the ridge and practically cannot reach the global peak if the random update interval is small (which is almost always the case). Based on these observations, we chose the Gibbs sampling algorithm for our Bayesian analysis of the genomic copy number data as the Gibbs sampler is well suitable to explore this 'ridged' terrain by using full conditionals to scan the landscape along ridges to find the global maximum.
Figure 7. The posterior distribution of model parameters. (A) A view of the posterior probability distribution landscape in the twodimensional s_{1}s_{2 }parameter space. The simulated data D (M = 700) were generated with θ = {N = 2, (s_{1 }= 100, w_{1 }= 20, a_{1 }= 2), (s_{2 }= 600, w_{2 }= 20, a_{2 }=  2), σ^{2 }= 0.4^{2}}. The posterior probability was evaluated with various combinations of s1 ands2 while all other parameters were kept fixed at their true values. The terrain color varies from green to yellow to red and then to gray as the posterior probability increases. (B) The contour of the same posterior distribution. A closed contour line traces points (s_{1}, s_{2}) of equal posterior probability density. As expected, the global maximum of the posterior probability occurs where s_{1 }= 100 and s_{2 }= 600, two values that were used to generate the underlying data.
As the ROC curves in Figure 4 show, our Bayesian algorithm is the most sensitive method at low (< 10%) false positive rates. This means that at a given low FPR our method can identify more true positive probes inside CNVs than other methods. When the FPR is higher, it is less sensitive than several methods, most of which find CNVs through data smoothing. However, this is hardly a disadvantage, as at high false positive rates the list of identified CNVs is awash with false positives, rendering the whole list practically unusable.
In addition to the improved sensitivity, our method also has several distinct advantages innate to its Bayesian approach. The confidence on an estimated parameter value can be assessed through its Bayesian credible interval. Akin to a confidence interval but with an intuitive probabilistic interpretation, a Bayesian credible interval is a range of values that contains the true parameter value with a certain probability. Through stochastic simulation, it is straightforward to summarize the otherwise analytically intractable joint posterior distribution of the unknown parameters and compute both the best estimate and a corresponding Bayesian credible interval for each parameter in the model. The availability of the intervals for s_{j}, e_{j}, and a_{j}the start and the end genomic locations and the copy number of each CNVis unique to our Bayesian method, and these credible intervals can be especially useful.
Recent years have seen fast development of methodologies in different frameworks to detect CNVs in arrayCGH data. For example, to detect CNV breakpoints, Shah et al. used a modified hidden Markov model (HMM) that is robust to outlier probes [13], while Rueda and D'azUriarte used a nonhomogeneous HMM fitted via reversible jump MCMC [12]. PiqueRegi et al. used piecewise constant (PWC) vectors to represent genome copy numbers and sparse Bayesian learning to detect CNV breakpoints [16]. Other methods for segmenting arrayCGH data have also been implemented, including using Bayesian change point analysis [15], a spatially correlated mixture model [14], a Bayes regression model [35], and wavelet decomposition and thresholding [36].
Due to the computational intensiveness of its MCMC simulation, the method that we present here can be most advantageously used to refine CNVs detected by fast pointestimate methods. It could also be seen as a basic genomic copy number data analysis framework, amenable for several possible extensions. Firstly, due to the nature of the genomic sequence duplications and deletions, the signal measurements of CNVs will aggregate to certain expected values. Such information could be incorporated into the model for better signal detection from background noise. Secondly, more complicated likelihood function, such as a truncated Gaussian, could be used to handle outliers in genomic copy number data. Thirdly, informative priors could be used for better CNV detection. The formation of CNVs in a genome is potentially affected by many local genomic features, such as conservation and repeat content on the sequence level. Compared with the aforementioned methods for arrayCGH data, our Bayesian approach has the advantage to readily incorporate such sequence information through the prior distributions, as it treats the start and the width of CNVs as parameters and thus directly models the genomic CNV state. For this initial Bayesian analysis of genomic copy number data, we used flat priors for both the CNV start site and width. However, instead of using such noninformative prior, we can assign a prior for the start site inversely proportional to the conservation level of the probe sequence. (This incorporates our belief that the more conserved a sequence is the less likely it is to be duplicated or deleted.) For the width, we can assign the width distribution of known CNVs in the database as a prior. The incorporation of such knowledge through the priors does not need to be done only once: it can be sequential (orderinsensitive) as more relevant information becomes available. Using such informative priors, our method can be seen as a framework that enables integration of genomic copy number data and the CNVrelated biological knowledge.
Authors' contributions
ZDZ conceived of the study, implemented the method, performed the analysis, and drafted the manuscript. MBG participated in the design of the study and helped to draft the manuscript. All authors read and approved the final manuscript.
Acknowledgements
ZDZ was funded by an NIH grant (T15 LM07056) from the National Library of Medicine.
References

Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, et al.: Global variation in copy number in the human genome.
Nature 2006, 444(7118):444454. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tuzun E, Sharp AJ, Bailey JA, Kaul R, Morrison VA, Pertz LM, Haugen E, Hayden H, Albertson D, Pinkel D, et al.: Finescale structural variation of the human genome.
Nat Genet 2005, 37(7):727732. PubMed Abstract  Publisher Full Text

Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C: Detection of largescale variation in the human genome.
Nat Genet 2004, 36(9):949951. PubMed Abstract  Publisher Full Text

Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H, Walker M, Chi M, et al.: Largescale copy number polymorphism in the human genome.
Science 2004, 305(5683):525528. PubMed Abstract  Publisher Full Text

Forozan F, Karhu R, Kononen J, Kallioniemi A, Kallioniemi OP: Genome screening by comparative genomic hybridization.
Trends Genet 1997, 13(10):405409. PubMed Abstract  Publisher Full Text

Jobanputra V, Sebat J, Troge J, Chung W, AnyaneYeboa K, Wigler M, Warburton D: Application of ROMA (representational oligonucleotide microarray analysis) to patients with cytogenetic rearrangements.
Genet Med 2005, 7(2):111118. PubMed Abstract  Publisher Full Text

Urban AE, Korbel JO, Selzer R, Richmond T, Hacker A, Popescu GV, Cubells JF, Green R, Emanuel BS, Gerstein MB, et al.: Highresolution mapping of DNA copy alterations in human chromosome 22 using highdensity tiling oligonucleotide arrays.
Proc Natl Acad Sci USA 2006, 103(12):45344539. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Fedurco M, Romieu A, Williams S, Lawrence I, Turcatti G: BTA, a novel reagent for DNA attachment on glass and efficient generation of solidphase amplified DNA colonies.
Nucleic Acids Res 2006, 34(3):e22. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, et al.: Genome sequencing in microfabricated highdensity picolitre reactors.
Nature 2005, 437(7057):376380. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Shendure J, Porreca GJ, Reppas NB, Lin X, McCutcheon JP, Rosenbaum AM, Wang MD, Zhang K, Mitra RD, Church GM: Accurate multiplex polony sequencing of an evolved bacterial genome.
Science 2005, 309(5741):17281732. PubMed Abstract  Publisher Full Text

Cahan P, Godfrey LE, Eis PS, Richmond TA, Selzer RR, Brent M, McLeod HL, Ley TJ, Graubert TA: wuHMM: a robust algorithm to detect DNA copy number variation using long oligonucleotide microarray data.
Nucleic Acids Res 2008, 36(7):e41. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Rueda OM, DiazUriarte R: Flexible and accurate detection of genomic copynumber changes from aCGH.
PLoS Comput Biol 2007, 3(6):e122. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Shah SP, Xuan X, DeLeeuw RJ, Khojasteh M, Lam WL, Ng R, Murphy KP: Integrating copy number polymorphisms into array CGH analysis using a robust HMM.
Bioinformatics 2006, 22(14):e431439. PubMed Abstract  Publisher Full Text

Broet P, Richardson S: Detection of gene copy number changes in CGH microarrays using a spatially correlated mixture model.
Bioinformatics 2006, 22(8):911918. PubMed Abstract  Publisher Full Text

Erdman C, Emerson JW: A fast Bayesian change point analysis for the segmentation of microarray data.
Bioinformatics 2008, 24(19):21432148. PubMed Abstract  Publisher Full Text

PiqueRegi R, Ortega A, Asgharzadeh S: Joint estimation of copy number variation and reference intensities on multiple DNA arrays using GADA.
Bioinformatics 2009, 25(10):12231230. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wu LY, Chipman HA, Bull SB, Briollais L, Wang K: A Bayesian segmentation approach to ascertain copy number variations at the population level.
Bioinformatics 2009, 25(13):16691679. PubMed Abstract  Publisher Full Text

Eilers PH, de Menezes RX: Quantile smoothing of array CGH data.
Bioinformatics 2005, 21(7):11461153. PubMed Abstract  Publisher Full Text

Hsu L, Self SG, Grove D, Randolph T, Wang K, Delrow JJ, Loo L, Porter P: Denoising arraybased comparative genomic hybridization data using wavelets.
Biostatistics 2005, 6(2):211226. PubMed Abstract  Publisher Full Text

Fridlyand J, Snijders AM, Pinkel D, Albertson DG, Jain ANAN: Hidden Markov models approach to the analysis of array CGH data.
Journal of Multivariate Analysis 2004, 90(1):132153. Publisher Full Text

Jong K, Marchiori E, van der Vaart A, Ylstra B, Meijer G, Weiss M: Chromosomal breakpoint detection in human cancer. In Lecture Notes in Computer Science. Volume 2611. Berlin: SpringerVerlag; 2003::54651. Publisher Full Text

Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary segmentation for the analysis of arraybased DNA copy number data.
Biostatistics 2004, 5(4):557572. PubMed Abstract  Publisher Full Text

Picard F, Robin S, Lavielle M, Vaisse C, Daudin JJ: A statistical approach for array CGH data analysis.
BMC Bioinformatics 2005, 6:27. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Hupe P, Stransky N, Thiery JP, Radvanyi F, Barillot E: Analysis of array CGH data: from signal ratio to gain and loss of DNA regions.
Bioinformatics 2004, 20(18):34133422. PubMed Abstract  Publisher Full Text

Lingjaerde OC, Baumbusch LO, Liestol K, Glad IK, BorresenDale AL: CGHExplorer: a program for analysis of arrayCGH data.
Bioinformatics 2005, 21(6):821822. PubMed Abstract  Publisher Full Text

Myers CL, Dunham MJ, Kung SY, Troyanskaya OG: Accurate detection of aneuploidies in array CGH and gene expression microarray data.
Bioinformatics 2004, 20(18):35333543. PubMed Abstract  Publisher Full Text

Wang P, Kim Y, Pollack J, Narasimhan B, Tibshirani R: A method for calling gains and losses in array CGH data.
Biostatistics 2005, 6(1):4558. PubMed Abstract  Publisher Full Text

Lai WR, Johnson MD, Kucherlapati R, Park PJ: Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data.
Bioinformatics 2005, 21(19):37633770. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Geman S, Geman D: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images.
IEEE Transactions of Pattern Analysis and Machine Intelligence 1984, 6:721741. Publisher Full Text

Newton MA, Raftery AE: Approximate Bayesian inference with the weighted likelihood bootstrap.
Journal of Royal Statistical Society (B series) 1994, 56:348.

R Development Core Team: R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2006.

Bredel M, Bredel C, Juric D, Harsh GR, Vogel H, Recht LD, Sikic BI: Highresolution genomewide mapping of genetic alterations in human glial brain tumors.
Cancer Res 2005, 65(10):40884096. PubMed Abstract  Publisher Full Text

Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E: Equation of state calculations by fast computing machines.
The Journal of Chemical Physics 1953, 21:10871092. Publisher Full Text

Hastings WK: Monte Carlo sampling methods using Markov chains and their applications.
Biometrika 1979, 57:97109. Publisher Full Text

Wen CC, Wu YJ, Huang YH, Chen WC, Liu SC, Jiang SS, Juang JL, Lin CY, Fang WT, Hsiung CA, et al.: A Bayes regression approach to arrayCGH data.
Stat Appl Genet Mol Biol 2006., 5(Article 3) PubMed Abstract  Publisher Full Text

BenYaacov E, Eldar YC: A fast and flexible method for the segmentation of aCGH data.
Bioinformatics 2008, 24(16):i139145. PubMed Abstract  Publisher Full Text