Department of Mathematics and Statistics, Oakland University, Rochester, MI 48309, USA

Department of Statistics and Actuarial Science, University of Iowa, Iowa City, IA 52246, USA

Department of Biostatistics, University of Iowa, Iowa City, IA 52246, USA

Abstract

Background

Deletions and amplifications of the human genomic DNA copy number are the causes of numerous diseases, such as, various forms of cancer. Therefore, the detection of DNA copy number variations (CNV) is important in understanding the genetic basis of many diseases. Various techniques and platforms have been developed for genome-wide analysis of DNA copy number, such as, array-based comparative genomic hybridization (aCGH) and high-resolution mapping with high-density tiling oligonucleotide arrays. Since complicated biological and experimental processes are often associated with these platforms, data can be potentially contaminated by outliers.

Results

We propose a penalized LAD regression model with the adaptive fused lasso penalty for detecting CNV. This method contains robust properties and incorporates both the spatial dependence and sparsity of CNV into the analysis. Our simulation studies and real data analysis indicate that the proposed method can correctly detect the numbers and locations of the true breakpoints while appropriately controlling the false positives.

Conclusions

The proposed method has three advantages for detecting CNV change points: it contains robustness properties; incorporates both spatial dependence and sparsity; and estimates the true values at each marker accurately.

Background

Deletions and amplifications of the human genomic DNA copy number are the causes of numerous diseases. They are also related to phenotypic variation in the normal population. Therefore, the detection of DNA copy number variation (CNV) is important in understanding the genetic basis of disease, such as, various types of cancer. Several techniques and platforms have been developed for genome-wide analysis of DNA copy number, including comparative genomic hybridization (CGH), array-based comparative genomic hybridization (aCGH), single nucleotide polymorphism (SNP) arrays and high-resolution mapping using high-density tiling oligonucleotide arrays (HR-CGH)

Several methods have been proposed to identify the breakpoints of copy number changes. A genetic local search algorithm was developed to localize the breakpoints along the chromosome

Recently, several penalized regression methods have been proposed for detecting change points. In the framework of penalized regression, a least squares (LS) regression model was used with the least absolute penalty on the differences between the relative copy numbers of the neighboring markers

In this manuscript, we propose a penalized LAD regression with the adaptive fused lasso penalty to analyze the noisy data sets. We name this method as the LAD-aFL. The proposed LAD-aFL method has three advantages in detecting CNV change points. First, it is expected to be resistant to outliers by using the LAD loss function. Second, the adaptive fused lasso penalty can incorporate both spatial dependence and sparsity properties of CNV data sets into the analysis. Third, the adaptive procedure is expected to significantly improve the estimates of the true intensity at each marker.

Methods

LAD-aFL model for CNV analysis

For a CGH profile array, let _{
i
}be the log2 ratio of the intensity of the red over green channels at marker _{
i
}be the true relative copy number and _{
i
}(= _{
i
}
_{
i-1}) be the true jump value at marker _{0 }= 0 and thus _{1 }= _{1}. The observed _{i }
_{i }

where _{i}
_{i}
_{i}
_{i}

Here, λ_{1 }and λ_{2 }are two tuning parameters controlling the sparsity and smoothness of the estimates, _{1}, ⋯, β_{n}
_{i}
_{i}
_{i}
_{i}
_{i}
_{i}

In our study, we set the initial values of _{i }
_{
i
}- _{
i-1 }for

Computation

Let **y **= (_{1}, ⋯, _{
n
})' and a _{1}λ_{1}/2, _{2}λ_{1}, ⋯, _{n}
_{1}). Define a

Consider a new response vector **y*** = (**y**', **0**', **0**')' and a new design matrix

For every fixed λ_{1 }and λ_{2}, (3) is the objective function of a LAD regression problem with a new sparse design matrix **X***. Therefore, an existing program such as the

Determining the tuning parameters

The magnitude of tuning parameters λ_{1 }and λ_{2 }determine the smoothness and sparsity of the estimates _{1 }= 0 and λ_{2 }= 0, then the estimate of _{
i
}is simply _{
i
}, which obviously leads to too many estimated non-zero relative ratios. In the other extreme, if λ_{1 }is very large, then all

We provide a fast algorithm to choose tuning parameters in LAD-aFL. For every fixed combo of λ_{1 }and λ_{2}, we obtain a LAD-aFL solution, _{1 }and _{2 }are _{1 }and _{2 }separately, then _{1 }- _{2 }

where _{1 }and λ_{2 }using the following two steps.

1. Let _{1 }with _{1 }≥ 1. For a fixed small value of λ_{1}, say λ_{1 }= 0.001, we search the "best" λ_{2 }from a uniform grid to minimize SIC.

2. Let _{2 }with _{2 }≥ 1. For the above "best" λ_{2}, we increase λ_{1 }by a small increment from a uniform grid and search a "best" one to minimize SIC.

Here λ_{2 }controls the frequency of alteration region, and λ_{1 }controls the number of nonzero log2 ratios. Noticing that there are much less number of alterations than the number of nonzero log2 ratios in a CGH array data set, we can select λ_{2 }more aggressively by choosing _{1 }= 1.5 and _{2 }= 1 in our computation.

Even though many cancer profiles contain large size of aberrations, which do not have the sparsity in their relative intensities data sets, the existence of the sparsity of the jumps (only a few jumps exists for the relative intensities) still favors the penalized method. To reflect the true relative intensities accurately, we can choose a small λ_{1}, say, λ_{1 }= 0.001. Our simulations show that LAD-aFL is significantly efficient in mapping these true segments.

Estimation of FDR

Let

Suppose all nonzero estimates _{1}, _{2}, ⋯, _{K }
_{k}
_{k }

We consider the test statistic

where

where

Detection the breakpoints

The procedure of detecting breakpoints can be summarized into two steps.

S1. First we use the SIC to compute _{0 }are identified as the candidates of breakpoints, where _{0 }is an empirical cutoff threshold for possible amplifications and deletions. Some work suggested that the possible chromosome amplifications and deletions should satisfy log2-ratio> 0.225, which is corresponding to values between 2 and 3 standard deviations from the mean _{0 }= 0.1 conservatively in our experiment.

S2. For the potential breakpoints in S1, we calculate p-values and estimate FDR. The significant breakpoints are identified by controlling FDR.

Results and Discussion

Simulation studies

We evaluate the performance of the LAD-aFL method for detecting CNV using three simulation examples. In the first two examples, we consider 500 markers equally spaced along a chromosome.

All observed log2 ratios are generated from

where _{0i
}'s are the true log2 ratios of all 500 markers which have three altered regions corresponding to quadraploid, triploid and monoploid states. Similar to _{i}

**Example 1**. _{0i
}
_{i}'s from the following three models such that they have the same standard deviations

The true log2 ratios for Examples 1 and 2

**
i
**

**1-100**

**101-110**

**111-450**

**451-460**

**461-980**

**981-1000**

_{0i}

0

1

0

0.59

0

-1

_{
i
}= _{
i0},

_{i }
_{
i-1 }+ _{
i1},

_{i }
_{
i-1 }+ 0.20_{
i-2 }+ _{
i2},

_{
i0 }~ ^{2}), _{
i1 }~ ^{2}), _{
i2 }~ ^{2})

**Example 2**. _{0i
}
_{ij}'s from double exponential (DE) distributions such that ε_{i}'s have equal standard deviation

_{i }
_{
i0},

_{i }
_{
i-1 }+ _{
i1,}

_{
i
}= 0.60_{
i-1 }+ .20_{
i-2 }+ _{
i2},

where _{
i0 }~ _{
i1 }~ _{
i2 }~

We generate 40 data sets for each model defined in Examples 1 and 2. Our simulated data sets are sparse with two amplifications and one deletion, and only 5 true breakpoints for each data set. Both LAD-aFL and LS-FL method are applied to all three models. In Figure

Analysis for simulated data in Example 2

**Analysis for simulated data in Example 2**. 1(a)-(b) AR(2) model. 1(c)-(d) AR(1) model. 1(e)-(f) Independent model. The left and right panels are the results from LAD-aFL and LS-FL respectively. Black dots are the observed log 2 ratios. The estimates from each method are connected by solid lines.

Simulation results for Examples 1 and 2

**Methods**

**AR(2)**

**AR(1)**

**Ind**.

Example 1

LAD-aFL

5.225 (0.831)^{1}

5.375 (0.806)

4.750 (0.669)

4.925^{2}, 0.300^{3}

4.975, 0.400

4.750, 0

LS-FL

4.250 (1.149)

4.750 (0.707)

4.550 (0.959)

4.250, 0

4.725, 0.025

4.525, 0.025

LAD-FL

5.025 (0.479)

4.975 (0.806)

4.350 (1.167)

4.850, 0.175

4.900, 0.075

4.275, 0.075

Example 2

LAD-aFL

5.275 (0.598)

5.475 (0.784)

4.925 (0.350)

5.000, 0.275

4.925, 0.550

4.900, 0.025

LS-FL

2.850 (1.189)

3.750 (1.171)

3.125 (1.362)

2.850, 0

3.750, 0

3.125, 0

LAD-FL

4.850 (0.533)

4.800 (0.791)

4.575 (0.874)

4.850, 0

4.575, 0.225

4.450, 0.125

^{1}The average number (with standard deviation) of all detected breakpoints;

^{2 }The correctly detected breakpoints of the true breakpoints on average;

^{3}The falsely detected breakpoints on average.

Our simulation results show that the LAD-aFL method can detect the copy number variations with significant accuracy. Compared to the LS-FL method, LAD-aFL is more stable and robust, even if the simulated data is generated from an independent model. The LS-FL method tends to over-smooth the data set and does not have the robust property. To contain some robust properties, the Loess technique was imposed

In Table

In the following Example 3, we apply LAD-aFL to large size aberrations with 10,000 markers equally spaced along a chromosome.

Example 3. _{ij}'s from AR(1) model in Example 2. We consider three cases of large aberrations containing 99.8%, 80% and 50% of the probes, respectively, in each profile

We summarize the simulation results in Table

Simulation results for Examples 3

**Methods**

**Case 1**

**Case 2**

**Case 3**

True Number^{1}

2.000

1.000

1.000

LAD-aFL

1.900 (0.410)^{2}

1.000 (0)

1.000 (0)

1.900^{3}, 0^{4}

1.000, 0

1.000, 0

LS-FL

1.750 (0.444)

1.150 (0.366)

1.000 (0)

1.750, 0

0.850, 0.300

1.000, 0

^{1}The true true breakpoints number for each data set;

^{2}The average number (with standard deviation) of all detected breakpoints;

^{3}The correctly detected breakpoints of the true breakpoints on average;

^{4}The falsely detected breakpoints on average.

Analysis for simulated data in Example 3

**Analysis for simulated data in Example 3**. The top, middle and bottom panels are for case 1, 2 and 3, respectively. Gray dots are the observed log2 ratios. Black, red, and blue lines represent the true signal, estimates from LAD-aFL, and estimates from LS-FL, respectively.

We investigate the estimate of FDR in using above examples. For example, if we control FDR rate at level 0.002, out of 100 iterations of model AR(1) in Example 2 and Case I in Example 3, 90% and 95% of the them have true FDR less than 0.002, respectively.

Furthermore, we perform the sensitivity analysis of the LAD-aFL model regarding the cutoff values. In Figure

Roc curve

**Roc curve**. The True Positive Rate is computed by the number of probes with true nonzero log2 ratios divided by the total number of probes detected in the aberration region. The False Positive Rate is computed by the number of probes with true zero log2 ratios divided by the total number of probes detected in the aberration region. Three curves are plotted for AR(1) and AR(2) model in Example 2 and Case 1 in Example 3, respectively.

Bacterial Artificial Chromosome (BAC) array

The BAC data set consists of single experiments on 15 fibroblast cell lines

We applied both LAD-aFL and LS-FL to four chromosomes. Chromosome 8 of GM03134, Chromosome 14 of GM01750, Chromosome 22 of GM13330, and Chromosome 23 of GM03563. Results are demonstrated in Figure

Analysis of BAC data

**Analysis of BAC data**. 2(a)-(b) Chromosome 8 of GM03134. 2(c)-(d) Chromosome 14 of GM01750. 2(e)-(f) Chromosome 22 of GM 13330. 2(g)-(h) Chromosome 23 of GM03563. The left and right panels represent results using LAD-aFL and LS-FL methods respectively. Black dots are observed log 2 ratios. The estimates from each method are connected by solid lines. The breakpoints detected by each method are identified by vertical lines.

Colorectal cancer data

Colorectal cancer data was reported and analyzed for the genomic alterations in tumors of colorectal cancer

Colorectal cancer data

**Colorectal cancer data**. 3(a) X59. 3(b) X186. 3(c) X204. 3(d) X524. Black dots are the observed log 2 ratios. The estimates from the LAD-aFL method are connected by solid lines.

Human chromosome 22q11 data

High-resolution CGH (HR-CGH) technology was applied to analyze CNVs on chromosome 22q11

These Human chromosome 22q11 data sets consist of the measurements on chromosome 22 of 12 patients with approximaately 372,000 features in the microarray data sets for each patient. In order to apply the LAD-aFL method, we partitioned the whole chromosome into several segments and then applied the method to each segment. We selected the cutoff value of

Human Chromosome 22q11 data sets

**Human Chromosome 22q11 data sets**. 4(a) Location 0-372,070 of Patient 03-154. 4(b) Location 0-372,352 of Patient 97-237. 4(c) Location 0-50,000 of Patient 03-154. 4(b) Location 0-50,000 of Patient 97-237. 4(e) Location 190,000-240,000 of Patient 03-154. 4(f) Location 190,000-240,000 of Patient 97-237. Here we plot the LAD-aFL analysis of Human Chromosome 22q11 data sets. The left and right panels are results for patient 03-154 and patient 97-237. For each panel, the top, middle and bottom plots show us the results of whole genome, first significant segment (marker 0 - 50,000) and the second significant segment (marker 190, 000 - 240, 000). The observed log 2 ratios are represented by gray dots; the estimates at all markers are connected by solid lines. The cutoff value

Conclusions

We propose to use a smoothing technique, LAD-aFL to detect the breakpoints, and then divide all the probes into different segments for a noisy CGH data. Very recently, a median smoothing median absolute deviation method (MSMAD) was proposed to improve the performance of breakpoints detection

The appealing features of the proposed LAD-aFL method include its resistance against outliers, its improved accuracy in mapping the true intensities and the fast and accurate computation algorithm. The robustness property is inherited from LAD regression, which significantly reduces the possibility of false positives due to outlying intensity measurements. These properties are demonstrated in the generating models used in our simulation studies. The adaptive fused Lasso penalty in the LAD-aFL method incorporates both sparsity and smoothness properties of the copy number data. The adaptive procedure generates the solutions with some oracle properties. Computationally, the LAD-aFL estimator can be computed by transform to a unpenalized LAD regression, since both the loss and penalty functions use the same _{1 }norm. Our simulation and real data analysis indicate that the LAD-aFL method is a useful and robust approach for CNV analysis. However, there are some important questions which requires further investigation. For example, in the proposed LAD-aFL method, it is assumed that the reported intensity data is properly normalized. It would be useful to examine the sensitivity of the method for different normalization procedures, or perhaps consider the possibility of incorporating normalization into an integrated model. Furthermore, regarding the theoretical properties of LAD-aFL, it would be of interest to consider under what conditions of the smoothness and sparsity of the underlying copy number the LAD-aFL is able to correctly detect the breakpoints with high probability.

Authors' contributions

XG and JH conceived of the research and designed the study. XG carried out the computational analysis and wrote the paper. JH helped to improve the computational analysis and manuscript preparation. Both authors read and approved the final manuscript.

Acknowledgements

XG was supported by a OU faculty research fellowship. JH was supported in part by the grants CA120988 from the National Cancer Institute and DMS 0805670 from the National Science Foundation.