Virtual CGH: an integrative approach to predict genetic abnormalities from gene expression microarray data applied in lymphoma

Geng, Huimin; Iqbal, Javeed; Chan, Wing C; Ali, Hesham H

doi:10.1186/1755-8794-4-32

Research article
Open access
Published: 12 April 2011

Virtual CGH: an integrative approach to predict genetic abnormalities from gene expression microarray data applied in lymphoma

Huimin Geng^1,2,3,
Javeed Iqbal²,
Wing C Chan² &
…
Hesham H Ali¹

BMC Medical Genomics volume 4, Article number: 32 (2011) Cite this article

4837 Accesses
6 Citations
Metrics details

Abstract

Background

Comparative Genomic Hybridization (CGH) is a molecular approach for detecting DNA Copy Number Alterations (CNAs) in tumor, which are among the key causes of tumorigenesis. However in the post-genomic era, most studies in cancer biology have been focusing on Gene Expression Profiling (GEP) but not CGH, and as a result, an enormous amount of GEP data had been accumulated in public databases for a wide variety of tumor types. We exploited this resource of GEP data to define possible recurrent CNAs in tumor. In addition, the CNAs identified by GEP would be more functionally relevant CNAs in the disease pathogenesis since the functional effects of CNAs can be reflected by altered gene expression.

Methods

We proposed a novel computational approach, coined virtual CGH (vCGH), which employs hidden Markov models (HMMs) to predict DNA CNAs from their corresponding GEP data. vCGH was first trained on the paired GEP and CGH data generated from a sufficient number of tumor samples, and then applied to the GEP data of a new tumor sample to predict its CNAs.

Results

Using cross-validation on 190 Diffuse Large B-Cell Lymphomas (DLBCL), vCGH achieved 80% sensitivity, 90% specificity and 90% accuracy for CNA prediction. The majority of the recurrent regions defined by vCGH are concordant with the experimental CGH, including gains of 1q, 2p16-p14, 3q27-q29, 6p25-p21, 7, 11q, 12 and 18q21, and losses of 6q, 8p23-p21, 9p24-p21 and 17p13 in DLBCL. In addition, vCGH predicted some recurrent functional abnormalities which were not observed in CGH, including gains of 1p, 2q and 6q and losses of 1q, 6p and 8q. Among those novel loci, 1q, 6q and 8q were significantly associated with the clinical outcomes in the DLBCL patients (p < 0.05).

Conclusions

We developed a novel computational approach, vCGH, to predict genome-wide genetic abnormalities from GEP data in lymphomas. vCGH can be generally applied to other types of tumors and may significantly enhance the detection of functionally important genetic abnormalities in cancer research.

Peer Review reports

Background

DNA Copy Number Alterations (CNAs), or chromosomal gains and losses, play an important role in regulating gene expression and constitute a key mechanism in cancer development and progression [1–3]. Comparative Genomic Hybridization (CGH) was developed as a molecular cytogenetic method for detecting and mapping such CNAs in tumor cells by comparing hybridization intensity of a tumor and a normal DNA sample [4, 5]. Recently, improved resolution and sensitivity of CGH have been achieved by array CGH (aCGH) by hybridizing to arrayed genomic DNA or cDNA clones [6–9]. However, in the post-genomic era, most cancer studies have been focusing on Gene Expression Profiling (GEP) but not CGH, and as a result, a tremendous amount of GEP data have been accumulated and made publicly accessible [10–14], but few CGH studies have been performed in large series of tumor samples [15]. The enormous amount of GEP data represents an important resource for cancer research, yet it has not been fully exploited for their links to CNAs. From the literature review, most studies including GEP and CGH have been focusing on the impact of one on the other or combining the two for identifying candidate tumor suppressor genes or oncogenes [16–28]. We hypothesized that with a well-designed computational model, GEP data can be readily used to derive functionally relevant genetic abnormalities in tumor.

In this paper, we proposed a novel computational approach, virtual CGH (vCGH), to predict DNA CNAs from GEP data, which may be functionally important as impact is being evaluated at the expression level. The biological foundation for vCGH lies in the observation that a region with a chromosomal gain or loss generally results in corresponding increased or decreased mRNA expression along the aberrant loci, as reported in Diffuse Large B-Cell Lymphoma (DLBCL) [17], Mantle Cell Lymphoma (MCL) [18], Natural Killer-Cell Lymphoma (NKCL) [19], Acute Myeloid Leukemia (AML) [20], sarcoma [25], glioblastoma [27], breast cancer [21, 22, 28], prostate cancer [23] and gastric cancer [24]. We recently studied a large group of DLBCL and MCL samples previously GEP profiled with Lymphochip [29–31] for genetic abnormalities using CGH, and found that DNA CNAs had a substantial impact on the expression of genes in the involved chromosomal regions [17, 18]. In another study on a number of tumor specimens and cell lines of NKCL using high-resolution aCGH and Affymetrix GEP microarrays, we observed a similar relationship between DNA CNAs and mRNA expression; a considerable percentage of variance in mRNA expression is directly attributable to underlying variation in gene copy numbers [19]. The association between GEP and CGH allows the development of vCGH when trained on a sufficient number of tumor samples. To our advantage, we had 190 DLBCL and 64 MCL samples examined by both CGH (Vysis CGH kits, Downers Grove, IL) and GEP (Affymetrix Inc., Santa Clara, CA). The paired GEP and CGH data on a large number of tumor samples provide a unique resource for developing and verifying the vCGH model.

vCGH was built on hidden Markov models (HMMs). HMMs are well-developed statistical models for capturing hidden patterns from observable sequential data, having been successfully applied in biology for finding CpG islands, protein secondary structure, etc. [32]. HMMs have recently been applied in aCGH for segmentation, a procedure to divide the signal ratios of each clone on the array into states, where all of the clones in a state have the same underlying copy number [33, 34]. In this paper, HMM was first time used for an integrative analysis of the GEP-to-CGH prediction which intended to capture two primary sources of uncertainty embedded in genomic data: (1) the significant but subtle correlations between GEP and CGH; (2) the sequential transitions of DNA CNAs along a chromosome. Hertzberg et al. has developed a method for predicting chromosomal aneuploidy from GEP data using fold change and chromosomal relative expression calculation for each chromosome [35]. The major limitation with this approach is that it can only call whole chromosome gain or loss. Nilsson et al. proposed a method that employed total variance minimization techniques for chromosomal segmentation based on altered gene expression pattern [36]. Our proposed vCGH method differs from the previous methods in two important respects. First, the proposed vCGH is based on HMMs, which are classical pattern recognition methods with a rich set of existing estimation and inference algorithms for sequential observations. Second, the vCGH is specifically designed to train paired CGH and GEP datasets and predict CNAs using GEP data only. The special requirement of vCGH is to ensure specificity of CNA calling from the GEP data.

vCGH was aimed to enhance the limited CGH data with the wealth of GEP data and provide an integrative genomic-transcriptomic approach for identifying functionally relevant CNAs in tumor pathogenesis. Many of the common CNAs are pathogenetically significant and provide additional information on a tumor which may not be immediately evident from the CGH data. CGH in principle defines only the chromosomal structural changes, but the functional effects of CNAs can be reflected by altered gene expression. The information is important in cancer research to identify the target genes in regions of CNAs and the biological effect of the CNAs.

Methods

In vCGH, HMMs are used to address the following question: "Given a sequence of GEP data as observations along a chromosome, predict the hidden CGH status of the chromosomal gains or losses."

vCGH model structure

A HMM is a Bayesian network which describes a doubly embedded stochastic process with one observable process and one hidden process. In vCGH, the observable process {x _i } describes GEP observations along a chromosome, where x _i ="H", "L" or "M" for high, low or medium expression of a gene; the hidden process {π _i } describes the underlying CNAs, where π _i = "+", "-" or "o" for gain, loss or normal copy number status of a gene. In Figure 1A, vCGH model was illustrated as a Bayesian network, where the shaded nodes S ₁, S ₂, ..., S _n represent hidden state variables and the visible nodes E ₁, E ₂, ..., E _n represent observations for the variables. The emission space consists of three symbols from GEP observations {H, L, M} and the hidden state space consists of nine states that GEP superimposed on the CNA {H ₊ , L ₊ , M ₊ , H _- , L _- , M _- , H _o , L _o , M _o }, where E _α emits E, E ∈ {H, L, M} and α ∈ {+, -, o}. A hidden state H ₊ can only emit H; however an emission H could come from any of the three underlying hidden states, H ₊, H _- or H _o. The reason that we limit the number of levels to three for GEP (L, M, H) and three for CGH (-, o, +) is the model complexity. Five levels for CGH (--, -, o, +, ++) and GEP (LL, L, M, H, HH) in the HMM would give 5*5 = 25 hidden states (i.e., the five GEP observations superimposed on the five CNA levels) and the transition matrix would have 25*25 = 625 parameters which is much more than the current 9*9 = 81 parameter model. Since we generally have a limited number of training samples, the three-level model is more appropriate in the current framework.

Figure 1B showed the state transition diagram of vCGH. The model is a single chain incorporating three Markov sub-chains. In each sub-chain, there is a complete set of state transitions, describing a continuous DNA segment within a gain, loss or normal CNA status. The state transitions between sub-chains are also allowed to describe the state change of a gain, loss or normal CNA. This design of intra- and inter-sub-chain transitions in vCGH makes it possible to identify alterative gain, loss and normal regions of variable length automatically.

vCGH training and prediction

For a specific tumor type, genomic aberrations often occur in a specific set of chromosomal hotspots. For example, DLBCL has frequent aberrations involving gains of 2p, 6p and 18q and loss of 6q and 17p [17], and the hallmark aberrations of MCL are gains of 3q and 8q and losses of 1p, 6q, 8p, 9p, 9q, 11q and 13q [18]. To accurately reflect the chromosomal differences, we developed and trained a separate HMM for each chromosome so that each chromosome can have a different statistical transition and emission distributions. Our training dataset includes the paired GEP and CGH data, and hence the hidden state path for each observation sequence is known. Therefore, the transition and emission probabilities can be estimated using Maximum Likelihood Estimation (MLE) in Eq. (1) and (2),

(1)

(2)

where a _kl is the transition probability from state k to state l, e _l (b) is the emission probability on output symbol b for state l, A _kl and E _l (b) are the counts that a state transition (k to l) and that a particular emission (b _l ) happened in the training data. k, l and l' ∈ {H ₊, H _-, H _o , L ₊, L _-, L _o , M ₊, M _-, M _o } and b and b' ∈ {H, L, M}. The initial probabilities of the states at the beginning of the chain for each chromosome are estimated using MLE, pi(l) = N _l /N, where pi(l) is the initial probability for state l, N _l is the number of samples with state l at the beginning of the chain, and N is the total number of samples in the training data.

Having the vCGH parameters trained by the paired GEP and CGH data in the training dataset, we used Viterbi and Posterior (also called Forward and Backward) decoding algorithms [32] to predict hidden CGH states based on the GEP observations for a new tumor sample in the testing dataset. Viterbi algorithm works by finding the highest probability path as a hidden state path, whereas Posterior algorithm finds the most likely state for each position and then concatenate those states into a hidden state path. The detailed algorithms of Viterbi and Posterior were given in Additional file 1. Preliminary versions of vCGH Viterbi and vCGH Posterior methods were presented in conferences by Geng et al. [37–39].

An alternative inference method for HMM when given only emissions as training data, i.e., only GEP observations in training, is the Baum-Welch algorithm [32]. Baum-Welch algorithm estimates the model parameters (transition and emission probabilities) together with unknown CGH states by an iterative procedure. We chose not to use this algorithm, as there are many parameters in the model but relatively few data points at each gene position to estimate these parameters. Instead, we used the Viterbi or Posterior algorithms in which the true CGH states were used to guide the HMM prediction.

vCGH validation

The procedure of vCGH was illustrated in Figure 2. The entire dataset was split into training and testing datasets. In the training dataset, the paired GEP and CGH data were used for HMM parameter estimation, and in the testing dataset, only the GEP data of a tumor sample was used to predict the CNAs. The predicted gain, loss or normal status of each gene was compared with those from the experimental CGH on the same tumor samples using the criteria of sensitivity, specificity and accuracy to validate vCGH. The entire process was repeated and the model performance was evaluated by Leave-One-Out Cross Validation (LOOCV). The sensitivity, specificity and accuracy can be calculated from the 2 × 2 contingency table for gain and loss. For example, in the contingency table for gain, true positive (TP) is the number of genes as a gain by both CGH and vCGH, true negative (TN) is the number of genes not as a gain by both CGH and vCGH, false positive (FP) is the number of genes as a gain by vCGH but not by CGH, and false negative (FN) is the number of genes as a gain by CGH but not by vCGH. Then, Sensitivity = TP/(TP+FN), Specificity = TN/(TN+FP), and Accuracy = (TP+TN)/(TP+TN+FP+FN). The same statistics were calculated for loss as well.

We also created two other methods to compare with vCGH, named rGEP (raw GEP) and sGEP (smoothing GEP), by simply mapping GEP status to CGH status without an intelligent learning and predicting process. By rGEP, we mean that a high expression status of a gene is mapped to a gain ("H" → "+"), low expression mapped to loss ("L" → "-"), and medium expression mapped to normal ("M" → "o"). In sGEP, a smoothing method (a multinomial model, as described below) was applied after rGEP to get a gain or loss status for a chromosomal cytoband, which contains a number of consecutive genes.

Smoothing algorithm

Since gains and losses identified by our experimental CGH reflected the resolution in cytobands, we determined as well the gains and losses on cytoband resolution for vCGH by applying a smoothing method. Basically, a multinomial probability was used to measure the likelihood of a cytoband harboring a gain or loss. In Eq. (3), L is the likelihood under a hypothesis H, where H ₀ is the null hypothesis that "a cytoband is not harboring a gain or loss" and H ₁ is the alternative hypothesis that "a cytoband is harboring a gain or loss"; n ₊, n _- and n _o are the numbers of genes in the gain, loss or normal status, and n is the total number of genes on this cytoband (n = n ₊+n _-+n _o); θ ₊, θ _- and θ _o are the corresponding multinomial parameters which can be estimated using MLE in Eq.(4). Under H ₁ hypothesis, θ _1,+, θ _1,- and θ _{1, o} are estimated by the number of genes n ₊, n _- and n _o on a cytoband; Under H ₀ hypothesis, θ _0,+, θ _0,- and θ _{0, o} are estimated by the number of genes N ₊, N _- and N _o on the whole genome as the background (N = N ₊+N _-+N _o). Log-of-odds (LOD), which is Log10 of the ratio of the two likelihoods, was used to measure the likelihood that a cytoband harbors a gain or loss, as described in Eq.(5). The higher the LOD score, the more likely a cytoband harbors a genomic gain or loss.

(3)

(4)

(5)

Sample description and data processing

The GEP and CGH experiments were performed on 190 DLBCLs [17] and 64 MCLs [18]. The survival data was also available for 190 DLBCL patients, who were all treated with CHOP (a regimen of cyclophosphamide, doxorubicin, vincristine and prednisone). The GEP data were obtained using Affymetrix HG-U133 plus2 arrays and normalized (global median normalization) using BRB-Array Tool [40]. The gene expression values (continuous variable) were discretized into three distinct levels, "H", "L" or "M", representing high, low or medium gene expression, respectively. 1.5-fold change was used as the threshold to determine high (>1.5fold increase), low (>1.5fold decrease) or medium (between 1.5 fold increase and decrease) expression of a gene in a tumor as compared to the median expression of the gene across the tumor cohort. The CGH experiments were performed by Vysis CGH kits (Downers Grove, IL). aCGH-Smooth [41] was used to determine breakpoints and relative levels of DNA copy number. The company recommended 1.25 and 0.75 signal ratio of tumor to normal cells was used to segregate gain (>1.25), loss (<0.75) and normal (between 0.75 and 1.25) chromosomal regions. Small-sized chromosomes and sex chromosomes were excluded from the study due to technical limitation and lack of gender data, including chromosomes 19-22, X and Y.

For a gene on GEP, we actually refer to the probeset level data without averaging multiple probesets within the same gene. A probeset in GEP data would be marked with "+" or "-" if its chromosomal locations were covered by the start and the end of a gain or a loss region from the CGH data; Otherwise it was marked with "o" representing not covered by a gain or loss region. The chromosomal locations of probesets, genes and cytobands were obtained by Affymetrix probesets alignments and NCBI Human Genome database Build 36.1. The vCGH model is based on HMMs that consider expression probesets as a sequence of hidden states without considering the distance between probesets. The vast majority of the expression probesets were near the 3' end of coding region and probesets located at other regions were equally treated. The LOD score of 2 was used as the cutoff to call a gain or loss for a cytoband after the smoothing algorithm.

Association of gene expression and survival time with recurrent abnormalities

In order to determine whether the additional recurrent abnormalities identified by vCGH are associated with altered gene expression or not, we performed a permutation test as follows. 1) Consider all probesets (genes) that are in the region of a recurrent abnormality. 2) For each probeset calculate a one-sided Student's t-test p-value for the difference in gene expression between the samples that exhibit the recurrent abnormality, and those that are wild type for that abnormality, in the direction of increased gene expression being associated with increased copy number or decreased gene expression being associated with decreased copy number. 3) Generate a statistic equal to the sum of the log (p-values) for the genes in the region. 4) Randomly permute sample labels as gain, loss or normal according to the abnormality and repeat steps "1-3" 1000 times. 5) Calculate how many times the unpermuted statistic is smaller than the same statistics calculated with the permuted data. For example, the significance of a recurrent abnormality associated with the gene expression in this region is 0.05 if 95% of the time the sum of log (p-value) for the real data is less than that of the permuted data.

In order to determine whether the additional recurrent abnormalities identified by vCGH were associated with survival time or not, we performed survival analysis on the patient groups defined by the recurrent abnormality. Overall survival (OS) distributions were estimated using the Kaplan-Meier method and the patient groups were compared with the log-rank test.

The vCGH source code and the GEP and CGH data for DLBCL and MCL can be accessed at: http://vcgh.sourceforge.net.

Results and discussion

Using cross-validation, vCGH was applied to 190 DLBCLs and 64 MCLs on which both GEP and CGH data were available [17, 18]. vCGH was first trained by the paired GEP and CGH data on the same tumor samples in the training dataset, and then applied to the GEP data of a new tumor sample in the testing dataset to predict its CNAs. The predicted gains and losses were compared with those identified by experimental CGH on both on gene level and cytoband level.

Gene-level validation of vCGH

We first evaluated vCGH, and for comparison purpose rGEP and sGEP as well, using sensitivity, specificity and accuracy against experimental CGH, in predicting gains and losses for all the DLBCL or MCL samples using LOOCV. Tables 1 and 2 summarized the sensitivity, specificity and accuracy for all chromosomes on DLBCL and MCL datasets, respectively. Figures 3 and 4 showed the performance on individual chromosomes for DLBCL and MCL datasets, respectively.

Table 1 Sensitivity, specificity and accuracy of vCGH*, rGEP and sGEP on DLBCL dataset

Full size table

Table 2 Sensitivity, specificity and accuracy of vCGH*, rGEP and sGEP on MCL dataset

Full size table

On the DLBCL dataset, in Figure 3, each box represents one chromosome. Good predictions should be at the upper right corner, where both sensitivity and specificity are good; while poor predictions are the points at the lower left corner. It is obvious from Figure 3 that vCGH, both Viterbi (in red) and Posterior (in multiple colors representing different posterior probability cut-offs) methods, predict better than rGEP (in light green) and sGEP (in dark green) by lying at the most upper right corner. On most of the chromosomes, vCGH achieved 70-80% sensitivity and 90%-95% specificity, for both gain and loss prediction; while sensitivity was much lower in rGEP (30%) and sGEP (40%-50%), and specificity was also lower in rGEP (80%) and sGEP (90%). We also observed that vCGH Viterbi and vCGH Posterior had similar performance (Viterbi point lied among a series of Posterior points), and that as expected, in vCGH Posterior, specificity increases and sensitivity decreases as the posterior probability cut-off increases. The results on the MCL dataset were similar as in DLBCL dataset (Figure 4). On average, vCGH achieved 75% sensitivity and 90% specificity for gain, and 60% sensitivity and 90% specificity for loss, while sensitivity was 40% for gain and 30% for loss in rGEP, and 40% for gain and 50% for loss in sGEP, and specificity was 70% for gain and 80% for loss in rGEP, and 85% for gain and 90% for loss in sGEP. In Tables 1 and 2, performance of vCGH, rGEP and sGEP were summarized. The bold-highlighted were the best predictions, which all fell into the vCGH category except one where sGEP is marginally better than vCGH. Tables S1 and S2 in Additional file 2 showed the detailed sensitivity, specificity and accuracy of vCGH on each chromosome.

Those results suggested that vCGH was able to capture the hidden genomic CNA information buried in the GEP data, while rGEP and sGEP didn't work well, which directly map GEP status to CGH status without any learning process. We noticed that vCGH did not predict well on some chromosomes, such as gain on chromosome 4 and loss on chromosome 11 for DLBCL (Figure 3) and gain on chromosomes 1, 6, 9, 10 and 13 and loss on chromosomes 4, 5, 15 and 18 for MCL (Figure 4). This is due to infrequent aberrations and hence insufficient training data for the gains or losses on those chromosomes. For example, in 190 DLBCLs, the number of samples with chr4 gain is n = 7 and with chr11 loss is n = 1; in 64 MCLs, the number of samples with gains on chr1 is (n = 1), chr6 (n = 3), chr9 (n = 1), chr10 (n = 2) and chr13 (n = 1), and with losses on chr4 (n = 2), chr5 (n = 1), chr15 (n = 1) and chr18 (n = 2).

Cytoband-level validation of vCGH

Cytobands are defined as the chromosomal areas distinguishable from other segments by appearing darker or lighter by one or more banding techniques for karyotype description. Our experimental CGH detected chromosomal gains and losses on cytobands. To compare vCGH with experimental CGH on the same resolution, we also determined the gains and losses on cytobands by applying a smoothing algorithm in vCGH as described in Method section.

Figures 5 and 6 showed the results of cytoband level gains and losses on DLBCL and MCL, respectively. The two vCGH decoding methods, Viterbi and Posterior, were shown in panels A and B, respectively. In each panel, loss frequencies were shown on left-sided bars and gain frequencies on right-sided bars. We found in Posterior decoding, as expected, the frequencies of gains and losses decrease as posterior probability increases (p = 0.5, 0.6, 0.7, 0.8 and 0.9) (panel B in Figures 5 and 6), and the frequencies at different posterior probability cut-offs are highly correlated, with Pearson's correlation coefficients around 0.99 (Tables 3 and 4). Comparing the results from Viterbi (panel A in Figures 5 and 6) and Posterior (panel B in Figures 5 and 6), a high concordance was also observed with Pearson's correlation coefficients around 0.95 (Tables 3 and 4). In panel C (Figures 5 and 6), the Viterbi method was used to represent vCGH to compare with the experimental CGH side by side. Gains and losses were shown separately. CGH results were above the X-axis in yellow and vCGH results were below the X-axis in red. Apparently, the majority of the recurrent gains and losses predicted by vCGH are in good concordance with those identified by experimental CGH, such as gains of 1q, 2p16-p14, 3q27-q29, 6p25-p21, 7, 11q, 12 and 18q21 and losses of 6q, 8p23-p21, 9p24-p21 and 17p13 on DLBCL. The Pearson's correlation coefficients between vCGH and CGH are around 0.8 for gains and losses (Tables 3 and 4).

Table 3 Correlation of gain and loss frequencies* on cytobands among CGH, vCGH Viterbi and vCGH Posterior on DLBCL dataset

Full size table

Table 4 Correlation of gain and loss frequencies* on cytobands among CGH, vCGH Viterbi and vCGH Posterior on MCL dataset

Full size table

As described in the model design in the Methods section, with intra- and inter- Markov sub-chain transitions, vCGH can identify alterative gain, loss or normal DNA segments automatically. vCGH is basically a segment-level prediction tool, and genes within a segment can be considered as the unit of a segment. Sensitivity, specificity and accuracy of vCGH on gene level and on cytoband level were compared in Tables S3 and S4 (Additional file 2) for DLBCL and MCL, respectively. As expected, the gene-level and cytoband-level vCGH gave very similar prediction sensitivity, specificity and accuracy.

Additional recurrent gains and losses predicted by vCGH

In addition to the common recurrent gains and losses between vCGH and CGH, vCGH also predicted some recurrent gains or losses that were not observed in CGH, such as gains of 1p (in 33 out of 190 samples), 2q (39/190) and 6q (37/190) and losses of 1q (25/190), 6p (44/190) and 8q (19/190) on the DLBCL dataset (Figure 5C). We checked those additional recurrent abnormalities predicted by vCGH and the corresponding gene expression within those regions in Figure 7. We observed higher expression of genes for the gain region and lower expression of genes for the loss region, as compared to the normal group.

We further evaluated the significance of a recurrent gain or loss region being associated with the altered gene expression by a permutation test as described in the method section. We performed 1000 permutations for each region and found that in all of the 1000 permutations, the test statistic for the real data was less than the test statistics of the permuted data (p < 0.001, Figure 8A). We also examined the association of those regions with clinical characteristics of the patients. We plotted overall survival (OS) time of the DLBCL patients characterized by those abnormalities, and found that three of those regions are significantly correlated with the survival time of the patient groups: 1q (p = 0.025), 6q (p = 0.04) and 8q (p = 0.009) (Figure 8B). Those associations revealed that the additional recurrent abnormalities identified by vCGH may be functionally important since the genes in those regions have consistently elevated or decreased level of expression and reflect clinical characteristics of DLBCL patients.

Experimental CGH might report false negative CNAs, for example, CGH kits have technical limitations; the optimal cut-off values may vary among samples when calling a "gain" or "loss"; normal cells in stromal or other reactive elements in the tumor microenviroment may contribute to the signal ratio of tumor versus normal. Other than that, one reason that vCGH has identified additional recurrent abnormalities is that, there are other biological mechanisms which exert control of the expression of a group of syntenic genes other than through chromosomal structural changes. For example, epigenetic modifications, such as DNA methylation and histone modifications, may turn on and off genes in DNA independent of the structural changes. It may be important to check the predicted amplified or deleted regions of these tumor samples for epigenetic alterations. Transcriptional units can also be turned on or off as a group of spatially contiguous genes which may resemble, but not due to, chromosomal structural changes. As another example, UniParental Disomy (UPD) occurs when a cell has two copies of a chromosome, or part of a chromosome, from one parent and no copies from the other parent. UPD can result in over- or uder- expression of genes in the affected regions if these genes have undergone genomic imprinting. Therefore, vCGH may identify not only the gain and loss regions caused by chromosomal structural changes, but also the apparent ("gain") or silenced ("loss") regions by other biological mechanisms. Those recurrent abnormalities may also be important to cancer biology and the clinical outcome of the patients. Additionally, with increasing evidence of polymorphic genomic variation in genome it is more important to critically look at structural changes and its influence on gene expression status.

vCGH prediction on an independent dataset of 176 DLBCLs

We applied vCGH which is trained by the paired GEP and CGH data on the 190 DLBCLs, to an independent dataset of 176 DLBCLs with the GEP data [42]. The GEP data of the 176 DLBCLs were downloaded at http://www.broadinstitute.org/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=102[42]. Since the CGH data was not available for the 176 DLBCLs, we compared the vCGH-predicted CNAs for the 176 DLBCLs with the CGH-identified CNAs for the 190 DLBCLs because a specific tumor type would feature specific genetic abnormalities even in different patient cohorts. Figure 9 showed the prediction results on the 176 DLBCLs in comparison with the CGH data on the 190 DLBCLs. Since the two patient cohorts are completely independent, we observed some differences in recurrent abnormalities between the two cohorts, especially in losses. However we do observe overall similarity between the two cohorts, such as gains of 1q, 2p14-p16, chr3, chr5, 6p, chr7 and chr9, and losses of chr4, 6q, 13q and 17p. Those recurrent regions have also been reported in another independent aCGH study on 99 DLBCLs [43].

There are some limitations of vCGH due to utilization of transcripts-based GEP data. For example, it may not predict well for regions with few genes (such as "gene desert"), or if the genes in a region are generally not expressed at a sufficiently high level on GEP in even normal status. vCGH is also limited by the design of the GEP arrays. For example, on Affymetrix HG-U133 plus 2 microarrays, there are no probes designed on the p arms of chromosomes 13, 14, 15, 21 and 22. Therefore, vCGH cannot predict gains or losses on those chromosomal regions.

Conclusions

We proposed a novel computational approach, vCGH, to predict genetic abnormalities from the GEP data in tumors. In addition to the wealth of GEP data already publicly available, vCGH also takes advantage of the paired GEP and CGH data on the same tumor samples in training to infer functionally relevant CNA regions. CNA regions identified by CGH alone in principle define only the chromosomal structural changes; however, the functional effects of CNAs can be reflected by altered gene expression and might be more important to the tumorigenesis. vCGH was constructed on HMMs to capture two primary sources of uncertainty embedded in genomic data: the significant but subtle correlations between GEP and CGH, and the sequential transitions of CNAs along a chromosome. We applied vCGH to two large cohorts of lymphoma samples on which both GEP and CGH experiments were performed, including 190 DLBCLs and 64 MCLs. Using cross-validation, vCGH achieved 80% sensitivity, 90% specificity and 90% accuracy in predicting gains and losses as compared to the experimental CGH on the same tumor samples. In addition to the recurrent gains and losses that are concordant with those by the experimental CGH, vCGH also identified a few recurrent abnormalities not shown by CGH, such as gains of 6q and losses of 1q and 8q on DLBCL, and those regions are significantly correlated with the patients' outcomes. As vCGH utilized both genomic and transcriptomic data, it can identify not only gains and losses by chromosomal structural changes, but also abnormal genomic regions activated or silenced by other mechanisms. We presented the results of vCGH on lymphoma samples, but vCGH is a general computational tool which can be applied to other tumor types and may significantly enhance the identification of functionally important abnormal genomic regions in cancer research.

References

Cahill DP, Kinzler KW, Vogelstein B, Lengauer C: Genetic instability and darwinian selection in tumours. Trends Cell Biol. 1999, 9 (12): M57-60. 10.1016/S0962-8924(99)01661-X.
Article CAS PubMed Google Scholar
Vogelstein B, Kinzler KW: Cancer genes and the pathways they control. Nature medicine. 2004, 10 (8): 789-799. 10.1038/nm1087.
Article CAS PubMed Google Scholar
Vogelstein B, Fearon ER, Hamilton SR, Kern SE, Preisinger AC, Leppert M, Nakamura Y, White R, Smits AM, Bos JL: Genetic alterations during colorectal-tumor development. The New England journal of medicine. 1988, 319 (9): 525-532. 10.1056/NEJM198809013190901.
Article CAS PubMed Google Scholar
du Manoir S, Speicher MR, Joos S, Schrock E, Popp S, Dohner H, Kovacs G, Robert-Nicoud M, Lichter P, Cremer T: Detection of complete and partial chromosome gains and losses by comparative genomic in situ hybridization. Human genetics. 1993, 90 (6): 590-610.
Article CAS PubMed Google Scholar
Kallioniemi A, Kallioniemi OP, Sudar D, Rutovitz D, Gray JW, Waldman F, Pinkel D: Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science. 1992, 258 (5083): 818-821. 10.1126/science.1359641.
Article CAS PubMed Google Scholar
Snijders AM, Nowak N, Segraves R, Blackwood S, Brown N, Conroy J, Hamilton G, Hindle AK, Huey B, Kimura K, et al: Assembly of microarrays for genome-wide measurement of DNA copy number. Nature genetics. 2001, 29 (3): 263-264. 10.1038/ng754.
Article CAS PubMed Google Scholar
Pinkel D, Segraves R, Sudar D, Clark S, Poole I, Kowbel D, Collins C, Kuo WL, Chen C, Zhai Y, et al: High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nature genetics. 1998, 20 (2): 207-211. 10.1038/2524.
Article CAS PubMed Google Scholar
Solinas-Toldo S, Lampel S, Stilgenbauer S, Nickolenko J, Benner A, Dohner H, Cremer T, Lichter P: Matrix-based comparative genomic hybridization: biochips to screen for genomic imbalances. Genes, chromosomes & cancer. 1997, 20 (4): 399-407.
Article CAS Google Scholar
Pollack JR, Perou CM, Alizadeh AA, Eisen MB, Pergamenschikov A, Williams CF, Jeffrey SS, Botstein D, Brown PO: Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nature genetics. 1999, 23 (1): 41-46. 10.1038/12640.
Article CAS PubMed Google Scholar
Stanford Microarray Database. [http://smd.stanford.edu/]
Gene Expression Omnibus, NCBI. [http://www.ncbi.nlm.nih.gov/geo/]
UPenn RAD database. [http://www.cbil.upenn.edu/RAD/php/index.php]
caArray, NCI. [https://cabig.nci.nih.gov/tools/caArray]
ArrayExpress at EBI. [http://www.ebi.ac.uk/microarray-as/ae/]
SKY/M-FISH & CGH Database at NCBI. [http://www.ncbi.nlm.nih.gov/sky/]
Lenz G, Wright GW, Emre NC, Kohlhammer H, Dave SS, Davis RE, Carty S, Lam LT, Shaffer AL, Xiao W, et al: Molecular subtypes of diffuse large B-cell lymphoma arise by distinct genetic pathways. Proceedings of the National Academy of Sciences of the United States of America. 2008, 105 (36): 13520-13525. 10.1073/pnas.0804295105.
Article CAS PubMed PubMed Central Google Scholar
Bea S, Zettl A, Wright G, Salaverria I, Jehn P, Moreno V, Burek C, Ott G, Puig X, Yang L, et al: Diffuse large B-cell lymphoma subgroups have distinct genetic profiles that influence tumor biology and improve gene-expression-based survival prediction. Blood. 2005, 106 (9): 3183-3190. 10.1182/blood-2005-04-1399.
Article CAS PubMed PubMed Central Google Scholar
Salaverria I, Zettl A, Bea S, Moreno V, Valls J, Hartmann E, Ott G, Wright G, Lopez-Guillermo A, Chan WC, et al: Specific secondary genetic alterations in mantle cell lymphoma provide prognostic information independent of the gene expression-based proliferation signature. J Clin Oncol. 2007, 25 (10): 1216-1222. 10.1200/JCO.2006.08.4251.
Article CAS PubMed PubMed Central Google Scholar
Iqbal J, Kucuk C, Deleeuw RJ, Srivastava G, Tam W, Geng H, Klinkebiel D, Christman JK, Patel K, Cao K, et al: Genomic analyses reveal global functional alterations that promote tumor growth and novel tumor suppressor genes in natural killer-cell malignancies. Leukemia. 2009, 23 (6): 1139-1151. 10.1038/leu.2009.3.
Article CAS PubMed Google Scholar
Virtaneva K, Wright FA, Tanner SM, Yuan B, Lemon WJ, Caligiuri MA, Bloomfield CD, de La Chapelle A, Krahe R: Expression profiling reveals fundamental biological differences in acute myeloid leukemia with isolated trisomy 8 and normal cytogenetics. Proceedings of the National Academy of Sciences of the United States of America. 2001, 98 (3): 1124-1129. 10.1073/pnas.98.3.1124.
Article CAS PubMed PubMed Central Google Scholar
Hyman E, Kauraniemi P, Hautaniemi S, Wolf M, Mousses S, Rozenblum E, Ringner M, Sauter G, Monni O, Elkahloun A, et al: Impact of DNA amplification on gene expression patterns in breast cancer. Cancer Res. 2002, 62 (21): 6240-6245.
CAS PubMed Google Scholar
Pollack JR, Sorlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE, Tibshirani R, Botstein D, Borresen-Dale AL, Brown PO: Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proceedings of the National Academy of Sciences of the United States of America. 2002, 99 (20): 12963-12968. 10.1073/pnas.162471999.
Article CAS PubMed PubMed Central Google Scholar
Phillips JL, Hayward SW, Wang Y, Vasselli J, Pavlovich C, Padilla-Nash H, Pezullo JR, Ghadimi BM, Grossfeld GD, Rivera A, et al: The consequences of chromosomal aneuploidy on gene expression profiles in a cell line model for prostate carcinogenesis. Cancer Res. 2001, 61 (22): 8143-8149.
CAS PubMed Google Scholar
Varis A, Wolf M, Monni O, Vakkari ML, Kokkola A, Moskaluk C, Frierson H, Powell SM, Knuutila S, Kallioniemi A, et al: Targets of gene amplification and overexpression at 17q in gastric cancer. Cancer Res. 2002, 62 (9): 2625-2629.
CAS PubMed Google Scholar
Linn SC, West RB, Pollack JR, Zhu S, Hernandez-Boussard T, Nielsen TO, Rubin BP, Patel R, Goldblum JR, Siegmund D, et al: Gene expression patterns and gene copy number changes in dermatofibrosarcoma protuberans. The American journal of pathology. 2003, 163 (6): 2383-2395. 10.1016/S0002-9440(10)63593-6.
Article CAS PubMed PubMed Central Google Scholar
Hughes TR, Roberts CJ, Dai H, Jones AR, Meyer MR, Slade D, Burchard J, Dow S, Ward TR, Kidd MJ, et al: Widespread aneuploidy revealed by DNA microarray expression profiling. Nature genetics. 2000, 25 (3): 333-337. 10.1038/77116.
Article CAS PubMed Google Scholar
Nigro JM, Misra A, Zhang L, Smirnov I, Colman H, Griffin C, Ozburn N, Chen M, Pan E, Koul D, et al: Integrated array-comparative genomic hybridization and expression array profiles identify clinically relevant molecular subtypes of glioblastoma. Cancer Res. 2005, 65 (5): 1678-1686. 10.1158/0008-5472.CAN-04-2921.
Article CAS PubMed Google Scholar
Clark J, Edwards S, John M, Flohr P, Gordon T, Maillard K, Giddings I, Brown C, Bagherzadeh A, Campbell C, et al: Identification of amplified and expressed genes in breast cancer by comparative hybridization onto microarrays of randomly selected cDNA clones. Genes, chromosomes & cancer. 2002, 34 (1): 104-114.
Article CAS Google Scholar
Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, et al: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000, 403 (6769): 503-511. 10.1038/35000501.
Article CAS PubMed Google Scholar
Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, Muller-Hermelink HK, Smeland EB, Giltnane JM, et al: The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. The New England journal of medicine. 2002, 346 (25): 1937-1947. 10.1056/NEJMoa012914.
Article PubMed Google Scholar
Alizadeh A, Eisen M, Davis RE, Ma C, Sabet H, Tran T, Powell JI, Yang L, Marti GE, Moore DT, et al: The lymphochip: a specialized cDNA microarray for the genomic-scale analysis of gene expression in normal and malignant lymphocytes. Cold Spring Harb Symp Quant Biol. 1999, 64: 71-78. 10.1101/sqb.1999.64.71.
Article CAS PubMed Google Scholar
Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis: probabilistic models of proteins and necleic acids. 1998, New York: Cambridge Unisersity Press
Chapter Google Scholar
Marioni JC, Thorne NP, Tavare S: BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data. Bioinformatics (Oxford, England). 2006, 22 (9): 1144-1146. 10.1093/bioinformatics/btl089.
Article CAS Google Scholar
Fridlyand J, Snijders AM, Pinkel D, Albertson DG, Jain AN: Hidden Markov Models Approach to the Analysis of Array CGH Data. J Multivariate Anal. 2004, 90: 132-153. 10.1016/j.jmva.2004.02.008.
Article Google Scholar
Hertzberg L, Betts DR, Raimondi SC, Schafer BW, Notterman DA, Domany E, Izraeli S: Prediction of chromosomal aneuploidy from gene expression data. Genes, chromosomes & cancer. 2007, 46 (1): 75-86.
Article CAS Google Scholar
Nilsson B, Johansson M, Heyden A, Nelander S, Fioretos T: An improved method for detecting and delineating genomic regions with altered gene expression in cancer. Genome Biol. 2008, 9 (1): R13-10.1186/gb-2008-9-1-r13.
Article PubMed PubMed Central Google Scholar
Geng H, Iqbal J, Deng X, Chan WC, Ali HH, Virtual CGH: Prediction of Novel Regions of Chromosomal Alterations in Natural Killer Cell Lymphoma from Gene Expression Profiling. Proceedings of the 40th Annual Hawaii International Conference on System Sciences (HICSS'07). 2007, 129a.
Google Scholar
Geng H, Ali HH, Chan WC: A Hidden Markov Model Approach for Prediction of Genomic Alterations from Gene Expression Profiling. Proceedings of the 4th International Symposium on Bioinformatics Research and Applications (ISBRA 2008), Lecture Notes in Computer Science 4983. 2008, 414-425.
Google Scholar
Geng H, Chan WC, Ali HH: A Computational Method to Predict DNA Copy Number Alterations from Gene Expression Data in Tumor Cases. Proceedings of the 42th Annual Hawaii International Conference on System Sciences (HICSS'09). 2009, 1-10.
Google Scholar
BRB-Array Tool. [http://linus.nci.nih.gov/BRB-ArrayTools.html]
Jong K, Marchiori E, Meijer G, Vaart AV, Ylstra B: Breakpoint identification and smoothing of array comparative genomic hybridization data. Bioinformatics (Oxford, England). 2004, 20 (18): 3636-3637. 10.1093/bioinformatics/bth355.
Article CAS Google Scholar
Monti S, Savage KJ, Kutok JL, Feuerhake F, Kurtin P, Mihm M, Wu B, Pasqualucci L, Neuberg D, Aguiar RC, et al: Molecular profiling of diffuse large B-cell lymphoma identifies robust subtypes including one characterized by host inflammatory response. Blood. 2005, 105 (5): 1851-1861. 10.1182/blood-2004-07-2947.
Article CAS PubMed Google Scholar
Tagawa H, Suguro M, Tsuzuki S, Matsuo K, Karnan S, Ohshima K, Okamoto M, Morishima Y, Nakamura S, Seto M: Comparison of genome profiles for identification of distinct subgroups of diffuse large B-cell lymphoma. Blood. 2005, 106 (5): 1770-1777. 10.1182/blood-2005-02-0542.
Article CAS PubMed Google Scholar

Pre-publication history

The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1755-8794/4/32/prepub

Download references

Acknowledgements

This work was supported in part by a NIH grant P20 RR16469 from the INBRE program of the National Center for Research Resources, U.S. Public Health Service grants CA36727, CA84967 and U01 CA114778 by the National Cancer Institute, Department of Health and Human Services. HG was supported by Bukey Fellowship and Blanche Widaman Fellowship from University of Nebraska Medical Center.

Author information

Authors and Affiliations

Department of Computer Science, University of Nebraska at Omaha, Omaha, NE, 68182, USA
Huimin Geng & Hesham H Ali
Department of Pathology and Microbiology, University of Nebraska Medical Center, Omaha, NE, 68198, USA
Huimin Geng, Javeed Iqbal & Wing C Chan
Institute for Computational Biomedicine, Weill Medical College of Cornell University, New York, NY, 10065, USA
Huimin Geng

Authors

Huimin Geng
View author publications
You can also search for this author in PubMed Google Scholar
Javeed Iqbal
View author publications
You can also search for this author in PubMed Google Scholar
Wing C Chan
View author publications
You can also search for this author in PubMed Google Scholar
Hesham H Ali
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Wing C Chan or Hesham H Ali.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

HG designed the study, implemented the model, performed the data analysis and drafted the manuscript. JI assisted in GEP and CGH data analysis. WCC contributed to the conceptual development of the model, provided the GEP and CGH data and supervised the study. HAA contributed to the algorithm design and supervised the study. All authors read and approved the final manuscript.

Electronic supplementary material

Additional file 1: Viterbi, Forward and Backward Algorithms. Word DOC file. (DOC 80 KB)

Additional file 2: Supplemental Tables. Word DOC containing Supplemental table S1, S2, S3 and S4. (DOC 300 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Geng, H., Iqbal, J., Chan, W.C. et al. Virtual CGH: an integrative approach to predict genetic abnormalities from gene expression microarray data applied in lymphoma. BMC Med Genomics 4, 32 (2011). https://doi.org/10.1186/1755-8794-4-32

Download citation

Received: 23 June 2010
Accepted: 12 April 2011
Published: 12 April 2011
DOI: https://doi.org/10.1186/1755-8794-4-32

Virtual CGH: an integrative approach to predict genetic abnormalities from gene expression microarray data applied in lymphoma

Abstract

Background

Methods

Results

Conclusions

Background

Methods

vCGH model structure

vCGH training and prediction

vCGH validation

Smoothing algorithm

Sample description and data processing

Association of gene expression and survival time with recurrent abnormalities

Results and discussion

Gene-level validation of vCGH

Cytoband-level validation of vCGH

Additional recurrent gains and losses predicted by vCGH

vCGH prediction on an independent dataset of 176 DLBCLs

Conclusions

References

Pre-publication history

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Competing interests

Authors' contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Medical Genomics

Contact us