Analysis of DNA copy number alterations and gene expression changes in human samples have been used to find potential target genes in complex diseases. Recent studies have combined these two types of data using different strategies, but focusing on finding gene-based relationships. However, it has been proposed that these data can be used to identify key genomic regions, which may enclose causal genes under the assumption that disease-associated gene expression changes are caused by genomic alterations.
Following this proposal, we undertake a new integrative analysis of genome-wide expression and copy number datasets. The analysis is based on the combined location of both types of signals along the genome. Our approach takes into account the genomic location in the copy number (CN) analysis and also in the gene expression (GE) analysis. To achieve this we apply a segmentation algorithm to both types of data using paired samples. Then, we perform a correlation analysis and a frequency analysis of the gene loci in the segmented CN regions and the segmented GE regions; selecting in both cases the statistically significant loci. In this way, we find CN alterations that show strong correspondence with GE changes. We applied our method to a human dataset of 64 Glioblastoma Multiforme samples finding key loci and hotspots that correspond to major alterations previously described for this type of tumors.
Identification of key altered genomic loci constitutes a first step to find the genes that drive the alteration in a malignant state. These driver genes can be found in regions that show high correlation in copy number alterations and expression changes.
Acquisition of somatic genetic alterations plays an important role in the development of cancer. Several systematic efforts have addressed the study of genetic alterations to characterize human cancers [1,2], including: copy-number alterations (CNAs), translocations, insertions or single-nucleotide polymorphisms (SNPs). Most of these approaches are focused on finding frequent alterations, which occur in a high number of cases. According to the selective pressure theory, a genomic alteration that confers an advantage to a malignant state is likely to be found in more tumors than expected by chance . However, most methods that look for recurrent aberrations using copy number information find many regions, containing many genes [4,5]. Therefore, to identify recurrently altered genomic regions -biologically relevant- it is necessary to integrate gene and genome information, as proposed by Akavia et al. . Several reports have recently shown that integrative strategies can be very useful to identify driver genes, considering the hypothesis that disease-associated gene expression changes are frequently induced by genomic alterations [3,6-10]. Most of these reports are focused on finding gene-based relationships.
Built on these hypotheses -that relate transcriptomic and genomic alterations-, we propose a new integrative method based on the location of both types of signals along the genome. Our method takes into account the genomic loci, both in the copy number (CN) analysis and also in the gene expression (GE) analysis, and applies the segmentation step proposed by Ortiz-Estevez et al. . These authors designed a method for robust comparison between CN and GE using paired samples. Such approach is based on a search for correlation between segmented CN regions and segmented GE regions to find the most significant simultaneous alterations. We follow this approach introducing two new steps to asses the matching between CN and GE loci: (i) first, a signal correlation analysis; (ii) second, an alteration frequency analysis. Using these analyses we propose a set of significantly altered genomic regions in the studied pathological state. In order to show the performance and demonstrate the value of our method, we use a dataset of 64 Glioblastoma Multiforme (GBM) samples with paired measurements of GE and CN (taken from [7,8]).
Results and discussion
The method is designed for combined analysis of datasets from two types of genome-wide arrays: DNA genomic microarrays and RNA expression microarrays. These arrays provide copy number and expression quantitative data, respectively. The analysis places both types of signals along the genome, taking into account the gene loci for the CN data and the GE data. The rationale of the method is to search for copy number alterations with a major influence in the expression levels of the genes encoded. As a distinctive element from other integrative approaches we do not consider only SNPs or genes individually. We take into account the gene loci following the strategy described in , that is based on the application of the same smoothing and segmentation algorithm to CN and GE in order to establish comparable regions. Once we get the smoothed segments, we perform two independent analyses for each gene loci: a signal correlation analysis and an alteration frequency analysis. (The workflow described in Materials and Methods, presented in last figure, illustrates the procedure of the method including these two independent analyses).
Analysis of correlation between gene expression and copy number levels
The method matches the CN and GE segmented signals within each chromosomal region -i.e., the log2 ratio signals of the corresponding segments- and selects the gene loci that show a significant correlation. These loci can be considered candidate hotspots. In Figure 1 we present the results of this analysis done for the GBM dataset, marking in purple the number of gene loci with Pearson Correlation Coefficient r ≥ 0.60 (that corresponds to a Bonferroni-adjusted p-value < 0.005). Such cutoff (r ≥ 0.60) includes around 55 % of the human gene loci, providing a good coverage with a significant p-value. Setting more stringent cutoffs reduces the coverage too much: r ≥ 0.70 includes only ~26 % of the gene loci; r ≥ 0.80 includes only ~6 %.
Figure 1. Density distribution of the correlation coefficients between GE and CN for the GBM dataset. Purple represents the number of gene loci that present significant correlation (r ≥ 0.60) between gene expression and copy number signals, counted considering all the samples. Blue are the rest, not considered significant.
The number of probes in the SNP arrays -used to calculate the segmented signals for CN- is large and uniform along the genome. However, in the expression arrays some genomic regions do not have enough allocated gene loci and the number of probes is sparse. This fact is a problem when a GE segment includes outliers (i.e. gene locus which have expression levels very different from the mean of their neighbours). To solve this problem, we look for statistically significant outliers within the GE segments -which were at least in 1/3 of the samples- and we recalculate the signal correlation between their unsegmented GE and the corresponding CN segments. In this way, we find a new set of gene loci with correlation r ≥ 0.60, which is added to the initial set of candidate hotspots identified. This step of the procedure is important to recover some gene loci with quite significant correlation (e.g. EGFR or SEC61G), which were missed in the first step due to the described problem.
Analysis of frequencies for the categorical states Up-Gain and Down-Loss
The method also proposes to find the genomic regions that present a significant GE and CN alteration in the same direction. To assess this, we included a second selective step based on stratification of the segmented data. The genomic regions are stratified in several categories: up-regulation (U), down-regulation (D) or no-change (N) for expression; and gain (G), loss (L) or no-change (N) for copy number. This approach allows a discretization of the genomic regions into 9 different categories as shown in Figure 2 (inserted table): U-G, N-G, D-G, U-N, N-N, D-N, U-L, N-L, D-L. Figure 2 also presents the empirical cumulative distributions for these 9 categories of the GBM samples per gene loci, counting the frequency of samples for all the gene loci in each category. As expected, the distributions show that the "no change" (i.e. N-N, neutral-neutral) is the most frequent state. The analysis of distributions also finds some regions that show a clear correspondence between GE and CN alterations: i.e. the scenario where GE up-regulation is observed co-located with a CN gain (U-G category) and the scenario where GE down-regulation is co-located with a CN loss (D-L category). Our interest focuses on these regions, since they are the ones altered in the same way in both types of data. The analysis of the empirical frequency distributions for the U-G and D-L categories allows identifying the frequency cutoffs that correspond to the 10% upper quantiles. These cutoffs were: 13 samples for U-G and 11 samples for D-L (out of 64 in GBM dataset). The set up of these thresholds identifies those genomic regions that are the most frequently assigned to such altered categories (U-G and D-L) in the studied dataset.
Figure 2. Observed frequency distributions of the 9 categorical states for the GBM dataset: U-G, N-G, D-G, U-N, N-N, D-N, U-L, N-L, D-L. Boxplots corresponding to the distributions of number of samples assigned to each category for all the gene loci. Insert: Contingency table showing the 9 possible categories for the gene expression and copy number. The total number of GBM samples analysed was 64.
Genome-wide identification of hotspots: candidate key genomic regions
Our method identifies candidate key regions that show high correlation between CN and GE and that are frequently altered in the same direction, in both types of signals. The overlapping between the regions with the most significant correlation and the ones with the highest frequencies of simultaneous alteration (CN and GE) along the genome, will constitute hotspots where putative driver genes are likely to be encoded.
Figure 3 presents the combined view of GE and CN alterations on the complete genome obtained for the GBM dataset. The graph shows the alteration frequency, either in CN or in GE independently, along all the genome (22 human chromosomes). The dark colors correspond to GE up-regulated regions (red) or down-regulated regions (green), and the light colors -placed on top- correspond to CN gains (pale red) and losses (pale green). These results show that the method finds the alterations previously described for CN in GBM cancer [12,13]. In fact, the most frequent alterations in glioblastoma are the gain of chromosome 7 and the loss of chromosome 10. Our analysis finds such alterations in CN, and also finds their correlation with GE up-regulation for chromosome 7 and with GE down-regulation for chromosome 10. Figure 4 presents a detailed view of the alterations that occur in chromosome 7. It includes a profile of the regions with significant correlation (purple dots along the chromosome) and a profile of the frequency of U-G regions (pale red). They cover nearly the complete chromosome. A figure with the representation for all the 22 human chromosomes for the GBM samples is included as Additional File 3.
Figure 3. Combined view of GE and CN alterations obtained for the GBM dataset. The regions for all the human chromosomes (chr) that are altered either in CN or in GE are presented along the whole genome keeping the proportional size of the chromosomes. The graph shows the frequency of such alterations in the GBM samples. The colors correspond to GE up-regulated regions (in red) or down-regulated regions (in green), and -plotted on top- the CN gains (in pale red) or the CN losses (in pale green). Blue lines mark the regions that change in more than 25% of the GBM patients. The chromosomes with most significant changes (that present large regions included in the categories U-G or D-L) are labeled: U-G chr 7, chr 19, chr 20; D-L chr 10, chr 13, chr 22.
Figure 4. Detailed view of chromosome 7 showing the CN and GE correlation and the U-G category frequency for the GBM dataset. The genomic regions for chromosome 7 are represented in X-axis. Blue and purple dots show the correlation coefficients between CN and GE for each gene loci (purple when r ≥ 0.60). Pink profile represents the frequency values for the Up-Gain category (U-G).
Key genomic regions found for the 64 paired GBM cancer samples
As shown in Figure 3, the method presented in this work allows the identification of relevant altered genomic regions suffering significant changes in most of the GBM samples. The results also show that many of the detected CN alterations and GE changes overlap along the genome. These regions can be proposed as relevant "hotspots". In Table 1 and Table 2 we present a detailed description of the common genomic regions found in GBM; indicating the correlation and frequency of the U-G regions (Table 1, which includes 19 regions), and the D-L regions (Table 2, which includes 24 regions). The tables include the correlation between GE and CN for each region (as average correlation for all the gene loci); and the percentage of samples -frequency- in each region, counting only the samples where simultaneous GE and CN alterations occur: either up gene expression and gain in copy number (U-G) or down gene expression and loss in copy number (D-L). The regions detected are in the chromosomes that suffer the most significant changes in GBM samples: U-G, chr 7 and chr 20; D-L chr 10, chr 13, chr 14 and chr 22. The tables also include the genes enclosed in these regions. The most remarkable changes correspond to a large part of chr 7 (U-G) and to a large part of chr 10 (D-L). Two important genes are precisely located in these chromosomes: EGFR (in chr 7) usually up-regulated and PTEN (in chr 10) usually down-regulated [12,13]. PTEN is not found in our analysis, but it has been reported an absence of PTEN alterations in more than half of de novo glioblastomas and more than 90 % of glioblastomas developed from a pre-existing lower grade gliomas , which has been linked to the presence of additional tumor suppressor genes on chr 10, such as LGI1  and MXI1 . We found these two genes in regions 8 and 10 of the D-L list (Table 2), and we observed a very variable profile of PTEN in the GBM samples. These facts may indicate that PTEN is not the best genomic marker for this altered region. By contrast, we found RB1 tumor suppressor in region 13 of the D-L list; and this gene -included in chr 13- is a clear candidate to drive the alteration of tumor cells. With respect to EGFR, it has the highest U-G frequency observed (60.9%, Table 1) and therefore the method reveals this gene locus as the most common GE up-regulated and CN gained in the GBM samples. The alteration of EGFR can be associated with other genes that regulate its function, also found by the method. This is the case of VOPP1 and RAB11FIP2. VOPP1 is also known as ECOP (EGFR-coamplified and overexpressed protein) or GASP (Glioblastoma-amplified secreted protein), and is found in region 12 of the U-G list (Table 1). RAB11FIP2 is a suppressor of the endocytic internalization of EGFR and it is found in region 10 of the D-L list (Table 2) . The presence of these genes in the hotspots found for GBM supports the value of the method described. There are many other interesting genes in the identified altered genomic regions, that can be useful for further investigations on the disease studied.
Table 1. Significant U-G regions with the associated genes.
Additional file 1. Spreadsheet with the complete data corresponding to Table 1.
Format: XLS Size: 66KB Download file
This file can be viewed with: Microsoft Excel Viewer
Table 2. Significant D-L regions with the associated genes.
Additional file 2. Spreadsheet with the complete data corresponding to Table 2.
Format: XLS Size: 76KB Download file
This file can be viewed with: Microsoft Excel Viewer
Additional file 3. Detailed view of all the 22 chromosomes showing the CN and GE correlation and the U-G or D-L categories frequency for the GBM dataset. The genomic regions are represented in X-axis. Blue and purple dots show the correlation coefficients between CN and GE for each gene loci (purple when r ≥ 0.60). Pink and green profiles represent the frequency values for the Up-Gain (U-G) category or the Down-Loss (D-L) category respectively.
Format: PDF Size: 5.2MB Download file
This file can be viewed with: Adobe Acrobat Reader
Complete information corresponding to the genes found in the significant U-G regions and D-L regions is included respectively as supplementary material in Additional-file-1 (for the data corresponding to Table 1) and Additional-file-2 (for the data corresponding to Table 2).
The combined analysis of CN and GE data obtained using DNA genome and RNA expression microarrays for paired samples is a very powerful approach to uncover key altered regions in a biological state studied. We present a robust method to find genomic regions that show simultaneous significant changes in both CN and GE. Our calculations applied to a cancer dataset find expected known genomic alterations and many others identified as key altered genomic regions. This approach is also proposed as an adequate strategy to identify driver or causal genes under the hypothesis that disease-associated gene expression changes are frequently induced by genomic alterations.
Materials and methods
In this study we use a dataset of 64 human samples from Glioblastoma Multiforme (GBM)  that includes for each sample: Affymetrix DNA microarrays applied to detect of genome-wide CN changes and Affymetrix RNA expression microarrays applied to detect of GE changes. We used the same subgroup of samples that was previously analysed in Ortiz-Estevez et al. .
GE and CN normalization and signals calculation
GE data were processed using RMA algorithm  applied to the human gene expression microarrays: Affymetrix HGU133 plus 2.0 (using the same strategy followed in [19,20]). CRMAv2 algorithm  was applied to normalize the raw data and obtain the signals from the Affymetrix Human Mapping 500K SNP arrays. The processed signals were divided by the median of the normal samples for each element (SNP or gene) and then the log2 was computed. These log2 ratio signals were smoothed and segmented using Circular Binary Segmentation (CBS) algorithm  with the default parameters implemented in the DNAcopy R package.
Correlation between GE and CN
Pearson Correlation Coefficients (r) of the segmented GE and CN data were calculated taking the values of the segmented copy number and gene expression at the central point of the genomic position for each gene. P-values for the correlation coefficient of every gene loci were computed and adjusted by Bonferroni method. The established threshold for the selection of significantly correlated gene loci was correlation coefficient r ≥ 0.60, which corresponds to adjusted p-value < 0.005. When using the gene loci GE unsegmented signal, the same correlation threshold and p-value cutoff were applied.
Frequency of U-G and D-L alterations
The thresholds that define DNA copy number gains and losses and up and down gene regulation were established applying k-Means algorithm, fixing three clusters (k = 3) on the segmented data, and done independently for the CN data and for the GE data. The CN data values were classified into gained (G), lost (L) or no-change (N) and the GE values were classified as up-regulated (U), down-regulated (D) or no-change (N). The thresholds found by k-Means for CN in the GBM dataset were > 0.19 (of the log2 ratio signals) for gain and < -0.15 for loss. The thresholds found for GE in the GBM were > 0.10 (of the log2 ratio signals) for up-regulation and < -0.12 for down-regulation. A contingency table with the 9 possible categorical states for the two types of data was built for every gene locus. A cutoff threshold was set up for the frequency of up-regulated and gained (U-G) and for the down-regulated and lost (D-L) categories, based on the empirical cumulative distributions of the categories. Taking into account the gene loci, the significant altered regions were defined as the ones that had a frequency ≥ than the upper 10% quantile of the distribution of U-G or the distribution of D-L.
General workflow for identification of key regions in the genome
Following the steps described above, we present a general workflow (Figure 5) that illustrates the strategy to achieve a combined paired analysis of datasets from genome-wide microarrays, both for GE and CN. The workflow includes the different steps, the applied methods and the progression of the analysis. The strategy designed searches for high correlation between chromosomal regions that present a significant CN alteration (as gain or loss) and regions with significant GE change (as up or down). In this way, it determines which CN alterations have a strong influence on GE patterns. Key regions, i.e. hotspots in the genome, are defined as those regions simultaneously chosen as significantly correlated and frequently altered in both GE and CN.
Figure 5. Workflow of the method for the analysis of gene expression and copy number changes. The figure illustrates the strategy to achieve a combined paired analysis of data sets from genome-wide microarrays both for GE and CN. The method searches for high correlation between segmented chromosomal regions that present a significant CN alteration (gain or loss) and segmented regions with significant GE change (up or down). The frequency of alteration is also analysed. The selected loci (hotspots) are regions chosen as significantly correlated and frequently altered in both GE and CN.