How to get the most from microarray data: advice from reverse genomics

Gorlov, Ivan P; Yang, Ji-Yeon; Byun, Jinyoung; Logothetis, Christopher; Gorlova, Olga Y; Do, Kim-Anh; Amos, Christopher

doi:10.1186/1471-2164-15-223

Research article
Open access
Published: 21 March 2014

How to get the most from microarray data: advice from reverse genomics

Ivan P Gorlov¹,
Ji-Yeon Yang²,
Jinyoung Byun³,
Christopher Logothetis¹,
Olga Y Gorlova⁴,
Kim-Anh Do⁵ &
…
Christopher Amos³

BMC Genomics volume 15, Article number: 223 (2014) Cite this article

2922 Accesses
10 Citations
13 Altmetric
Metrics details

Abstract

Background

Whole-genome profiling of gene expression is a powerful tool for identifying cancer-associated genes. Genes differentially expressed between normal and tumorous tissues are usually considered to be cancer associated. We recently demonstrated that the analysis of interindividual variation in gene expression can be useful for identifying cancer associated genes. The goal of this study was to identify the best microarray data–derived predictor of known cancer associated genes.

Results

We found that the traditional approach of identifying cancer genes—identifying differentially expressed genes—is not very efficient. The analysis of interindividual variation of gene expression in tumor samples identifies cancer-associated genes more effectively. The results were consistent across 4 major types of cancer: breast, colorectal, lung, and prostate. We used recently reported cancer-associated genes (2011–2012) for validation and found that novel cancer-associated genes can be best identified by elevated variance of the gene expression in tumor samples.

Conclusions

The observation that the high interindividual variation of gene expression in tumor tissues is the best predictor of cancer-associated genes is likely a result of tumor heterogeneity on gene level. Computer simulation demonstrates that in the case of heterogeneity, an assessment of variance in tumors provides a better identification of cancer genes than does the comparison of the expression in normal and tumor tissues. Our results thus challenge the current paradigm that comparing the mean expression between normal and tumorous tissues is the best approach to identifying cancer-associated genes; we found that the high interindividual variation in expression is a better approach, and that using variation would improve our chances of identifying cancer-associated genes.

Background

Global profiling of gene expression by microarray technology is widely used to study molecular mechanisms of cancer. Even though a number of more sophisticated methods have been developed [1, 2] a typical approach to analyze gene expression data is to compare the expression level between normal and primary tumor tissues [3–6]; the genes showing the largest differences in expression are usually considered to be the top candidates as cancer genes.

Recently we hypothesized that high inter-tumor variation in gene expression may more effectively identify cancer-associated genes [7]. Tumors are heterogeneous at the molecular level: in different tumors, different subsets of cancer genes are drivers and therefore are upregulated or downregulated. This leads to a higher inter-tumor variation of the drivers and only slight differences in mean expression values between normal and tumorous tissues. The goal of this study was to comprehensively evaluate that hypothesis.

A previous census of cancer-associated genes identified 400 human cancer genes [8]. The real number of cancer-associated genes is likely to be higher because the cited studies used presence of recurrent somatic mutations in tumor tissue as the only criteria to define cancer-associated genes. We can use known cancer-associated genes to identify which microarray data–derived variables are the best predictors of known cancer genes. We call this a “reverse genomics approach”, and we used it to identify the best predictors of cancer genes for the 4 most common cancers: breast, colorectal, lung, and prostate.

Methods

Datasets and data processing

Figure 1 outlines the design of the study. Table 1 describes the datasets we used. We applied 2 criteria for selecting the datasets: the sample size should be large enough to allow reliable estimates of interindividual variance, and the gene expression data for both tumor and adjacent normal tissues should be available. We used only those probes that could be linked to a single gene. Because different datasets used different gene identifiers, we converted them to Entrez gene identification numbers by using bioDBnet [9]. GeWorkbench 2.3.0 was used to download Simple Omnibus Format in Text files [10]; the data were log2 transformed and normalized by variance-stabilizing algorithm [11] using VCN package in R from bioconductor http://bioconductor.org/.

Table 1 Brief description of the datasets used

Full size table

Identification of known cancer genes

Our working hypothesis was that inter-individual variance for gene expression values are higher for the genes associated with cancer development because in different tumors different sets of genes are used to drive cancer development. To test this hypothesis we needed to identify genes known to be associated with a given type of cancer. We used text mining tools to identify cancer-associated genes. We evaluated 3 text mining tools: (1) Pathway Studio, (2) Ingenuity Systems, and (3) KnowledgeNet. Results were consistent across the 3 algorithms. Throughout this paper we use Pathway Studio [19] because it uses the most updated databases. The lists of known cancer-associated genes for the 4 cancers we analyzed can be found in Additional file 1. We excluded genes recently reported to be associated with cancer (defined as those reported from January 1, 2011, through July 25, 2012, the date we completed our analysis) because we used recently reported cancer genes for validation.

Recently reported cancer genes

Our approach to retrieving recently reported cancer genes is exemplified by breast cancer genes. First we retrieved articles on breast cancer published during the period we defined as recent (from January 1, 2011, through July 25, 2012). In total, 23,550 papers were retrieved by using PubMed. We then scanned the abstracts by using MedScan [20]. One hundred forty-six associations between genes and breast cancer were identified. Seventy genes were novel and not previously reported (Additional file 2); 13 of them were microRNAs.

The same approach was used to retrieve novel colorectal, lung, and prostate cancer genes. The list of the recently reported cancer genes and the corresponding publications is found in Additional file 2.

Microarray data–derived predictors of known cancer-associated genes

We used 6 microarray data–derived predictors: (1) mean gene expression in adjacent normal tissue (m(AN)); (2) mean gene expression in tumor tissue, (m(T)); (3) the degree of change (fold change, or FC) in expression level between tumorous and normal tissue; (4) –LOG(P), in which P is the type-I error identified by using Student’s t test to compare the mean expression between adjacent normal and tumor tissues; (5) the standard deviation of the gene expression in adjacent normal tissue (SD(AN)); and (6) the standard deviation of the gene expression in tumor tissue (SD(T)). Nonparametric Mann-Whitney U (MW) test was used to compare those 6 predictors between cancer-associated genes and all other genes in the human genome.

To estimate how efficiently microarray data–derived predictors predict cancer genes, we ranked the probes by the predictors’ values and estimated the percentage of the cancer-associated genes among the top 5% of the probes. An enrichment factor (EF) was used as a measure of identification efficacy: EF = P/0.05, in which P is the proportion of known cancer genes among the top 5% genes ranked by a given predictor. When EF equals 1, the predictors provide no advantage over random selection of cancer-associated genes; the higher the EF, the better the identification efficacy.

Analysis of outliers

We then identified gene expression outliers in tumor samples separately for cancer and noncancer genes. Outliers for a given gene were defined as tumors with expression level < $\hat{m (N)} - 4 * SD (N)$ or > $\hat{m (N)} + 4 * SD (N)$ , in which $\hat{m (N)}$ is the mean expression and SD(N) is the standard deviation of the gene expression in adjacent normal tissue. Known cancer genes are more likely to be differently expressed and therefore are more likely to be outliers. To account for the effect of differential gene expression on its probability to be an outlier we first i) sorted all genes according to –LOG(P) from largest to smallest, and for each known cancer gene, and then ii) took non-cancer genes from the list immediately above and below of a given known cancer genes. Those neighbouring non-cancer genes were used as a comparison group.

Computer simulation

We used computer simulation to compare the efficacy of identification of cancer genes by SD(T) and –LOG(P). We simulated the expression levels of 1,000 genes: 50 cancer genes and 950 noncancer genes. The total sample size was 40 tissues: 20 adjacent normal tissues and 20 tumor tissues. Expression values were sampled from the normal distribution with mean = 7 and SD = 0.6 {N(7.0,0.6)}, which are typical means and SDs for the datasets we used.

Two models were compared: the “shifting means model” and the “outlier’s model” (Figure 2). In the shifting means model, tumors are homogeneous: differences in mean expression levels between normal and tumor tissues are due to the shift of the distribution to the right (upregulation) or the left (downregulation). In the outlier’s scenario, a cancer gene is differently expressed in only a fraction of tumors. In different tumors, different cancer genes can be outliers.

In the shifting means model, expression values were sampled from the normal distribution: N(7.0 + s, 0.6), in which s is the difference in mean expression values between tumor and adjacent normal tissues. Those mean and variance are typical for the studies we have used. Three different values of s (0.07, 0.35, and 0.7) were considered. The number of simulated outliers was defined to make the mean expression value in tumors similar to that of the shifting means model.

Binary logistic regression model

To explore whether the combination of several predictors can provide better identification of known cancer genes than any single predictor can, we ran a binary logistic regression model. A stepwise-forward likelihood ratio model was used to identify significant predictors in each dataset. This analysis was done for each cancer type separately. Known cancer genes were considered as outcome and –LOG(P), m(AN), m(T), FC, SD(AN), and SD(T) as predictors.

Raw and processed data

To estimate the effect of the data processing on the variance estimates, we computed variance in adjacent normal and tumor samples of (1) raw gene expression data, (2) log(2)-transformed data, and (3) data normalized by using a variance-stabilizing approach. We compared variance through these 3 levels using Kendall’s rank-correlation and estimated the efficacy of the variance-based identification of cancer genes for each level.

Results

SD(T) shows the strongest association with known cancer genes

The lists of the known breast, colorectal, lung, and prostate cancer genes are shown in Additional file 2. To check how known cancer genes differ from other genes, we used nonparametric MW testing; Table 2 shows the results of the comparison. SD(T) consistently revealed the strongest differences between cancer and noncancer genes: in 6 of 7 datasets, SD(T) had the largest MW statistics.

Table 2 Differences between cancer and all other genes for 4 cancer gene predictors in 7 datasets

Full size table

We also ranked the probes according to the predicting variables and estimated percentage of the known cancer genes among the top 5% of the probes. Under the null hypothesis, one can expect 5% of the known cancer genes to be among the top 5% of the ranked probes. We found that all predictors identified more known cancer genes than what one could expect by chance. Figure 3 shows the results of that analysis.

Outliers

In all datasets, the percentage of outliers for cancer genes was higher than it was for the comparison group (Table 3). In 3 datasets, the differences were significant. In the overall analysis (nonparametric MW test), the percentage of outliers was overall higher for cancer genes than it was for the paired controls: 3.3 ± 0.2 vs. 2.6 ± 0.1; MW test Z = 3.5; P = 0.0004.

Table 3 Percentage of outliers in tumor samples

Full size table

Recently reported cancer-associated genes

We next compared the newly reported cancer genes (i.e., reported from January 1, 2011, through July 25, 2012) with all other genes in the human genome (Table 4). SD(T) was the most significant variable for recently identified cancer genes.

Table 4 Differences between recently reported cancer genes and all other genes in the human genome

Full size table