Graduate Program in Structural and Computational Biology and Molecular Biophysics, Baylor College of Medicine, Houston, Texas, USA

Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, Texas, USA; and Department of Radiation Oncology, the University of Texas M. D. Anderson Cancer Center, Houston, Texas, USA

Department of Radiation Oncology, the University of Texas M. D. Anderson Cancer Center, Houston, Texas, USA

Department of Pathology, the University of Texas M. D. Anderson Cancer Center, Houston, Texas, USA

Graduate Program in Structural and Computational Biology and Molecular Biophysics; and Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA

Abstract

Background

Assays of multiple tumor samples frequently reveal recurrent genomic aberrations, including point mutations and copy-number alterations, that affect individual genes. Analyses that extend beyond single genes are often restricted to examining pathways, interactions and functional modules that are already known.

Methods

We present a method that identifies functional modules without any information other than patterns of recurrent and mutually exclusive aberrations (RME patterns) that arise due to positive selection for key cancer phenotypes. Our algorithm efficiently constructs and searches networks of potential interactions and identifies significant modules (RME modules) by using the algorithmic significance test.

Results

We apply the method to the TCGA collection of 145 glioblastoma samples, resulting in extension of known pathways and discovery of new functional modules. The method predicts a role for

Conclusions

We have developed a sensitive, simple, and fast method for automatically detecting functional modules in tumors based solely on patterns of recurrent genomic aberration. Due to its ability to analyze very large amounts of diverse data, we expect it to be increasingly useful when applied to the many tumor panels scheduled to be assayed in the near future.

Background

Tumor characterization projects are beginning to produce a large volume of data about genomic, epigenomic, and gene expression aberrations in tumor samples. This unprecedented volume of information has the potential to transform our understanding of cancer biology, reveal new biomarkers and drug targets, and accelerate the development of new cancer therapies. One recent genome-wide tumor characterization effort revealed recurrent somatic aberrations in 91 glioblastoma (GBM) tumors

A key question is how to extend integrative analysis of somatic genomic aberration data to expand known cancer pathways and interactions, or discover completely new modules (sets of related genes). Such inference has been done extensively using gene expression arrays, both in yeast and humans

Specifically, we focus on patterns of recurrent and mutually exclusive aberrations (RME patterns). Previous analyses of large tumor panels have discovered that alteration of genes comprising a specific functional module are often observed across a sample collection, but are almost never concurrently found in the same tumor. Examples of these modules include

The key insight is that these RME patterns may be used to identify groups of genes that are functionally related. This concept was explored in 2008 in the context of cancer by Yeang et al., who utilized data from the Catalogue of Somatic Mutations in Cancer (COSMIC) to identify functional relations among mutated genes

To address these issues, we developed a new method for detecting RME patterns, which we formalized by using structural reliability models

Overview of RME Module Detection

**Overview of RME Module Detection**. **a) **An example of a structural reliability model of progression of a particular tumor type. Cancer progression in this example requires aberrations in each of the three distinct functional modules (three horizontal lines). If mutated genes (crossed out in red) occur in all three modules, the connection between the left and right part of the structural model will be lost, indicating failure (cancer). **b) **A module may be disrupted by different aberrations in distinct tumor samples. One measure of an RME pattern is coverage, defined as the percentage of samples that contain at least one aberration within the module. Another measure of the pattern is exclusivity, defined as the percentage of covered samples that contain exactly one aberration within the module. An aberration in one of the genes within a specific RME module removes selective pressure of aberrations in other genes within the same module, giving rise to the exclusivity. **c) **Example network where nodes represent genes and edge thickness represents the level of exclusivity. The search for RME patterns starts by constructing such a graph using the Winnow algorithm. This graph indicates three potential RME modules. The node colors and numbers correspond to those in panel **a**. **d) **The significance score for RME patterns is dependent on both exclusivity (y-axis) and coverage (x-axis). Shown is the RME algorithmic compression score, d, for a three-gene RME module across 100 samples with aberrations equally distributed, assuming background frequency of 13.38 aberrant genes per sample (see section 2.3 andAdditional file ^{-d}.

Methods

Creating a mutation matrix

We designed our algorithm to be capable of utilizing many disparate sources of mutational data, including single-nucleotide polymorphisms, copy-number alterations, and epigenomic modifications. In a pre-processing step, these diverse data types were converted into a single two-dimensional binary "mutation" matrix (Figure

Analysis Pipeline

**Analysis Pipeline**. In a preprocessing step, validated SNPs and focal CNAs are combined into a mutation matrix. This matrix is fed into the winnow algorithm, which scores each gene pair by exclusivity, indicated by edge scores in a graph. This graph is then searched for modules up to a specified size and the algorithmic significance is calculated for each potential module. Finally, the most significant modules are reported.

Data was obtained from the The Cancer Genome Atlas Data Portal (

**Supplemental Methods and Results**.

Click here for file

These two forms of data were then merged into a two-dimensional mutation matrix. Each gene in each sample was checked against these single nucleotide and copy number mutations and a matrix was created such that if sample _{
i,j
}in the matrix was equal to 1, otherwise it was set to 0. This matrix is available at

Constructing a gene network with Winnow

The first step in our module detection pipeline was to filter the mutation matrix and retain only genes that meet a set frequency of recurrence, as genes altered in only one or a few samples do not contain enough information to calculate meaningful exclusivity scores.

A possible next step would be to calculate the exclusivity score between each pair of genes, defined as the number of samples where exactly one of the pair is mutated divided by the number of samples where at least one of the pair is mutated. (Figure

Thus, we used an online-learning linear threshold algorithm called Winnow to detect signals of exclusivity against the noisy background of passenger mutations in many irrelevant genes

The Winnow algorithm was run in an online setting, using one gene as a classifier and the rest of the mutation array as training data. In the first winnow run, all the bits in the array were flipped, such that we calculated how well each aberration in the classifier is predictive of non-aberration in each gene of the matrix. Then, we flipped the bits of the classifier, such that we calculated how well each non-aberration in the classifier was predictive of aberration in each gene of the matrix. The resulting weights were used to score the edges of the graph, then low-scoring edges were removed.

Since the range of weights for each run was determined by how quickly Winnow finds an optimal classifier, we did not use an absolute threshold value when removing edges. Instead, for each classifier gene, we took the second highest weight and retained all edges with a score greater than or equal to that value.

Identifying candidate modules

We then used each gene in the network as a starting point in a greedy local combinatorial search for RME modules, such that we evaluated all possible connected modules with size below the specified limit. We report those that have algorithmic significance above a predetermined threshold, based on the size of the input data (Figure

Evaluating modules by performing an algorithmic significance test

The problem of determining whether a module (subset of genes) contains a significant RME pattern of aberrations can be addressed using probabilistic models or heuristic scores. Both approaches would generally require establishment of extremely low significance values (pre-Bonferroni correction), which would in turn require many cycles of computationally demanding permutation testing. To eliminate this bottleneck, we employ a new implementation of the computationally much less demanding algorithmic significance test

Let _{
i,j
}= 1) or absence (_{
i,j
}= 0) of an aberration of the

The presence of an RME pattern (Figure _{
i
}
_{,0}, _{0,}
_{
j
},j = 1,...,

The algorithm then examines the sorted matrix row by row in a left to right order, keeping track of how many aberrations have been observed, and calculates a probability of observing an aberration in the next cell of the matrix and encoding the bit optimally according to the calculated probability. To describe how the probability is calculated, we first introduce additional notation. Let _{
i,j
}= 1) denote the number of unobserved mutations divided by the number of unobserved positions remaining in the matrix. Let _{
i,j
}and _{
i,j
}denote the number of unobserved aberrations in the current gene and sample respectively.

Then, we can encode elements of X according to the following probability distribution: If _{
i,j
}and _{
i,j
}are both larger than 0, and a one has not been observed in this row yet, we use the following formula (derived by applying Bayes' rule):

else we estimate that the probability is very low (but not equal to zero in order to avoid infinitely large penalties):

In contrast, the Null algorithm encodes optimally assuming that the _{
NULL
}(1), and that mutations occur independently in each of the k genes.

The encoding length difference between the null and RME algorithms and the algorithmic significance are calculated in the following two steps:

**Step 1**. Encode the binary aberration matrix.

Set

If _{
i,j
}= 1 then:

Else,

where log denotes binary logarithm.

**Step 2**. Account for additional information (including implicit correction for multiple testing) and calculate significance.

Calculate significance value 2^{-d
}.

Whole-genome simulations

In order to benchmark the performance of this algorithm, we ran simulations on synthetic data sets. When generating sets with the same size as the current glioblastoma data (145 samples, 1290 genes), the actual distribution of mutations from the TCGA data was used to create random matrices. We simulated larger data sets using the knowledge that the current gene list is heavily biased towards known and frequently-altered oncogenes, so we compensated by assuming that 0.1% of newly considered genes will have a mutation frequency greater than 0.2, 0.9% will have frequency between 0.2 and 0.1, and 99% will have frequency less than 0.1.

We then used a binning procedure, where we started with the empirical GBM distribution, and calculated the proportion of mutations in each bin. To compensate for the fact that the distribution is heavily biased towards low-frequency mutations, we used bins of size 1% until we reached the tenth percentile, then used bins of size 5% to allow for some variability. We then distributed the specified proportion of aberrations randomly within each bin. We tested coverage levels between 50 and 100%, and generated RME Modules such that the number of alterations matched the given coverage level, exclusivity was 100%, and each gene was altered in a random number of samples that exceeded the minimum threshold.

Determination of prognostic significance

Affymetrix HG133-based GeneChip mRNA expression profiling data from two published datasets, the TCGA ("TCGA", n = 260) and the Erasmus Medical Center, Netherlands ("Erasmus", n = 153) were obtained as raw intensity files (.CEL files) and normalized

Implementation and availability

Implementation of the algorithm was done using Ruby and Bash. The core algorithm is available for download at

Results

Discoverability of RME modules using current and anticipated TCGA project data

To determine how well our method detects RME modules over the background noise of passenger mutations, we ran experiments on synthetic data using several different parameter sets. As described in the methods, we created a randomized mutation matrix then added an RME module consisting of two to five genes. One thousand simulations were run for each parameter set to determine whether the seeded RME module could be detected. We measured sensitivity by the fraction of simulations where the seeded module was detected above the significance threshold. We measured precision by the fraction of simulations where the algorithm detected the seeded genes as more significant than any other module.

Genes altered in only a few samples did not contain enough information to calculate meaningful exclusivity scores, so we tested two different recurrence thresholds. When considering only genes that are altered in at least 10% of the samples, the algorithm had high sensitivity and precision, with smaller modules being more susceptible to false positives that arise by chance (Figure

Simulation Results

**Simulation Results**. One thousand simulations were run using varied numbers of genes and samples, for 5% and 10% recurrence thresholds. As sample size and the number of genes assayed increase, our algorithm retains the ability to detect RME modules with high sensitivity and precision.

We then evaluated the characteristics of pathways that are discoverable using the data that is to be generated by future stages of the TCGA project. We increased the number of samples to 500 and increased the number of resequenced genes to either the 6000 currently being evaluated in TCGA Phase 2, or the ~18000 that may be examined with whole-exome coverage (Figure

Comparison to other methods

We also compared the performance of our algorithm to previously published methods based on calculating p-values for exclusivity from a hypergeometric test or from a log-likelihood ratio

This can be explained as follows: The formula for calculating the likelihood ratio between the frequency of joint mutations relative to the best simpler model, as given in Yeang, is:

where the denominator is the empirical frequency of mutations in the first and second genes respectively, and the numerator is the empirical frequency of co-mutation. Thus, for two genes that are not mutated in the same samples, the likelihood ratio is 0 whether they each have one mutation, or they both have many mutations. Because of this characteristic, the likelihood method almost always reports false positive modules of genes that are exclusive by chance. Such modules usually have much lower coverage than the true seeded module. Since our algorithmic significance test considers recurrence as well as mutual exclusivity, it much more reliably excludes these false positives. The hyper-geometric p-value calculations described in Yeang suffer from a similar problem.

These other methods are also orders of magnitude slower than algorithmic significance, since they require many rounds of permutation testing to do multiple testing correction. Averaged across ten trials, the likelihood-based method had an average runtime of 899.377s, the hypergeometric method had an average runtime of 409.543s, and the algorithmic significance method had an average runtime of 0.779s. As described in the Methods section, algorithmic significance handles the problem of multiple testing using a penalty that takes very little time to compute. In contrast, both of the other methods require a step where the input data is permuted 1000 times and the number of combinatorial patterns is assessed. Unsurprisingly, this step makes these methods much slower.

Application to glioblastoma tumors

We next applied this method to data from genomic assays run on 145 primary GBM tumor samples, using a conservative recurrence threshold of 10%. The modules were ranked by their algorithmic significance scores. The top six modules listed in Figure ^{-50 }(~= 8.88 × 10^{-16}). Three of the six modules contain components of core GBM pathways reported by the TCGA consortium

Pathway context for RME modules found in glioblastoma

**Pathway context for RME modules found in glioblastoma**. Genes colored red are recurrently mutated in such a way that we expect loss of function, and those colored green are amplified or contain putatively activating mutations. The d-score is the algorithmic significance value, with significance being equal to 2^{-d}. 1st row: Alterations in

Rediscovery and expansion of known modules

The highest-scoring rediscovered module consisted of alterations to the genes

We expect that alterations in any of these three module components would disrupt the tumor-suppressive activity of

We also observed that the pattern of regulation in this module is concordant with our knowledge about the tumor-suppressive activity of the

The second rediscovered module consists of the genes

The first module that does not directly map to a known pathway consists of the genes

Newly Discovered Modules

In addition to finding RME modules that are components of known pathways, we discovered modules that are previously unreported. Several have intriguing functional similarities, such as the pro-apoptotic roles of both

To show how one might leverage our gene module discovery process to produce clinically useful results, we decided to investigate EP300 further. Our method suggests that EP300 plays a role in the p53 pathway, which is strengthened by previous studies that show its interaction with

EP300 expression predicts survival for patients with glioblastoma

**EP300 expression predicts survival for patients with glioblastoma**. The role of

Conclusions

We have developed a sensitive, simple, and fast method for automatically detecting functional modules in tumors based on patterns of recurrent genomic aberration alone. The results indicate that integrative analyses of genome characterization data have the potential to identify groups of genes that have related roles in producing cancer phenotypes. Furthermore, it is possible to generate hypotheses about pathway membership, or about the functional relevance of unexpected or uncharacterized genes by using co-occurrence in an RME module as an indicator of function.

Our experiments do show that RME patterns are not perfect. The fact that 30-70% of samples are not covered by individual modules may be explained by several factors, including low-frequency mutations that fall below our recurrence thresholds, the small proportion of genes that were assayed for somatic point mutations, and lack of comprehensive epigenomic assays, which could give information on gene silencing. As the costs of massively parallel sequencing drop, we expect more complete coverage of a larger number of samples, which may resolve the first two issues. A larger number of genes are also slated to be assayed for abnormal methylation patterns soon, and this algorithm can incorporate such data into future analyses. These comprehensive whole-genome data will undoubtedly improve our ability to detect functional modules and eliminate any bias that comes from operating on a reduced gene set.

We also note that while our method does not use pathway, interactome, and other network information, we do not suggest this method as a complete replacement for analyses that do use these data. In fact, we envision extensions of this method that may use background knowledge in a controlled and explicit way. At this point, we also do not make use of aberration co-occurrence, which may suggest a lack of functional similarity. Such overlapping aberrations do not lend themselves to the same kind of clear and compelling interpretation as RME patterns, but may be nonetheless useful in future expansions of this method.

As the throughputs of technologies and the capacity of data producing projects increases, so will the significance and abundance of RME patterns. In anticipation of this trend, this method has been designed at the outset to accommodate an increasing diversity and volume of genome characterization information. We therefore anticipate that the method will be increasingly useful in generating hypotheses that will drive specific experiments and increase understanding of cancer progression.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

CM and AM: project conception, algorithm design. CM: implementation, simulations, and application. SS, ES, KA: survival analysis. CM and AM manuscript preparation. AM project leadership and supervision. All authors read and approved the final manuscript.

Acknowledgements

We gratefully acknowledge Dr. Pim J. French, Dr. Lonneke M. Gravendeel, and Dr. Peter S. Smitt of the Erasmus Medical Center, The Netherlands for providing raw intensity files for the Erasmus dataset. We also thank Dr. Chen-Hsiang Yeang of the Institute of Statistical Science of Academia Sinica for providing code that enabled us to do algorithmic comparisons.

The results published here are in whole or part based upon data generated by The Cancer Genome Atlas pilot project established by the NCI and NHGRI. Information about TCGA and the investigators and institutions who constitute the TCGA research network can be found at

Funding: This research has been funded by the NIH grants R01-HG004009 and R21-HG004554 from the National Human Genome Research Institute and R33-CA114151 from the National Cancer Institute to AM.

Pre-publication history

The pre-publication history for this paper can be accessed here: