Center for Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA

Abstract

Background

Biomedical ontologies have become an increasingly critical lens through which researchers analyze the genomic, clinical and bibliographic data that fuels scientific research. Of particular relevance are methods, such as enrichment analysis, that quantify the importance of ontology classes relative to a collection of domain data. Current analytical techniques, however, remain limited in their ability to handle many important types of structural complexity encountered in real biological systems including class overlaps, continuously valued data, inter-instance relationships, non-hierarchical relationships between classes, semantic distance and sparse data.

Results

In this paper, we describe a methodology called Markov Chain Ontology Analysis (MCOA) and illustrate its use through a MCOA-based enrichment analysis application based on a generative model of gene activation. MCOA models the classes in an ontology, the instances from an associated dataset and all directional inter-class, class-to-instance and inter-instance relationships as a single finite ergodic Markov chain. The adjusted transition probability matrix for this Markov chain enables the calculation of eigenvector values that quantify the importance of each ontology class relative to other classes and the associated data set members. On both controlled Gene Ontology (GO) data sets created with Escherichia coli, Drosophila melanogaster and Homo sapiens annotations and real gene expression data extracted from the Gene Expression Omnibus (GEO), the MCOA enrichment analysis approach provides the best performance of comparable state-of-the-art methods.

Conclusion

A methodology based on Markov chain models and network analytic metrics can help detect the relevant signal within large, highly interdependent and noisy data sets and, for applications such as enrichment analysis, has been shown to generate superior performance on both real and simulated data relative to existing state-of-the-art approaches.

Background

Ontologies have become a crucial component for the analysis, retrieval and integration of the data underpinning modern biomedical science

Of particular importance in the biomedical space are the family of applications, including enrichment analysis

To help address these limitations, we have developed a new methodology, Markov Chain Ontology Analysis (MCOA), for analyzing hierarchical models relative to a collection of domain data. Our approach represents the combination of an ontology and the instances in an associated dataset as a single finite ergodic Markov chain whose adjusted transition probability matrix is used to compute modified eigenvector centralities, or steady-state probabilities, for each class and instance. The negative log of these modified eigenvector centralities, a quantity we call the **information rank **of the class, represents the importance of each class relative to both the data set and other classes in the ontology.

In the remainder of this paper, we outline the analytical challenges that motivated the development of our methodology, detail the mathematical model of our technique and demonstrate its utility in the context of GO enrichment analysis. Following a standard benchmarking process, we demonstrate the ability of a MCOA-based enrichment analysis method to outperform existing state-of-the-art enrichment methods on simulated gene enrichment datasets. To evaluate the performance of MCOA on real experimental data, we compare the enrichment results generated by MCOA with other comparable methods using gene expression data from a study of Parkinson's disease. Finally, we discuss other applications that could benefit from the MCOA approach and our plans for future investigations.

Enrichment Analysis

Although the analysis approach we propose is relevant to any application that quantifies the importance of ontology classes relative to a dataset, we frame the discussion in this paper in the context of enrichment analysis. Our focus on enrichment analysis is motivated both because of the widespread use of enrichment analysis in the biomedical field as well as the fact that the technical challenges faced by enrichment analysis methods are directly relevant to many other ontology-based data analysis activities.

Enrichment analysis assesses whether classes in an ontology are statistically over or under-represented in a specific dataset based on the semantic annotations of dataset members relative to some baseline distribution. In the biomedical field, enrichment analysis methods are commonly employed to determine the statistical enrichment of GO categories for gene expression data by comparing the annotation frequency in a target gene list with the annotation frequency in a background collection of genes. The widespread use of the method in this context has motivated the extensive manual annotation of genomic and proteomic data with GO categories and the development of a wide range of enrichment analysis techniques and tools

Whether analyzing genomic data for enrichment of GO categories or bibliographic data for enrichment of classes in a clinical ontology, the same set of enrichment methods can be employed. Huang

The MEA category includes the MCOA-based enrichment analysis approach described in this paper as well as a number of state-of-the-art techniques developed since the publication of the Huang

Despite the extensive use and high utility of enrichment analysis applications and the important recent advances made in the GSEA and MEA categories, existing analytical methods remain limited in their ability to successfully analyze the full spectrum of ontological and dataset complexity. Challenging structural features include overlaps between ontology classes, continuous instance and annotation weights, relationships between instances, non-hierarchical relationships between classes, semantic distance and sparse data. These analytical challenges, and how current enrichment methods attempt to address them, are discussed in further detail below.

Analysis Challenges

Class overlaps

Methods in the SEA and GSEA categories commonly generate enrichment results comprising long lists of highly correlated classes, leaving users to determine which of multiple, largely redundant, classes are actually relevant. This problem is due to both the overlaps between class members and the fact that SEA and GSEA methods evaluate each class independently for enrichment and thus fail to take class interdependencies into account. Overlaps between the member sets of different classes can result from several structural features:

• **Inheritance**: one class is an ancestor of the other class and therefore all dataset members annotated to the descendant are implicitly annotated to the ancestor.

• **Multiple parents**: both classes share a common descendant and therefore are implicitly annotated with the same dataset members.

• **Multiple annotations**: a dataset member is annotated to both classes (or descendants of both classes).

Overlaps between classes are very common in practice with each GO term overlapping with an overage of 1078 other terms based on common human gene annotations (see Additional File

**Gene Ontology term overlap statistics with Homo sapiens annotations**.

Click here for file

The class overlap problem has been explored by several existing enrichment analysis approaches including MGSA, GenGO, parent-child union by Grossmann

Continuously valued data

A key drawback of methods in the SEA category and most methods in the MEA category is their inability to model continuously valued data. For most biological data of interest in an enrichment analysis scenario, dataset members have varying levels of experimental significance and continuous weights can be associated directly with each instance (e.g., differential gene expression, test statistic associated with SNP-to-gene analysis, etc.) or with each instance-to-class annotation (e.g., probabilistic confidence score generated via statistical classification, GO annotations weighted according to source of evidence, etc.). Continuous weights can also be associated directly with classes or with inter-instance and inter-class relationships (e.g., protein-protein interaction scores, gene co-expression scores, etc.). Analyzing continuously valued datasets using SEA or MEA methods requires the use of an arbitrary cut-off with all dataset members or annotations above the cut-off given equal weighting in the analysis, potentially leading to significantly skewed enrichment results. Addressing this shortcoming is the primary objective of methods in the GSEA category including Gene Set Enrichment Analysis (GSEA)

Although the GSEA methods avoid a potentially arbitrary dataset "cut-off" through the use of continuous dataset weights, this requirement can be problematic in cases where a single biologically meaningful value for each gene does not exist. GSEA methods are further limited by their one-at-a-time analysis of ontology classes and, in practice, have been found to generate enrichment results very similar to those output by SEA methods on actual experimental data

Inter-instance relationships

Meaningful relationships often exist between the members of the datasets targeted for enrichment analysis (e.g., citation links between publications, protein-protein interaction links, gene-gene links in gene regulatory networks, etc.). Network models are particularly well suited for representing the interconnections in real biological systems

Non-hierarchical class relationships

Standard enrichment analysis only considers hierarchical relations between classes (is-a, part-of), however, many relevant biomedical ontologies, including GO, include non-hierarchical class relationships (e.g., regulates). Accounting for such inter-class relationships may be even more relevant in scenarios where multiple inter-related ontologies are jointly analyzed and inter-class relationships are used to capture mappings between classes in different ontologies (e.g., relationship between GO categories and KEGG pathways). Although the same network analytical methods used to analyze instance-level links can be applied on the ontology graph, the current set of state-of-the-art enrichment methods do not do so, and, for most enrichment approaches, their incorporation is not feasible due to the nature of the underlying statistical tests.

Semantic distance

When analyzing data against hierarchical ontologies, it is generally desirable to bias more specific classes over more general classes when both classes are associated with the same number of dataset members. Standard SEA category methods like Fisher's exact test measure significance based solely on annotation frequency and ignore semantic distance. Although semantic distance is incorporated into methods such as parent-child union, elim and weight, the state-of-the-art MEA methods GenGO and MGSA use flattened representations of the ontology and therefore fail to explicitly incorporate semantic distance.

Sparse data

Real datasets frequently suffer from sparsity due to a variety of data collection and experimental design issues

Methods

Our approach represents the combination of the classes in an ontology and the instances in an associated dataset as a single finite ergodic Markov chain whose adjusted transition probability matrix is used to compute modified eigenvector centralities, or steady-state probabilities, for each class. These modified eigenvector centralities, a quantity we term the information rank, provide a measure of the importance of each class relative to both a dataset and the other classes in the ontology. Similar to annotation frequency, the information rank of a class can be used to support applications that compare the importance of a class in a target dataset with a baseline dataset (e.g., enrichment analysis).

Ontology Model

For defining our approach and discussing other related methods, we follow Bade

**
Definition 1 (Ontology)
**:

• **
C
**

• **parent(c) **that maps each class c in C to the set of direct parents of c in the class hierarchy

**
Definition 2 (Ontology Extension)
**:

• **I **of instance identifiers

• **type(i) **that maps each instance in I to a set of one or more classes in C

• **rel(i) **that maps each instance in I to a set of zero or more other related instances in I

• **weight(i) **that maps each instance in I to a normalized weight between 0 and 1

Markov Chain Model

Our proposed methodology for analyzing an ontology relative to a collection of domain data represents the combination of an ontology and its extension as a finite ergodic Markov chain. A finite Markov chain is a finite stochastic process in which the probability of transitioning from a state i to a state j is only dependent on the state i and not on the path taken through the chain to arrive at state i

**
Definition 3 (Finite Ergodic Markov Chain)
**:

•

• _{ij }represents the probability that the state will be j if the current state is i

• _{ij }are only dependent on the current state i. Therefore:

•

•

Core MCOA Process

At the core of our methodology is a process for computing an eigenvector-based score for each class in an ontology relative to an extension of that ontology (i.e., a collection of data annotated using the ontology classes). We call this the information rank based on its similarity to the well-known PageRank algorithm for computing the ranks of web pages using a Markov model of a random walk with jumps through web page links

• **Step 1**: Model the ontology and extension as a single finite ergodic Markov chain.

• **Step 2**: Create an adjusted transition probability matrix for the Markov chain.

• **Step 3**: Use the transition probability matrix to compute the eigenvector-based steady-state probability and information rank for each ontology class.

Algorithmic details for each of these steps are outlined below and formalized in Definitions 4, 5, 6 and 7. Figure

MCOA mapping between ontology, ontology extension and Markov chain

**MCOA mapping between ontology, ontology extension and Markov chain**. (A) Simple ontology and extension. (B) Markov chain representing simple ontology and extension according to MCOA method (C) Adjusted transition probability matrix for Markov chain according to MCOA method (D) Information rank values generated from adjusted transition probability matrix using α = 0.15 and ω = 0.01.

Step 1: Model Ontology and Extension as Markov Chain

Our approach builds a Markov chain model of an ontology and its extension by mapping classes in the ontology and the instances of those classes to states in the Markov chain and by mapping all instance-to-class relations and hierarchical relations between classes to state transitions. Given the simplified model of an ontology and its extension specified in Definitions 1 and 2 and the model of a finite ergodic Markov chain specified in Definition 3, the process for building a Markov chain from an ontology and its extension is formalized in Definition 4 below. Figure

**
Definition 4 (Ontology-to-Markov Chain Mapping)
**:

• _{C}, which contains the states corresponding to ontology classes, and S_{I}, which contains the states corresponding to ontology instances:

• _{C }of the Markov chain and the classes in set C (i.e., there is a one-to-one mapping between each class and each Markov chain state in S_{C})

• _{I }of the Markov chain and the instances in set I (i.e., there is a one-to-one mapping between each instance and each Markov chain state in S_{I})

Step 2: Create Adjusted Transition Probability Matrix

Calculating the transition probability matrix for the Markov chain defined above involves three key adjustments:

• A random jump probability α. This is equivalent to the damping factor, d, used in the PageRank algorithm, specifically α = 1-d.

• A parameter, ω, that controls how much of the random jump probability is distributed among class states, S_{C}, vs. instance states, S_{I}

• The weights of each individual instance, as specified by the function weight(i)

Using these parameters, the creation of the adjusted transition probability matrix can be formalized according to Definition 5 below. Figure

**
Definition 5 (Adjusted Transition Probability Matrix)
**:

•

• _{C}, and states representing instances, S_{I}, following each random jump. If

•

• _{ij }of the N × N transition probability matrix P are defined as follows (where i represents the source state and j represents the destination state of the transition):

The use of the random jump and non-uniform distribution parameters defined above has several benefits in the context of our method:

• It ensures that the Markov chain is ergodic (it would otherwise be absorbing given the 0 out-degree for any root node).

• It allows for prior probability smoothing. Classes without instances can be assigned a configurable portion of the random jump probability as a form of prior probability smoothing. By varying the ω parameter between 0 and 1, the relative weight of a uniform prior probability distribution can be adjusted relative to the analyzed dataset distribution.

• It enables the use of class and instance weighting. Similar to the topic-sensitive PageRank approach

• It allows semantic distance to be quantified. The amount of transferred rank naturally decays as one moves up the hierarchy.

Step 3: Compute Information Rank

Given an adjusted transition probability matrix as specified in Definition 5 above, the importance of each class relative to the dataset can be quantified using the components of the principal left eigenvector that correspond to classes in the ontology. These eigenvector components represent the steady-state probabilities of the class states in the associated Markov chain. Normalizing these steady-state probabilities relative to the probabilities for all class states and then taking the negative log of the normalized probabilities generates the information rank. The definitions of steady-state class probability and information rank are formalized in Definitions 6 and 7 below. Figure

**
Definition 6 (Adjusted Steady-State Class Probability)
**:

**
Definition 7 (Information Rank)
**:

MCOA Enrichment Analysis

Our initial application of the MCOA method to enrichment analysis adopts the probabilistic generative model of gene activation used by both GenGO and MGSA. It specifically extends the GenGO maximum likelihood approach by adding MCOA-based terms to the objective function used in the original GenGO algorithm. Although our initial enrichment analysis method extends GenGO, MCOA can be integrated with other enrichment methods or used directly to determine enrichment significance by employing permutation tests to compute a distribution of possible information rank values. Our choice of GenGO as a base approach was motivated by several factors:

• **GenGO is one of the best state-of-the-art methods**. GenGO and MGSA are two state-of-the-art MEA approaches shown to provide overwhelmingly superior enrichment performance on simulated data.

• **GenGO is feasible to extend**. Integration of MCOA through modification of the objective function was both feasible and straightforward.

• **GenGO returns intuitive results with flexible statistics**. The GenGO process outputs p-values, using the statistical test of choice, for the set of categories that maximize the log likelihood objective function. Use of p-values, as opposed to the marginal posterior probabilities used by MGSA, make the results of this method more intuitive to researchers and more easily comparable to the results from other enrichment methods. Use of multiple hypothesis correction is also optional.

Execution of the MCOA enrichment analysis algorithm involves three steps:

• **Step 1: **Compute steady state probability scores for the ontology relative to both the reference and target datasets.

• **Step 2**: Find the set of ontology classes that maximizes the likelihood of the observed dataset given a probabilistic generative model.

• **Step 3: **Compute p-values and apply multi-hypothesis correction.

Algorithmic details for each of these steps are outlined below.

Step 1: Compute steady state probability scores for the ontology relative to both the reference and target datasets

This step follows the core MCOA process outlined above.

Step 2: Find the set of ontology classes that maximize the likelihood of the observed dataset given a probabilistic generative model

The MCOA approach modifies the GenGO objective function by replacing the

**
Definition 8 (MCOA Objective Function)
**:

•

•

•

•

• _{g }is the set of active instances annotated with at least one active class

• _{n }is the set of active instances not annotated with any active classes

• _{g }is the set of annotations (materialized according to the ontology hierarchy) between inactive instances and active classes

• _{n }is the set of annotations (materialized according to the ontology hierarchy) between inactive instances and inactive classes

• _{ref }is the steady state probability for ontology class c computed using the reference dataset

• _{tar }is the steady state probability for ontology class c computed using the target dataset

•

Step 3: Compute p-values and apply multi-hypothesis correction

For the set of ontology classes that maximizes the objective function, p-values can be computed using any desired statistical test. Similar to the original GenGO method, the current implementation of MCOA computes p-values using the hypergeometric distribution. If desired, multiple hypothesis correction methods can also be applied to the generated p-values. An important benefit of this approach is that multiple hypothesis correction only needs to consider the subset of classes that maximize the objective function rather than all classes in the ontology.

GO Enrichment Analysis of Simulated Data

To demonstrate the utility of the MCOA methodology for enrichment analysis of biomedical data, we compared the performance of the MCOA method against GenGO (the Ontologizer implementation), MGSA, Alexa

To enable comparison with prior work, our benchmarking process follows the general approach adopted by Bauer

• **Source of GO annotations**: Creation and analysis of the simulated datasets was performed using the following ontology and species annotation files downloaded from the source control repository links on the Gene Ontology website

• **Selection of active GO categories: **Following prior work

1. **Generate datasets using a more accurate distribution of categories**. Filtering on the total number of annotations results in the disproportionate removal of leaf categories. For the versions of GO and the Drosophila melanogaster annotations used for our benchmarking, 42.4% of the 7,855 directly and indirectly annotated GO categories are leaf terms. If all categories with fewer than 5 total annotations are removed from this set, the total proportion of leaf categories falls to 20.6% of the remaining 3,953 annotated categories. If filtering is instead based on direct annotations, the proportion of leaf categories remains essentially constant at 43.9% with 1,855 categories left in the set. Both types of filtering effectively maintain the overall distribution of categories by level (see Figure

Distribution of annotated GO categories by hierarchical level

**Distribution of annotated GO categories by hierarchical level**. Distribution of Gene Ontology categories annotated with Drosophila melanogaster genes by hierarchical level. Shown are distributions for all annotated categories, categories with at least 5 total annotations and categories with at least 5 direct annotations.

2. **Create simulated datasets that are more consistent with a generative model of gene activation**. Categories with very few or no direct annotations are more likely to be high-level grouping constructs with low analytical value than categories with at least a few direct annotations. A direct annotation for a high-level category provides evidence that the category, rather than one of its subcategories, has been found by curators to provide the best explanation for a specific piece of experimental data. We believe that requiring such evidence for active categories results in datasets that better reflect a generative model of gene activation and represent more biologically meaningful categories.

3. **Create simulated datasets that highlight key analytical challenges**. Filtering based on either direct or total annotations creates a dataset with a high mean annotation level and increased level of class overlaps. Filtering by direct annotations has the added benefit of generating datasets with a larger ratio of direct-to-indirect annotations, highlighting the challenge of differentiating between these types of annotations during enrichment analysis, a distinction ignored by most enrichment methods. With no filtering, each GO category with Drosophila annotations has an average of 7 direct and 61 total annotations. Requiring a minimum of 5 direct annotations results in a set of potentially active categories with an average of 29 direct and 115 total annotations. If a minimum of 5 total annotations is required, the set of active categories has an average of 14 direct annotations and 120 total annotations.

• **False positive rate (q)**: Probability that a gene not associated with an active category is activated. GenGO tested with fairly low false positive rates of 0.01 and 0.15. MGSA reported results for false positive rates of 0.1 and 0.4. The results shown below use a value of 0.1, which corresponds to one of the MGSA values and is between the two GenGO values. Simulations were also performed for false positive rates of .01 and .4 and results can be found in Additional Files

**Benchmarking results on simulated Escherichia coli data sets for false positive rate (q) of 0.01 and false negative rate (1-p) of 0.1**.

Click here for file

**Benchmarking results on simulated Drosophila Melanogaster data sets for false positive rate (q) of 0.01 and false negative rate (1-p) of 0.1**.

Click here for file

**Benchmarking results on simulated Homo sapiens data sets for false positive rate (q) of 0.01 and false negative rate (1-p) of 0.1**.

Click here for file

**Benchmarking results on simulated Escherichia coli data sets for false positive rate (q) of 0.4 and false negative rate (1-p) of 0.25**.

Click here for file

**Benchmarking results on simulated Drosophila Melanogaster data sets for false positive rate (q) of 0.4 and false negative rate (1-p) of 0.25**.

Click here for file

**Benchmarking results on simulated Homo sapiens data sets for false positive rate (q) of 0.4 and false negative rate (1-p) of 0.25**.

Click here for file

• **False negative rate (1-p)**: Probability that a gene associated with an active category is deactivated. GenGO reported primary results for false negative rates of 0.1 and 0.5. MGSA reported results for false negative rates of 0.25 and 0.4. The results shown below use a value of 0.25, which matches one of the GenGO settings and is in the between the two MGSA values. Simulations were also performed for false negative rates of .1 and .4 and results can be found in Additional Files

**Benchmarking results on simulated Escherichia coli data sets for false positive rate (q) of 0.1 and false negative rate (1-p) of 0.1**.

Click here for file

**Benchmarking results on simulated Drosophila Melanogaster data sets for false positive rate (q) of 0.1 and false negative rate (1-p) of 0.1**.

Click here for file

**Benchmarking results on simulated Homo sapiens data sets for false positive rate (q) of 0.1 and false negative rate (1-p) of 0.1**.

Click here for file

**Benchmarking results on simulated Escherichia coli data sets for false positive rate (q) of 0.1 and false negative rate (1-p) of 0.4**.

Click here for file

**Benchmarking results on simulated Drosophila Melanogaster data sets for false positive rate (q) of 0.1 and false negative rate (1-p) of 0.4**.

Click here for file

**Benchmarking results on simulated Homo sapiens data sets for false positive rate (q) of 0.1 and false negative rate (1-p) of 0.4**.

Click here for file

• **Enrichment threshold for precision/recall calculations (σ)**: The prior benchmarking work by Bauer

GO Enrichment Analysis of Parkinson's Gene Expression Data

To demonstrate the utility of the MCOA method on real experimental data, we compared the enrichment results generated by MCOA, GenGO, MGSA and the standard hypergeometric test on differentially expressed genes from a study of Parkinson's post-mortem brain samples available in the Gene Expression Omnibus (GEO)

The R GEOquery package

Implementation

To validate our approach, generate experimental results for this paper and analyze real biomedical data, we have created a prototype implementation of the MCOA core methodology and MCOA enrichment analysis method described above. The core MCOA method was implemented in Java™(version 1.6) using JUNG

The MCOA-based enrichment analysis method was implemented in Java™ as an extension to the Ontologizer 2 framework

The MCOA enrichment analysis application can be accessed at the project homepage

Results

Analysis Challenge Examples

To illustrate the computational behaviour of the MCOA method and the ability of this method to detect complex structural features, we computed information rank and information content values for a set of simple, domain-independent models that represent the analytical challenges outlined in the introduction section above. Each model was generated as a synthetic OWL ontology with associated instance data and, for all examples, the MCOA method was run with α = 0.15 and ω = 0.01. The ontology, dataset and analysis results for each example are shown in Figure

Analysis challenge examples

**Analysis challenge examples**. (A) Overlapping classes due to multiple annotations. (B) Overlapping classes due to multiple parents. (C) Continuously valued instance weights. (D) Inter-instance relationships. (E) Semantic distance. (F) Sparse data. For all examples, MCOA run with α = 0.15 and ω = 0.01.

• **Class overlaps**. Figures

• **Continuously valued data**. Figure

• **Inter-instance relationships**. In Figure

• **Semantic distance**. Figure

• **Sparse data**. Figure

Results of GO Enrichment Analysis of Simulated Data

Using the benchmarking process outlined above, we tested MCOA enrichment analysis and the other state-of-the-art methods on simulated Escherichia coli, Drosophila melanogaster and Homo sapiens datasets. Figures

Benchmarking on simulated Escherichia coli data sets

**Benchmarking on simulated Escherichia coli data sets**. Performance of MCOA, MGSA, GenGO, weight, parent-child union and hypergeometric methods on simulated Escherichia coli data sets created with false positive rate (q) of 0.1, false negative rate (1-p) of 0.25. (A) Precision/recall statistics are computed using all categories. (B) Precision/recall statistics are computed using only significantly enriched categories.

Benchmarking on simulated Drosophila melanogaster data sets

**Benchmarking on simulated Drosophila melanogaster data sets**. Performance of MCOA, MGSA, GenGO, weight, parent-child union and hypergeometric methods on simulated Drosophila melanogaster data sets created with false positive rate (q) of 0.1, false negative rate (1-p) of 0.25. (A) Precision/recall statistics are computed using all categories. (B) Precision/recall statistics are computed using only significantly enriched categories.

Benchmarking on simulated Homo sapiens data sets

**Benchmarking on simulated Homo sapiens data sets**. Performance of MCOA, MGSA, GenGO, weight, parent-child union and hypergeometric methods on simulated Homo sapiens data sets created with false positive rate (q) of 0.1, false negative rate (1-p) of 0.25. (A) Precision/recall statistics are computed using all categories. (B) Precision/recall statistics are computed using only significantly enriched categories.

**Relative execution time statistics on simulated Homo sapiens data**.

Click here for file

As the precision/recall curves in Figures

When precision/recall metrics are calculated irrespective of enrichment values, as show in Figures

Overall, the MCOA method provides superior enrichment performance across a range of species and experimental parameters. It is important to note that these benchmarking tests, in order to support comparison against other state-of-the-art methods, only reflect performance on data sets that exercise the class overlap and semantic distance challenges. On datasets that incorporate continuous data values, inter-instance relationships, non-hierarchical class relationships or sparse data, the relative advantage of the MCOA method should be even more significant.

Results of GO Enrichment Analysis of Parkinson's Gene Expression Data

The top ten enriched GO terms returned by MCOA, hypergeometric, MGSA and GenGO are listed in Figure

Analysis of Parkinson's gene expression data from GEO GDS3129

**Analysis of Parkinson's gene expression data from GEO GDS3129**. GO enrichment results on significantly differentially enriched genes in Parkinson's postmortem brain tissue (GEO dataset GDS3129). The top 10 GO terms generated by MCOA, the standard hypergeometric method, GenGO and MGSA are shown for comparison. GO terms are ranked by uncorrected p-value for MCOA, GenGO and hypergeometric and by marginal posterior probability for MGSA. See Additional Files 15, 16, 17 and 18 for complete results.

**Full analysis results for MCOA on GEO dataset GDS3129**.

Click here for file

**Full analysis results for hypergeometric method on GEO dataset GDS3129**.

Click here for file

**Full analysis results for MGSA method on GEO dataset GDS3129**.

Click here for file

**Full analysis results for GenGO method on GEO dataset GDS3129**.

Click here for file

**Research linking top ten GO terms returned by MCOA on GEO dataset GDS3129 and Parkinson's disease**.

Click here for file

The top GO terms returned by the standard hypergeometric method are all at a very high level in the GO tree (the forth ranked result is the root biological process) and a number of terms are redundant due to hierarchical overlap. Although both GenGO and MGSA generate results that are generally similar in content and specificity to those returned by MCOA, a close inspection reveals important differences impacting result quality and utility to experimental scientists. The second term in the GenGO results,

The results returned by the MGSA method have similar issues, when compared to MCOA, as the GenGO results (e.g., MGSA also fails to identify

Discussion

The Challenge of Biological Complexity

Ontology-based data analysis methods such as enrichment analysis and semantic similarity clustering have become critical tools for processing the experimental results of modern biomedical science. Without the abstract lens of classifications such as GO and KEGG, the large gene and protein lists generated by molecular biological research would be difficult to analyze manually and almost impossible to compare meaningfully across experimental populations or species. Despite the important role that these methods play in interpreting and guiding biomedical research, their utility has been hampered by the limitations of traditional analytical methods to handle the complex interdependencies present in real biomedical data and associated data models. The members of real biological datasets do not cleanly sort into independent classes but instead group into complex collections of nested and overlapping categories, with direct relationships between dataset members and a mixture of continuous and categorical data values.

Tackling this complexity requires methods that perform a global, rather than local, analysis of the ontology and dataset to capture the full range of structural interdependencies and data values. Although recent methods in the GSEA and MEA categories have made notable advances in this area, specifically in addressing class overlaps and continuously valued data, the interesting features of many biological datasets remain inaccessible to analytical tools. To help address the challenge of biological complexity, we developed the MCOA method as a network analytic framework capable of addressing the class overlap and continuously valued data challenges targeted by MEA and GSEA methods as well as supporting continuous relationship values, inter-instance relations, non-hierarchical class relations, semantic distance and sparse data.

Advantages of the MCOA Markov Chain Model

Underlying the MCOA method's analytical behaviour and its ability to successfully detect structural complexity is the method employed for building a Markov chain model and computing steady state probabilities. Several features of the MCOA Markov chain model are critical to its functionality:

• **Assignment of probabilistic weight per instance rather than per annotation**. Under the MCOA Markov chain model, the weight for each dataset instance is divided among all of the classes to which the instance is annotated. This weight is initially divided among all direct annotations of the instance and, as it propagates through the Markov chain, consolidates in an increasingly smaller number of classes until the entire instance weight is concentrated at the root. The MCOA approach contrasts with the annotation frequency approach in which the full instance weight is assigned to each annotated class with the effect that instances shared by many classes contribute the same weight as instances annotated to only a single class. MCOA uses the differential contribution of instances with a large number of class annotations and those with small number of annotations to help detect class overlaps resulting from multiple annotations and multiple parents.

• **Flexible relationships**. Traditional analysis methods only model hierarchical class relationships and class-to-instance annotations. Some methods, such as GenGO and MGSA, ignore most hierarchical information by analyzing a collapsed representation of the ontology graph. The MCOA method, in contrast, analyzes the full ontology and dataset network and can additionally handle relationships, such as inter-instance relationships and non-hierarchical relationships between classes, that are important for modelling real biomedical data but are not directly supported by existing MEA approaches.

• **Semantic distance computation**. The use of a random jump parameter allows semantic distance to be quantified and hierarchical overlaps to be detected, since the amount of transferred rank naturally decays with each transition up the ontology hierarchy. Although semantic distance is captured at some level by enrichment methods such as elim and weight, it is ignored by the more recent MEA approaches GenGO and MGSA as well as by techniques in the GSEA category.

• **Continuous values for instances, classes and relationships**. A non-uniform distribution of random jump probabilities can be used in the MCOA method to mirror differential class and instance weights. The Markov chain model also enables continuous values to be applied to inter-class, class-to-instance or inter-instance relationships. With existing state-of-the-art analysis methods, support for continuous data values is usually limited to dataset instances.

• **Prior weighting**. The non-uniform distribution of random jump probability also allows the MCOA method to apply any desired prior probability distribution to achieve smoothing of sparse data or to align with a Bayesian analysis approach.

MCOA for Enrichment Analysis

We chose enrichment analysis as the context in which to explore and validate the functionality of the MCOA method. In developing and benchmarking a MCOA-based enrichment analysis approach, we aimed to create an enrichment tool with the best performance among existing state-of-the-art methods on simulated datasets created to highlight the complexities encountered in real biomedical data. We also aimed to create a practical methodology capable of generating enrichment results on real data sets that are specific, non-overlapping and of high utility to experimental biologists. The superior performance achieved by the MCOA enrichment analysis approach can be understood in terms of the kinds of type I and type II errors encountered by the other generative methods (GenGO and MGSA) but avoided by MCOA.

In this context, type I, or false positive, errors represent cases where an enrichment method incorrectly identifies a non-active category as enriched. There were two varieties of type I errors commonly made by the other generative methods that were avoided by MCOA:

• **Incorrectly flagging non-active categories that are more general than an active category**. In these cases, the more general category appears enriched because it is inheriting all of the annotations from the active category along with a significant number of additional annotations enabled due to noise. MCOA is able to correctly ignore these categories because the contributions from the active category are discounted due to both semantic distance and overlaps with other classes. GenGO and MGSA, because they collapse the ontology graph and give each annotation equal weight regardless of the number of annotations, do not discount the contributions from the active category and incorrectly flag the more general category as enriched.

• **Incorrectly flagging non-active categories that are not hierarchically related to an active category, have a small number of associated genes and few or no direct annotations**. In these cases, the non-active category appears enriched due to noise. Because these categories have few annotated genes and almost no directly annotated genes, MCOA assigns the category a low steady-state probability and does not include it in the set of significantly enriched categories. Because the other generative methods assign weight per annotation and ignore semantic distance, they give the category an incorrectly high weight and mark it as enriched.

Type II, or false negative, errors represent cases where an enrichment method fails to identify an active category as enriched. In our experiments, the other generative methods commonly failed to identify as enriched active categories that had a small number of directly annotated genes. When analyzed by MCOA, these categories have a higher relative steady-state probability due to both the lack of a semantic distance discount for the direct annotations and the fact that direct annotations will not have overlaps due to multiple parents. Because of this higher relative steady-state probability, MCOA is able to successfully mark these categories as enriched. GenGO and MGSA, on the other hand, do not give any special weight to the direct annotations and therefore fail to detect the relatively higher enrichment of these categories.

MCOA Limitations

Limitations of the MCOA method and MCOA-based enrichment analysis include a comparatively high computational complexity relative to other methods (see Additional File

Other MCOA Applications

Although the discussion and examples in this paper have primarily focused on the use of the MCOA method for enrichment analysis, the same general approach can be used to support other ontology-based analysis applications, such as:

• **Semantic similarity clustering**: Semantic similarity algorithms that use the information content of classes (e.g., Resnik

• **Ontology evaluation**: Similar to the modification of semantic similarity algorithms, existing statistical ontology evaluation approaches that leverage information content (e.g., Alterovitz

• **Ontology-driven information retrieval**. If the Markov chain is created such that state transitions flow from classes in the ontology to instances, instance-level steady-state probability values can be computed that quantify the importance of each instance relative to the classes in the ontology.

• **Ontology comparative analysis**. If state transitions flow from the classes, through a set of associated instances and into the classes in another ontology, it becomes possible to use the MCOA method to quantify the importance of one set of classes relative to another set of classes based on the annotations of a common dataset. Comparative analysis of multiple ontologies can also be enabled through non-hierarchical relationships between the classes in one ontology and the classes in another ontology.

Conclusion

Biomedical ontologies have become increasingly critical for the analysis, retrieval and integration of large and complex datasets. Of particular importance are applications, such as enrichment analysis, that measure the importance of ontology classes relative to a collection of domain data. Current analysis methods, however, remain limited in their ability to detect and accurately quantify a range of complex structural features at the ontological and dataset levels. To help address these challenges, we developed the Markov Chain Ontology Analysis (MCOA) methodology and used this method to create the MCOA extension of the GenGO enrichment analysis approach.

The core MCOA method can detect structural features including class overlaps, continuous data values, relationships between data instances, semantic distance and sparse data that are difficult to detect using standard annotation frequency analysis. In benchmarking studies on simulated Escherichia coli, Drosophila melanogaster and Homo sapiens datasets highlighting the complexities of biomedical data, the MCOA enrichment analysis method provides the best performance of comparable state-of the-art Gene Ontology enrichment methods. On real experimental data, MCOA has been shown to provide specific, non-redundant and scientifically valid results.

As next steps, we plan to conduct benchmarking on datasets that capture a wider range of analytical challenges (e.g., continuous weights and inter-instance relationships), use the MCOA enrichment analysis method to analyze and interpret additional experimental data sets, and perform enrichment against ontologies other than the Gene Ontology. We also plan to explore the use of the MCOA information rank value for applications that have traditionally employed information content, such as ontology evaluation and semantic similarity clustering.

An implementation of the MCOA-based enrichment analysis tool can be accessed at the project homepage

Authors' contributions

HRF conceived of the methodology, implemented the MCOA algorithm and MCOA enrichment analysis method, performed the reported data analysis and drafted the manuscript. ATM participated in the development of the methodology, selection and analysis of use cases and revision of the manuscript. Both HRF and ATM have read and approve the final manuscript.

Acknowledgements

This work was supported by an anonymous foundation and the Harvard Catalyst | The Harvard Clinical and Translational Science Center (NIH Grant #1 UL1 RR 025758-01 and financial contributions from Harvard University and participating academic health care centers). We thank the anonymous reviewers for their insightful comments and suggestions.