Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742, USA

Department of Computer Science, University of Maryland-College Park, College Park, MD 20742, USA

Abstract

Background

Enabled by rapid advances in sequencing technology, metagenomic studies aim to characterize entire communities of microbes bypassing the need for culturing individual bacterial members. One major goal of metagenomic studies is to identify specific functional adaptations of microbial communities to their habitats. The functional profile and the abundances for a sample can be estimated by mapping metagenomic sequences to the global metabolic network consisting of thousands of molecular reactions. Here we describe a powerful analytical method (MetaPath) that can identify differentially abundant pathways in metagenomic datasets, relying on a combination of metagenomic sequence data and prior metabolic pathway knowledge.

Methods

First, we introduce a scoring function for an arbitrary subnetwork and find the max-weight subnetwork in the global network by a greedy search algorithm. Then we compute two _{abund}_{struct}

Results

In order to validate our methods, we have designed a simulated metabolic pathways dataset and show that MetaPath outperforms other commonly used approaches. We also demonstrate the power of our methods in analyzing two publicly available metagenomic datasets, and show that the subnetworks identified by MetaPath provide valuable insights into the biological activities of the microbiome.

Conclusions

We have introduced a statistical method for finding significant metabolic subnetworks from metagenomic datasets. Compared with previous methods, results from MetaPath are more robust against noise in the data, and have significantly higher sensitivity and specificity (when tested on simulated datasets). When applied to two publicly available metagenomic datasets, the output of MetaPath is consistent with previous observations and also provides several new insights into the metabolic activity of the gut microbiome. The software is freely available at

Background

Metagenomics is a new scientific field that involves the analysis of organismal DNA sequences obtained directly from an environmental sample, enabling studies of microorganisms that are not easily cultured in a laboratory

To address these problems, we introduce a general method (MetaPath) for searching the global metabolic network to find differentially abundant finer-level subnetworks. For the purposes of this paper we define a subnetwork to be a connected set of genes that is statistically enriched or depleted in one group of samples. Underlying our approach is a statistical scoring system that captures the differential abundance for a given subnetwork, combined with a greedy search algorithm for a maximum weighted subgraph, to indentify the highest scoring subnetworks. Unlike previous approaches, MetaPath explicitly searches significant subnetwork in the global metabolic network (rather than the KEGG defined pathways), enabling us to detect subnetworks spanning predefined “containers”. In addition, we developed rigorous statistical methods that take into account the topology of the network when testing the significance of the subnetworks.

Using simulated datasets, we demonstrate that Metapath outperforms previously described approaches for comparing biological networks based on abundance data. We show that our findings are more robust to noisy data than the results of single gene comparisons, and that MetaPath can find finer-level subnetwork than can be found by comparing predefined KEGG pathways. We also discuss the biological significance of the results derived from the application of MetaPath to actual metagenomic datasets, demonstrating that the output from MetaPath is easy to interpret and provides valuable biological insights. The software is freely available at

Methods

Datasets

We tested our methods on two previously published metagenomic datasets, which were downloaded from the NCBI Trace Archive or Short Read Archive databases: (1) gut microbiomes from obese and lean twins

Schematic diagram of the MetaPath methods

**Schematic diagram of the MetaPath methods**.
Sequences from each sample are annotated against KEGG genes database and
mapped to reactions in metabolic networks, resulting an abundance matrix where the
rows are reactions and columns are samples. Then _{abund} and p_{struct} significance values of the max-weight subnetwork.

Scoring metabolic subpathways

To score the biological activity of a particular subnetwork, we first use Metastats

where

Identifying high-scoring pathways

As proposed in

This algorithm tries to find a connected metabolic subnetwork, which can have any arbitrary structure, with maximum weight. However, it is believed that in metabolic networks, chains are especially more biologically meaningful and interesting, because they attempt to capture the structure of a series of reactions that are successively connected. To allow this idea, we modify line 8 of the above algorithm to “Pick an edge e_{j} which has the highest weight of the edges that are adjacent to and have the same direction with e_{j-1}”. Both searching algorithms are implemented in our program and can be selected through command-line parameters. To find all significant subnetworks (computing significance is discussed below), we iteratively remove the edges in the global network that are contained in previously found significant subnetworks, and rerun our greedy search on the rest of the network until we can no longer find any additional significant subnetworks. Note, that unlike the original version of our code

Computing the significance of subnetwork

The null score distribution for a specific subnetwork can be estimated by permuting the sample labels (columns of the abundance matrix) of the reactions and computing the subnetwork scores from the permuted abundance matrix. The significance _{abund}_{struct}_{struct}_{struct}

Significant subnetworks that are caused by structural biases

**Significant subnetworks that are caused by structural biases**.
On the left side, both of the two pathways have equal weight, indicating equal
significance of differential abundance. The high weight of the second pathway,
however, mainly come from the middle fat edge that has weight 7. On the right side,
in a densely connected network, any random high-weight edges will form a
subnetwork with high weight (correlated noise).

MetaPath methods summary

To summarize the methods described above, the MetaPath algorithm proceeds as follows:

1. Differential abundance is assessed on an edge-by-edge basis (reaction-by-reaction) using Metastats;

2. The significance estimates (

3. The significance of each subnetwork detected by the greedy search algorithm is assessed using both a topology-independent bootstrapping approach (_{abund}_{struct}

4. The subnetworks determined to be significant (_{abund}_{struct}_{abund}

Results and discussions

Performance evaluation using simulated datasets

In order to validate our methods, we have designed a simulated metagenomic study and compared the results with three previous approaches: (i) identifying significantly active subnetworks using simulated annealing and greedy search

We designed a simulated metabolic pathways dataset in which five subjects are created for each of the two groups with distinct phenotypes. To generate the artificial reaction abundance matrix (where rows represent reactions and columns represent subjects), a Gaussian distribution is created for each reaction, whose mean is randomly chosen from a real metagenomic dataset (gut microbiome from obese and lean subjects

The receiver operating characteristic (ROC) curve is plotted for each method (Fig.

Comparison of statistical methods for discovering significant reactions in simulated datasets

**Comparison of statistical methods for discovering significant
reactions in simulated datasets**.
Four methods are evaluated: discovering active subnetworks using simulated
annealing (Anneal) and greedy search (Greedy)

Obese and lean twins

We used MetaPath to compare the abundances of the metabolic networks of the gut microbiome in lean and obese subjects, relying on data from ^{-5}, bitscore > 50, and %identity > 50; parameters suggested in the original study), resulting in total 1832 unique reactions within the 12 metagenomic samples. First, we computed

p values distributions from comparing individual metabolic reactions by Metastats and from comparing metabolic networks by MetaPath

** p values distributions from comparing individual metabolic
reactions by Metastats and from comparing metabolic networks by MetaPath**.
The top histogram is the distribution of the p values of individual metabolic reactions
calculated by Metastats. The Bottom histogram is the distribution of the p

We, then, applied MetaPath to this dataset, and have found 9 differentially abundant subnetwork (Fig. _{abund}_{struct}

9 statistically significant subnetworks are found in the comparison of the gut microbiome from the obese and lean subjects

**9 statistically significant subnetworks are found in the comparison
of the gut microbiome from the obese and lean subjects**.
All these subnetworks are enriched in the obese subjects. p_{abund} and p_{struct} significance
values are shown above each subnetwork. p values for each reaction are shown with
the KEGG reaction number. Five pathways (a)-(e) belong to the Fatty Acid
Metabolism pathway in KEGG. Four pathways (f)-(i) contain the L-Homocysteine
molecules.

Five subnetworks (Fig.

Another interesting significant networks consists of 10 reactions (Fig.

Infant and adult individuals

A second data-set comprises gut microbiome samples from 4 infants and 9 adults individuals which were sequenced by Kurokawa, ^{-8}, hit length coverage ≥ 50% of a query sequence), resulting in total 1781 unique reactions within the 13 metagenomic samples. Based on 10 runs of Metastats, 383.7±1.56 reactions are significant using

Applying MetaPath to search for significant subnetworks using the same parameters as before, we have found that 6 are enriched in infant subjects (Fig.

10 statistically significant subpathways are found in the infant and adult individuals dataset

**10 statistically significant subpathways are found in the infant and
adult individuals dataset**.
6 subpathways are enriched in the infant subjects (Fig. 4a-4f), and 4 subpathways are
enriched in the adult subjects (Fig. 4g-4j). p_{abund} and p_{struct} significance values are
shown above each pathway.

The pathway in Fig.

Conclusions

We have introduced a statistical method for finding significant metabolic subpathways from metagenomic datasets. Compared with previous methods, results from MetaPath are more robust to noise in the data, and have significantly higher sensitivity and specificity (when tested on simulated datasets). When applied to two publicly available metagenomic data-sets the output of MetaPath is consistent with previous observations and also provides several new insights into the metabolic activity of the gut microbiome. Finally, MetaPath is efficient: a typical metagenomic dataset and the corresponding metabolic network (about 2000 edges) can be analyzed in half an hour on a single processor.

While showing promising results, our methods have several limitations that we plan to address in the near future. First, and foremost, we restrict ourselves to pathways of a fixed length — a restriction necessary for accurately computing the null distribution of pathway scores. This can severely affect our ability to discover long pathways whose abundance differs only slightly, but significantly, between samples. Second, we currently estimate gene abundances by simply counting the number of sequencing reads that map to a certain gene. Such an approach ignores differences in the length of genes, potentially leading to incorrect conclusions. We plan to address this issue by incorporating a recently-published

Competing interests

The authors declare that they have no competing interests

Authors' contributions

BL and MP conceived the project, designed the algorithm and wrote the manuscript. BL implemented the algorithm and analyzed the data. Both authors read and approved the final manuscript.

Acknowledgements

We thank Niranjan Nagarajan, Carl Kingsford, James White and Saket Navlakha, Theodore Gibbons for helpful discussions. This work was supported in part by grants R01-HG004885 from the NIH, and IIS-0812111 from the NSF, both to MP.

This article has been published as part of