Biosciences and Informatics, Keio University, 3-14-1 Hiyoshi, Yokohama 223-8522, Japan

Abstract

Background

Prediction of biochemical (metabolic) pathways has a wide range of applications, including the optimization of drug candidates, and the elucidation of toxicity mechanisms. Recently, several methods have been developed for pathway prediction to derive a goal compound from a start compound. However, these methods require high computational costs, and cannot perform comprehensive prediction of novel metabolic pathways. Our aim of this study is to develop a

Results

We formulated pathway prediction between a start compound and a goal compound as the shortest path search problem in terms of the number of enzyme reactions applied. We propose an efficient search method based on A* algorithm and heuristic techniques utilizing Linear Programming (LP) solution for estimation of the distance to the goal. First, a chemical compound is represented by a feature vector which counts frequencies of substructure occurrences in the structural formula. Second, an enzyme reaction is represented as an operator vector by detecting the structural changes to compounds before and after the reaction. By defining compound vectors as nodes and operator vectors as edges, prediction of the reaction pathway is reduced to the shortest path search problem in the vector space. In experiments on the DDT degradation pathway, we verify that the shortest paths predicted by our method are biologically correct pathways registered in the KEGG database. The results also demonstrate that the LP heuristics can achieve significant reduction in computation time. Furthermore, we apply our method to a secondary metabolite pathway of plant origin, and successfully find a novel biochemical pathway which cannot be predicted by the existing method. For the reconstruction of a known biochemical pathway, our method is over 40 times as fast as the existing method.

Conclusions

Our method enables fast and accurate

Background

Identification of the metabolic pathway of a chemical compound and discovery of new metabolic pathways are important in various fields. In general, an enzyme reaction pathway is a sequence of applications of enzymes (represented by EC number) that derives a goal compound from a given compound. In the field of drug discovery

To solve the problem of predicting various metabolic pathways, many attempts from bioinformatics have been made so far. Existing approaches can be broadly divided into three methods: the fingerprint-based method, the maximum common substructure search method, and the reaction rule-based method.

Fingerprint-based method

A chemical compound is represented by a fingerprint of the molecular structure, and the Tanimoto coefficient between fingerprints for compounds is calculated to indicate similarity. It then predicts that there is a metabolic pathway between compounds if the similarity exceeds a certain threshold. The necessary calculations are fast, but accurate path prediction is difficult.

Maximum common substructure search method

This approach focuses on the maximum common substructure between compounds to predict a metabolic pathway. The maximum common substructure search is an NP-hard problem, and requires enormous computation time in order to evaluate the similarity between compounds of complex structures

Rule-based method

This requires a database of reaction rules constructed from known metabolic reactions, and attempts to predict a metabolic pathway as a sequence of reaction rules. As a feature of reaction rules, some techniques focus on physicochemical properties and structures

This study aims at a comprehensive and

Methods

First, a chemical compound is represented by a feature vector which counts the frequencies of substructures in the structural formula. Second, a set of enzyme reaction rules is collected from the KEGG pathway database. Third, a reaction rule is represented as an operator vector by detecting the structural change to compounds before and after the reaction. Fourth, by defining compound vectors as nodes and operators as edges, prediction of a reaction pathway from a start compound to a goal compound is reduced to the shortest path search problem in the vector space. Then, "the output for reaction pathway prediction consists of a sequence of applied reaction rules". The A* algorithm is used to efficiently search for the shortest path. Finally, the Linear Programming (LP) algorithm is used as an admissible heuristic for estimating the distance to the goal.

KEGG reaction data

The data for compounds and metabolic enzyme reaction information used in this method all come from KEGG. First, we extracted the information pathways from KEGG pathway

Representation of chemical compounds and enzyme reactions

A key idea in our method is that a chemical compound is converted to a feature vector that represents substructure statistics extracted from the structural formula of the compound. This feature-vector representation evaluates whether a feature, such as a specific substructure, exists in a chemical compound or how many times that feature appears. This converts information about compounds into numerical vectors, called feature vectors, whose

Substructures or paths extracted from chemical structures, which are regarded as graphs with atoms as nodes and bonds as edges, can be an effective descriptor of chemical compounds

where _{c}

For example, methane (CH_{4}),

can be represented by the following feature vector:

We call the path length range specified by

According to the feature-vector representation of chemical compounds, every enzyme reaction rule in the 14570 KEGG enzyme reactions is represented as an operator vector. An operator vector expresses the change in chemical structure before and after the reaction, which is computed as the subtraction of the substrate compound vector from the product compound vector: Let _{a }

Operator vector of enzyme reaction and a sequence of applications of operator vectors

**Operator vector of enzyme reaction and a sequence of applications of operator vectors**. (A) An operator vector expresses the change in chemical structure before and after the reaction, which is computed as the difference between the product compound vector and the substrate compound vector. (B) An application of an enzyme reaction to a compound can be done simply by "addition" of the operator vector to the compound vector. Therefore, a reaction pathway is represented by a sequence of additions of operator vectors.

Further, every reaction rule _{a }_{a }_{a}, O_{a}_{a}

In this method, different reactions may sometimes be represented by the same vector because of insufficient short-length path counts in the compound vector.

Two constraint conditions for applying enzyme reaction rules

As a constraint for applying a reaction rule to a compound, the substrate inclusion condition is set as inclusion of the substrate vector. When attempting to apply the operator vector of a reaction rule _{k }_{1}, ..., _{n}_{k }_{1}, ..., _{n}

Substrate satisfaction condition for applying enzyme reactions

**Substrate satisfaction condition for applying enzyme reactions**. As a constraint for applying a reaction rule to a compound, the substrate inclusion condition is modeled by inclusion of the substrate vector. When attempting to apply the reaction rule _{k }_{1}, ..., _{n}_{k }_{1}, ..., _{n}

Note that this computationally easy procedure for substrate inclusion is a great advantage of our method using vector representation, because the graph inclusion problem for determining whether a compound structure contains a substrate structure is computationally hard (NP-hard).

The second constraint is the "non-negative" compound-vector condition. Since the operator vector

Search algorithm between two compounds

The purpose of this study is, given a start compound

Search space of pathway prediction and A* algorithm for shortest path search

**Search space of pathway prediction and A* algorithm for shortest path search**. (A) The metabolic pathway prediction problem between compounds can be replaced by a mathematical shortest path search problem. That is, finding the shortest path to reach the integer vector of a goal compound by adding the integer vectors of reaction rules to the integer vectors of intermediate compounds can be considered a shortest path problem in an integer-vector space. (B) In the A* algorithm, the evaluation function (the distance-plus-cost heuristic)

A* algorithm and heuristics

The A* algorithm uses a best-first search and finds a least-cost path from a given start node to a goal node. It uses a distance-plus-cost heuristic function to determine the order in which the search algorithm visits nodes to be explored in the search space. In the A* algorithm, the evaluation function (the distance-plus-cost heuristic)

In addition, the condition that ensures the A* algorithm finds a shortest path is expressed by the following formula:

where

Breadth-first search (exhaustive search)

By setting the heuristic function

Manhattan distance

Since each node

However, naive use of the MH distance is inadmissible and does not guarantee the shortest path solution. Therefore, we use the following modified MH heuristic function

where ||_{max}

Linear programming (LP) heuristics

A path from the current node _{1}, ..., _{m}_{1}, ..., _{n}_{1}, ..., _{n}_{k }_{k }

and the sum of the coefficients _{k }_{1}, ..., _{n}_{1}, ..., _{n}

This optimization problem is an Integer Programming (IP) problem. The solution to this problem is similar to that for the shortest reaction path problem between the start node and the goal node, except that it does not take into account the order of application of the reaction rules and it ignores the constraint conditions when applying reaction rules. Nevertheless, the solution to "minimize ∑_{k }w_{k}

Our approach is to relax the constraints on the optimization problem "minimize ∑_{k }w_{k}_{k }_{k }_{k }w_{k}_{k }w_{k}

For solving the LP heuristic, we used IBM ILOG CPLEX in

Results

Datasets and target pathways

KEGG Reaction dataset

Table

Reaction rules for the whole KEGG pathway database

**Representation-depth**

**0-1**

**0-2**

**0-3**

Dimensionality of vector representation

76

254

653

Number of operator vectors

4240

5542

8108

This table shows the relationship between the representation-depth of enzyme reactions and the number of unique reaction rules.

1. Some reactions are registered as different in KEGG, but the changes in structure are the same and only the substrates are different.

2. Some reactions are actually different but are represented by the same vector.

3. The structure registered as "main" is unchanged by the reaction.

The weakness of the second reason can be reduced by increasing the representation-depth for the vectors, which increases the number of reactions distinguished due to the improved expressive power.

DDT degradation pathway

In this study, we used the well-known DDT degradation pathway data set

Taking into account the number of involved pathways and compounds, as well as the fact that the pathway is a closed circuit, we consider the DDT degradation pathway ideal for verifying our approach. The pathway consists of 20 compounds and 46 enzyme reactions (Figure

DDT degradation pathway

**DDT degradation pathway**. DDT stands for dichlorodiphenyltrichloroethane. DDT is a chemical that began to be used as an insecticide after showing insecticidal action against many insects in very small quantities. It is important to evaluate the negative impact on the environment, and human health studies on the metabolism of DDT have been done in recent years. This pathway consists of 20 compounds and 46 enzyme reactions.

Table

Reaction rules only for the DDT degradation pathway

**Representation-depth**

**0-1**

**0-2**

**0-3**

Number of operator vectors

38

44

46

This table shows the number of reaction rules focusing only on the DDT degradation pathway in the KEGG database.

In our experiments, 20 × 19 = 380 pathway routes were selected for the search problem. The first validation experiment only used the 46 enzyme reaction rules contained in the DDT degradation pathway. In the second "more general" experiment, all KEGG reaction rules were used to search the DDT pathway.

Reconstruction of DDT pathway by shortest path finding

We first verified that the shortest path between the start node and the goal node implied the true metabolic pathway, identifying the shortest path using a BF search. Table

Agreement rate with the true pathway

**depth**

**0-1**

**0-2**

**0-3**

Agreement of the distance (%)

93.2

98.4

100

Agreement of the route (%)

81.3

93.2

100

This table shows the agreement percentage between the true distance and the shortest distance (points on

Computational times for heuristics

Table

Average computational time (seconds/pair) for finding 380 pathway routes

**depth**

**BF**

**MH**

**LP**

0-1

1534

27.9

0.872

0-2

52.1

0.255

0.0325

0-3

0.0240

0.0314

0.0310

This table shows the computational time for each heuristic and each representation-depth to search for the shortest paths in the DDT pathway.

(BF: breadth-first search, MH: MH heuristic, LP: LP heuristic.)

Comparing the efficiency of the heuristic functions in this table showed in particular that a significant reduction in computational time was achieved by the LP heuristic. On the other hand, in the depth 0-3, reduction in computation time was not seen for most heuristics. This implies that, as the representation depth increases, the substrate inclusion condition works more effectively, and the number of branches in the search space becomes smaller.

Table

Average number of branchings in the search (#branch/pair)

**depth**

**BF**

**MH**

**LP**

0-1

8389

1572

225

0-2

1575

129.4

17.4

0-3

12.8

11.6

7.6

This table indicates the number of times that the search algorithm branched for each heuristic and each representation-depth.

Prediction of DDT pathway using all KEGG reaction rules

A more general reconstruction problem for DDT pathway was carried out using all KEGG reaction rules, to verify whether the method is practical for comprehensively reproducing the DDT degradation pathways. Table

Average computational time (seconds/pair) using all KEGG reaction rules

**depth**

**BF**

**MH**

**LP**

0-1

N/A

N/A

N/A

0-2

N/A

N/A

N/A

0-3

N/A

N/A

61.9

This table shows the computational time for each heuristic and each representation-depth using all KEGG reaction rules to search for the shortest paths in the DDT pathway.

The agreement rate between the true distance and the true pathway route using the LP heuristic were 100% (380/380). Thus, despite using the generic operators (all KEGG reaction rules), the results showed that the method had high reproducibility.

Prediction of Lutein biosynthesis pathway using all KEGG reaction rules

Another pathway prediction using all KEGG reaction rules was executed for Lutein biosynthesis pathway. Lutein biosynthesis pathway is a secondary metabolic pathway from the start compound "Lycopene" to the goal compound "Lutein". Lycopene is a red carotenoid and Lutein is a plant carotenoid, and there are two routes from Lycopene to Lutein in KEGG pathway database: the one is via Zeinoxanthin and the other is via

Our method with the LP heuristics succeeded to precisely predict all pathways between every pair of compounds on the Lutein biosynthesis pathway. The average computational time for the LP heuristic to predict the shortest paths for all pairs was 10.9 seconds. On the other hand, PathPred failed to predict the pathway between Lycopene and Lutein, where the default parameters of PathPred were used: "Simcomp Threshold" was set at 0.4, "Prediction cycle" was set at 1, and Reference pathway was set at "Biosynthesis of Secondary Metabolites (Plants)".

Finding novel biochemical pathways for secondary metabolites of plant origin

To demonstrate the effectiveness of our method for finding novel pathways, we applied our method to predict a biochemical pathway for the start node "Delphinidin" and the goal node "Gentiodelphin". Both compounds are present in the KEGG database. Gentiodelphin is a plant-derived secondary metabolite associated with blue dye, and is known to be synthesized from Delphinidin

Our method with the LP heuristics predicted the two shortest path solutions shown in Figure

Novel pathway finding for plant biochemical pathways

**Novel pathway finding for plant biochemical pathways**. "Gentiodelphin" is a plant-derived secondary metabolite associated with blue dye, and is known to be synthesized from "Delphinidin". The biochemical pathway was predicted with a start node of Delphinidin and a goal node of Gentiodelphin. The LP heuristic predicted the two shortest path solutions shown in this figure. The arrow indicates the reaction rule for routing information, accompanied by the KEGG reaction number. Both predicted pathways consist of four enzyme reactions. The first path (blue) is a metabolic pathway present in the KEGG pathway database. On the other hand, the second path (orange) is new and not registered in the KEGG database, and there is a possibility of a new route where the operator "R6798" is applied at the end.

Overall, our A*-based algorithm with the LP heuristic is more comprehensive and computationally efficient prediction method for biochemical pathway finding.

Discussion

We have achieved high-speed pathway predictions using a vector-based search that simply focuses on the 2D structures of compounds. The A* algorithm guarantees the discovery of the shortest path, and the efficient search is achieved by the Linear Programming heuristic that estimates the distance to the goal. Results of verification experiments show the high reproducibility of KEGG pathways, the validity of the novel predicted pathway, and the versatility of our method.

Search space for pathway predictions

An exponential increase in the search space accompanies an increase in the true distance. This is represented by the equation:

where

In addition, taking into account the effect of the substrate inclusion condition that bounds the branching, the search space is improved as follows:

where

Reproducibility of KEGG Pathway

Our experimental results for comprehensive predictions using all 8108 KEGG reaction rules show that our proposed method is able to reproduce enzyme reaction pathways in the KEGG pathway database with high accuracy. This is presumably due to the LP heuristic and bound on branching due to the substrate inclusion constraint on the vector representation.

De novo prediction of known and unknown biosynthetic pathways

Our proposed method in this paper is a

Conclusions

We have proposed a computationally efficient method to predict biochemical reaction pathways that derives a goal compound from a start compound. A chemical compound is represented by a feature vector that counts the frequencies of substructure occurrences in the structural formula. A set of enzyme reaction rules collected from the KEGG pathway database was represented using operator vectors, by determining the structural change in the compounds before and after the reaction. Two constraint conditions when applying reaction rules were substrate inclusion and compound formation. By defining each compound vector as a node and each operator as an edge, prediction of reaction pathways was reduced to the shortest path search problem in a vector space. We proposed an efficient search method that uses the A* algorithm for the shortest path search problem. We used an LP solution for heuristic estimation of the distance to the goal. The results showed that our method had high reproducibility for KEGG pathways and a high possibility of predicting new reaction pathways. We understand that we need larger-scale experiments to test the general performance and stability of our method on a number of various known pathways. This is one of our important future works. Also in the future work, the resulting shortest distance can be thought of as a kind of similarity measure between compounds that represents metabolic information, and hence applications to determining similarity of compounds for drug discovery such as

List of abbreviations

LP: Linear Programming; MH: Manhattan; BF: Breadth-first; IP: Integer Programming; DDT: dichlorodiphenyltrichloroethane.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

M.N. and Y.Sakakibara designed the study and analyzed the data. M.N. developed the system and performed the experiments. T.H., Y.Saito, and K.S. proposed the heuristics and analyzed the data. Y.Sakakibara wrote the manuscript. All authors read and approved the final manuscript.

Author's information

Department of Biosciences and Informatics, Faculty of Science and Technology, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama 223-8522, Japan.

Acknowledgements

This work was supported in part by a Grant program for bioinformatics research and development from the Japan Science and Technology Agency. This work was also supported by Grant-in-Aid for KAKENHI (Grant-in-Aid for Scientific Research) on Innovative Areas (No.221S0002) and Scientific Research (A) No.23241066 from the Ministry of Education, Culture, Sports, Science and Technology of Japan.

This article has been published as part of