Abstract
Background
Prediction of biochemical (metabolic) pathways has a wide range of applications, including the optimization of drug candidates, and the elucidation of toxicity mechanisms. Recently, several methods have been developed for pathway prediction to derive a goal compound from a start compound. However, these methods require high computational costs, and cannot perform comprehensive prediction of novel metabolic pathways. Our aim of this study is to develop a de novo prediction method for reconstructions of metabolic pathways and predictions of unknown biosynthetic pathways in the sense that it does not require any initial network such as KEGG metabolic network to be explored.
Results
We formulated pathway prediction between a start compound and a goal compound as the shortest path search problem in terms of the number of enzyme reactions applied. We propose an efficient search method based on A* algorithm and heuristic techniques utilizing Linear Programming (LP) solution for estimation of the distance to the goal. First, a chemical compound is represented by a feature vector which counts frequencies of substructure occurrences in the structural formula. Second, an enzyme reaction is represented as an operator vector by detecting the structural changes to compounds before and after the reaction. By defining compound vectors as nodes and operator vectors as edges, prediction of the reaction pathway is reduced to the shortest path search problem in the vector space. In experiments on the DDT degradation pathway, we verify that the shortest paths predicted by our method are biologically correct pathways registered in the KEGG database. The results also demonstrate that the LP heuristics can achieve significant reduction in computation time. Furthermore, we apply our method to a secondary metabolite pathway of plant origin, and successfully find a novel biochemical pathway which cannot be predicted by the existing method. For the reconstruction of a known biochemical pathway, our method is over 40 times as fast as the existing method.
Conclusions
Our method enables fast and accurate de novo pathway predictions and novel pathway detection.
Background
Identification of the metabolic pathway of a chemical compound and discovery of new metabolic pathways are important in various fields. In general, an enzyme reaction pathway is a sequence of applications of enzymes (represented by EC number) that derives a goal compound from a given compound. In the field of drug discovery [1], the mechanism of side effects based on information about metabolic pathways has been investigated to clarify the movement of drugs in the body and to optimize drug candidate compounds. In the field of toxicity prediction, exploration of the dynamics of in vivo chemical substances identified the metabolic pathway, leading to the elucidation of the mechanisms of toxicity [2]. Prediction of the biochemical pathways for secondary metabolites has received the most attention in recent years. Although secondary metabolites have been used as lead compounds for food and medicines, most of their biosynthetic pathways still remain unknown. Further, computational methods that support de novo design of biosynthetic pathways are expected in the field of synthetic biology. In the synthetic biology approach, the de novo design is not necessarily limited to biochemical routes that already exist in nature [3].
To solve the problem of predicting various metabolic pathways, many attempts from bioinformatics have been made so far. Existing approaches can be broadly divided into three methods: the fingerprintbased method, the maximum common substructure search method, and the reaction rulebased method.
Fingerprintbased method [4]
A chemical compound is represented by a fingerprint of the molecular structure, and the Tanimoto coefficient between fingerprints for compounds is calculated to indicate similarity. It then predicts that there is a metabolic pathway between compounds if the similarity exceeds a certain threshold. The necessary calculations are fast, but accurate path prediction is difficult.
Maximum common substructure search method [5,6]
This approach focuses on the maximum common substructure between compounds to predict a metabolic pathway. The maximum common substructure search is an NPhard problem, and requires enormous computation time in order to evaluate the similarity between compounds of complex structures [7,8]. Various approximation algorithms have been studied in the search for a computationally tractable approach [916].
Rulebased method [1724]
This requires a database of reaction rules constructed from known metabolic reactions, and attempts to predict a metabolic pathway as a sequence of reaction rules. As a feature of reaction rules, some techniques focus on physicochemical properties and structures [25], while other methods focus on enzyme and gene information [26,27]. Since prediction ability depends on the size and type of metabolites used to build reaction rules, comprehensive prediction is difficult using the exhaustive search algorithm such as breadthfirst search [23,24], and the approach has only been used to predict specific pathways and enzymes. In addition, the complexity of the features used to construct the reaction rules is a factor that has made comprehensive prediction difficult.
This study aims at a comprehensive and de novo approach to predict metabolic pathways between two arbitrary known or unknown compounds, and belongs to the rulebased methods. Using a simple feature that focuses only on the structural formula of compounds, our method enables comprehensive prediction that has been difficult for the conventional methods. Enzyme reactions on the metabolic pathways are used as the reaction rules, and are extracted from the KEGG database [28]. The feature on which we focus is the information before and after the structural change of the compound caused by the enzyme reaction. By using the information about this structural change, our method predicts the enzyme reactions that give the shortest path between two compounds for a given query. Further, a constraint for applying an enzyme reaction rule to a compound is set as the substrate inclusion condition, that is, the compound must include the substrate of the enzyme reaction as part of its own structure. This constraint and shortestpath strategy lead to de novo prediction of unknown biosynthetic pathways that a knowledgebased approach [17,18] cannot predict. In this study, the metabolic pathway prediction problem is reduced to the shortest path problem, and the search method is based on A* algorithm to traverse nodes in the order of priority and employs the LP solution as an admissible heuristics for estimating the distance to the goal.
Methods
First, a chemical compound is represented by a feature vector which counts the frequencies of substructures in the structural formula. Second, a set of enzyme reaction rules is collected from the KEGG pathway database. Third, a reaction rule is represented as an operator vector by detecting the structural change to compounds before and after the reaction. Fourth, by defining compound vectors as nodes and operators as edges, prediction of a reaction pathway from a start compound to a goal compound is reduced to the shortest path search problem in the vector space. Then, "the output for reaction pathway prediction consists of a sequence of applied reaction rules". The A* algorithm is used to efficiently search for the shortest path. Finally, the Linear Programming (LP) algorithm is used as an admissible heuristic for estimating the distance to the goal.
KEGG reaction data
The data for compounds and metabolic enzyme reaction information used in this method all come from KEGG. First, we extracted the information pathways from KEGG pathway [29]. By using the KEGG API, we concretely collected all enzyme reactions registered in the global map on the KEGG pathway. In the KEGG Reaction, a pair of compounds that are registered as "main" before and after the reaction indicate that it is a metabolic reaction present in KEGG enzyme or the KEGG pathway global map. In this study, the 2DSDF structure was extracted only for those pairs registered as "main" to ensure the focus on compound metabolic reactions. Further, most KEGG reactions are registered as reversible reaction, and therefore, the forward and reverse directions were treated as a separate reaction. As a result, the 14570 enzyme reactions and the 6073 related compounds were obtained.
Representation of chemical compounds and enzyme reactions
A key idea in our method is that a chemical compound is converted to a feature vector that represents substructure statistics extracted from the structural formula of the compound. This featurevector representation evaluates whether a feature, such as a specific substructure, exists in a chemical compound or how many times that feature appears. This converts information about compounds into numerical vectors, called feature vectors, whose ith value corresponds to the existence or frequency of the ith feature considered. This featurevector representation enables us to reduce the pathway search problem to a computationally feasible problem in the vector space, as will be discussed later in detail.
Substructures or paths extracted from chemical structures, which are regarded as graphs with atoms as nodes and bonds as edges, can be an effective descriptor of chemical compounds [3032]. In this study, the feature vector based on the 2D chemical structure is defined as the number of appearances of each path in the structure of the chemical compound as follows:
where is a set of paths whose length (depth), or number of bonds, is between l and u (u ≥ l) and which appear at least once in the chemical structures in the dataset. f_{c}(p) is the number of appearances of path p in the structure of a chemical compound c.
For example, methane (CH_{4}),
can be represented by the following feature vector:
We call the path length range specified by l and u the "representationdepth", and denote it by "depth lu", which is crucial for the expressiveness of vector representation.
According to the featurevector representation of chemical compounds, every enzyme reaction rule in the 14570 KEGG enzyme reactions is represented as an operator vector. An operator vector expresses the change in chemical structure before and after the reaction, which is computed as the subtraction of the substrate compound vector from the product compound vector: Let i denote the substrate compound of an enzyme reaction a and j denote the product compound of a. Let and denote the vector representation of i and j, respectively. Then, the operator vector O_{a }for the reaction a is defined as (see also Figure 1):
Figure 1. Operator vector of enzyme reaction and a sequence of applications of operator vectors. (A) An operator vector expresses the change in chemical structure before and after the reaction, which is computed as the difference between the product compound vector and the substrate compound vector. (B) An application of an enzyme reaction to a compound can be done simply by "addition" of the operator vector to the compound vector. Therefore, a reaction pathway is represented by a sequence of additions of operator vectors.
Further, every reaction rule R_{a }for the reaction a is defined as a pair R_{a }= (U_{a}, O_{a}) of the substrate vector and the operator vector O_{a}. As a result, an application of the enzyme reaction to a compound can be achieved simply by "addition" of the operator vector to the compound vector (see also Figure 1), and a reaction pathway from a start compound S to a goal compound G is represented by a sequence of applications (additions) of operator vectors:
In this method, different reactions may sometimes be represented by the same vector because of insufficient shortlength path counts in the compound vector.
Two constraint conditions for applying enzyme reaction rules
As a constraint for applying a reaction rule to a compound, the substrate inclusion condition is set as inclusion of the substrate vector. When attempting to apply the operator vector of a reaction rule R = (U, O) to a compound vector C, the compound vector C must include the substrate vector U as part of its own structure (see Figure 2). The precise determination procedure requires the entries c_{k }in the compound vector C = (c_{1}, ..., c_{n}) to exceed the corresponding value u_{k }in the substrate vector U = (u_{1}, ..., u_{n}), as represented by the following formula:
Figure 2. Substrate satisfaction condition for applying enzyme reactions. As a constraint for applying a reaction rule to a compound, the substrate inclusion condition is modeled by inclusion of the substrate vector. When attempting to apply the reaction rule R = (U, O) to a compound vector C, the compound vector C must include the substrate vector U as part of its own structure. That is, for any k, the value c_{k }in the compound vector C = (c_{1}, ..., c_{n}) must exceed the value u_{k }in the substrate vector U = (u_{1}, ..., u_{n}).
Note that this computationally easy procedure for substrate inclusion is a great advantage of our method using vector representation, because the graph inclusion problem for determining whether a compound structure contains a substrate structure is computationally hard (NPhard).
The second constraint is the "nonnegative" compoundvector condition. Since the operator vector O denotes a change in structure before and after the reaction, it sometimes contains negative values and the application (addition) of this to a compound vector C may produce a vector containing negative values. A compound vector is a vector that we define as representing the frequency of occurrence of partial structures, so negative values are not appropriate. Therefore, we set the following nonnegative condition to filter out compound vectors containing negative values, preventing inappropriate products that contain negative values as a result of applying an operator vector.
Search algorithm between two compounds
The purpose of this study is, given a start compound S and a goal compound G, to find the pathway leading to G by applying reaction rules. Here, the distance of a pathway between S and G is defined as "the number of rule applications (i.e., path length)". By introducing this distance definition, the metabolic pathway prediction problem between compounds can be replaced by a mathematical shortest path search problem. That is, finding the shortest path for reaching the integer vector of the goal compound by successively adding integer vectors of reaction rules to integer vectors of intermediate compounds can be considered as the problem of finding the shortest path in the integervector space. In this search space, a node is an intermediate compound vector and an edge is an applied reaction operator vector (see Figure 3).
Figure 3. Search space of pathway prediction and A* algorithm for shortest path search. (A) The metabolic pathway prediction problem between compounds can be replaced by a mathematical shortest path search problem. That is, finding the shortest path to reach the integer vector of a goal compound by adding the integer vectors of reaction rules to the integer vectors of intermediate compounds can be considered a shortest path problem in an integervector space. (B) In the A* algorithm, the evaluation function (the distancepluscost heuristic) f(t) for each node t in the search space is defined as the sum of two functions: the pastcost function p(t), which is the cost (distance) it has taken to get from the starting node to the current node t, and a heuristic estimate h'(t) of the distance to the goal.
A* algorithm and heuristics
The A* algorithm uses a bestfirst search and finds a leastcost path from a given start node to a goal node. It uses a distancepluscost heuristic function to determine the order in which the search algorithm visits nodes to be explored in the search space. In the A* algorithm, the evaluation function (the distancepluscost heuristic) f(t) for each node t in the search space is defined as the sum of two functions, the pastcost function p(t), which is the cost (distance) it has taken to get from the start node to the current node t, and a heuristic estimate h'(t) of the distance to the goal (see Figure 3):
In addition, the condition that ensures the A* algorithm finds a shortest path is expressed by the following formula:
where h(t) is the true distance to the goal. That is, a heuristic function h'(t) that always underestimates the distance to the goal is required. Such a heuristic function h'(t) is referred to as an "admissible heuristics". If a given heuristic is admissible, the A* algorithm will reliably find a shortest path. The A* algorithm was implemented using the data structure "sorted priority queue" for maintaining the nodes to be traversed with weights of the evaluation function value f(t), while the breadthfirst search uses the simple "queue" for the nodes to be traversed with no weight.
Breadthfirst search (exhaustive search)
By setting the heuristic function h'(t) to zero for any node t, the A* algorithm becomes equivalent to the breadthfirst (BF) search as exhaustive search.
Manhattan distance
Since each node t is an ndimensional vector representing a compound, the most simple heuristic function h'(t) is to use the Manhattan (MH) distance, denoted by MH(t, G), between the current node vector t and the goal node vector G:
However, naive use of the MH distance is inadmissible and does not guarantee the shortest path solution. Therefore, we use the following modified MH heuristic function h'(t):
where O_{max} represents the maximum norm among all of the operator vectors. The MH distance divided by this norm becomes an admissible heuristic, because this modified MH distance indicates the number of times the goal node G is reached by only applying the largest norm operator, and hence does not exceed the true distance to the goal node.
Linear programming (LP) heuristics
A path from the current node t to the goal node G is represented by a sequence of reaction rules R_{1}, ..., R_{m}, and hence the difference ΔT = G  t between the current node vector t = (t_{1}, ..., t_{n}) and the goal node vector G = (g_{1}, ..., g_{n}) can be represented by a linear sum of operator vectors . Let w_{k }denote the number of times that the operator vector O_{k }is applied. Then the difference ΔT between the current node t and the goal node G may be expressed as follows:
and the sum of the coefficients w_{k }is exactly equal to the distance (the number of applications of operators) between the current node and the goal node. Now, consider the following optimization problem for the current node vector t = (t_{1}, ..., t_{n}) and the goal node vector G = (g_{1}, ..., g_{n}):
This optimization problem is an Integer Programming (IP) problem. The solution to this problem is similar to that for the shortest reaction path problem between the start node and the goal node, except that it does not take into account the order of application of the reaction rules and it ignores the constraint conditions when applying reaction rules. Nevertheless, the solution to "minimize ∑_{k }w_{k}" provides the tightest underbound for estimating the distance from the current node t to the goal node G, and it is obviously admissible. However, a critical defect is that the IP problem is computationally hard.
Our approach is to relax the constraints on the optimization problem "minimize ∑_{k }w_{k}" and to treat w_{k }as a real number rather than an integer, that is, "continuous relaxation". The optimization problem now becomes an LP problem that can be solved in polynomial time. Note that allowing w_{k }to be a real number means that we may apply an operator a real number of times, for example "apply the operator 1.5 times". In other words, a realvalued solution for the optimization problem "minimize ∑_{k }w_{k}" can be considered as the shortest distance to the goal node in a real vector space. In addition, it is well known and obvious that the real solutions for the optimization problem "minimize ∑_{k }w_{k}" with linear equation constraints are always smaller than the integer solutions. Therefore, the LP solution is admissible for guaranteeing the shortest path. We use this value as the LP heuristic function, which is another advantage of our method using the vector representation.
For solving the LP heuristic, we used IBM ILOG CPLEX in [33]. CPLEX is one of the fastest optimization problem solvers, and can be used for linear programming, quadratic programming, constraint programming, mixed integer programming, and is applicable to largescale problems.
Results
Datasets and target pathways
KEGG Reaction dataset
Table 1 shows the number of operator vectors represented from the 14570 KEGG enzyme reaction dataset. There are three reasons that the total number of operators represented is less than 14570, the number of enzyme reactions registered in KEGG:
Table 1. Reaction rules for the whole KEGG pathway database
1. Some reactions are registered as different in KEGG, but the changes in structure are the same and only the substrates are different.
2. Some reactions are actually different but are represented by the same vector.
3. The structure registered as "main" is unchanged by the reaction.
The weakness of the second reason can be reduced by increasing the representationdepth for the vectors, which increases the number of reactions distinguished due to the improved expressive power.
DDT degradation pathway
In this study, we used the wellknown DDT degradation pathway data set [34] as pathway data to verify the validity of our method. DDT is a chemical substance that can be synthesized for minimal cost, and began to be used as an insecticide during the 1940s because of its insecticidal action against many insects. However, the human carcinogenicity of DDT and its longterm persistency in the environment has since been pointed out [35]. It is important to evaluate the negative impact on the environment, and human health studies analyzing the metabolism of DDT has continued in recent years [36].
Taking into account the number of involved pathways and compounds, as well as the fact that the pathway is a closed circuit, we consider the DDT degradation pathway ideal for verifying our approach. The pathway consists of 20 compounds and 46 enzyme reactions (Figure 4).
Figure 4. DDT degradation pathway. DDT stands for dichlorodiphenyltrichloroethane. DDT is a chemical that began to be used as an insecticide after showing insecticidal action against many insects in very small quantities. It is important to evaluate the negative impact on the environment, and human health studies on the metabolism of DDT have been done in recent years. This pathway consists of 20 compounds and 46 enzyme reactions.
Table 2 shows the number of reactions represented as operator vectors associated with the 46 enzyme reactions from the DDT pathway.
Table 2. Reaction rules only for the DDT degradation pathway
In our experiments, 20 × 19 = 380 pathway routes were selected for the search problem. The first validation experiment only used the 46 enzyme reaction rules contained in the DDT degradation pathway. In the second "more general" experiment, all KEGG reaction rules were used to search the DDT pathway.
Reconstruction of DDT pathway by shortest path finding
We first verified that the shortest path between the start node and the goal node implied the true metabolic pathway, identifying the shortest path using a BF search. Table 3 shows the percentage for which the true distance and the shortest distance are equal, and the rates at which the true pathway and the shortestdistance path match, for each representationdepth. The agreement rate with the true pathway increased as the representationdepth became larger, and the depth 03 gave an agreement rate of 100%.
Table 3. Agreement rate with the true pathway
Computational times for heuristics
Table 4 shows the computational time for searching for the shortest paths in the DDT pathway for each heuristic and each representationdepth.
Table 4. Average computational time (seconds/pair) for finding 380 pathway routes
Comparing the efficiency of the heuristic functions in this table showed in particular that a significant reduction in computational time was achieved by the LP heuristic. On the other hand, in the depth 03, reduction in computation time was not seen for most heuristics. This implies that, as the representation depth increases, the substrate inclusion condition works more effectively, and the number of branches in the search space becomes smaller.
Table 5 indicates the number of times that the search algorithm branched for each heuristic and each representationdepth. From Table 5, we observed a decrease in the search space with the improved heuristics at depth 01 and depth 02. A similar trend was observed at depth 03, but because the effect of the bound by the substrate inclusion condition was large for the more expressive representations at depth 03, the reduction in actual computation time was limited in contrast.
Table 5. Average number of branchings in the search (#branch/pair)
Prediction of DDT pathway using all KEGG reaction rules
A more general reconstruction problem for DDT pathway was carried out using all KEGG reaction rules, to verify whether the method is practical for comprehensively reproducing the DDT degradation pathways. Table 6 shows the computational time for each heuristic and each representationdepth when all KEGG reaction rules were used to search for the shortest DDT pathway. Since increasing the number of operators increases the search space exponentially, none of the heuristics was able to accomplish the task within a realistic computational time (time out, or memory out) at depth 01 and depth 02. At depth 03, the BF search and the MH heuristics were unable to explore the solution space in a realistic computational time, and only the LP heuristic was able to calculate the shortest path solutions within a practical computational time.
Table 6. Average computational time (seconds/pair) using all KEGG reaction rules
The agreement rate between the true distance and the true pathway route using the LP heuristic were 100% (380/380). Thus, despite using the generic operators (all KEGG reaction rules), the results showed that the method had high reproducibility.
Prediction of Lutein biosynthesis pathway using all KEGG reaction rules
Another pathway prediction using all KEGG reaction rules was executed for Lutein biosynthesis pathway. Lutein biosynthesis pathway is a secondary metabolic pathway from the start compound "Lycopene" to the goal compound "Lutein". Lycopene is a red carotenoid and Lutein is a plant carotenoid, and there are two routes from Lycopene to Lutein in KEGG pathway database: the one is via Zeinoxanthin and the other is via αCryptoxanthin. The Lutein biosynthesis pathway has other difficulty compared with DDT pathway prediction: the structures of chemical compounds in the pathway are significantly larger than the ones in DDT pathway, and the KEGG pathway predictive tool PathPred [23,37] could not predict this pathway.
Our method with the LP heuristics succeeded to precisely predict all pathways between every pair of compounds on the Lutein biosynthesis pathway. The average computational time for the LP heuristic to predict the shortest paths for all pairs was 10.9 seconds. On the other hand, PathPred failed to predict the pathway between Lycopene and Lutein, where the default parameters of PathPred were used: "Simcomp Threshold" was set at 0.4, "Prediction cycle" was set at 1, and Reference pathway was set at "Biosynthesis of Secondary Metabolites (Plants)".
Finding novel biochemical pathways for secondary metabolites of plant origin
To demonstrate the effectiveness of our method for finding novel pathways, we applied our method to predict a biochemical pathway for the start node "Delphinidin" and the goal node "Gentiodelphin". Both compounds are present in the KEGG database. Gentiodelphin is a plantderived secondary metabolite associated with blue dye, and is known to be synthesized from Delphinidin [23]. The KEGG pathway predictive tool PathPred was also used for performance comparison.
Our method with the LP heuristics predicted the two shortest path solutions shown in Figure 5. Each arrow indicates the enzyme reaction as routing information, accompanied by the KEGG reaction number. Both predicted pathways consist of four enzyme reactions. The first path (blue) is a metabolic pathway present in the KEGG pathway database. The second path (orange) is new and not registered in the KEGG database, and there is the possibility of a new route where the enzyme reaction "R6798" is applied at the end. The computational times for the LP heuristic to predict the shortest paths were 35.0 seconds for the first solution, and 58.4 seconds for the second solution. PathPred only predicted the first pathway with a computational time of 1462 seconds, where the default parameters of PathPred were used.
Figure 5. Novel pathway finding for plant biochemical pathways. "Gentiodelphin" is a plantderived secondary metabolite associated with blue dye, and is known to be synthesized from "Delphinidin". The biochemical pathway was predicted with a start node of Delphinidin and a goal node of Gentiodelphin. The LP heuristic predicted the two shortest path solutions shown in this figure. The arrow indicates the reaction rule for routing information, accompanied by the KEGG reaction number. Both predicted pathways consist of four enzyme reactions. The first path (blue) is a metabolic pathway present in the KEGG pathway database. On the other hand, the second path (orange) is new and not registered in the KEGG database, and there is a possibility of a new route where the operator "R6798" is applied at the end.
Overall, our A*based algorithm with the LP heuristic is more comprehensive and computationally efficient prediction method for biochemical pathway finding.
Discussion
We have achieved highspeed pathway predictions using a vectorbased search that simply focuses on the 2D structures of compounds. The A* algorithm guarantees the discovery of the shortest path, and the efficient search is achieved by the Linear Programming heuristic that estimates the distance to the goal. Results of verification experiments show the high reproducibility of KEGG pathways, the validity of the novel predicted pathway, and the versatility of our method.
Search space for pathway predictions
An exponential increase in the search space accompanies an increase in the true distance. This is represented by the equation:
where P is the size of the search space if all solutions are explored, N is the number of reaction rules, and d is the distance from the start node to the goal node. In the heuristic search, the search space can be reduced by visiting the nodes on the true path on a priority basis. Table 5 shows that the LP heuristic can significantly reduce the search space compared to searching all possible solutions.
In addition, taking into account the effect of the substrate inclusion condition that bounds the branching, the search space is improved as follows:
where B is the ratio that bounds the branching, that is, the ratio at which operator applications are eliminated. When B increases, the base of the exponential function becomes smaller and hence the exponential increase can be reduced. That is, B plays a role in minimizing the exponential expansion of the search space. The significant reduction in computational time achieved by increasing the representationdepth for the vector representation is considered to be due to this reason. In other words, designing a highspeed searching method requires both an accurate heuristic function that estimates the distance to the goal and an effective bound on the branching to reduce the search space.
Reproducibility of KEGG Pathway
Our experimental results for comprehensive predictions using all 8108 KEGG reaction rules show that our proposed method is able to reproduce enzyme reaction pathways in the KEGG pathway database with high accuracy. This is presumably due to the LP heuristic and bound on branching due to the substrate inclusion constraint on the vector representation.
De novo prediction of known and unknown biosynthetic pathways
Our proposed method in this paper is a de novo prediction method in the sense that it does not require any initial network such as KEGG metabolic network as input and it is not a method just to traverse the pathway network. Our method takes as input the set of enzyme reaction rules collected from the KEGG pathway database. However, this does not necessarily imply that the pathway prediction using the list of all reaction rules is equal to the path search on KEGG pathway network. For each compound occurring at a node in KEGG pathway network, the KEGG network only contains the enzyme reactions whose substrate is exactly equal to the compound as an edge connected to the node. On the other hand, our method applies all reaction rules to a given compound if the compound is not only equal to the substrate of the reaction rule but also contains the substrate as a substructure (the substrate inclusion condition). Therefore, the search space of our method is exponentially larger than the KEGG pathway network. Further, our method is able to predict unknown biosynthetic pathways between two arbitrary known or unknown compounds.
Conclusions
We have proposed a computationally efficient method to predict biochemical reaction pathways that derives a goal compound from a start compound. A chemical compound is represented by a feature vector that counts the frequencies of substructure occurrences in the structural formula. A set of enzyme reaction rules collected from the KEGG pathway database was represented using operator vectors, by determining the structural change in the compounds before and after the reaction. Two constraint conditions when applying reaction rules were substrate inclusion and compound formation. By defining each compound vector as a node and each operator as an edge, prediction of reaction pathways was reduced to the shortest path search problem in a vector space. We proposed an efficient search method that uses the A* algorithm for the shortest path search problem. We used an LP solution for heuristic estimation of the distance to the goal. The results showed that our method had high reproducibility for KEGG pathways and a high possibility of predicting new reaction pathways. We understand that we need largerscale experiments to test the general performance and stability of our method on a number of various known pathways. This is one of our important future works. Also in the future work, the resulting shortest distance can be thought of as a kind of similarity measure between compounds that represents metabolic information, and hence applications to determining similarity of compounds for drug discovery such as [3840] can be also expected.
List of abbreviations
LP: Linear Programming; MH: Manhattan; BF: Breadthfirst; IP: Integer Programming; DDT: dichlorodiphenyltrichloroethane.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
M.N. and Y.Sakakibara designed the study and analyzed the data. M.N. developed the system and performed the experiments. T.H., Y.Saito, and K.S. proposed the heuristics and analyzed the data. Y.Sakakibara wrote the manuscript. All authors read and approved the final manuscript.
Author's information
Department of Biosciences and Informatics, Faculty of Science and Technology, Keio University, 3141 Hiyoshi, Kohokuku, Yokohama 2238522, Japan.
Acknowledgements
This work was supported in part by a Grant program for bioinformatics research and development from the Japan Science and Technology Agency. This work was also supported by GrantinAid for KAKENHI (GrantinAid for Scientific Research) on Innovative Areas (No.221S0002) and Scientific Research (A) No.23241066 from the Ministry of Education, Culture, Sports, Science and Technology of Japan.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 17, 2012: Eleventh International Conference on Bioinformatics (InCoB2012): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S17.
References

Cho A, Yun H, Park J, Lee S, Park S: Prediction of novel synthetic pathways for the production of desired chemicals.
BMC Systems Biology 2010, 4:35. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Nicholson J, Connelly J, Lindon J, Holmes E, et al.: Metabonomics: a platform for studying drug toxicity and gene function.
Nature Reviews Drug Discovery 2002, 1(2):153162. PubMed Abstract  Publisher Full Text

Medema M, van Raaphorst R, Takano E, Breitling R: Computational tools for the synthetic design of biochemical pathways.
Nature Reviews Microbiology 2012, 10(3):191202. PubMed Abstract  Publisher Full Text

Tohsato Y, Nishimura Y: Metabolic pathway alignment based on similarity between chemical structures.

Kotera M, McDonald A, Boyce S, Tipton K: Eliciting possible reaction equations and metabolic pathways involving orphan metabolites.
Journal of Chemical Information and Modeling 2008, 48(12):23352349. PubMed Abstract  Publisher Full Text

Leber M, Egelhofer V, Schomburg I, Schomburg D: Automatic assignment of reaction operators to enzymatic reactions.
Bioinformatics 2009, 25(23):31353142. PubMed Abstract  Publisher Full Text

Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E: The Chemistry development kit (CDK): An opensource Java library for chemoand bioinformatics.
Journal of chemical information and computer sciences 2003, 43(2):493500. PubMed Abstract  Publisher Full Text

Rahman S, Bashton M, Holliday G, Schrader R, Thornton J: Small molecule subgraph detector (SMSD) toolkit.
Journal of cheminformatics 2009, 1:113. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

McGregor J, Willett P: Use of a maximum common subgraph algorithm in the automatic identification of ostensible bond changes occurring in chemical reactions.
Journal of Chemical Information and Computer Sciences 1981, 21(3):137140. Publisher Full Text

Stahl M, Mauser H: Database clustering with a combination of fingerprint and maximum common substructure methods.
Journal of chemical information and modeling 2005, 45(3):542548. PubMed Abstract  Publisher Full Text

Takahashi Y, Sukekawa M, Sasaki S: Automatic identification of molecular similarity using reducedgraph representation of chemical structure.
Journal of chemical information and computer sciences 1992, 32(6):639643. Publisher Full Text

Sussenguth E Jr: A graphtheoretic algorithm for matching chemical structures.
Journal of Chemical Documentation 1965, 5:3643. Publisher Full Text

Raymond J, Willett P: Effectiveness of graphbased and fingerprintbased similarity measures for virtual screening of 2D chemical structure databases.
Journal of computeraided molecular design 2002, 16:5971. PubMed Abstract  Publisher Full Text

Raymond J, Willett P: Maximum common subgraph isomorphism algorithms for the matching of chemical structures.
Journal of computeraided molecular design 2002, 16(7):521533. PubMed Abstract  Publisher Full Text

Raymond J, Gardiner E, Willett P: Heuristics for similarity searching of chemical graphs using a maximum common edge subgraph algorithm.
Journal of chemical information and computer sciences 2002, 42(2):305316. PubMed Abstract  Publisher Full Text

Cao Y, Jiang T, Girke T: A maximum common substructurebased algorithm for searching and predicting druglike compounds.
Bioinformatics 2008, 24(13):i366. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hatzimanikatis V, Li C, Ionita J, Henry C, Jankowski M, Broadbelt L: Exploring the diversity of complex metabolic networks.
Bioinformatics 2005, 21(8):16031609. PubMed Abstract  Publisher Full Text

Li C, Henry C, Jankowski M, Ionita J, Hatzimanikatis V, Broadbelt L: Computational discovery of biochemical routes to specialty chemicals.
Chemical engineering science 2004, 59(2223):50515060. Publisher Full Text

Hou B, Ellis L, Wackett L: Encoding microbial metabolic logic: predicting biodegradation.
Journal of industrial microbiology & biotechnology 2004, 31(6):261272. PubMed Abstract  Publisher Full Text

Langowski J, Long A: Computer systems for the prediction of xenobiotic metabolism.
Advanced drug delivery reviews 2002, 54(3):407415. PubMed Abstract  Publisher Full Text

Oh M, Yamada T, Hattori M, Goto S, Kanehisa M: Systematic analysis of enzymecatalyzed reaction patterns and prediction of microbial biodegradation pathways.
Journal of chemical information and modeling 2007, 47(4):17021712. PubMed Abstract  Publisher Full Text

Talafous J, Sayre L, Mieyal J, Klopman G: META. 2. A dictionary model of mammalian xenobiotic metabolism.
Journal of chemical information and computer sciences 1994, 34(6):13261333. PubMed Abstract  Publisher Full Text

Moriya Y, Shigemizu D, Hattori M, Tokimatsu T, Kotera M, Goto S, Kanehisa M: PathPred: an enzymecatalyzed metabolic pathway prediction server.
Nucleic acids research 2010, (38 Web Server):W138W143. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gao J, Ellis L, Wackett L: The university of Minnesota pathway prediction system: multilevel prediction and visualization.
Nucleic acids research 2011, (39 Web Server):W406W411. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

GonzalezLergier J, Broadbelt L, Hatzimanikatis V: Theoretical considerations and computational analysis of the complexity in polyketide synthesis pathways.
Journal of the American Chemical Society 2005, 127(27):99309938. PubMed Abstract  Publisher Full Text

Yamanishi Y, Vert J, Kanehisa M: Supervised enzyme network inference from the integration of genomic data and chemical information.
Bioinformatics 2005, 21(suppl 1):i468i477. PubMed Abstract  Publisher Full Text

Feist A, Henry C, Reed J, Krummenacker M, Joyce A, Karp P, Broadbelt L, Hatzimanikatis V, Palsson B: A genomescale metabolic reconstruction for Escherichia coli K12 MG1655 that accounts for 1260 ORFs and thermodynamic information.
Molecular Systems Biology 2007, 3:121. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M: KEGG for integration and interpretation of largescale molecular data sets.
Nucleic Acids Research 2012, 40:D109D114. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

KEGG PATHWAY Database [http://www.kegg.jp/kegg/pathway.html] webcite

Swamidass SJ, Chen J, Bruand J, Phung P, Ralaivola L, Baldi P: Kernels for small molecules and the prediction of mutagenicity, toxicity and anticancer activity.
Bioinformatics 2005, 21(Supple 1):359368. PubMed Abstract  Publisher Full Text

Nagamine N, Sakakibara Y: Statistical prediction of protein chemical interactions based on chemical structure and mass spectrometry data.
Bioinformatics 2007, 23(15):20042012. PubMed Abstract  Publisher Full Text

Sakakibara Y, Hachiya T, Uchida M, Nagamine N, Sugawara Y, Yokota M, Nakamura M, Popendorf K, Komori T, Sato K: COPICAT: A software system for predicting interactions between proteins and chemical compounds.
Bioinformatics 2012.
doi:10.1093/bioinformatics/bts031
PubMed Abstract  Publisher Full Text 
IBM ILOG CPLEX [http:/ / www06.ibm.com/ software/ jp/ websphere/ ilog/ optimization/ coreproductstechnologies/ cplex/ ] webcite

DDT degradation  Reference pathway [http://www.kegg.jp/keggbin/show_pathway?map00351] webcite

Higginson J: DDT: Epidemiological evidence.
IARC scientific publications 1985, (65):107117. PubMed Abstract

Manaca M, Grimalt J, Gari M, Sacarlal J, Sunyer J, Gonzalez R, Dobaño C, Menendez C, Alonso P: Assessment of exposure to DDT and metabolites after indoor residual spraying through the analysis of thatch material from rural African dwellings.
Environmental Science and Pollution Research 2011, 19(3):756762. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

PathPred: Pathway Prediction server [http://www.genome.jp/tools/pathpred/] webcite

Hattori M, Okuno Y, Goto S, Kanehisa M: Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways.
Journal of the American Chemical Society 2003, 125(39):1185311865. PubMed Abstract  Publisher Full Text

Tsuda K, Kin T, Asai K: Marginalized kernels for biological sequences.
Bioinformatics 2002, 18(suppl 1):S268. PubMed Abstract  Publisher Full Text

Nagamine N, Shirakawa T, Minato Y, Torii K, Kobayashi H, Imoto M, Sakakibara Y: Integrating statistical predictions and experimental verifications for enhancing proteinchemical interaction predictions in virtual screening.
PLoS Computational Biology 2009, 5(6):e1000397. PubMed Abstract  Publisher Full Text  PubMed Central Full Text