The two most common models for the evolution of metabolism are the patchwork evolution model, where enzymes are thought to diverge from broad to narrow substrate specificity, and the retrograde evolution model, according to which enzymes evolve in response to substrate depletion. Analysis of the distribution of homologous enzyme pairs in the metabolic network can shed light on the respective importance of the two models. We here investigate the evolution of the metabolism in E. coli viewed as a single network using EcoCyc.
Sequence comparison between all enzyme pairs was performed and the minimal path length (MPL) between all enzyme pairs was determined. We find a strong over-representation of homologous enzymes at MPL 1. We show that the functionally similar and functionally undetermined enzyme pairs are responsible for most of the over-representation of homologous enzyme pairs at MPL 1.
The retrograde evolution model predicts that homologous enzymes pairs are at short metabolic distances from each other. In general agreement with previous studies we find that homologous enzymes occur close to each other in the network more often than expected by chance, which lends some support to the retrograde evolution model. However, we show that the homologous enzyme pairs which may have evolved through retrograde evolution, namely the pairs that are functionally dissimilar, show a weaker over-representation at MPL 1 than the functionally similar enzyme pairs. Our study indicates that, while the retrograde evolution model may have played a small part, the patchwork evolution model is the predominant process of metabolic enzyme evolution.
In 1945 one of the first theories regarding the evolution of metabolic pathways, often referred to as the retrograde evolution model, was proposed by Horowitz . It states that during evolution pathways assembled backward compared to the direction of the pathway in response to depletion of substrates from the environment. As an example consider the following scenario: Enzyme E1 catalyzes reaction A → B, where B is essential to the organism. A is depleted from the environment, which means that an organism harboring an enzyme E2 that can catalyze a reaction producing A from some other substrate in the environment would be at an advantage. Since E1 can already bind A there is a greater chance that El rather than an enzyme without an affinity for A will be duplicated and mutated into E2.
In 1976 Jensen  proposed the recruitment evolution theory, more often referred to as the patchwork evolution model . The patchwork evolution model states that enzymes initially have broad substrate specificities and that specialization takes place by way of gene duplication. As an example consider the following: An enzyme E1 catalyzes a reaction where either one of the substrates S1 and S2 is accepted. Through gene duplication and random mutation two different versions of E1 evolve; E1', which only accepts S1 as a substrate, and E1" which only accepts S2 as a substrate.
The retrograde evolution model was at first supported by the discovery of operons, where the functionally related genes in operons were thought to have evolved through tandem duplications . The theory that operons are the remnants of tandem duplication has recently been criticized by Lawrence and Roth , who instead proposed that horizontal gene transfer may be the underlying mechanism for the occurrence of gene clusters. There are a few known homologous genes coding for enzymes which catalyze consecutive reactions and which therefore represent possible cases of retrograde evolution: trpC, trpA and trpB (in E. coli trpA and trpB are fused) that catalyze consecutive steps of the tryptophan biosynthesis , hisF and hisA in the histidine biosynthetic pathway  and metB and metC in the methionine biosynthesis .
A few recent studies give some support for the retrograde evolution model. Saqi & Sternberg  showed that a super-family has a general tendency to appear in one or two particular pathway(s). Rison et al  showed that homologous enzymes are found at close distances within the (extended) pathways of E. coli and Alves et al  showed that homologous enzymes are also found close to each other in the whole metabolic network using a modified version of the KEGG database[12,13].
The patchwork evolution model holds that there should be many pairs of homologous enzymes that catalyze basically the same kind of reaction, where one or more substrates are non-identical but similar. Support for this theory is more abundant than for the retrograde evolution model [14-17]. The TIM-barrel containing enzymes have been found in many different pathways  and the homologous pairs of small molecule metabolism enzymes of E. coli have been shown to be evenly distributed within and across pathways [15,16].
Metabolic networks are often partitioned into pathways, which are considered to be functionally separate units of the network. The partitioning of the metabolic network into pathways is not always straightforward . As a result there may be correlations that are not visible in a pathway oriented perspective which will emerge in a whole-network oriented view. There is also an element of arbitrariness involved in which compounds are considered promiscuous (compounds involved in many reactions, e. g. H2O, ATP and cofactors) and which are not. We have chosen to apply a simple network-based criterion. We count the number of reactions a compound participates in within the complete metabolic network of E. coli. The most common compounds are then considered promiscuous and are excluded from part of the analysis.
In the study presented here we investigate whether homologous E. coli enzymes can be found close to each other in the complete, unpartitioned metabolic network of E. coli as derived from the EcoCyc database . We subsequently investigate the homologous enzyme pairs found close to each other in the network and classify these enzyme pairs as cases of retrograde evolution and patchwork evolution respectively. Finally we investigate whether the correlation between metabolic network distance and homology differs for the enzyme pairs classified as retrograde evolution cases compared to the patchwork evolution cases.
Results and Discussion
Metabolic pathway information is available in several different databases such as EcoCyc , WIT , BRENDA  and KEGG . EcoCyc is an E. coli-specific database and contains the metabolic complement of E. coli. We chose EcoCyc over the other available databases for three reasons: 1) EcoCyc contains information about the directionality of the enzymatic reactions, 2) Some enzymes have not yet been classified according to the Enzyme Commission (EC) system  (EcoCyc contains both EC-classified enzymes and enzymes that fall outside of the EC classification; There are 1172 enzyme entries in the database and 781 of these have EC numbers.) and 3) EcoCyc is freely available for universities and non-profit research institutes.
The vertices of the graph we use represent the enzymes catalyzing the reactions. One edge represents one or more compounds. There will be an edge leading from an enzyme E1 to an enzyme E2 if E1 catalyzes a reaction where compound A is produced and E2 takes A as a substrate. Reversible reactions, such as A B, are separated into two reactions, A → B and B → A. There can be at most one edge in each direction between a pair of enzymes. Note that the representation used herein is different from the common representation of metabolic pathways where the substrates and products are the vertices and the enzymes catalyzing the reactions are the edges (Figure 1). The type of network representation used in our study has been used before for metabolic network analysis where it has been referred to as 'protein-centric' graphs  and 'reaction graphs'  (Figure 2).
Figure 1. Common representation of the biotin metabolism. In the most common representation of biochemical reactions the reactants represent the vertices of the network and the enzymes represent the edges. The drawing of the biotin metabolism was redrawn from EcoCyc .
Figure 2. Protein-centric representation of the biotin metabolism. The protein-centric or reaction graph representation which was used in our study has enzymes as vertices and reactants as edges. EC 184.108.40.206 is a neighbor of EC 220.127.116.11 because there is an edge leading from EC 18.104.22.168 to EC 22.214.171.124. In the same manner, EC 126.96.36.199 is a neighbor of 188.8.131.52. In the biotin metabolism example there are 8 enzyme neighbor pairs. Note that in our definition an edge can consist of one or more compounds and there can only be one edge in each direction between two enzymes. The edges in blue font, representing CO2, between EC 184.108.40.206 and EC 220.127.116.11, are eliminated when the 20 most promiscuous compounds are removed from the network.
Some reactions are physiologically irreversible and should be represented accordingly. To encompass the directionality information we use a directed graph to represent the metabolic network of E. coli. As a result there are not necessarily paths from every enzyme vertex in the network to every other enzyme vertex.
One of the most problematic aspects with metabolic network analysis is how to handle the promiscuous compounds. One may argue that because promiscuous compounds are usually not the limiting factors of reactions, the network would become more biochemically meaningful if these compounds are removed . However, to our knowledge there is no generally accepted criterion to determine whether a compound participating in a reaction is a current compound, cofactor or main metabolite. In this study, we have chosen to formulate and apply a simple network-based criterion. We count the number of times a compound occurs as part of an edge in the network. The most common compounds are then considered as promiscuous compounds . As in the study performed by Wagner and Fell  we here conduct our study with one network that includes all compounds, including the promiscuous compounds, and another network where the most promiscuous compounds have been removed.
Determination of network parameters
The directed graph derived from EcoCyc contains 1,172 vertices and 224,972 edges (Table 1). The network consists of two components; one contains 2 vertices and a single edge while the other component consists of the remaining vertices. Except for the two enzymes (two chloride ion transporters) the whole metabolic network of E. coli is therefore connected as one would expect from a unicellular, autonomous and non-parasitic organism.
Table 1. Network parameters. The table contains some important characteristics of the network (see text for more detailed description). The second column contains the network parameters for the whole network while the third column contains the network parameters for the network where the 20 most promiscuous compounds have been removed. The average path length (D) is determined by taking the average over the MPLs between all vertex pairs. The connectivity of a vertex is defined as the number of vertices which it is connected to by an outgoing edge.
We examined which compounds are the most promiscuous in the metabolic network graph by determining the number of times the compounds occur as part of edges connecting two vertices in the network. The most promiscuous compound is H2O, which appears as part of an edge between 79,485 enzyme pairs (Table 2). The compound frequency plot shows some scale-free network characteristics in that there are a few compounds that occur very often and most compounds occur as parts of edges 1, 2 or 3 times (Figure 3). We also determine which enzymes are the most highly connected. The most highly connected enzyme has 735 enzyme neighbors (Carbamoyl phosphate synthase) (Table 3).
Table 2. The most promiscuous compounds in the network. The frequency of a compound is the number of times the compound occurs as part of an edge in the network.
Figure 3. Compound frequency distribution histogram for the metabolic network of E. coli. The 20 most promiscuous compounds have been removed in the network analyzed here. The histogram shows how many compounds (y-axis, logged scale) that occur with a certain frequency (x-axis) in the network. Compound frequency is defined as the number of times a compound occurs as part of an edge.
Table 3. The enzymes with the highest connectivities (k) of the network for the whole and the reduced networks. The connectivity of an enzyme is here defined as the number of enzymes which it is connected to by an outgoing edge.
It has been shown in previous studies that the metabolic network of E. coli is a small world network [24-26]. The definition of a small world graph is that its average path length (D) is on the same order as the average path length for a random graph, with the same number of vertices (n) and mean connectivity (), but its clustering coefficient (Cn) exceeds that of the random graph by far . Cn is a measure between 0 and 1 of the cliquishness of the graph . Each vertex' clustering coefficient (C) is calculated by taking the number of edges between the vertex' neighbors (m) divided by the maximum number of edges between the vertex' neighbors, i.e if u is the number of neighbors and the graph is directed C = m/(u(u - 1)). Cn is the average clustering coefficient for the graph.
Our network graph has Cn = 0.72 which is large compared to the clustering coefficient of the random graph Cr = ( - 1)/n = 0.16 (Table 1). The average path length for the Erdös-Rényi random graph  is Dr ~ ln(n)/ln() = 1.3, which is smaller than but on the same order as the average path length for the metabolic network (Table 1). Given the large Cn and the small D we can conclude that the metabolic network of E. coli constructed from EcoCyc shows some characteristics of a small world network.
Correlation between MPL and homology
We used PSI-BLAST  with an E-value cut-off of 10-6 and 3 iterations for an all-against-all sequence comparison of the 1105 protein sequences coding for the metabolic enzymes in E. coli. We chose the relatively strict E-value cut-off in order to minimize the number of false positives. We also ran PSI-BLAST against the SCOP database  to increase the sensitivity and collect further pairs of homologous enzymes. The proteins in the SCOP database are ordered into a hierarchy consisting of class, fold, super-family and family, with an increasing level of structural similarity between the proteins. In our study, enzymes belonging to the same super-family are considered homologous. Using these methods 8,218 homologous enzyme pairs were found.
There are 72 cases where two homologous genes code for the same enzyme. The reason may for instance be that the genes code for different subunits of the enzyme, that there are multiple copies of one gene or that the genes code for two isozymes. In addition there are 169 pairs of enzymes which are clearly isozymes because they are homologous, identified as separate enzymes in EcoCyc and catalyze the same reaction. It is not clear what relationship these 241 pairs should have in our graph representation of the metabolic network. They may be included at minimal path length (MPL, Figure 4) 1 since they catalyze identical reactions, or they may be included as a special category at MPL 0. For our analysis however, since we are interested in the relationship between clearly different enzymes, we chose not to include these probably fairly recent gene duplications.
Figure 4. Minimal path length (MPL) between enzymes. a) The reactions catalyzed by EC 18.104.22.168 (alcohol dehydrogenase), EC 22.214.171.124 (1-pyrroline-5-carboxylate dekydrogenase) and EC 126.96.36.199 (aldehyde dehydrogenase). b) There are severeral paths leading from EC 188.8.131.52 to EC 184.108.40.206. The shortest path goes from EC 220.127.116.11 through aldehyde to EC 18.104.22.168 and is of length 1 (MPL = 1).
There are 209 E. coli genes which are each associated with at least two enzymatic functions. Most of these multifunctional enzymes are enzymes with broad substrate specificities. 18 of these genes consist of two separate regions which are clearly associated with two (or more) different enzymatic functions (see 1 and for instance [31,32]). We used EcoCyc to identify these genes and Pfam  to localize the domains on the genes. The domains were separated from each other and included in the analysis as partial genes.
Additional File 1. The text file additional_file_1.txt contains a list of some of the genes in E. coli which code for multifunctional enzymes. In this study these genes were separated into two partial genes because the functions could be clearly associated with different domains. Each line in the text file consists of the name for the partial gene, the EcoCyc ID for the enzyme associated with the gene, the EcoCyc gene ID and the gene's Blattner ID.
Format: TXT Size: 2KB Download file
There are 644,656 pairs of enzymes in the whole network (Table 4). Of these, only 7,989 (1.2%) are homologous enzyme pairs. There are 121,295 enzyme pairs at MPL 1. 4,662 (3.8%) of these are homologous enzyme pairs. There appears to be a 3-fold over-representation of homologous enzyme pairs at MPL 1 in the whole network. In order to determine whether the observed over-representation of homologous enzyme pairs at MPL 1 is statistically significant randomized networks were constructed. There are many alternative ways to construct randomized networks. We have chosen an approach that preserves the topological properties of the network since it has been shown that the metabolic network of many organisms have some of the properties of scale-free networks . We constructed many randomized networks by starting from the original real network and shuffling the enzyme identities between randomly chosen pairs of vertices in the graph which preserves the network topology (Figure 5).
Table 4. The numbers of homologous enzyme pairs and enzyme pairs found at MPLs 1–12. The fraction is simply the ratio between the number of homologous enzyme pairs and the enzyme pairs. The reduced network is the network where the 20 most promiscuous compounds have been removed.
Figure 5. Preservation of network topology. The left side of the figure shows the numbers of vertices and edges in the original example graph. The right side of the figure shows the numbers of vertices and edges for the randomized graph where the vertex identities (A, B, C. D, E, F, G, H) have been shuffled. The topological properties of the original graph are preserved in the randomized graph.
The number of homologous enzyme pairs plotted against MPL for the whole metabolic network of E. coli and the standard deviations for the 1,000 randomized networks are shown in Figure 6. In the real network there are 4,662 pairs of homologous genes at MPL 1, which is 58% of all homologous pairs found (Table 4). In contrast typically 1,454 homologous enzyme pairs (18%) are found at MPL 1 in the randomized networks. Compared to the randomized networks there is again a 3-fold over-representation of homologous enzyme pairs at MPL 1. If the homologies were randomly distributed in the network we would expect to see the line representing the real network within the boundaries of the standard deviations for the randomized networks in Figure 6. Instead, we find that there is a significantly greater likelihood of enzymes at distance 1 from each other to be homologous in the metabolic network of E. coli than in the randomized network.
Figure 6. Homology vs minimal path length (MPL) for the whole metabolic network of E. coli. The plot shows the correlation between homology and MPL when all compounds are included. The solid line represents the real metabolic network of E. coli. The dotted vertical lines represent three standard deviations of the number of homologous enzyme pairs for the randomized networks. The number of homologous enzyme pairs has been normalized by the average number of homologous enzyme pairs for the randomized networks.
The network includes promiscuous compounds such as H2O, ATP and NAD. There are therefore homologous enzyme pairs containing for instance NAD-binding domains that are at MPL 1 because their gene products catalyze reactions involving that cofactor. It could be argued that such coenzyme binding domains give rise to skewed results in our analysis. To remedy this complication we removed 20 compounds, starting from the most promiscuous compound (H2O) down to the 20th most promiscuous compound (O2). We find that the correlation between MPL and homology is preserved (Figure 7), indicating that the abundance of homologous enzyme pairs at MPL 1 is not the result of common cofactor-binding domains alone. We could also detect a marginally significant correlation between MPL and homology at MPL 2 when the 20 most promiscuous compounds had been excluded from the network (Figure 7). No correlation could be detected at MPLs greater than 2. These results were robust for variations in the number of compounds that were removed from the network, i.e. removing between 17 and 23 of the most promiscuous compounds generated the same result (data not shown).
Figure 7. Homology vs minimal path length (MPL) without the 20 most promiscuous compounds. The plot shows the correlation between homology and MPL when the 20 most promiscuous compounds (Table 2) have been removed. The solid line represents the metabolic network of E. coli.The dotted vertical lines represent three standard deviations of the number of homologous enzyme pairs for the randomized networks. The number of homologous enzyme pairs has been normalized by the average number of homologous enzyme pairs for the randomized networks.
It could be argued that the over-representation of homologous enzyme pairs at MPL 1 is due to the fact that not all cofactors have been removed by removing the 20 most promiscuous compounds. To investigate this possibility an alternative network was constructed where all the cofactors, as defined by EcoCyc, were removed from the network. In this alternative network the number of homologous enzyme pairs at MPL 2 is just within the boundaries of 3 standard deviations for the randomized networks. The correlation at MPL 1 remained unchanged (data not shown). Hence, the correlation between homology and MPL at MPL 1 is not due to cofactor-binding regions alone.
Rison et al  found an over-representation within the extended pathways of E. coli at pathway distances 1, 2 and 3. Alves et al found that there is clustering of homologous enzyme pairs at MPL 1 and 2 in the metabolic networks of several organisms. We detect a clearly significant over-representation only at MPL 1 in the metabolic network of E. coli. Two possible explanations for the differences between our results are that Alves et al performed a multi-organism analysis while we analyze only E. coli and that Rison et al looked at extended pathways rather than at the whole network.
Analysis of the homologous enzyme pairs
We wished to investigate to which extent the retrograde evolution model and the patchwork evolution model respectively are responsible for the over-representation of homologous enzyme pairs seen at MPL 1. In order to be able to analyze the homologous enzyme pairs at MPL 1 further we use a rough criterion for discriminating between cases that fit the retrograde evolution model or the patchwork evolution model:
1) Patchwork model: Homologous enzymes with similar functions probably evolved through patchwork evolution events. Therefore homologous enzymes that evolved through the patchwork evolution model should share the same primary EC number.
2) Retrograde evolution: Homologous enzymes with dissimilar functions are less likely to have evolved through patchwork evolution events. Therefore homologous enzymes that have different primary EC numbers are candidates for retrograde evolution.
We find that 304 enzymes (26%) have not been EC classified. Reactions that do not have EC numbers have been classified by EcoCyc according to an EcoCyc-specific scheme. We consider enzymes that belong to the same EcoCyc reaction type as functionally similar. There are some reactions that remain unclassified in EcoCyc. Pairs including such enzymes are regarded as 'undetermined' in our analysis.
240 (56%) of the homologous enzyme pairs at MPL 1 have similar functions (Table 5). One instance is 1-phosphofructokinase and 6-phosphofructokinase, which catalyze similar reactions with the difference that 1-phosphofructokinase catalyzes a reaction involving fructose-1-phosphate while 6-phosphofructokinase has fructose-6-phosphate as a substrate (Figure 8a). These are clearly analogous reactions whose homology can readily be explained by the patchwork evolution model.
Table 5. The numbers of functionally similar /dissimilar/undetermined homologous enzyme pairs at MPLs 1–11 in the network where the 20 most promiscuous compounds have been removed.
Figure 8. Different types of homologous enzyme pairs at minimal path length (MPL) 1. a) Enzymes that catalyze similar reactions. The enzymes are at MPL 1 because they both catalyse reactions involving the metabolite fructose-1,6-bisphosphate. b) Enzymes that catalyze reactions that are mechanistically different. These enzymes are at MPL 1 because they both catalyse reactions involving chorismate.
90 (21%) of the homologous enzyme pairs at MPL 1 are functionally dissimilar (Table 5). One instance is component I of anthranilate synthase (trpE) which is homologous to the two isozymes of isochorismate synthase (entC and menF) (Figure 8b). If substrate depletion was the primary selective pressure for the E. coli ancestor of these enzymes, chorismate was probably the substrate being depleted because it is the only compound that these two reactions have in common. It is primarily among the 297 homologous enzyme pairs with dissimilar functions at MPL 1 and 2 that the candidates for retrograde theory enzymes can be found. However, it should be noted that some of the candidates for retrograde evolution that have been identified before are not included among our retrograde evolution candidates: We classify the enzymes, coding for the last two consecutive steps in the tryptophan biosynthesis (trpA/trpB and trpC) as functionally conserved because these two enzymes have the same primary EC numbers (22.214.171.124 and 126.96.36.199). In the same manner we classify the enzymes, coding for two consecutive steps in methionine biosynthesis (metB and metC) (188.8.131.52 and 184.108.40.206) as functionally similar as well as the four homologous consecutive enzymes in the peptidoglycan biosynthesis (220.127.116.11, 18.104.22.168, 22.214.171.124, 126.96.36.199).
The functionally similar, dissimilar and undetermined homologous enzyme pairs were plotted against MPL and normalized against 1,000 randomized networks by the same method as described above. From this procedure we could determine that most of the correlation between homology and MPL at MPL 1 can be attributed to enzymes with similar functions (Figure 9) and enzymes with undetermined function (data not shown). However, there is still an over-representation of functionally dissimilar homologous enzyme pairs at MPL 1 (Figure 10) indicating that there is a statistically significant proportion of the homologous enzyme pairs whose homology cannot be attributed to chemical similarity. We also note an over-representation at MPL 10 for the functionally similar homologous pairs (Figure 9). The over-representation consists of 9 pairs of homologous inner membrane MFS transporters which show 15–20% sequence identity.
Figure 9. Homology vs minimal path length (MPL) without the 20 most promiscuous compounds for functionally similar enzyme pairs. The plots show the correlation between homology and MPL for functionally similar (shared primary EC number) enzyme pairs when the 20 most promiscuous compounds have been removed. The solid line represents the metabolic network of E. coli. The dotted vertical lines represent three standard deviations of the number of homologous enzyme pairs for the randomized networks. The number of homologous enzyme pairs has been normalized by the average number of homologous enzyme pairs for the randomized networks.
Figure 10. Homology vs minimal path length (MPL) without the 20 most promiscuous compounds for functionally dissimilar enzyme pairs. The plots show the correlation between homology and MPL for functionally dissimilar (different primary EC number) enzyme pairs when the 20 most promiscuous compounds have been removed. The solid line represents the metabolic network of E. coli. The dotted vertical lines represent three standard deviations of the number of homologous enzyme pairs for the randomized networks. The number of homologous enzyme pairs has been normalized by the average number of homologous enzyme pairs for the randomized networks.
The correlation between homology and network distance at MPL 1 is mostly due to the homologous enzyme pairs with similar functions which have evolved through patchwork evolution and to homologous enzyme pairs with undetermined function. While there is some statistically significant over-representation of functionally dissimilar homologous enzyme pairs at MPL 1 which cannot easily be explained by the patchwork evolution model, it appears that the evolution of an enzyme that catalyzes a new type of reaction is rare and increased enzyme specificity driven evolution is more common which is in general agreement with recent studies [16,15].
We constructed a representation of the whole metabolic network of E. coli from EcoCyc and analyzed the distribution of homologous enzyme pairs over the network. We conclude from our study that homologous pairs of enzymes are more common at minimal path length (MPL) 1 than expected by chance. This correlation persists after the systematic removal of the most promiscuous compounds from the network.
The retrograde evolution model predicts that homologous enzyme pairs will be found at close distances in the metabolic network. Like previous studies our study seems to lend some support to the retrograde evolution model. To investigate the support for the retrograde evolution model further we analyzed the homologous pairs of enzymes in order to distinguish between on the one hand cases of patchwork evolution (broad-to-narrow evolution of enzyme substrate specificity) and on the other hand cases of retrograde evolution (evolution of a different reaction mechanisms). We found that the correlation between homology and network distance at MPL 1 is mostly due to homologous enzyme pairs with similar functions which have evolved through patchwork evolution and to homologous enzyme pairs with undetermined function. However, there is a statistically significant over-representation of functionally dissimilar homologous enzyme pairs at MPL 1 which cannot easily be explained by the patchwork evolution model. In conclusion, our study indicates that while the retrograde evolution model may have played a small part, the patchwork evolution model is the predominant process of metabolic enzyme evolution.
The record of evolutionary history that is present in modern genome sequences does not give much support for the retrograde evolution model  while Jensen's patchwork evolution model  has substantially more support. Horowitz aimed at explaining the emergence of metabolic pathways at the origin of life. It is possible that the subsequent mutations and gene rearrangements have obliterated the traces of ancient retrograde evolution. The patchwork evolution cases that we identify could be the examples of more recent events in evolutionary history. Further phylogenetic analysis of other genomes may shed light on this issue.
EcoCyc is arranged into several different flat-files. We used the enzrxns.dat file to extract the enzyme identifiers and the reaction directions, the proteins.dat file to extract the proteins coding for the enzymes, the genes.dat file to find the genes coding for the proteins, the genes.col file to find the Blattner identification number and the compounds.dat file to extract the compounds that are included in the reactions.
There were some reactions that were EC-number classified but did not have a link to the corresponding enzyme in the enzrxns.dat file. We could find 46 such cases and used KEGG  and BRENDA  to correct this problem (see 2).
Additional File 2. The text file additional_file_2.txt contains enzymatic reactions from EcoCyc's reactions.dat file which were associated with complete EC numbers and did not have links to the enzyme objects in EcoCyc's enzrxns.dat file. These reactions were added to our study using information from KEGG  and BRENDA . The enzymatic reactions are included in this file with the EcoCyc reaction ID, EC number, common name, left side reactants, right side reactants, genes and reaction direction.
Format: TXT Size: 15KB Download file
Of the 1,172 enzyme identifiers available in EcoCyc, 44 were not connected to the rest of the network. Some other enzymes have not yet been located in the genome, which makes sequence comparison impossible, leaving the final number of enzymes for our study at 1,085. The 1,105 protein sequences coding for these enzymes were extracted from the Wisconsin-Madison E. coli genome project's flat-file . There were 519 compounds that were involved in the reactions extracted.
A set of programs were constructed for determination of the minimal path lengths (MPLs) for all enzyme pairs in the network graph. For each E. coli enzyme the neighboring enzymes were determined (Figure 11a). The main program is a breadth-first search implementation which determines all possible MPLs between all pairs of enzymes (Figure 11b). The general idea for the algorithm is the same as described in Jeong et al . It should be noted that because the graph representing the metabolic network of E. coli is directed there may not be paths between all enzyme pairs in one graph component. All the scripts used in this study were written in Python.
Figure 11. Main algorithm. a) Determination of neighbors for each enzyme. b) Determination of minimal path length (MPL). The algorithm is a bread-first search for the shortest distance between all pairs of enzymes. MAX_MPL is the maximum MPL under investigation.
SL performed the analysis as a graduate student under the supervision of PK.
This work was supported by the Foundation for Strategic Research (SSF).
Wilmanns M, Hyde CC, Davies DR, Kirschner K, Jansonius JN: Structural conservation in parallel beta/alpha-barrel enzymes that catalyze three sequential reactions in the pathway of tryptophan biosynthesis.
Biochemistry 1991, 30:9161-9169. PubMed Abstract
J Mol Evol 1995, 41:760-774. PubMed Abstract
Belfaiza J, Parsot C, Martel A, de la Tour CB, Margarita D, Cohen GN, Saint-Girons I: Evolution in biosynthetic pathways: two enzymes catalyzing consecutive steps in methionine biosynthesis originate from a common ancestor and possess a similar regulatory region.
Proc Natl Acad Sci USA 1986, 83:867-871. PubMed Abstract
Overbeek R, Larsen N, Pusch GD, D'Souza M Jr, Selkov E Jr, Kyrpides N, Fonstein M, Maltsev N, Selkov E: WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction.
Schaffer AA, Aravind L, Madden TL, Shavirin S, Sponge JL, Wolf YI, Koonin EV, Altschul SF: Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements.
Veron M, Falcoz-Kelly F, Cohen GN: The threonine-sensitive homoserine dehydrogenase and aspartokinase activities of Escherichia coli K12. The two catalytic activities are carried by two independent regions of the polypeptide chain.
Eur J Biochem 1972, 28:520-527. PubMed Abstract
Blattner FR, Plunkett G 3rd, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, Gregor J, Davis NW, Kirkpatrick HA, Goeden MA, Rose DJ, Mau B, Shao Y: The complete genome sequence of Escherichia coli K-12.