Computer Science Laboratory of Lille, UMR USTL/CNRS 8022, INRIA, F59655, Villeneuve d'Ascq, France

ProBioGEM (UPRES EA 1026), University of Sciences and Technologies of Lille, F59655, Villeneuve d'Ascq, France

Abstract

Background

Nonribosomal peptides (NRPs), bioactive secondary metabolites produced by many microorganisms, show a broad range of important biological activities (e.g. antibiotics, immunosuppressants, antitumor agents). NRPs are mainly composed of amino acids but their primary structure is not always linear and can contain cycles or branchings. Furthermore, there are several hundred different monomers that can be incorporated into NRPs. The N

Results

We developed an efficient method that allows for a quick search for a structural pattern in the N

Conclusion

The method has been incorporated into the N

Background

Nonribosomal Peptides (NRPs) are bioactive compounds having various important biological functions (e.g. as antibiotics, siderophores, antitumor agents, immunosuppressants). NRPs are synthesized by large multi-enzymatic complexes called Nonribosomal Peptide Synthetases (NRPSs) that are modularly organized

Until about fifteen years ago, the number of known NRPs remained relatively low. However, many new molecules have been reported in the literature during the last years, associated with different biological activities and having a broad range of potential applications. This triggered a considerable interest among the research community in the nonribosomal synthesis pathway.

Among potential applications of such studies, redesigning natural products by genetic engineering of NRPSs opens an interesting new way in drug discovery

NRPS enzymes have been well studied for several years. Stachelhaus

NRP molecules show several important particularities. The first one is related to the incorporation of non-proteinogenic amino acids. Indeed, in addition to the twenty standard amino acids found in proteins, several hundreds of other residues can be encountered in final NRPS products. Incorporated residues can further undergo chemical modifications such as epimerisation or methylation. Products of other biosynthesis pathways, like lipids or carbohydrates, can also be introduced. Because of this composition diversity of NRPs, we will use the term 'monomer' rather than 'amino acid' for NRP structural units. Another interesting property of NRPs is their structure. Unlike regular proteins, the primary structure of NRPs is not always linear but can also be cyclic (partially or totally), branched or even poly-cyclic. A computational treatment of these molecules appears therefore to be very different from standard proteins and requires a development of specific computational methods and resources.

There exist, however, very few computational resources specifically devoted to NRPs and, until recently, there was no one providing a complete inventory of those. To fill this lack, we have developed the N

Similar to the search for sequence patterns in genomic and protein databases, N

In some analyses, one needs to identify a part of the pattern, rather than the whole pattern, occurring in a given peptide. For example, the order of monomers in the resulting peptide can be changed with respect to the order of modules in the synthetase (so-called nonlinear biosynthesis

In this paper, we present an efficient method to identify a substructure of a given structural pattern that occurs in a given NRP, where both the pattern and the peptide are represented by undirected labeled graphs. From the computational viewpoint, this can be expressed as a variant of the Maximum Common Subgraph (MCS) problem, which is NP-complete

Our method is based on the commonly used construction of a Compatibility Graph (CG), also called association or product graph, in which the largest clique represents a solution to the MCS problem (see

Results and discussion

Theory and algorithms

Graph representation of NRP structure

We encode the monomeric structure of NRPs by an undirected labeled graph. A

Examples of peptide and pattern graphs

**Examples of peptide and pattern graphs**. This figure shows examples of (a) peptide graphs and (b) pattern graphs. Nodes and edges represent monomers and chemical bonds respectively. Labels are the monomer names.

A structural pattern is also represented by a graph. Let _{P}, _{P}, _{P }is a set of nodes, _{P }⊆ _{P }× _{P }a set of undirected edges and _{P }→

Computing a maximal common substructure using the compatibility graph

The construction of the compatibility graph (CG) is often used in chemoinformatics to establish a structure mapping between two molecule graphs

Compatibility graph

The classical definition of the CG of two graphs

• the set of nodes of CG is the cartesian product _{P }×

• nodes

For our purposes, we modify the classical CG definition to only require that associated nodes have compatible labels. If

Figure

A simple example of compatibility graph

**A simple example of compatibility graph**. The Figure shows a pattern graph, a peptide graph and the corresponding CG. Each node of pattern and peptide graphs has a label (for example 'Ala') and a number that is a unique identifier of this node. Identifiers of pattern nodes are underlined in order to distinguish them from peptide nodes. A node of the CG corresponds to the association of a node of the pattern graph (underlined number) and a node of the peptide graph with the same label. Nodes of the CG are named by letters. For example, node 'a' corresponds to the association of pattern node

- 0

Clique computation

The CG represents all potential mappings between graphs

Refining CG building rules

Our goal is to detect efficiently and exactly whether a part (connected subgraph) of a size

We first modify the definition of compatibility graph, taking into account that if two nodes in

CG nodes

In order to decrease the number of nodes in the CG, we associate a node

CG edges

According to our definition of common substructure, we have to modify the above definition of an edge in the CG. Conditions (1) and (2) are replaced by the following:

In other words, if two nodes in the pattern graph are connected, then the corresponding nodes in the peptide graph must be connected too, but the opposite is not necessarily true. With this definition we achieve that if two nodes

An elementary path (EP) in a graph is a path without loops. For each node in _{G}, where the _{G }[

Figure

Matrix of elementary path sizes

**Matrix of elementary path sizes**. This figure shows matrix of elementary path sizes for (a) P1 of Figure 1 and (b) G1 of Figure 1.

We then define an edge between

Figure

Example of compatibility graph constructed with classical and new methods

**Example of compatibility graph constructed with classical and new methods**. The CG of pattern graph

New CG building rules: summary

We conclude this section by summarizing the CG building rules for a pattern graph

• each CG node

• two nodes _{P }[_{G }[

Search for a

The presence of a

To search for a

Another heuristic we use to speed up the clique search is based on the fact that once we identified more than (|_{P}| - _{P}|), each pattern node has to contribute to the clique. For example, in Figure

Testing

Case study of structural properties of NRPs

We studied the distribution of patterns of size 4 in all peptides of the database. The results are shown in Figure

Structural properties of NRPs contained in N

**Structural properties of NRPs contained in N ORINE**. Distribution of (a) 4-patterns and (b) peptide sizes in the N

Efficiency of the method

In order to test the efficiency of our method, we compared the number of nodes and edges in the CGs obtained with the classical and the new building rules on different examples in the case of search for an entire pattern. The results are shown in Table

Number of nodes and edges of the CG constructed with classical and new building rules

pattern

peptide

# CG nodes

# CG edges

P1

G1

13/**12**

22/**19**

P2

G1

16/**16**

43/**29**

P3

G1

35/**30**

210/**100**

P3

G2

25/**15**

100/**0**

P4

G2

10/**8**

14/**9**

Ala-1^{(a)}

Ala^{(b)}

73/**73**

1918/**286**

(X)19^{(c)}

Ala^{(b)}

380/**346**

53010/**3948**

Patterns P1–P4 and peptides G1–G2 refer to Figure 1. In all examples,

^{(a) }linear pattern of size 19 corresponding to alamethicin F50 without the last monomer

^{(b) }alamethicin F50 [N

^{(c) }linear pattern of 19 'X' monomers

In order to validate this speed-up in running time, we measured the search time for different complete patterns in the N

Search time for different complete patterns in the N

**Search time for different complete patterns in the N ORINE database**. Here,

For the linear pattern of size 7 (example 7), which is contained in more than 70% of the database peptides, the classical rules show a 8-fold slow-down of the running time compared to the new rules. For a linear pattern composed of 14 'X' (example 9), the classical method required 7 hours to produce the result whereas our method took less than 300 ms. Example 12 is the search for a pattern composed of 49 'X', the size of the largest peptide of the database. About 5 minutes were needed for the classical method to obtain the result whereas our method took only about 600 ms. Example 14 represents a negative test as this pattern does not occur in N

These experiments illustrate the effficiency and adequacy of our method for the search for structural patterns in NRPs.

Examples of practical applications

In this part, we give some examples of using structural pattern matching of NRPs in biological studies.

Structural features

Structural search can allow one to identify a structural motif common to peptides of a given family. As an example, a search for a cyclic 8-node pattern composed of seven 'X' and a fatty acid moiety (represented by the monomer code '

Another example is the search for a pattern associated with a biological activity. For example, pattern P4 occurs in G2 that represents ornibactin. Ornibactin is a siderophore, an iron-chelating molecule. This type of molecule needs bidendate functions that can ensure a six-fold coordination of the ferric iron. Ornithine and its derivatives can harbour this function. A search for complete pattern P4 returns a list of six siderophore peptides such as ornibactin, pyoverdin or foroxymithine. A search for the pattern R-CO_*OH-Orn_*Asp_*Ser_*Orn derived from ornibactin, with

Product identification

Another application of structural pattern matching is the search for a predicted peptide. Several studies (see

Analysis of a putative peptide

From the analysis of protein sequence similarity, some proteins can be predicted as putative NRPSs. Examples of such predictions can be found in the UniProtKB database. Even though the produced peptide has not been identified, one can infer some properties of a putative NRPS product using the structural search. An example can be provided by the putative NRPS [UniProtKB:Q1I964] from UniProtKB found in

Conclusion

Nonribosomal peptides are important bioactive compounds that have various important biological activities and are increasingly studied. With this motivation, we developed the N

In this paper, we presented an efficient dedicated method to search for a structural pattern in the database. We refined the CG building rules previously used in the literature and improved them to adapt to our problem. The main idea of refinement is to use the information on elementary path sizes and on the node degrees in order to decrease the number of nodes and especially the number of edges in the resulting CG. This, in turn, leads to a considerable speed-up in the search for a clique in the CG, which is the final step in the identification of a pattern occurrence.

As a result, a search for a pattern in the N

Searching for a structural pattern in the database can be used in different biological studies. For example, it can help to identify members of a peptide family that share common structural properties. It can also help to identify a predicted peptide by searching for it in the N

An obvious weakness of the method is that in general it might be unable to identify a common structure if the correspondence is not exact, i.e. some monomers "get replaced" by others (not specified explicitely with the joker or alternative labels), or do not have their counterparts at all. Therefore, an interesting direction for future research would be to extend the method to an "error-tolerant" pattern matching dealing with possible deletions, insertions or substitutions of monomers.

Methods

N

The method presented in this paper is included in N

Implementation

The method has been implemented in Java within the N

Authors' contributions

SC carried out most of the work on the algorithm design, implementation in N

Acknowledgements

This work was supported by the