Molecular Bioscience Graduate Program, Arkansas State University, Arkansas, USA

Bioinformatics Graduate Program, University of Arkansas at Little Rock, Arkansas, USA

School of Computing, DePaul University, Illinois, USA

Department of Computer Science, Lafayette College, Pennsylvania, USA

Department of Computer Science, Arkansas State University, Arkansas, USA

Abstract

Background

Protein structure comparison and classification is an effective method for exploring protein structure-function relations. This problem is computationally challenging. Many different computational approaches for protein structure comparison apply the secondary structure elements (SSEs) representation of protein structures.

Results

We study the complexity of the protein structure comparison problem based on a mixed-graph model with respect to different computational frameworks. We develop an effective approach for protein structure comparison based on a novel independent set enumeration algorithm. Our approach (named: ePC, **e**fficient **e**numeration-based **P**rotein structure **C**omparison) is tested for general purpose protein structure comparison as well as for specific protein examples. Compared with other graph-based approaches for protein structure comparison, the theoretical running-time ^{rn}n^{2}) of our approach ePC is significantly better, where

Conclusion

Through the enumeration algorithm, our approach can identify different substructures from a list of high-scoring solutions of biological interest. Our approach is flexible to conduct protein structure comparison with the SSEs in sequential and non-sequential order as well. Supplementary data of additional testing and the source of ePC will be available at

Background

Protein structure comparison is an effective method for exploring protein structure-function relations and for studying evolutionary relations of different species. It can also be applied to identify the active sites of carrier proteins, the binding sites of antibodies, the inhibition sites of enzymes, and the common structural motifs of proteins, which has significant applications in biological and biomedical research.

The computational methods for protein structure comparison usually represent a protein structure by atomic coordinates in the Euclidean space, as a distance matrix

We first show the problem of comparing a query structure to another structure is intractable with respect to several computational frameworks. For example, we show that the problem is

Whereas the above results are negative hinting at the challenging nature of the problem, the graph-based approach we use allows us to model the problem as a maximum independent set problem, for which a repertoire of effective exact algorithms exist in the literature. We use an algorithm developed by (some of) the authors ^{n}n^{2}), where ^{rn}n^{2}) of our approach ePC is the current best, where n is the smaller number of SSEs of the two proteins,

Many different approaches for protein structure comparison apply the secondary structure elements (SSEs) representation and database searching, such as deconSTRUCT

Methods

A mixed graph for a protein structure is constructed from the PDB file as follows: each vertex represents a core/secondary structure element (i.e., an alpha helix element, or, a beta strand element), each undirected edge represents the interaction between two cores, and each directed edge (arc) represents the loop between two consecutive cores (from the N-terminal to the C-terminal). A mixed graph representation is used for protein structure prediction in

Structure graph for 6ldh

**Structure graph for 6ldh**. Alpha helix elements are represented by circles and beta strand elements are represented by squares.

Goldman et al.

Song et al. _{v }_{v}_{v}

The graph embedding problem and complexity results

In this section, we study the complexity of the mixed graph embedding problem, which corresponds to the problem of identifying the query protein structure (e.g., a motif structure) as a substructure in a larger protein structure.

We define the

Given two mixed graphs

(i)

(ii) for any two vertices

(iii) for any two vertices

We shall call an injective embedding

Informally speaking, the

We define the restriction of the

If one cannot embed the whole graph _{≥}, by introducing a nonnegative parameter

(i)

(ii) for any two vertices

(iii) for any two vertices

The optimization/maximization version of the _{≥ }problem, denoted _{≥ }and

It was shown in

**Theorem 0.1 **

For every vertex

For every two vertices

For every two vertices

This completes the construction of

It is not difficult to verify that (

The above theorem, together with the result in

If we consider the ^{r}

We investigate next the complexity of the _{≥ }problem.

**Theorem 0.2 **_{≥ }

_{≥ }problem. We only prove the NP-hardness, as it is very easy to show the membership of the problem in

Let _{≥ }as follows. The set of vertices _{1}, ... ,_{n}_{1}, ... ,_{n}_{1 },..., _{n }_{1}, ... ,_{n }_{i}_{j }_{i}u_{j }

It is not difficult to verify that _{≥}. This completes the proof. □

The reduction in the above theorem is an fpt-reduction, from the _{≥}, where the parameter is the size of the subgraph sought

**Theorem 0.3 **The _{≥ }

Finally, we observe that the same reduction in Theorem 0.2 provides an

**Theorem 0.4 **

Graph embedding to independent set

In this section we show that the

Let (_{1}, _{2}, ... ,_{n}_{i }_{i}_{+1}, for 1 ≤ _{1}, _{2}, ... ,_{m}_{i }_{i}_{+1}, for 1 ≤

**Theorem 0.5 **^{cn}^{crn }time

P_{i }_{j}_{ij}_{ij }_{kl}

1.

2.

3. There is an undirected edge between _{i }_{k }_{j }_{l }

Note that Condition 2 could be removed when the order of the mapped vertices are not required to be the same for the two graphs.

It is clear that any independent set of ^{cn}^{crn }

If we use the current-best exact algorithm for ^{n}^{/4}), we conclude that:

**Theorem 0.6 **^{rn}^{/4}),

Algorithm for structure comparison

The problem of protein structure comparison could be modeled as finding an independent set problem of an auxiliary graph. When aligning two protein structures, the auxiliary graph

Refer to the following for the outline of the algorithm for protein structure comparison.

1

2

3

4

We analyze the time complexity of the algorithm:

Step 1: The algorithm processes the two proteins to generate the corresponding two structure graphs, where each vertex of a graph represents an SSE of the corresponding protein. Suppose the number of the vertices of each structure graph is bounded by

Step 2: We introduce a parameter

Step 3: Through calling the enumeration algorithm develop in ^{rn}

Step 4: It takes time ^{rn}n^{2}) to evaluate the generated independent sets and identify the independent set, which corresponding to the SSE pairs with the best score of the two proteins.

Refer to ^{n}^{n}^{+1})^{rn}n^{2}) of our approach ePC is the current best, where

Testing results

Our approach ePC is designed for general-purpose protein structure comparison. In this section we test our approach for this purpose using SABmark-sup and SABmark-twi

Given two proteins, _{ij }_{ij }_{1}, _{2}) = 0.1 − |_{1 }− _{2}|/(_{1 }+ _{2}).

Let _{A }_{B }_{c }_{c }_{n }_{n }_{c}_{c}

Testing different parameter values

There are two important parameters of our algorithm _{ij }

We present our testing results for accuracy (using the score

The running times for different r values

**The running times for different r values**. Note for all these testing, our approach use the same parameter K = 1000.

The scores for different r values

**The scores for different r values**. Note for all these testing, our approach use the same parameter K = 1000.

For the enumeration of independent sets, we have introduced a parameter

The running times and scores for different

**K =**

**125**

**250**

**500**

**1000**

time

1.90

3.57

6.76

12.73

score

8.89

9.08

9.17

9.28

Note for all these testing our approach use the same parameter

Performing structure comparison

Specific examples

We test our approach on specific examples for common substructures and novel folds which share common substructures with non-sequential SSEs.

Please refer to the following testing results of our approaches, when 1a02N is compared with: 1iknA, 1nfiA, and 1a3qA. Our testing results match the results in

The 3D Structure of 1a02N with its two domains: p53-like transcription factors and E set domains

**The 3D Structure of 1a02N with its two domains: p53-like transcription factors and E set domains**. There are 18 cores/SSEs (0-17) with conserved SSEs marked with *. Matched SSEs of 1a02N and 1ikna: (0,1) (1,2) (3,3) (7,5) (13,7) (14,8) (17,11); Matched SSEs of 1a02N and 1nfia: (0,1) (1,2) (3,3) (7,5) (12,7) (13,10) (15,11) (17,13); Matched SSEs of 1a02N and 1a3qa: (3,0) (5,3) (6,4) (7,5) (13,7) (14,8) (16,12) (17,13).

Structure search and comparison of the three novel folds with the structural analogs

**New fold**

**Detected analog**

**DaliLite**

**TM-align**

**GANGSTA+**

**deconSTRUCT**

**SSM**

**ePC**

2JMK/7/57

1GO4H/4/93

11.0/0/75

4.0/1/67

1.8/7/61

0/0

1/14

4/100%/8.3

2AJE/7/44

1J7NB/40/738

3.9/3/45

3.4/3/45

2.1/4/53

3/31

3/61

7/100%/10.9

2ES9/5/58

1SXJH/15/267

2.5/4/57

4.0/5/65

1.8/5/69

3/36

2/38

5/100%/9.9

The results for DaliLite, TM-align and GANGSTA+ are from

Structure alignment of PDB:2AJE and PDB:1J7NB

**Structure alignment of PDB:2AJE and PDB:1J7NB**. Structure alignment of the new fold PDB:2AJE and the structural analog PDB:1J7NB, showing nonsequential order of aligned SSEs.

Aligned SSEs of PDB:2AJE and PDB:1J7NB

**Aligned SSEs of PDB:2AJE and PDB:1J7NB**. The amino acid sequences of the new fold PDB:2AJE and the structural analog PDB:1J7NB, showing the non-sequential order of aligned SSEs of the two protein sequences.

Discussion

We use an SSE-based graph model for general purpose protein structure comparison. We presented the computational complexity results related to the protein structure comparison problem. An effective algorithm is developed integrating a novel enumeration of independent sets and parameterized computation for the problem. Our approach is tested for protein structure comparison using benchmark testing sets. Compared with other SSE-based approaches, our approach has comparable performance for the general purpose protein structure comparison. We also demonstrate that our approach could be applied to identify common substructure with non-sequential SSEs and proteins sharing more than one common substructure.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

XH, IK and GX carried out the study on the complexity and the design of the approach for the protein structure comparison problem, and drafted the manuscript. CA, DJ and KW participated in the implementation and the testing of the algorithm. All authors have approved the final manuscript.

Acknowledgements

This research is supported by the National Institute of Health grants from the National Center for Research Resources (5P20RR016460-11) and the National Institute of General Medical Sciences (8P20GM103429-11).

Declarations

The publication costs for this article were funded by the corresponding author's institution.

This article has been published as part of