Protein structure prediction is an important but unsolved problem in biological science. Predicted structures vary much with energy functions and structure-mapping spaces. In our simplified ab initio protein structure prediction methods, we use hydrophobic-polar (HP) energy model for structure evaluation, and 3-dimensional face-centred-cubic lattice for structure mapping. For HP energy model, developing a compact hydrophobic-core (H-core) is essential for the progress of the search. The H-core helps find a stable structure with the lowest possible free energy.
In order to build H-cores, we present a new Spiral Search algorithm based on tabu-guided local search. Our algorithm uses a novel H-core directed guidance heuristic that squeezes the structure around a dynamic hydrophobic-core centre. We applied random walks to break premature H-cores and thus to avoid early convergence. We also used a novel relay-restart technique to handle stagnation.
We have tested our algorithms on a set of benchmark protein sequences. The experimental results show that our spiral search algorithm outperforms the state-of-the-art local search algorithms for simplified protein structure prediction. We also experimentally show the effectiveness of the relay-restart.
Proteins are essentially sequences of amino acids. They adopt specific folded three-dimensional structures to perform specific tasks. The function of a given protein is determined by its native structure, which has the lowest possible free energy level. Nevertheless, misfolded proteins cause many critical diseases such as Alzheimer's disease, Parkinson's disease, and Cancer [1,2]. Protein structures are important in drug design and biotechnology.
Protein structure prediction (PSP) is computationally a very hard problem . Given a protein's amino acid sequence, the problem is to find a three dimensional structure of the protein such that the total interaction energy amongst the amino acids in the sequence is minimised. The protein folding process that leads to such structures involves very complex molecular dynamics  and unknown energy factors. To deal with the complexity in a hierarchical way, researchers have used discretised lattice-based structures and simplified energy models [5-7] for PSP. However, the complexity of the simplified problem still remains challenging.
The state-of-the-art approaches
There are a large number of existing search algorithms that attempt to solve the PSP problem by exploring feasible structures called conformations. A memory based local search (LS-Mem)  method reportedly produced the best results on face-centred-cubic (FCC) lattice for hydrophobic-polar (HP) energy model. Before LS-Mem, the state-of-the-art results were achieved for similar model by tabu-based local search (LS-Tabu) methods [9,10]. Besides these, genetic algorithms (GA) , and tabu search  found promising results on 2D and 3D hexagonal lattice based HP models.
In general, the success of single-point and/or population-based search algorithms crucially depends on the balance of diversification and intensification of the exploration. However, these algorithms often get stuck in local minima. As a result, they perform poorly on large sized (> 100 amino acids) proteins. Any further progress to these algorithms requires addressing the above issues appropriately.
In this paper, we present a novel spiral search algorithm for ab initio protein structure prediction using HP energy model on three-dimensional (3D) FCC lattice. By using tabu heuristic, the search approaches towards the optimum solution by spinning around a dynamic hydrophobic-core centre (HCC) like a coil. We call our tabu-based spiral search algorithm SS-Tabu. In SS-Tabu, we consider the diagonal move (corner-flip) (as shown in Figure 1) to build the hydrophobic-core (H-core). We apply random-walk  to break the premature H-core. We use a novel relay-restart when the search is trapped in local minima and the random-walk fails to overcome the stagnation. On a set of benchmark proteins, SS-Tabu significantly outperforms the state-of-the-art local search algorithms [8,9] on similar models.
Figure 1. Diagonal move. Depiction of a diagonal move, for easy comprehension, shown in 2D space.
Computational methods for PSP
Homology modeling, protein threading and ab initio are three computational approaches used in protein structure prediction. Prediction quality of homology modeling and protein threading depends on the sequential similarity of previously known protein structures. However, our work is based on the ab initio approach that only depends on the amino acid sequence of the target protein. Levinthal's paradox  and Anfensen's hypothesis  are the basis of ab initio method for protein structure prediction. The idea was originated in 1970 when it was demonstrated that all information needed to fold a protein resides in its amino acid sequence. In our simplified protein structure prediction model, we use 3D FCC lattice for conformation mapping, HP energy model for conformation evaluation, and a hydrophobic-core centric local search algorithm (SS-Tabu) for conformation search. Local search approach, 3D FCC lattice, and HP energy model are described below.
Starting from an initial solution, local search algorithms move from one solution to another to find a better solution. Local search algorithms are well known for efficiently producing high quality solutions, which are difficult for systematic search approaches. However, they are incomplete , and suffer from revisitation and stagnation. Restarting the whole or parts of a solution remains the typical approach to deal with such situations. In PSP, Cebrian et al.  used a local search algorithm combined with tabu heuristic. They implemented their method on 3D FCC lattice configuration for HP model, and tested its effectiveness on Harvard instances . Later, Dotu et al.  extended the work in  by using a hybrid method that combines local search and constraint programming together. Prior to LS-Mem, these two methods [9,10] produced the state-of-the-art results for PSP on FCC lattice and HP energy model.
Tabu meta-heuristic [18,19] enhances the performance of local search algorithms. It maintains a memory structure to remember the local changes of a solution. Then any local changes for those stored positions are forbidden for certain number of subsequent iterations (known as tabu tenure).
3D FCC lattice
The FCC lattice has the highest packing density compared to the other existing lattices . In FCC, each lattice point (the origin in Figure 2) has 12 neighbours with 12 basis vectors (1, 1, 0), (-1, -1, 0), (-1, 1, 0), (1, -1, 0), (0, 1, 1), (0, 1, -1), (1, 0, 1), (1, 0, -1), (0, -1, 1), (-1, 0, 1), (0, -1, -1), and (-1, 0, -1). The hexagonal closed pack (HCP) lattice, also known as cuboctahedron, was used in . In HCP, each lattice point has 12 neighours that correspond to 12 basis vertices with real-numbered coordinates, which causes the loss of structural precision for PSP. In simplified PSP, conformations are mapped on the lattice by a sequence of basis vectors, or by the relative vectors that are relative to the previous basis vectors in the sequence.
Figure 2. 3D FCC lattice. A unit 3D FCC lattice; the 12 basis vectors are shown on the Cartesian coordinates; the vectors represent the 12 topological neighbours of the origin on the FCC lattice.
HP energy model
The 20 constituent amino acids of proteins are broadly divided into two categories based on the hydrophobicity of the amino acids: (a) hydrophobic amino acids denoted as H (Gly, Ala, Pro, Val, Leu, Ile, Met, Phe, Tyr, Trp); and (b) hydrophilic or polar amino acids denoted as P (Ser, Thr, Cys, Asn, Gln, Lys, His, Arg, Asp, Glu). In the HP model [21,22], when two non-consecutive hydrophobic amino acids become topologically neighbours, they contribute a certain amount of negative energy, which for simplicity is shown as -1 in Figure 3. The total energy (E) of a conformation based on the HP model becomes the sum of the contributions of all pairs of non-consecutive hydrophobic amino acids as shown in Equation 1.
where cij = 1 if amino acids i and j are non-consecutive neighbours on the lattice, otherwise 0; and eij = -1 if ith and jth amino acids are hydrophobic, otherwise 0.
Different types of metaheuristic have been used in solving the simplified PSP problem. These include Monte Carlo Simulation , Simulated Annealing , Genetic Algorithms (GA) [25,26], Tabu Search with GA , Tabu Search with Hill Climbing , Ant Colony Optimisation , Immune Algorithms , Tabu-based Stochastic Local Search [8,9], and Constraint Programming . Cebrian et al.  used tabu-based local search, and Shatabda et al.  used memory-based local search with tabu heuristic and achieved the state-of-the-art results. However, Dotu et al.  used constraint programming and found promising results but only for small sized (< 100 amino acids) proteins. Besides local search, Unger and Moult  applied genetic algorithms to PSP and found their method to be more promising than the Monte Carlo based methods . They used absolute encodings on the square and cubic lattices for HP energy model. Later, Patton  used relative encodings to represent conformations and a penalty method to enforce the self-avoiding walk constraint.
The GA has been used by Hoque et al.  for cubic, and 3D HCP lattices. They used Depth First Search (DFS) to generate pathways  in GA crossover for PSP. They also introduced a twin-removal operator  to remove duplicates from the population and thus to prevent the search from stalling.
In HP model, protein structures have H-cores that hide the hydrophobic amino acids from water and expose the polar amino acids to the surface to be in contact with the surrounding water molecules . H-core formation is the main objective of HP based PSP. To achieve this, the total distance of all H-H pairs is minimised in . A predefined motif based segment replacement strategy is applied in  to replace structure segments by pre-determined substructures based on matching H-P orientations in the target sequence. In SS-Tabu, we try to reduce the distance of each H-amino acid from the HCC; which eventually helps minimise the free energy level of the conformation.
Spiral search framework
In spiral search, only the diagonal move operator is used repeatedly (as shown in Figure 4) in building H-cores. A diagonal move displaces ith amino acid from its position to another position on the lattice without changing the position of its succeeding (i + 1)th and preceding (i - 1)th amino acids in the sequence. The move is just a corner-flip to an unoccupied lattice point. In SS-Tabu, we repeatedly use diagonal moves that squeeze the conformation and quickly form the H-core. The spiral search procedure (see the pseudocode in Figure 5) is composed of several sub-procedures mainly, for move selection, for handling local minima and stagnation, and for initialisation and evaluation.
Figure 4. Spiral search. Spiral search comprising a series of diagonal moves. For simplification and easy understanding, the figures are presented in 2-dimensional space.
Figure 5. Spiral search algorithm. Spiral Search framework: pseudocode of Procedure SpiralSearch.
In move selection, the hydrophobic amino acids get priority in comparison to polar amino acids. The move selection criteria are explained in the following paragraphs.
In H-move selection (see the pseudocode in Figure 6), the HCC is calculated by finding arithmetic means of x, y, and z coordinates of all hydrophobic amino acids using Equation 2. The selection is guided by the Cartesian distance di (as shown in Equation 3) between HCC and the hydrophobic amino acids in the sequence. For the ith hydrophobic amino acid, the common topological neighbours of the (i - 1)th and (i + 1)th amino acids are computed. The topological neighbours (TN) of a lattice point are the points at unit lattice-distance apart from it. For 3D FCC lattice, there are four common TN of the (i - 1)th and (i + 1)th amino acids. From the common neighbours, the unoccupied points are identified. The Cartesian distance of all unoccupied common neighbours are calculated from the HCC using Equation 3. Then the point with the shortest distance is picked. This point is listed in the possible H-move list for ith hydrophobic amino acid if its current distance from HCC is greater than that of the selected point. When all hydrophobic amino acids are traversed and the feasible shortest distances are listed in H-move list, the amino acid having the shortest distance in H-move list is chosen to apply diagonal move operator on it. A tabu list is maintained for each hydrophobic amino acid to control the selection priority amongst them. For each successful move, the tabu list is updated for the respective amino acid. The process stops when no H-move is found. In this situation, the control is transferred to select and apply P-moves.
Figure 6. H-move algorithm. Spiral Search framework: pseudocode of Procedure selectMoveForH.
where nh is the number of H amino acids in the protein.
For polar amino acids, the same kind of diagonal moves are applied as H-move. For each ith polar amino acid, all free lattice points that are common neighbours of lattice points occupied by (i - 1)th and (i + 1)th amino acids are listed. From the list, a point is selected randomly to complete a diagonal move for the respective polar amino acid. No hydrophobic-core center is calculated, no Cartesian distance is measured, and no tabu list is maintained for P-move. After one try for each polar amino acid the control is returned to select and apply H-moves.
For hard optimisation problems such as protein structure prediction, local search algorithms often face stagnation. Thus, handling such situation intelligently is important to proceed further. In our SS-Tabu, we use random-walk  and a new relay-restart technique with on-demand basis to deal with stagnation.
Premature H-cores are observed at local minima. To escape local minima, a random-walk  algorithm (see the pseudocode in Figure 7) is applied. This algorithm uses pull moves  (as shown in Figure 8) to break the H-core. We use pull-moves because they are complete, local, and reversible. Successful pull moves never generate infeasible conformations. During pulling, energy level and structural diversification are observed to maintain a balance among these two. We allow energy level to change within 5% to 10% with changes in the structure from 10% to 75% of the original. We try to accept the conformation that is close to the current conformation in terms of energy level but is diverse in terms of structure.
Figure 7. Random-walk algorithm. Spiral Search framework: pseudocode of Procedure randomWalk.
Instead of using a fresh restart or restarting from the current best solution [8,9], we use a new relay-restart technique (see the pseudocode in Figure 9) when the search stagnation situation arises. We use relay-restart when random-walk fails to escape from local minima. The relay restart starts from an improving solution. We maintain an improving solution list that contains all the improving solutions after initialisation. When a solution with energy level better than the current global best is found, the solution is added to the top of the list pushing existing solutions back. For relay-restart, a random conformation from the top 10% solutions of the list is selected to start with. The selected solution is then sent back to the bottom of the list to keep it away from the scope of reselection in very near future.
Figure 9. Relay-restart algorithm. Spiral Search framework: pseudocode of Procedure relayRestart.
Further implementation details
Like other local search algorithms, our spiral search requires initialisation. It also needs evaluation of the solution in each iteration. Further, it needs to maintain a tabu meta-heuristic to guide the search.
Our algorithm starts with a feasible conformation. We generate an initial conformation following a self-avoiding walk (SAW) on FCC lattice points. The pseudocode of the algorithm is presented in Figure 10. It places the first amino acid at (0, 0, 0). It then randomly selects a basis vector to place the successive amino acid at a neighbouring free lattice point. The mapping proceeds until a self-avoiding walk is found for the whole protein sequence.
Figure 10. Initialisation algorithm. Spiral Search framework: pseudocode of Procedure initialise.
Intuitively we use different tabu-tenures based on the number of hydrophobic amino acids (hCount) in the sequence. We intuitively calculate tabu-tenure using the formula in Equation 4:
After each iteration, the conformation is evaluated by counting the H-H contacts (topological neighbour) where the two amino acids are non-consecutive. The pseudocode in Figure 11 presents the algorithm of calculating the free energy of a given conformation. Note that energy value is negation of the H-H contact count.
Figure 11. Evaluation algorithm. Spiral Search framework: pseudocode of Procedure evaluate.
Results and discussion
In our experiment, the protein instances (as shown in Table 1), F180 and R instances are taken from Peter Clote laboratory website (bioinformatics.bc.edu/clotelab/FCCproteinStructure). Cebrian et al. , Dotu et al. , and Shatabda et al.  used these instances in evaluating their algorithms. We also use six more larger sequences that are taken from the CASP (predictioncenter.org/casp9/targetlist.cgi) competition. The corresponding CASP target IDs for proteins 3mse, 3mr7, 3mqz, 3no6, 3no3, and 3on7 are T0521, T0520, T0525, T0516, T0570, and T0563. These CASP targets are also used in . To fit in the HP model, the CASP targets are converted to HP sequences based on the hydrophobic properties of the constituent amino acids. The lower bounds of the free energy values (in Column LB-FreeE of Table 1) are obtained from [8,9]; however, there are some unknown values (presented as ?) of lower bounds of free energy for large sequences.
Table 1. Experimental results of LS-Mem, LS-Tabu, and SS-Tabu
In Table 1, the Size column presents the number of amino acids in the sequences, and LB-FreeE column shows the known lower bounds of free energy for the corresponding protein sequences in Column ID. However, lower bound of free energy for protein 3on7 is unknown. The best and average free energy for three different algorithms are also present in the table. The bold-faced values indicate better performance in comparison to the other algorithms for corresponding proteins. The experimental results show that our SS-Tabu wins over LS-Mem and LS-Tabu over the 21 proteins with a significant margin on average search results.
The difficulty to improve energy level is increased as the predicted energy level approaches to the lower bound. For example, if the lower bound of free energy of a protein is -100, the efforts to improve energy level from -80 to -85 is much less than that to improve energy level from -95 to -100 though the change in energy is the same (-5). Relative Improvement (RI) explains how close our predicted results to the lower bound of free energy with respect to the energy obtained from the state-of-the-art approaches.
In Table 2, we present a comparison of improvements (%) on average conformation quality (in terms of free energy levels). We compare SS-Tabu (target) with LS-Tabu and LS-Mem (references). For each protein, the RI of the target (t) w.r.t. the reference (r) is calculated using the formula in Equation 5, where Et and Er denote the average energy values achieved by target and reference respectively, and E1 is the lower bound of free energy for the protein in the HP model. We present the relative improvements only for the proteins having known lower bound of free energy values. We test our new algorithm on 21 different proteins of various length. The bold-faced values are the minimum and the maximum improvements for the same column.
Table 2. Relative improvement by SS-Tabu w.r.t. LS-Mem and LS-Tabu
Improvement w.r.t. LS-Mem
The experimental results in Table 2, at column RI (relative improvement) under LS-Mem shows that our SS-Tabu is able to improve the search quality in terms of minimizing the free energy level over all 21 proteins. The relative improvements with respect to LS-Mem range from 12.20% to 37.68%.
Improvement w.r.t. LS-Tabu
The experimental results in Table 2, at column RI under LS-Tabu shows that our SS-Tabu is able to improve the search quality in terms of minimising the free energy level over all 21 proteins. The relative improvements with respect to LS-Tabu range from 21.95% to 75.00%.
Effectiveness of relay-restart
In Table 3, we present another set of experimental results to show the effectiveness of relay-restart in the spiral search framework. The results under the headings Target and Reference are obtained by running SS-Tabu respectively with and without relay-restart. The relative improvements on average search results are presented in the last column of the table. The relative improvements after including relay-restart in our SS-Tabu, are as minimum as 1.39% and as maximum as 23.08%.
Table 3. Effectiveness of relay-restart in SS-Tabu
In Figure 12, we show the best structures found by SS-Tabu, LS-Mem and LS-Tabu for protein R1. Each algorithm runs over a period of 5 hours to achieve the results.
Figure 12. Structure comparison. The 3D structures of protein R1 obtained by a) LS-Tabu, b) LS-Mem, and c) SS-Tabu.
We compare the search progresses of different variants of local search; LS-Tabu, LS-Mem, and SS-Tabu over time. Figure 13 shows the average energy values obtained with times by the algorithms for protein R1. We observe that all of the algorithms achieve very good progress initially, but with time increasing, our spiral search SS-Tabu makes more progress than LS-Tabu and LS-Mem.
Figure 13. Search progress. Search progress for protein R1 with time for different approaches. The results are calculated over 50 different runs with identical settings for each algorithm.
In this paper, we present a new spiral search (SS-Tabu) under the local search framework for simplified protein structure prediction on 3D face-centred-cubic lattice. We use a new search guiding heuristic, which is the distance of a hydrophobic amino acid from a dynamic hydrophobic-core centre. We also use a novel relay-restart technique to break the stagnation. We compare our results with two other local search algorithms: LS-Tabu and LS-Mem, which achieved the state-of-the-art results for similar models. We found that our SS-Tabu significantly outperforms both LS-Mem and LS-Tabu. We aim to apply our algorithm in high resolution protein structure prediction in future.
The authors declare that they have no competing interests.
MAR conceived the idea of Spiral Search algorithm. MAHN, MTH, SS, DNP, and AS helped MAR modeling, implementing, and testing the algorithm. All authors equally participated in analysing the test results to improve the algorithm and were significantly involved in the process of writing and reviewing the manuscript. SS also provided experimental data from his memory based local search (LS-Mem).
NICTA, the sponsor of the article for publication, is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 2, 2013: Selected articles from the Eleventh Asia Pacific Bioinformatics Conference (APBC 2013): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S2 webcite.
We would like to express our great appreciation to the people managing the Cluster Computing Services at National ICT Australia (NICTA) and Griffith university. They helped a lot in preparing this article on time by taking care of our submitted jobs in clusters.
Shatabda S, Newton M, Pham DN, Sattar A: Memorybased local search for simplified protein structure prediction. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM; 2012::345-352.
Hoque MT, Chetty M, Sattar A: Protein folding prediction in 3D FCC HP lattice model using genetic algorithm. In IEEE Transactions on Computational Biology and Bioinformatics. Volume 2007. IEEE Congress on Evolutionary Computation; 2007::4138-4145.
Böckenhauer HJ, Ullah AZMD, Kapsokalivas L, Steinhöfel K: A local move set for protein folding in triangular lattice models. In WABI. Volume 5251. Lecture Notes in Computer Science, Springer; 2008::369-381.
Rashid MA, Shatabda S, Newton M, Hoque MT, Pham DN, Sattar A: Random-walk: a stagnation recovery technique for simplified protein structure prediction. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM; 2012::620-622.
ORSA Journal on Computing 1989, 1(3):190-206. Publisher Full Text
The Annals of Mathematics 2005, 162(3):1065-1185. Publisher Full Text
Macromolecules 1989, 22(10):3986-3997. Publisher Full Text
Unger R, Moult J: A genetic algorithm for 3D protein folding simulations. In Soft Computing - A Fusion of Foundations, Methodologies and Applications. Morgan Kaufmann Publishers, The 5th International Conference on Genetic Algorithms; 1993::581.
Physics of Life Reviews 2005, 2(4):353-373. Publisher Full Text
Patton AL, Punch WF III, Goodman ED: A standard GA approach to native protein conformation prediction. In IEEE Transaction on Evolutionary Computing. International Conference on Genetic Algorithms; 1995.
Neurocomputing 2010, 73(13-15):2308-2316. Publisher Full Text