JHK Co., Ltd., 2049 Heping Road, Shenzhen, Guangdong, 518010, China

Departments of Computing Science and Biological Sciences, 2–21 Athabasca Hall, University of Alberta, Edmonton, Alberta, T6G 2E8, Canada

Abstract

Background

Structure-based clustering is commonly used to identify correct protein folds among candidate folds (also called

Results

We propose a new scheme that performs rapid but incomplete clustering on protein decoys. Our method detects structurally similar decoys (measured using either C_{α} RMSD or GDT-TS score) and extracts representatives from them without assigning every decoy to a cluster. We integrated our new clustering strategy with several different scoring functions to assess both the performance and speed in identifying correct or near-correct folds. Experimental results on 35 Rosetta decoy sets and 40 I-TASSER decoy sets show that our method can improve the correct fold detection rate as assessed by two different quality criteria. This improvement is significantly better than two recently published clustering methods, Durandal and Calibur-lite. Speed and efficiency testing shows that our method can handle much larger decoy sets and is up to 22 times faster than Durandal and Calibur-lite.

Conclusions

The new method, named HS-Forest, avoids the computationally expensive task of clustering every decoy, yet still allows superior correct-fold selection. Its improved speed, efficiency and decoy-selection performance should enable structure prediction researchers to work with larger decoy sets and significantly improve their

Background

Predicting the 3D structure of a protein from its sequence continues to be one of the most challenging, unsolved problems in biology. However, thanks to the development of programs such as Rosetta

Many studies

While clustering can be quite useful in both candidate fold selection and correct fold detection, its speed deteriorates rapidly as the size of the decoy set increases

To accelerate the clustering and selection steps on large decoy sets, several methods have been developed in recent years. These include SCUD

In contrast to SCUD, Calibur
_{α}) RMSD properties and the triangle inequalities in metric distance are used to reduce the number of structure comparisons. Extensive testing has shown that Calibur is faster than both the Rosetta clustering program and SPICKER (used in I-TASSER). This is especially true when the decoy set is larger than 4,000 structures. The quality of the candidates selected by Calibur has been found to be similar to that of SPICKER.

Durandal

With continuous algorithmic improvements over the last decade, many

In this work we describe a faster, more efficient and more accurate approach to detect correct protein folds using a technique called partial clustering. Given that the ultimate goal of clustering in decoy selection is to return representatives of the decoy clusters, rather the clusters themselves, then we would argue that partial clustering should be sufficient to solve the problem at hand. In particular, we have designed a partial clustering algorithm that does not need to define cluster membership for every decoy but is able to extract key representatives quickly. Our method also combines the speed of this fast clustering approach with a more intelligent approach to scoring or ranking candidate folds. In contrast to other quality assessment methods, our method uses the scoring energy to rank a small set of cluster centers instead of the whole decoy set. We have found that our clustering strategy can be applied to any scoring function to enhance the rate of correct fold detection. Tests conducted on decoy sets generated by Rosetta and I-TASSER show that our method is able to select a higher proportion of correct folds than using energy functions alone or using other fast (or traditional) structure clustering algorithms. Speed and efficiency testing also shows that our method is up to 22 times faster than Durandal (the fastest clustering method described to date), and that it is also significantly more memory-efficient.

Methods

Our method, called HS-Forest, is based on the concept of partial clustering and then “intelligently” combining this partial clustering with energy function evaluation to detect the most correct fold from a given decoy set. The first step in the program involves using a novel structural clustering scheme to detect structurally “dense” regions in the given decoy set. These regions are thought to be local minima in the protein folding energy landscape according to the hypothesis that most candidate structures generated by a protein structure prediction program will tend to cluster near the correct fold if the structure generation program is reasonably good
_{α} RMSD distance (the RMSD calculated using only C_{α} atoms). However, it is important to note that our method is applicable to both metric and non-metric distance functions. Once this extracted representative distance calculation is done, we then select a representative with the smallest total distance to the lowest energy decoys as a candidate for the best decoy. In the clustering stage we introduce a random factor, described later, so that each clustering process will typically generate a different optimal candidate. We run the clustering process multiple times to generate a set of candidates. From this set of candidates, we return a consensus decoy that has the smallest total distance to all other candidates. The algorithm is outlined step-by-step below:

1. Select a given number of pivots randomly;

2. For each pivot create a hashing function;

3. Build the root node of a tree, which is the first leaf of that tree;

4. Randomly select a hashing function to split the leaves of the tree;

5. Go to Step 4 until the tree reaches a certain height as described below;

6. Determine the cluster nodes;

7. For each of the largest cluster nodes, select a representative;

8. Rank the representatives by their total distances to the top energy decoys. The top one is the candidate of the tree;

9. Go to Step 3 until a given number of trees are constructed;

10. Return the consensus of the candidates as the best decoy.

To efficiently detect locally “dense” regions in a given decoy set we decided to use a recently developed database searching technique called

Our design of the hashing function is based on a well-known metric property

To organize the hashing functions, our algorithm constructs a tree, denoted as an

The next step is to decide which nodes contain a dense region, otherwise known as a cluster. Ideally, a cluster is a region balanced with density and size (i.e., number of decoys). Since the density of a node is indicated by its height in an HS-Tree, the ideal clusters are in nodes balanced by both height and size. We define such a node as a

DEFINITION 1 (Cluster node). In an HS-Tree

Intuitively, as shown in Figure

An example of an HS-Tree

**An example of an HS-Tree.** Cluster nodes are nodes balanced with density (height) and size. Sub-trees under each cluster nodes are omitted.

LEMMA 1. In an HS-Tree

Having extracted the clusters from the non-redundant set of cluster nodes, the next step is to extract a representative from each cluster. At this stage, the challenge is two-fold: (1) even though each cluster is likely to consist of decoys structurally similar to each other, it is possible that some dissimilar decoys are mixed into the cluster; and (2) the size of clusters can vary significantly so that the traditional approach of computing a medoid with the minimal total distance to the decoys within a cluster can involve a quadratic number of distance computations. In our algorithm, we select a representative of a cluster by choosing the decoy with the smallest total distance to the set of pivots we picked when constructing the hashing functions. No additional distance computation is needed.

To select the best decoy from the pool of cluster representatives, we borrow an idea used by

Due to the random factor we introduced when constructing our HS-Tree, different HS-Trees can return different candidates for the best decoy. To compute a consensus, our algorithm generates multiple HS-Trees. First a set of _{
max
} hashing functions are randomly chosen to build an HS-Tree, so that the maximal height of the tree is _{
max
}. This process is repeated until a given number (typically between 5 and 50) of HS-Trees are constructed. Finally, our algorithm ranks the candidates from all the trees by the total distance to all the candidates, and returns the highest ranked candidate.

In our implementation, when constructing HS-Trees and extracting cluster nodes, we use _{
max
} to replace the actual tree height _{
max
} and re-run the program.

Results

We tested our method, _{α} RMSD below 4 Angstroms relative to the native (i.e. correct) structure. The exact sizes of the Rosetta and I-TASSER decoy sets are shown in Additional files

**Details of the Rosetta decoy sets.** The size of each decoy set is larger than 1000, and each decoy has the same length as the native structure in PDB.

Click here for file

**Details of the I-TASSER decoy sets.**

Click here for file

Speed and efficiency

To evaluate the speed and efficiency of HS-Forest, we tested it on both the Rosetta and I-TASSER decoy sets. Given that the results presented by

The first test set we investigated was the 1shfA set from the I-TASSER decoys. We varied the decoy set size from 1,000 to 20,000 by randomly sampling a portion of the entire decoy set. The runtimes for the three methods are shown in Figure

Runtime on the 1shfA decoy set

**Runtime on the 1shfA decoy set.** The size varies from 1,000 to 20,000 structures.

The second test set we studied was the 1bm8 set from the Rosetta decoys. This decoy set contains 64,307 decoys and poses a significant challenge to most clustering algorithms. Similar to the first experiment, we varied the decoy set size from 1,000 to 60,000. The runtime for all three methods are shown in Figure

Runtime on the 1bm8 Rosetta decoy set

**Runtime on the 1bm8 Rosetta decoy set.** The size varies from 1,000 to 60,000 structures. Durandal runs out of memory when the size exceeds 40,000 proteins. Calibur-lite takes more than 2 hours to finish a run when the number of proteins was more than 45,000.

Figures

HS-Forest is also very memory efficient. While Durandal ran out of memory when testing the 1bm8 set from the Rosetta decoy set on our testing machine with 16 GB RAM, HS-Forest was able to finish analyzing all the decoys in this set with the maximal memory usage of around 3 GB.

The computational time complexity for HS-Forest can be calculated as follows: let the number of decoys be

Performance for correct fold detection

In this section, we will show that HS-Forest is able to improve the performance of energy functions in correct fold detection. We will also compare HS-Forest with two state-of-the-art decoy clustering programs: Durandal and Calibur-lite. Since both Rosetta and I-TASSER have their own clustering programs and since these clustering programs might have an advantage when tested on their own decoy sets, we also included these two clustering programs in our comparison. The results show that HS-Forest exhibits better performance than all the above-mentioned methods.

Measuring the performance for correct fold detection is a challenging task because the same method usually performs differently on different decoy sets. For example, a method can outperform a standard energy function significantly in one decoy set but consistently fail to outperform the same energy function in other decoy sets. To measure performance we adopted two different criteria. For the first criterion, denoted as _{α} RMSD than the top decoy selected by the energy function alone. Since both HS-Forest and Durandal contain random seeding functions, unless otherwise specified, we averaged their results over 50 runs.

For our experiments on the Rosetta decoys, we applied HS-Forest to enhance the decoy selection using different energy functions, including the Rosetta energy values (that came with the decoys), the GaFolder pseudo-energy values
_{α} RMSD distance to a number of reference structures. To calculate the Consensus score, we chose the same number of reference structures as the number of pivots used in HS-Forest. When HS-Forest was combined with an energy function, we use that function to select the 10 lowest energy decoys in HS-Forest. The results are shown in Table

**Energy/Program**

**Method**

**Criterion-1 (%)**

**Criterion-2 (/35) ± Std Dev.**

Criterion-1 and Criterion-2 are defined in the manuscript. Criterion-1 measures the percentage of decoys that the top-scoring decoy outperformed; Criterion-2 measures the frequency with which the top decoy, selected by a given method, has a lower C_{α} RMSD than the one selected by the energy alone. The values behind ± are the standard deviations among different runs.

Rosetta

Energy only

80.0

NA (baseline)

with HS-Forest

85.2

24.7 ± 1.8

GaFolder

Energy only

71.8

15.0

with HS-Forest

80.4

20.0 ± 1.2

Consensus

Energy only

81.2

17.0

with HS-Forest

83.3

18.0 ± 0.7

Durandal

82.5

16.0 ± 1.8

Calibur-lite

78.9

13.6 ± 0.7

Rosetta

Clustering

80.0

20

On the Rosetta decoys, we also compared HS-Forest with Durandal, Calibur-lite, and the clustering program used in the recently released Rosetta 3.4. When testing on the largest decoy set, 1bm8, Durandal ran out of memory and Calibur-lite took more than 2 hours to finish. Consequently, we used the largest sample that Durandal and Calibur-lite could generate on the 1bm8 set to calculate its performance for this particular decoy set (since Calibur-lite was much slower than other methods on large decoy sets, we only repeated its run for a total of 10 times). The average result for Durandal using Criterion-1 is 82.5%, which is better than all of the energy-only methods but is still 2.7% less than what Rosetta energy function combined with HS-Forest achieved. For Criterion-2, the average result for Durandal was 16.0 out of 35 decoy sets (46%) versus 24.7 (71%) for Rosetta with HS-Forest. This result for Durandal is somewhat worse than what HS-Forest achieved even when it used the GaFolder and Consensus energy functions. Similar to Durandal, the performance of Calibur-lite and the Rosetta clustering program in Table

The results on the I-TASSER decoy sets are shown in Table

**Energy/Program**

**Method**

**Criterion-1 (%)**

**Criterion-2 (/40) ± Std Dev.**

Criterion-1 measures the percentage of decoys that the top scoring decoy outperformed; Criterion-2 measures the frequency with which the top decoy selected by a given method has a lower C_{α} RMSD than the one selected based on energy alone. The values behind ± are the standard deviations among different runs.

I-TASSER

Energy only

59.7

NA (baseline)

with HS-Forest

68.4

27.2 ± 1.9

GaFolder

Energy only

57.1

19

with HS-Forest

75.3

27.3 ± 1.8

Consensus

Energy only

67.4

22

with HS-Forest

72.1

26.5 ± 1.2

Durandal

72.0

25.7 ± 1.2

Calibur-lite

74.1

26.5 ± 0.5

SPICKER 2.0

72.6

25.0

Because the performance of HS-Forest on the I-TASSER data appears to be modestly better than the performance of Durandal and Calibur-lite, we performed an unpaired Student’s t-test to assess the statistical significance
^{-13} and 6.1x10^{-7} for Criterion-1 and Criterion-2 respectively; for Calibur-lite the p-values are 3.4x10^{-4} and 6.3x10^{-3} for Criterion-1 and Criterion-2 respectively. So using the standard p<0.05 criterion for statistical significance these differences are statistically significant. We also calculated the standard deviation on the C_{α} RMSD values, relative to the native structure, among different runs for each decoy set. These are found in Additional files

We also tested HS-Forest using another distance metric known as the GDT-TS distance
_{α} RMSD, we calculated the GDT-TS scores using TMscore for the top decoys selected by Durandal and Calibur-lite. The following results were averaged over 10 runs. On the Rosetta decoys, the average performance of Rosetta energy with HS-Forest for Criterion-1 and Criterion-2 was 81.9% and 22.0/35 respectively. These values are better than the performance achieved with Durandal (77.0% and 18.8/35) and Calibur-lite (71.2% and 14.3/35). On the I-TASSER decoys, the average performance using the GaFolder energy with HSForest for Criterion-1 and Criterion-2 were 75.2% and 27.5/40 respectively, which is also better than that found for Durandal (72.9% and 25.7/40) and Calibur-lite (73.9% and 25.9/40).

Discussion

Parameter settings

HS-Forest contains several parameters that can be adjusted, including the number of pivots _{
max
}, the number of trees _{
max
} =

The value of _{
max
}) values are too large, it makes the inequality (1) in Definition 1 less optimal to extract clusters. Based on these data we recommend a

Criterion-2 performance (in %) of I-TASSER with HS-Forest, for the number of pivots

**Criterion-2 performance (in %) of I-TASSER with HS-Forest, for the number of pivots **
**
P
**

Figure

Criterion-2 performance (in %) of I-TASSER with HS-Forest, for the number of trees

**Criterion-2 performance (in %) of I-TASSER with HS-Forest, for the number of trees **
**
T
**

Criterion-2 performance (in %) of I-TASSER with HS-Forest, for the number of largest clusters

**Criterion-2 performance (in %) of I-TASSER with HS-Forest, for the number of largest clusters **
**
S
**

Criterion-2 performance (in %) of I-TASSER with HS-Forest, for the number of top decoys

**Criterion-2 performance (in %) of I-TASSER with HS-Forest, for the number of top decoys **
**
E
**

For all the experiments presented in the Results section, we set _{α} RMSD distance, and then tested on the leave-out Rosetta data set. As seen in the last section the performance of HS-Forest remained superior even on the leave-out data. To further validate this result we also tested the same parameters on I-TASSER and Rosetta data using the GDT-TS distance, which served as another leave-out testing set. Again, our results show that HS-Forest’s performance was not compromised. The superior performance of HS-Forest on all leave-out sets, as well as the data shown in Figures

The role of clustering

Our method combines the power of partial clustering with the use of energy functions to help detect correct folds. It is interesting to see how clustering contributes to the improved performance of HS-Forest. To explore this further, we tested the Rosetta decoys by simply ranking the decoys by the total C_{α} RMSD distance to the 10 lowest energy decoys. When using the 10 lowest Rosetta energy decoys, the performance is 81.8% for Criterion-1. Compared with the results in Table

Essentially, the clustering step in HS-Forest narrows the collection of decoy candidates to smaller set that are close to the cluster centers. As hypothesized by

On the other hand, clustering on its own may also have problems discerning the correct fold in certain situations, particularly when there are multiple clusters of nearly equal size. In these cases, choosing the wrong cluster could lead to a very poor result. This is where other information, such as the energy score ranking, can help. It is for these reasons that HS-Forest combines both structure clustering and energy score information to detect correct folds.

Improvements to energy functions

Our results on the Rosetta and I-TASSER decoys show that HS-Forest is more effective at detecting correct folds for certain energy functions than for others. As seen in Table
_{α} RMSD to the native structure drops from 12.6 to 3.9 Å and from 7.7 to 4.0 Å respectively.

Comparison of C_{α} RMSD to the native structure for the selected decoys on the I-TASSER decoy sets before and after applying HS-Forest to GaFolder energy, sorted by the C_{α} RMSD difference

**Comparison of C**_{α }**RMSD to the native structure for the selected decoys on the I-TASSER decoy sets before and after applying HS-Forest to GaFolder energy, sorted by the C**_{α }**RMSD difference.** HS-Forest results are averaged over 50 runs.

The significant improvements seen with GaFolder indicates that our HS-Forest concept may open the door for computational biologists to reconsider using some previously discarded heuristic energy functions in structure prediction and refinement. In some cases the inferior overall performance for some energy functions may be due to a very poor performance on a small subset of proteins. Such weaknesses could be corrected by using HS-Forest as part of the energy function or as part of the evaluation criteria.

This result also illustrates the advantage of HS-Forest over traditional clustering algorithms in that it is able to combine different energy functions and adapt to different types of decoy sets. Decoys generated by different structure prediction programs can often have very different properties. Therefore it is important to be able to use different energy functions to detect the most correct folds. While traditional clustering algorithms only rely on structure comparisons within a given decoy set to select correct folds, HS-Forest combines efficient structure clustering with different energy functions to generate better decoy selections.

Performance on small decoy sets

To test the performance of HS-Forest on small decoy sets, we downloaded just such a set located on the I-TASSER website

Drawbacks of HS-Forest and decoy clustering programs

HS-Forest is not without some limitations. First of all, HS-Forest may not work particularly well with poor quality decoy sets. Following the previous work of Durandal, we limited our testing data to those decoy sets with at least one decoy being less than 4 Å C_{α} RMSD away from the native structure. To see how well HS-Forest would perform on decoy sets that do not satisfy this condition, we tested HS-Forest on 16 I-TASSER decoy sets from

A second shortcoming with HS-Forest is that the use of a random factor leads to a non-deterministic output. The speed gains with HS-Forest rely primarily on randomized hashing and the fact that it does not perform a full clustering. These changes make its clustering results a little less stable than other clustering methods. To reduce this effect, HS-Forest creates multiple trees and computes the consensus. As we can see in Tables

Conclusion

In this work, we have proposed a novel partial clustering scheme for decoy selection in protein structure prediction. This method, called HS-Forest, avoids the computationally expensive task of clustering every decoy, yet still allows superior correct-fold selection. The basic idea behind HS-Forest is to take advantage of Local Sensitive Hashing and the generation of multiple, independent trees to create a consensus result. Our method is able to adapt to different decoy sets by utilizing decoy-specific energy functions to detect correct protein folds. Extensive tests on both Rosetta and I-TASSER decoy sets show that our method is up to 22 times faster than two recently published clustering/decoy selection methods Durandal and Calibur-lite. Our method also achieves better accuracy using both C_{α} RMSD and GDT-TS distance metrics for two different decoy sets.

While no clustering method or scoring function has yet been developed that can consistently identify the most correct structure among large decoy sets, we believe HS-Forest is a step in the right direction. We hope this idea can inspire the development of even better methods for correct fold detection, and that this concept of partial clustering may be seen to have applications to other scientific fields facing similar clustering challenges.

Availability and requirements

**Project name**: HS-Forest

**Project homepage**:

**Operating System**: Tested on Linux.

**Programming Language**: C++.

**Other requirements**: None.

**License**: GNU General Public License

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

All authors jointly developed the methods and wrote the article. They read and approved the final manuscript.

Acknowledgements

We are very grateful to Dr. David Baker and his group for providing the Rosetta decoy sets and for their helpful suggestions. HS-Forest uses portions of the C_{α} RMSD and GDT-TS code implemented in Durandal, Calibur and TMscore.

Funding

This work is supported by the Alberta Prion Research Institute (APRI) and Alberta Innovates BioSolutions.