Mathematical and Computer Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, 23955, KSA

Abstract

Background

Protein side-chain packing problem has remained one of the key open problems in bioinformatics. The three main components of protein side-chain prediction methods are a rotamer library, an energy function and a search algorithm. Rotamer libraries summarize the existing knowledge of the experimentally determined structures quantitatively. Depending on how much contextual information is encoded, there are backbone-independent rotamer libraries and backbone-dependent rotamer libraries. Backbone-independent libraries only encode sequential information, whereas backbone-dependent libraries encode both sequential and locally structural information. However, side-chain conformations are determined by spatially local information, rather than sequentially local information. Since in the side-chain prediction problem, the backbone structure is given, spatially local information should ideally be encoded into the rotamer libraries.

Methods

In this paper, we propose a new type of backbone-dependent rotamer library, which encodes structural information of all the spatially neighboring residues. We call it protein-dependent rotamer libraries. Given any rotamer library and a protein backbone structure, we first model the protein structure as a Markov random field. Then the marginal distributions are estimated by the inference algorithms, without doing global optimization or search. The rotamers from the given library are then re-ranked and associated with the updated probabilities.

Results

Experimental results demonstrate that the proposed protein-dependent libraries significantly outperform the widely used backbone-dependent libraries in terms of the side-chain prediction accuracy and the rotamer ranking ability. Furthermore, without global optimization/search, the side-chain prediction power of the protein-dependent library is still comparable to the global-search-based side-chain prediction methods.

Background

Protein molecules are indispensable in most of the cellular functions, such as metabolism, gene regulation, signal transduction, and cell cycle. The capability of being such a diverse worker arises mainly due to their structures. Therefore, predicting protein structures accurately is important for both function determination and protein design purposes.

Side-chain prediction

A protein structure contains both the backbone structure and the side-chain structure. Protein structures are typically represented in either coordinate space or angular space. Based on the assumption that the length of the covalent bonds is approximately constants, protein structures are usually modeled in angular space, which can reduce the number of variables by about one third. The dihedral angles can be calculated from coordinates that define the corresponding twists of the protein’s backbone as well as side chains. There are three backbone dihedral angles namely _{1}, _{2}, _{3} and _{4} respectively. Figure **rotamers **

Protein dihedral angle

**Protein dihedral angle.** This figure illustrates different protein dihedral angles. _{1}, _{2} and _{3} denote side-chain dihedral angles.

Due to the difficulty of predicting complete protein structures simultaneously, structure determination remains as a multi-phase task. There are different sub-tasks including backbone prediction, side-chain prediction, loop modeling, and refinement. In this paper, we focus on the prediction of the side-chain conformation for a given backbone structure, i.e., protein side-chain prediction problem. By using the concept of rotamers, this is essentially the problem of correct rotamer assignment for every amino acid so that the overall structure is thermodynamically stable. It is assumed that stability comes at low internal energy states. That is why the problem of side-chain prediction is traditionally considered to be an optimization problem which strives to find a rotamer assignment which will minimize the total internal energy of the protein molecule. Since in most cases rotamers are discrete values, the problem is reduced to a combinatorial search problem in previous work

To solve an optimization problem, two components are needed, the objective function which has to be maximized/minimized and the search strategy which tries to search for the global maximum/minimum. In side-chain prediction, the rotamer solution space is exponential in the size of the protein and the objective function, which is an energy function in this case, has numerous local minima. This combination dictates people to prioritize the candidate rotamers to design a practical search strategy, which is the place where rotamer libraries come to play a role. In the past three decades there have been lots of studies in each direction. Different kinds of energy functions have been tried and developed

Rotamer library

Rotamer libraries **protein-dependent rotamer library**, without global optimization or search. To the best of our knowledge, this is a novel idea in the domain of rotamer library. For traditional backbone-dependent rotamer library, for a certain amino acid, the probability of its certain rotamer depends only on the local backbone

Markov random field model

Given a backbone dependent rotamer library, e.g., Dun-brack’s libraries published in the year of 2002 or 2010, and the backbone structure of a query protein, we first model the backbone and side-chain structures of the protein in Markov random field (MRF), where the residues are modeled as vertices of the interaction graph. We then employ widely used energy functions, e.g., Scwrl3

One thing to notice is that modeling protein structures using probabilistic graphical models is not new

Another thing to notice is that our protein-dependent rotamer library computes the marginal distributions of all the side-chain torsion angles (up to four) for a specific residue position, rather than considering them independently. This makes sense due to the high correlation between the torsion angles belong to the same amino acid.

Contributions

Our contributions can be summarized as follows:

1. We introduce the idea of protein-dependent rotamer library and show the superiority of this library with respect to the widely used backbone-dependent rotamer libraries

2. We model the protein structure as a MRF, encode the Scwrl3 energy function, and compare different sum-product BP algorithms to re-rank the rotamers. Our method does not contain a learning process, which is more likely to perform consistently well on other data sets and other energy functions.

3. The proposed protein-dependent rotamer library can be easily used as a side-chain predictor if we threshold each marginal distribution to its most probable rotamer. We compare our library with the most widely used side-chain predictors

Methods

We use the backbone structure of a protein in PDB format as our input. The output is a rotamer library with a format similar to that of Dunbrack’s library

When a protein backbone conformation is given, our method constructs an interaction graph where each residue is a vertex. We place an edge between a pair of residues if at least one pair of atoms from them is found to be closer than a minimum threshold. After that we set up the energy potentials for each node as well as each edge. Using the potentials, an inference algorithm is applied to calculate the marginal distributions of rotamers choices.

Our discussion of methods can be logically split into the following three phases:

1. Creating the interaction graph

2. Setting up energy potentials

3. Inferring marginal distributions

Creating the interaction graph

From the coordinates of the backbone atoms we create an interaction graph for the given protein. For every amino acid in the protein a vertex is added. We join each residue pair with an edge for which the distance between any possible pair of their corresponding ^{α}^{β}_{1}, _{2}, _{3}, _{4}, respectively.

Interaction graph for residue chain

**Interaction graph for residue chain.** This figure gives an example of the interaction graph for a protein sequence of seven amino acids. _{i}_{2}, _{4}) and (_{4}, _{6}).

Setting potentials

After creating the interaction graph, we calculate the energy potentials for the vertices and the edges. In MRF, potential functions are a measure of the likelihood for the random variables. We denote the entire protein structure by a set of random variables _{b}_{s}_{b}_{s}_{1}, _{2}, _{3}, and _{4}. We need to approximate the marginal probabilities for _{s}_{s}_{s}_{b}_{b}_{s}_{b}_{i}_{i}_{i}_{i}

Markov random field for interaction graph in Figure

**Markov random field for interaction graph in Figure **_{bi}_{si}_{i}

Here _{i}_{i}_{i,j} is contributed by the interactions among the side-chain atoms of a certain residue-residue pair (_{i}_{j}_{i}_{j}_{i}

Similarly the edge potential for a pair of vertices (_{i}_{j}

Here the _{B}_{B}T_{i}_{ij}

Here _{d}_{d}_{ij}_{ij}_{imax}_{i}_{sci}_{ij}

Here _{a}_{b}_{ab}_{a}_{b}_{max}_{sc}_{i}_{j}_{im}_{jn}

Here _{scij}_{im}_{jn}_{i}_{im}_{j}_{jn}_{hb}_{i}_{j}

Inferring marginal distributions

After assigning all the vertex and edge potentials, the interaction graph becomes a MRF. To re-rank the rotamer choices for each side-chain in this MRF, marginal distributions need to be computed. We employ different inference algorithms such as loopy belief propagation (LBP), generalized belief propagation (GBP) with a region graph, mean field approximation (MF) and tree re-weighted belief propagation (TRBP). Among them, LBP performs better than others, as we will show in the Results section. We give a brief description of them in the following.

Loopy belief propagation

In LBP, we initialize the vertices with some random marginal distributions called beliefs. In each iteration, depending on the potential function and the messages passed by the neighbors, every vertex updates its belief, which is assumed to be an approximation of the marginal distribution of rotamer choices for this vertex. After updating to new belief, the vertex forms a set of new messages for each of its neighbors and passes them accordingly. This procedure is repeated by every vertex at each iteration. For connected acyclic graphs it gives the exact marginal distributions for the random variables associated with the vertices of the graph. However, for the graphs with loops it gives a good estimate when the procedure converges. We set a maximum number of 100 iterations to detect whether it converges or not. If two successive iterations do not differ more than a threshold in their beliefs, the algorithm is considered to be converged. For scheduling we use an asynchronous update. The calculated belief or marginal distribution for a vertex _{i}

Message update rule is defined by the following equation:

The first equation intuitively captures the marginal likelihood by combining old belief of a vertex and the old incoming messages sent by all of its neighbors. From this information a vertex can calculate new outgoing messages which capture an estimation of the marginal distribution of destination neighbors by combining old belief of the source vertex and the beliefs of source vertex estimated by all of its neighbors except the destination. Specifically, _{i}_{i}_{i}_{ij}_{j}_{i}_{j}_{j}_{i}

Other inference algorithms

**Generalized belief propagation** is a family of approximate inference algorithms which divide the original graph into several regions to decrease the computational complexity. However, the belief expression and message update rule remain same with one subtle difference. Due to the division among regions one node can occur in multiple regions. So, we need to set weights for the contributions of these border nodes to different regions so that their overall contributions remain correct.

**Mean field approximation** tries to approximate the overall joint probability distribution by a product of independent marginals. This does not explicitly pass any messages, however at each iteration it tries to update its beliefs with the following equation:

In **tree re-weighted belief propagation**, the regular loopy belief propagation is given another set of constants called edge appearance probabilities or _{ij}_{i}_{j}

The message update rule can be written as the following equation:

After computing the marginal distribution of side-chain conformation for every vertex, the rotamers in the input rotamer library are re-ranked for each side-chain. We create a protein-dependent rotamer library according to the same structure of the input backbone-dependent rotamer library which can be used by other global optimization algorithms.

Dataset and software

To show the efficacy of our idea, we use the same data set of 379 proteins used in the Scwrl4

Results

In this section, we evaluate the performance of our proposed protein-dependent rotamer library. First of all, we compare the side-chain packing power of our library and the widely used backbone-dependent libraries

To calculate the accuracy of a rotamer choice, the most widely used criterion is used, i.e., if the mean dihedral angle of this rotamer is within 40 degree of the actual dihedral angle, this rotamer is considered to be correct; otherwise, it is considered to be wrong. For _{1+2} to be correct, both _{1} and _{2} have to be correct. We judge the correctness of _{1+2+3} and _{1+2+3+4} similarly.

Performance on side-chain prediction

We first evaluate the side-chain packing power of our protein-dependent rotamer library. We choose the widely used backbone-dependent rotamer libraries proposed by Dunbrack’s lab in 2002 and 2010

In this experiment, we use LBP as the inference algorithm, because as we will show later in this section, LBP outperforms the other three inference algorithms. Similar conclusion can be drawn if other inference algorithms are used.

Table

Comparison of rotamer libraries for side-chain prediction

**Amino acid**

**Dihedral angle**

**P10**

**P02**

**D10**

**D02**

**CYS**

_{1}

55.76

**56.40**

50.16

50.09

**SER**

_{1}

**67.34**

67.13

61.94

61.84

**THR**

_{1}

**88.46**

87.81

86.13

85.85

**VAL**

_{1}

**90.79**

90.58

86.94

86.99

**ASN**

_{1}

_{1+2}

**79.21**

**56.18**

78.33

53.34

69.53

49.34

69.31

47.15

**ASP**

_{1}

_{1+2}

78.27

**60.80**

**79.33**

60.36

72.47

57.16

73.12

56.18

**HIS**

_{1}

_{1+2}

**79.12**

**45.01**

77.93

43.29

63.33

33.33

62.06

32.86

**ILE**

_{1}

_{1+2}

**91.56**

**77.71**

91.18

77.20

86.91

68.18

87.05

68.02

**LEU**

_{1}

_{1+2}

**84.21**

**74.22**

83.70

73.19

74.89

68.59

74.20

67.94

**PHE**

_{1}

_{1+2}

**88.26**

**53.17**

86.90

52.28

72.95

42.00

73.03

42.23

**PRO**

_{1}

_{1+2}

**83.11**

**79.01**

82.20

78.19

80.96

76.70

80.92

76.74

**TRP**

_{1}

_{1+2}

**69.42**

55.60

68.06

50.40

53.49

35.61

53.01

34.98

**TYR**

_{1}

_{1+2}

**87.29**

51.30

86.38

**51.59**

72.64

42.67

72.68

43.18

**GLN**

_{1}

_{1+2}

_{1+2+3}

**72.37**

**50.25**

**25.71**

70.67

48.91

23.08

63.62

34.05

17.20

62.47

38.72

16.03

**GLU**

_{1}

_{1+2}

_{1+2+3}

**67.46**

**47.86**

**26.17**

66.39

46.65

25.03

62.36

41.97

21.21

61.71

41.14

20.48

**MET**

_{1}

_{1+2}

_{1+2+3}

71.54

56.50

**39.95**

**72.40**

**56.66**

39.51

60.03

36.31

20.12

60.85

34.55

19.91

**ARG**

_{1}

_{1+2}

_{1+2+3}

_{1+2+3+4}

**71.52**

**56.83**

**29.92**

**17.60**

71.35

56.60

29.62

16.82

63.60

47.18

21.33

9.82

61.41

47.47

21.27

9.14

**LYS**

_{1}

_{1+2}

_{1+2+3}

_{1+2+3+4}

72.02

58.54

**44.83**

**28.42**

**72.11**

**58.79**

44.80

27.85

66.43

50.82

36.86

23.33

66.28

50.73

36.89

23.48

**Overall**

_{1}

_{1+2}

_{1+2+3}

_{1+2+3+4}

**80.45**

**61.50**

**32.81**

**23.25**

80.05

60.74

31.82

22.55

73.80

53.72

24.62

16.94

73.43

53.60

24.03

16.61

The first column contains amino acid names. The second column denotes the combination of dihedral angles for which the accuracy is reported. Starting from the third column, the accuracy for our proposed library with Dunbrack’s 2010 library as input, our proposed library with Dunbrack’s 2002 library as input, Dunbrack’s backbone-dependent library proposed in 2010, and Dunbrack’s backbone-dependent library proposed in 2002 is reported, respectively.

**D02** Dunbrack’s backbone-dependent rotamer library proposed in 2002

**D10** Improved version of Dunbrack’s library proposed in 2010

**P02** Our protein-dependent rotamer library with D02 as the input library

**P10** Our protein-dependent rotamer library with D10 as the input library

The accuracy of _{1} until _{4} (if there exists) for different amino acids as well as the overall accuracy of the four rotamer libraries is shown in Table _{1} accuracy of P10 improves the higher one of D10 and D02 by at least 5% on 15 out of all the 18 amino acids, whereas the improvement is at least 10% on five amino acids. The overall _{1} accuracy of both P10 and P02 is above 80%, which improves the corresponding input library by about 6.5%. We also run a well-known side-chain prediction method, TreePack, proposed in

One thing to notice is that the improvement of the accuracy of our libraries is not consistent on different amino acids. There are some amino acids whose accuracy has been improved significantly (around 15-20%). There are also few amino acids whose improvement is below average. We investigate the fact and discover that accuracy of all the amino acids with a big aromatic ring has been improved greatly. They are HIS, PHE, TRP and TYR. A possible explanation is that because of the size of the aromatic rings, the conformations of the amino acids with aromatic rings highly depend on the local geometric environments, rather than depending only on backbone information. These amino acids are more constrained in choosing a particular rotamer even if the rotamer is heavily represented within the database. Therefore,

One interesting thing is that on MET, which does not have any big aromatic ring, our protein-dependent rotamer libraries still have about 10% improvement on _{1} and about 20% improvements on _{1+2} and _{1+2+3}. It turns out that MET is the only amino acid which has a sulfur atom inside its side chain (not the end of the side-chain). Sulfur has a bigger atomic radius with respect to carbon and nitrogen. So the conformation of sulfur dihedral angles are more constrained than the carbon or nitrogen dihedral angles, thus largely depend on the specific protein structure. However, this explanation can be questioned because of the low improvement of the accuracy for CYS, which also has a sulfur atom in its side-chain. This is due to the fact that in proteins, if suitable condition found, two CYS amino acids normally form a disulfide bond which changes its regular conformation. Such trend can already be partially captured by the statistics on the protein databases. Therefore, the protein-dependent rotamer libraries do not encode much more information than the backbone-dependent rotamer libraries. On the other hand, the energy function used in our method does not contain a specific term for disulfide bond, whereas side-chain prediction programs, which apply global search techniques, normally encode such a term. Therefore, it can be expected that our method does not improve the backbone-dependent libraries on CYS as much as the global search methods do.

It is shown in Table

Performance on rotamer ranking

To demonstrate the potential for the global optimization/search algorithms to benefit from our protein-dependent rotamer library, we further evaluate the ability to re-rank the input rotamers of our library. It has been shown in Table

We first evaluate the average rank of the first correct rotamers for P10 and D10. The average rank of correct rotamers is calculated by taking the mean rank of the first correct rotamer for each side-chain by the corresponding library according to their probability. This indicates the expected rank within which a correct rotamer should be found. In ideal case, the average rank should be 1, which means the rotamer library is able to rank the correct rotamers as the first choice. The comparison of the average rank between P10 and D10 is shown in Table _{1} and from 2.95 to 2.67 for _{1+2}. This improvement denotes that our method re-ranks the original input rotamer library in such a way that the correct rotamers bubble up across the list. This is an important measurement since most of the global search procedures give a priority towards highly probable rotamers from the library. Usually such prior knowledge is encoded in the energy functions of the side-chain prediction methods. The result confirms that our library indeed prioritizes correct rotamers on average.

Comparison of rotamer libraries for rotamer ranking

Average rank of correct rotamer

**Dihedral angle**

**Rank of P10**

**Rank of D10**

_{1}

**1.6301**

1.738

_{1+2}

**2.6663**

2.9517

Average probability of finding correct rotamers at top 1 position

**Dihedral angle**

**Probability of P10**

**Probability of D10**

_{1}

**0.8018**

0.6470

_{1+2}

**0.6111**

0.4127

Average probability of finding correct rotamer at top 2 positions

**Dihedral angle**

**Probability of P10**

**Probability of D10**

_{1}

**0.8984**

0.8899

_{1+2}

**0.7248**

0.7053

Average probability of finding correct rotamer at top 3 positions

**Dihedral angle**

**Probability of P10**

**Probability of D10**

_{1}

**0.9313**

0.9265

_{1+2}

**0.7655**

0.7479

The top part of the table shows the comparison of the average rank of the first correct rotamers between our protein-dependent rotamer library with Dunbrack’s 2010 library as input (P10) and Dunbrack’s backbone-dependent library proposed in 2010 (D10). The other part of the table shows the comparison of the average probability of finding correct rotamers in the top 1, 2 and 3 rotamers of P10 and D10, respectively.

Another set of criteria which has been evaluated is whether our top choices are populated by correct rotamers or not. We calculate the average probability of finding correct rotamers in the top 1,2 and 3 choices. As shown in Table _{1} and from 0.41 to 0.61 for _{1+2}. In the cases of top 2 and top 3 choices, even though the probability of both libraries are high, our library still outperforms the backbone-dependent library. Note that the probability here is the prior probability by the corresponding library, which is different from the accuracy of the library. Such prior probabilities are widely used in the energy functions of the global search algorithms to direct the search procedure. Therefore, with high average probability, the energy functions can be more accurate, which can thus reduce the search space of the side-chain packing methods.

Combining the results from Table

Comparison of inference algorithms

We finally report the comparison between different inference algorithms for MRF on our problem. We compare the performances of four approximate inference algorithms namely,

**LBP** Loopy belief propagation

**GBP** Generalized belief propagation

**MF** Mean field approximation

**TRBP** Tree re-weighted belief propagation

Table

Comparison of inference algorithms

Comparison

**Attribute**

**LBP**

**GBP**

**MF**

**TRBP**

Accuracy of _{1}

**80.07**

80.03

79.54

76.45

Accuracy of _{1+2}

**60.78**

60.72

60.33

55.58

Average rank of _{1}

**1.50**

1.51

1.54

1.59

Average rank of _{1+2}

**2.23**

2.25

2.33

2.45

Average execution time (in seconds)

29.99

63.19

**14.17**

189.94

Comparison of the average accuracy for side-chain prediction, the average rank of the first correct rotamers, and the average running time for loopy belief propagation (LBP), generalized belief propagation (GBP), mean field approximation (MF), and tree re-weighted belief propagation (TRBP).

Discussion

We have demonstrated that by modeling protein structures by MRF and applying inference algorithms to estimate the marginal distributions of the side-chains, we can get a much more accurate rotamer library, which we refer to as protein-dependent rotamer library. One may argue that although we do not use the global optimization/search algorithms, our method encodes the energy information. However, the energy information we used is mainly for the purpose of setting the potentials to build MRF, rather than for directing any search procedure. In this sense, the traditional backbone-dependent rotamer libraries also encode the energy information, in another form. The traditional libraries are mainly based on the statistics of the solved protein structures, which are assumed to be the global minimum conformation of the natural energy function. Therefore, doing statistics on such structures also encode energy information. This is further confirmed by the facts that if high-resolution protein structures are used to build the traditional libraries or if the core regions with high electri-density are used to do the statistics, the accuracy of the traditional libraries can be increased significantly

Although probabilistic graphical models are a relatively new tool for protein structure modeling, they have already proved their efficacy. However, they are not immune from all kinds of drawbacks. In our use of belief propagation, it is not guaranteed that the inference algorithm will converge. We avoid this problem by setting maximum limit on the number of iterations. Nevertheless, for our dataset loopy belief propagation is able to converge within 100 iterations for around 97% input proteins. Moreover, for those cases where LBP fails to converge, we still have moderately good results. Thus, this limitation is not as much daunting as it first seems to be. Other deterministic methods also can suffer from errant input. For example, both TreePack

One important application of our method is side-chain prediction for flexible backbone conformations. In many applications, a large number of backbone structures are available, such as the protein structure sampling, protein structures gathered from different protein structure prediction servers, or protein backbone refinement tasks. In such cases, there are a large number of close-to-native backbone structures, but none of them is the native structure. The traditional side-chain packing methods usually take only one single backbone structure as input, which cannot be applied here, because the set of structures contain important information about the native structure. Therefore, all of these close-to-native structures should be considered simultaneously. Our method can easily take a set of flexible backbone structures as input. In this case, the backbone structures will also be modeled as random variables. The standard belief propagation algorithms can still be used to infer the marginal distributions for side-chain rotamers under the condition of flexible backbones.

Conclusion

In this paper, we have proposed a novel type of backbone-dependent rotamer library, i.e., protein-dependent rotamer library, which encodes structural information of all the spatially neighboring residues. By estimating the marginal distributions of the side-chains in a Markov random field model, the proposed library significantly boosts the accuracy of the input rotamer library, without global optimization or search. The proposed library can hopefully lead to the performance improvements of the side-chain prediction methods.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

XG initiated the idea. MSIB and XG wrote and revised the manuscript. MSIB carried out the experiments.

Acknowledgements

We thank anonymous reviewers whose suggestions improved the manuscript. We are grateful to Dunbrack lab for issuing us the academic licenses of both Scwrl4 and the backbone dependent rotamer libraries. We thank Jinbo Xu to make the TreePack executable version publicly available. This work is supported by a grant from King Abdullah University of Science and Technology.

This article has been published as part of