Abstract
Background
Protein sidechain packing problem has remained one of the key open problems in bioinformatics. The three main components of protein sidechain prediction methods are a rotamer library, an energy function and a search algorithm. Rotamer libraries summarize the existing knowledge of the experimentally determined structures quantitatively. Depending on how much contextual information is encoded, there are backboneindependent rotamer libraries and backbonedependent rotamer libraries. Backboneindependent libraries only encode sequential information, whereas backbonedependent libraries encode both sequential and locally structural information. However, sidechain conformations are determined by spatially local information, rather than sequentially local information. Since in the sidechain prediction problem, the backbone structure is given, spatially local information should ideally be encoded into the rotamer libraries.
Methods
In this paper, we propose a new type of backbonedependent rotamer library, which encodes structural information of all the spatially neighboring residues. We call it proteindependent rotamer libraries. Given any rotamer library and a protein backbone structure, we first model the protein structure as a Markov random field. Then the marginal distributions are estimated by the inference algorithms, without doing global optimization or search. The rotamers from the given library are then reranked and associated with the updated probabilities.
Results
Experimental results demonstrate that the proposed proteindependent libraries significantly outperform the widely used backbonedependent libraries in terms of the sidechain prediction accuracy and the rotamer ranking ability. Furthermore, without global optimization/search, the sidechain prediction power of the proteindependent library is still comparable to the globalsearchbased sidechain prediction methods.
Background
Protein molecules are indispensable in most of the cellular functions, such as metabolism, gene regulation, signal transduction, and cell cycle. The capability of being such a diverse worker arises mainly due to their structures. Therefore, predicting protein structures accurately is important for both function determination and protein design purposes.
Sidechain prediction
A protein structure contains both the backbone structure and the sidechain structure. Protein structures are typically represented in either coordinate space or angular space. Based on the assumption that the length of the covalent bonds is approximately constants, protein structures are usually modeled in angular space, which can reduce the number of variables by about one third. The dihedral angles can be calculated from coordinates that define the corresponding twists of the protein’s backbone as well as side chains. There are three backbone dihedral angles namely ϕ, ψ and ω. Each of them defines the twist between two corresponding neighboring planes defined by backbone atoms. To explain the detailed sidechain conformations, dihedral angles χ are used. Different amino acids have different number of sidechain dihedral angles with the maximum number of four, namely χ_{1}, χ_{2}, χ_{3} and χ_{4} respectively. Figure 1 illustrates the definition of the backbone and sidechain dihedral angles. The interesting thing is these angles cannot take any arbitrary values due to atomic clashes and orientations. They appear to take values only from discrete domains. These discrete conformations which are available to the sidechain dihedral angles are called rotamers [1].
Figure 1. Protein dihedral angle. This figure illustrates different protein dihedral angles. ϕ, ψ and ω constitute backbone dihedral angles and χ_{1}, χ_{2} and χ_{3} denote sidechain dihedral angles.
Due to the difficulty of predicting complete protein structures simultaneously, structure determination remains as a multiphase task. There are different subtasks including backbone prediction, sidechain prediction, loop modeling, and refinement. In this paper, we focus on the prediction of the sidechain conformation for a given backbone structure, i.e., protein sidechain prediction problem. By using the concept of rotamers, this is essentially the problem of correct rotamer assignment for every amino acid so that the overall structure is thermodynamically stable. It is assumed that stability comes at low internal energy states. That is why the problem of sidechain prediction is traditionally considered to be an optimization problem which strives to find a rotamer assignment which will minimize the total internal energy of the protein molecule. Since in most cases rotamers are discrete values, the problem is reduced to a combinatorial search problem in previous work [25].
To solve an optimization problem, two components are needed, the objective function which has to be maximized/minimized and the search strategy which tries to search for the global maximum/minimum. In sidechain prediction, the rotamer solution space is exponential in the size of the protein and the objective function, which is an energy function in this case, has numerous local minima. This combination dictates people to prioritize the candidate rotamers to design a practical search strategy, which is the place where rotamer libraries come to play a role. In the past three decades there have been lots of studies in each direction. Different kinds of energy functions have been tried and developed [410]. In the domain of search strategy, a broad range of combinatorial search algorithms, both exact [1115] and approximate [1620] ones, have been applied. To incorporate prior knowledge, different kinds of rotamer libraries have been developed. In this paper, we propose a novel idea in the context of rotamer library.
Rotamer library
Rotamer libraries [16,2124] are important components not only in sidechain prediction but also in several other areas including protein design. They summarize the existing knowledge of the experimentally determined structures quantitatively. Along with other information, rotamer libraries contain estimated probabilities of the discrete conformations of sidechain dihedral angles calculated from the structure databases. Depending on how much contextual information is taken into account, there can be different kinds of libraries. Initially the libraries consider only amino acid specific context and the probabilities are given for rotamers of different amino acids [2530]. They are called backbone independent rotamer libraries. However, the discriminative power of the backbone independent libraries is not enough to eliminate sufficient amount of rotamer choices. Therefore, the backbone dependent rotamer libraries have been introduced [1,2124,3136]. These libraries consider the local backbone context through the ϕ and ψ angles along with the amino acid information. Backbone dependent rotamer libraries have been demonstrated to be able to boost the accuracy of a sidechain predictor equipped with the global optimizer over energy function landscape and guide them to avoid local minima [33]. Theoretically, the more contextspecific information the library can encode, the more precise rotamer choices it can deliver. In this paper, we combine the general purpose backbonedependent rotamer library with the detailed backbone atom coordinates of a specific protein, to introduce a proteindependent rotamer library, without global optimization or search. To the best of our knowledge, this is a novel idea in the domain of rotamer library. For traditional backbonedependent rotamer library, for a certain amino acid, the probability of its certain rotamer depends only on the local backbone ϕ and ψ angles. In our case the probabilities of two different rotamers of the same amino acid with the same ϕ and ψ angles can have different marginal distributions depending on their interactions with the surrounding environments.
Markov random field model
Given a backbone dependent rotamer library, e.g., Dunbrack’s libraries published in the year of 2002 or 2010, and the backbone structure of a query protein, we first model the backbone and sidechain structures of the protein in Markov random field (MRF), where the residues are modeled as vertices of the interaction graph. We then employ widely used energy functions, e.g., Scwrl3 [4] energy function, to set up the potential for inference algorithms, e.g., sumproduct belief propagation, to compute the marginal distributions of the residuespecific rotamers. In this way, all the rotamers are reranked for each residue in the query protein, according to the marginal distributions. We will demonstrate that this reranking can significantly improve the accuracy of the input backbone dependent rotamer library, which can hopefully benefit the global search algorithms for sidechain packing, such as the deadendelimination algorithm proposed in [11] and the tree decomposition algorithm proposed in [2,3].
One thing to notice is that modeling protein structures using probabilistic graphical models is not new [3740]. Kamisetty et al. modeled protein structures by MRF and applied generalized belief propagation (GBP) to compute the free energy of a protein structure [37]. Our graphical model of protein structures is similar to their model. However, our focus is to calculate the marginal distributions and rerank the rotamers, without calculating the free energy. We will demonstrate that loopy belief propagation (LBP) outperforms GBP for this purpose. Besides, we have encoded an energy function that is more suitable for reranking the rotamers than the ROSETTA energy function used in [37]. Yanover et al. modeled protein structures by conditional random field (CRF) and applied maxproduct belief propagation (BP) algorithms for sidechain prediction [39]. Our work is different from theirs in several ways. Firstly, their purpose is to apply maxproduct BP as a global search algorithm, which means they are interested in finding the optimal rotamer combination of all the side chains simultaneously, i.e., the combination that corresponds to the maximum joint probability. Therefore, their method is a sidechain predictor by itself, which can hardly be used by more powerful search algorithms, such as the one proposed recently in Scwrl4 [5]. We model protein structures as a MRF and apply sumproduct BP, which provides the detailed marginal distribution for each sidechain, without global optimization. That means, if one selects the highest probability rotamer for each sidechain in our method, it may not yield a valid sidechain packing due to atomic clashes. Therefore, our method should be considered as a proteindependent rotamer library which serves as the input for global search algorithms. Secondly, [39] used the ROSETTA energy function and demonstrated that the tree reweighted BP algorithm performed very well to minimize and learn this energy function. However, our results demonstrate that this is not a general case. We use the simpler Scwrl3 [4] energy function and for that tree reweighted BP does not perform better than the other BP algorithms.
Another thing to notice is that our proteindependent rotamer library computes the marginal distributions of all the sidechain torsion angles (up to four) for a specific residue position, rather than considering them independently. This makes sense due to the high correlation between the torsion angles belong to the same amino acid.
Contributions
Our contributions can be summarized as follows:
1. We introduce the idea of proteindependent rotamer library and show the superiority of this library with respect to the widely used backbonedependent rotamer libraries [1,24] in terms of both the accuracy of rotamer ranking and the probability assigned to the correct rotamers, on a large benchmark data set proposed recently by [5].
2. We model the protein structure as a MRF, encode the Scwrl3 energy function, and compare different sumproduct BP algorithms to rerank the rotamers. Our method does not contain a learning process, which is more likely to perform consistently well on other data sets and other energy functions.
3. The proposed proteindependent rotamer library can be easily used as a sidechain predictor if we threshold each marginal distribution to its most probable rotamer. We compare our library with the most widely used sidechain predictors [25] and demonstrate that the accuracy is acceptable without using any global search/optimization techniques. Moreover, our library gives a probability distribution among rotamer choices instead of producing a single choice.
Methods
We use the backbone structure of a protein in PDB format as our input. The output is a rotamer library with a format similar to that of Dunbrack’s library [1]. In Dunbrack’s library for each combination of an amino acid and a particular (ϕ, ψ) backbone dihedral conformation, there is one and only one distribution of rotamer conformations. However, in our generated output, for every amino acid in the protein sequence we have a distribution. This implies we can have distinct marginal distributions for same type of amino acids with similar (ϕ, ψ) angles. The distributions differ because of the consideration of surrounding environment of a certain amino acid.
When a protein backbone conformation is given, our method constructs an interaction graph where each residue is a vertex. We place an edge between a pair of residues if at least one pair of atoms from them is found to be closer than a minimum threshold. After that we set up the energy potentials for each node as well as each edge. Using the potentials, an inference algorithm is applied to calculate the marginal distributions of rotamers choices.
Our discussion of methods can be logically split into the following three phases:
1. Creating the interaction graph
2. Setting up energy potentials
3. Inferring marginal distributions
Creating the interaction graph
From the coordinates of the backbone atoms we create an interaction graph for the given protein. For every amino acid in the protein a vertex is added. We join each residue pair with an edge for which the distance between any possible pair of their corresponding C^{α}, C^{β} and carbonyloxygen atoms is less than the contact threshold. In our experiment, we set the contact threshold to 10Å which is the value used by ROSETTA. Figure 2 shows an illustration of an interaction graph for a protein sequence of seven amino acids. We denote the interaction graph as G = (V, E), where V is the set of all vertices and E is the set of all edges. For every vertex we calculate the backbone ϕ and ψ angles and load the corresponding aminoacid specific rotamer conformations from the input backbonedependent rotamer library with the detail description of rotamers. The description contains possible discrete conformations along with their estimated prior probabilities, and the mean values of χ_{1}, χ_{2}, χ_{3}, χ_{4}, respectively.
Figure 2. Interaction graph for residue chain. This figure gives an example of the interaction graph for a protein sequence of seven amino acids. C_{i} denotes the alpha carbon of the ith residue. Apart from the neighboring relationships, the are two more edges in this map namely (C_{2}, C_{4}) and (C_{4}, C_{6}).
Setting potentials
After creating the interaction graph, we calculate the energy potentials for the vertices and the edges. In MRF, potential functions are a measure of the likelihood for the random variables. We denote the entire protein structure by a set of random variables X = {X_{b}, X_{s}} where X_{b} is our given backbone structure consists of ϕ and ψ angles and X_{s} is the side chain structure consists of χ_{1}, χ_{2}, χ_{3}, and χ_{4}. We need to approximate the marginal probabilities for X_{s}. Figure 3 shows the corresponding Markov random field model for the interaction graph shown in Figure 2. The probability for a specific assignment of rotamers is given by P(X_{s} = x_{s}X_{b} = x_{b}, θ), where x_{s} is a particular assignment of sidechain variables for a given backbone assignment x_{b}, and θ is all the other parameters needed in prior to calculate the probabilities. Observations suggest that the sidechain conformation of a specific residue does not depend on every part of the protein. Therefore, it is assumed that only the residue pairs with an edge in the interaction graph can affect the sidechain conformation of each other. Using this conditional independence, for each vertex V_{i} we can write the probability of a particular assignment by p(X_{i} = x_{i}Neighbors(X_{i}), θ). The probability of a complete sidechain assignment can be given by following equation:
Figure 3. Markov random field for interaction graph in Figure 2. This figure illustrates the corresponding Markov random field structure of the interaction graph from Figure 2. Square nodes are backbone random variables for which conformation is known and circular nodes are sidechain random variables for which conformation is unknown. x_{bi} denotes the backbone conformation of the ith residue. Since this is given, the variable is not capitalized. X_{si} denotes the sidechain conformation R_{i} of the ith residue. Since this is not given, the variable is capitalized.
Here C(G) is the set of all cliques in G, f is a potential function denoting the likelihood of a specific assignment of the backbone and the side chain conformation, and Z is called a normalizing factor. Our potential function includes two components, i.e., vertex potential and edge potential. Vertex potential E_{i} is contributed by the interactions between all the atoms of a certain residue x_{i} and the backbone atoms of all the other neighboring residues. Edge potential E_{i,j} is contributed by the interactions among the sidechain atoms of a certain residueresidue pair (x_{i}, x_{j}) where x_{i} and x_{j} are connected by an edge in the interaction graph. To define potential function we use the Boltzmann distribution. According to the Boltzmann distribution the vertex potential for a node x_{i} can be written as:
Similarly the edge potential for a pair of vertices (x_{i}, x_{j}) can be written as:
Here the k_{B} is the Boltzmann constant and T is the absolute temperature. k_{B}T = 0.6 kcal/mol is used. To define the vertex potential of a vertex x_{i} for a particular rotameric state r_{ij}, we use the following equation:
Here k_{d} is a constant factor, which is set to k_{d} = 3.0. p(r_{ij}) is the prior probability of the rotamer r_{ij} and p(r_{imax}) is the probability of the most densely populated rotamer, both of which are for a specific backbone conformation of the vertex x_{i} and are read from the input rotamer library. We use this component to prioritize more likely rotamers by giving them an energetically favorable advantage. Similar technique was used in Scwrl4 [5]. However, they used amino acid specific constants and learned them from training data. E_{sci}(r_{ij}) denotes the summation of all pairwise energy components resulting from the interaction between the atoms of residue xi and backbone atoms of all the other neighboring residues. To calculate the energy between two atoms a and b we use the energy function used by the SCWRL3.0 [4]. It is a piecewise function used to approximate the repulsive portion of the LennardJones 126 potential. It can be defined as:
Here d is the Euclidean distance between the coordinates of the two atoms. If r_{a} is the interaction radius of atom a and r_{b} is the interaction radius of atom b then r_{ab} = r_{a} + r_{b}. The default radius values used by Scwrl4 [5] are used here, i.e., E_{max} = 10 and k_{sc} = 0.8254. To define the edge potential between two vertices x_{i} and x_{j} for a particular rotameric pair (r_{im}, r_{jn}) we use the following equation:
Here E_{scij} (r_{im}, r_{jn}) denotes the summation of all pairwise energy components resulting from the interaction between the side chain atoms of residue x_{i} in rotameric state r_{im} and the side chain atoms of residue x_{j} in rotameric state r_{jn}. For calculation of energy of an atomic pair we use the SCWRL3.0 function described above. The E_{hb}(x_{i}, x_{j}) denotes the energy due to hydrogen bonding between residue pair. We use the hydrogen bond component of the ROSETTA energy function.
Inferring marginal distributions
After assigning all the vertex and edge potentials, the interaction graph becomes a MRF. To rerank the rotamer choices for each sidechain in this MRF, marginal distributions need to be computed. We employ different inference algorithms such as loopy belief propagation (LBP), generalized belief propagation (GBP) with a region graph, mean field approximation (MF) and tree reweighted belief propagation (TRBP). Among them, LBP performs better than others, as we will show in the Results section. We give a brief description of them in the following.
Loopy belief propagation
In LBP, we initialize the vertices with some random marginal distributions called beliefs. In each iteration, depending on the potential function and the messages passed by the neighbors, every vertex updates its belief, which is assumed to be an approximation of the marginal distribution of rotamer choices for this vertex. After updating to new belief, the vertex forms a set of new messages for each of its neighbors and passes them accordingly. This procedure is repeated by every vertex at each iteration. For connected acyclic graphs it gives the exact marginal distributions for the random variables associated with the vertices of the graph. However, for the graphs with loops it gives a good estimate when the procedure converges. We set a maximum number of 100 iterations to detect whether it converges or not. If two successive iterations do not differ more than a threshold in their beliefs, the algorithm is considered to be converged. For scheduling we use an asynchronous update. The calculated belief or marginal distribution for a vertex x_{i} in LBP is defined by the following equation:
Message update rule is defined by the following equation:
The first equation intuitively captures the marginal likelihood by combining old belief of a vertex and the old incoming messages sent by all of its neighbors. From this information a vertex can calculate new outgoing messages which capture an estimation of the marginal distribution of destination neighbors by combining old belief of the source vertex and the beliefs of source vertex estimated by all of its neighbors except the destination. Specifically, b(x_{i}) denotes the approximated marginal distribution of the node x_{i}. The set of all the neighbors of x_{i} is represented by N(i). m_{ij}(x_{j}) indicates a message sent from node x_{i} to node x_{j} and contains a marginal distribution of node x_{j} estimated by node x_{i}. We use the sumproduct algorithm for message passing where each node collects messages from all of its neighboring nodes and calculates new messages for each of its neighbors by taking the product of messages sent by other neighbors and summing over its current distribution.
Other inference algorithms
Generalized belief propagation is a family of approximate inference algorithms which divide the original graph into several regions to decrease the computational complexity. However, the belief expression and message update rule remain same with one subtle difference. Due to the division among regions one node can occur in multiple regions. So, we need to set weights for the contributions of these border nodes to different regions so that their overall contributions remain correct.
Mean field approximation tries to approximate the overall joint probability distribution by a product of independent marginals. This does not explicitly pass any messages, however at each iteration it tries to update its beliefs with the following equation:
In tree reweighted belief propagation, the regular loopy belief propagation is given another set of constants called edge appearance probabilities or ρ_{ij} which represents the probability of an edge (x_{i}, x_{j}) that it will appear in a spanning tree of the graph. This is a mechanism for edge prioritization which affects the belief equation and the message update rule. The equation for belief can be written as the following equation:
The message update rule can be written as the following equation:
After computing the marginal distribution of sidechain conformation for every vertex, the rotamers in the input rotamer library are reranked for each sidechain. We create a proteindependent rotamer library according to the same structure of the input backbonedependent rotamer library which can be used by other global optimization algorithms.
Dataset and software
To show the efficacy of our idea, we use the same data set of 379 proteins used in the Scwrl4 [5] paper. This is a larger and more recent data set comparing to the ones used in [2,4]. After downloading the PDB files, there are 355 of them which do not contain any duplicate backbone atoms. For this set we run our program and construct the proteindependent rotamer library, then compare the performance with two other widely used backbonedependent rotamer libraries [1,24]. For calculating actual dihedral angles from the original PDB files, we use the program Dangle [41]. To create the interaction graph of the protein we use the molecular biology toolkit [42]. Please note that since our method does not involve any training process, we do not subdivide the data set.
Results
In this section, we evaluate the performance of our proposed proteindependent rotamer library. First of all, we compare the sidechain packing power of our library and the widely used backbonedependent libraries [1,24]. For our proteindependent library, we threshold the marginal distribution of each sidechain to its most probably rotamer, which is considered as the prediction of our library for sidechain packing purpose. For the backbonedependent libraries, we select the rotamer with the highest probability as the prediction. Please note that neither our library nor the backbonedependent libraries involve global optimization. However, strong sidechain packing power gives more potential for the global optimization/search algorithms to benefit from the library. Secondly, we evaluate the reranking accuracy of our proteindependent rotamer library. We compare both the average rank of the first correct rotamers and the average probability of finding correct rotamers within the top 1, 2 and 3 rotamers, respectively. A lower average rank and a higher probability can clearly reduce the search space of the following global optimization/search algorithms and boost the likelihood of such algorithms to pack sidechains correctly. Finally, we compare the accuracy and the speed of the four inference algorithms.
To calculate the accuracy of a rotamer choice, the most widely used criterion is used, i.e., if the mean dihedral angle of this rotamer is within 40 degree of the actual dihedral angle, this rotamer is considered to be correct; otherwise, it is considered to be wrong. For χ_{1+2} to be correct, both χ_{1} and χ_{2} have to be correct. We judge the correctness of χ_{1+2+3} and χ_{1+2+3+4} similarly.
Performance on sidechain prediction
We first evaluate the sidechain packing power of our proteindependent rotamer library. We choose the widely used backbonedependent rotamer libraries proposed by Dunbrack’s lab in 2002 and 2010 [1,24] for comparison. The backbonedependent library is used as input for our method and the corresponding rotamers are reranked according to the marginal distributions. For both our proteindependent libraries and the backbonedependent libraries, the rotamer with the highest probability for each sidechain is considered as the prediction by the corresponding library. The predictions are then compared with the real sidechain angles to calculate the accuracy.
In this experiment, we use LBP as the inference algorithm, because as we will show later in this section, LBP outperforms the other three inference algorithms. Similar conclusion can be drawn if other inference algorithms are used.
Table 1 shows the performance of four rotamer libraries namely
Table 1. Comparison of rotamer libraries for sidechain prediction
D02 Dunbrack’s backbonedependent rotamer library proposed in 2002 [1]
D10 Improved version of Dunbrack’s library proposed in 2010 [24]
P02 Our proteindependent rotamer library with D02 as the input library
P10 Our proteindependent rotamer library with D10 as the input library
The accuracy of χ_{1} until χ_{4} (if there exists) for different amino acids as well as the overall accuracy of the four rotamer libraries is shown in Table 1. It can be seen that our proteindependent library clearly outperforms both D10 and D02 on all the amino acids. In fact, the χ_{1} accuracy of P10 improves the higher one of D10 and D02 by at least 5% on 15 out of all the 18 amino acids, whereas the improvement is at least 10% on five amino acids. The overall χ_{1} accuracy of both P10 and P02 is above 80%, which improves the corresponding input library by about 6.5%. We also run a wellknown sidechain prediction method, TreePack, proposed in [2], which is based on a global search algorithm, i.e., tree decomposition, on the same data set. The overall accuracy of TreePack is about 82%. This demonstrates that without global optimization/search, our proteindependent rotamer library is still comparable to the global search methods.
One thing to notice is that the improvement of the accuracy of our libraries is not consistent on different amino acids. There are some amino acids whose accuracy has been improved significantly (around 1520%). There are also few amino acids whose improvement is below average. We investigate the fact and discover that accuracy of all the amino acids with a big aromatic ring has been improved greatly. They are HIS, PHE, TRP and TYR. A possible explanation is that because of the size of the aromatic rings, the conformations of the amino acids with aromatic rings highly depend on the local geometric environments, rather than depending only on backbone information. These amino acids are more constrained in choosing a particular rotamer even if the rotamer is heavily represented within the database. Therefore, ϕ and ψ angles, which are the only information used by backbonedependent rotamer libraries, are not enough to reveal the conformation preference of such sidechains. Therefore, the simple statistics from the generic protein databases can be misleading.
One interesting thing is that on MET, which does not have any big aromatic ring, our proteindependent rotamer libraries still have about 10% improvement on χ_{1} and about 20% improvements on χ_{1+2} and χ_{1+2+3}. It turns out that MET is the only amino acid which has a sulfur atom inside its side chain (not the end of the sidechain). Sulfur has a bigger atomic radius with respect to carbon and nitrogen. So the conformation of sulfur dihedral angles are more constrained than the carbon or nitrogen dihedral angles, thus largely depend on the specific protein structure. However, this explanation can be questioned because of the low improvement of the accuracy for CYS, which also has a sulfur atom in its sidechain. This is due to the fact that in proteins, if suitable condition found, two CYS amino acids normally form a disulfide bond which changes its regular conformation. Such trend can already be partially captured by the statistics on the protein databases. Therefore, the proteindependent rotamer libraries do not encode much more information than the backbonedependent rotamer libraries. On the other hand, the energy function used in our method does not contain a specific term for disulfide bond, whereas sidechain prediction programs, which apply global search techniques, normally encode such a term. Therefore, it can be expected that our method does not improve the backbonedependent libraries on CYS as much as the global search methods do.
It is shown in Table 1 that the overall accuracy of D10 is slightly higher than D02. Consequently, the overall accuracy of P10 is also higher than P02, which demonstrates that the improvement of our method is consistent and not input library dependent. Therefore, with an improved backbonedependent or backboneindependent rotamer library, a better overall accuracy can be expected for our method.
Performance on rotamer ranking
To demonstrate the potential for the global optimization/search algorithms to benefit from our proteindependent rotamer library, we further evaluate the ability to rerank the input rotamers of our library. It has been shown in Table 1 that P10 is better than P02 and D10 is better than D02. Therefore, from now on, we will use only P10 and D10 for comparison.
We first evaluate the average rank of the first correct rotamers for P10 and D10. The average rank of correct rotamers is calculated by taking the mean rank of the first correct rotamer for each sidechain by the corresponding library according to their probability. This indicates the expected rank within which a correct rotamer should be found. In ideal case, the average rank should be 1, which means the rotamer library is able to rank the correct rotamers as the first choice. The comparison of the average rank between P10 and D10 is shown in Table 2. It can be seen that P10 is able to improve the average rank of the first correct rotamers from 1.74 to 1.63 for χ_{1} and from 2.95 to 2.67 for χ_{1+2}. This improvement denotes that our method reranks the original input rotamer library in such a way that the correct rotamers bubble up across the list. This is an important measurement since most of the global search procedures give a priority towards highly probable rotamers from the library. Usually such prior knowledge is encoded in the energy functions of the sidechain prediction methods. The result confirms that our library indeed prioritizes correct rotamers on average.
Table 2. Comparison of rotamer libraries for rotamer ranking
Another set of criteria which has been evaluated is whether our top choices are populated by correct rotamers or not. We calculate the average probability of finding correct rotamers in the top 1,2 and 3 choices. As shown in Table 2, if we only consider the first choice, the average probability of correct rotamers boosts up from 0.65 to 0.80 for χ_{1} and from 0.41 to 0.61 for χ_{1+2}. In the cases of top 2 and top 3 choices, even though the probability of both libraries are high, our library still outperforms the backbonedependent library. Note that the probability here is the prior probability by the corresponding library, which is different from the accuracy of the library. Such prior probabilities are widely used in the energy functions of the global search algorithms to direct the search procedure. Therefore, with high average probability, the energy functions can be more accurate, which can thus reduce the search space of the sidechain packing methods.
Combining the results from Table 1 and Table 2, our proteindependent rotamer library significantly increases the average accuracy for sidechain prediction, reduces the average rank of the first rotamers, and assigns higher prior probabilities to correct rotamers. All of these improvements are done without doing global optimization/search, which clearly shows the potential of the proteindependent library to benefit the sidechain packing methods.
Comparison of inference algorithms
We finally report the comparison between different inference algorithms for MRF on our problem. We compare the performances of four approximate inference algorithms namely,
LBP Loopy belief propagation
GBP Generalized belief propagation
MF Mean field approximation
TRBP Tree reweighted belief propagation
Table 3 shows the average accuracy for sidechain prediction, average rank of the first correct rotamers, and the average running time of these four inference algorithms. With the exception of TRBP, all the other three algorithms perform similarly while LBP maintains a consistent superiority. This may look contradicting to the results reported in [39], in which they applied TRBP to optimize the ROSETTA energy function for sidechain prediction. However, in [39], they were interested in finding the sidechain configuration of the maximum probability and optimizing the energy function. Therefore, they applied maxproduct TRBP. Here, we are not interested in finding a final configuration. Instead, we want to estimate the marginal distribution of each sidechain accurately. Therefore, selecting the highest probability choice for each sidechain according to our library may not give a sidechain packing of the entire protein without atomic clash. Sumproduct TRBP turns out to perform poorly for this purpose. There are two possible reasons for the weak performance of TRBP. Firstly, we use a different energy function than [39]. Instead of using ROSETTA we use SCWRL energy function. Secondly, TRBP is not wellsuited for the sumproduct algorithm which we use. This can be further backed up by the fact that while LBP fails to converge for only 3% of all the input proteins, TRBP fails to converge on more that 20% input proteins.
Table 3. Comparison of inference algorithms
Discussion
We have demonstrated that by modeling protein structures by MRF and applying inference algorithms to estimate the marginal distributions of the sidechains, we can get a much more accurate rotamer library, which we refer to as proteindependent rotamer library. One may argue that although we do not use the global optimization/search algorithms, our method encodes the energy information. However, the energy information we used is mainly for the purpose of setting the potentials to build MRF, rather than for directing any search procedure. In this sense, the traditional backbonedependent rotamer libraries also encode the energy information, in another form. The traditional libraries are mainly based on the statistics of the solved protein structures, which are assumed to be the global minimum conformation of the natural energy function. Therefore, doing statistics on such structures also encode energy information. This is further confirmed by the facts that if highresolution protein structures are used to build the traditional libraries or if the core regions with high electridensity are used to do the statistics, the accuracy of the traditional libraries can be increased significantly [5]. Therefore, although our library uses energy functions in a more explicit way than the traditional libraries, it can be expected that the global search algorithms can still benefit a lot from our library. Since the source code for both SCWRL and TreePack are not publicly available, we can not directly encode our library into such global search methods. We are implementing our own deadendelimination, treedecomposition and other search algorithms, such as A* search, to test the performance of combining our library with global search algorithms.
Although probabilistic graphical models are a relatively new tool for protein structure modeling, they have already proved their efficacy. However, they are not immune from all kinds of drawbacks. In our use of belief propagation, it is not guaranteed that the inference algorithm will converge. We avoid this problem by setting maximum limit on the number of iterations. Nevertheless, for our dataset loopy belief propagation is able to converge within 100 iterations for around 97% input proteins. Moreover, for those cases where LBP fails to converge, we still have moderately good results. Thus, this limitation is not as much daunting as it first seems to be. Other deterministic methods also can suffer from errant input. For example, both TreePack [2,3] and Scwrl4 [5] use tree decomposition technique to employ exhaustive search strategy. However, there is still no guarantee that the tree width of the tree decomposition must be small. Therefore, it is also possible that for some large input proteins, such methods may also fail to produce results or find an approximate solution.
One important application of our method is sidechain prediction for flexible backbone conformations. In many applications, a large number of backbone structures are available, such as the protein structure sampling, protein structures gathered from different protein structure prediction servers, or protein backbone refinement tasks. In such cases, there are a large number of closetonative backbone structures, but none of them is the native structure. The traditional sidechain packing methods usually take only one single backbone structure as input, which cannot be applied here, because the set of structures contain important information about the native structure. Therefore, all of these closetonative structures should be considered simultaneously. Our method can easily take a set of flexible backbone structures as input. In this case, the backbone structures will also be modeled as random variables. The standard belief propagation algorithms can still be used to infer the marginal distributions for sidechain rotamers under the condition of flexible backbones.
Conclusion
In this paper, we have proposed a novel type of backbonedependent rotamer library, i.e., proteindependent rotamer library, which encodes structural information of all the spatially neighboring residues. By estimating the marginal distributions of the sidechains in a Markov random field model, the proposed library significantly boosts the accuracy of the input rotamer library, without global optimization or search. The proposed library can hopefully lead to the performance improvements of the sidechain prediction methods.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
XG initiated the idea. MSIB and XG wrote and revised the manuscript. MSIB carried out the experiments.
Acknowledgements
We thank anonymous reviewers whose suggestions improved the manuscript. We are grateful to Dunbrack lab for issuing us the academic licenses of both Scwrl4 and the backbone dependent rotamer libraries. We thank Jinbo Xu to make the TreePack executable version publicly available. This work is supported by a grant from King Abdullah University of Science and Technology.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 14, 2011: 22nd International Conference on Genome Informatics: Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/12?issue=S14.
References

Dunbrack R Jr: Rotamer libraries in the 21st century.
Current Opinion in Structural Biology 2002, 12(4):431440. PubMed Abstract  Publisher Full Text

Xu J: Rapid protein sidechain packing via tree decomposition.

Xu J, Berger B: Fast and accurate algorithms for protein sidechain packing.
Journal of the ACM 2006, 53(4):533557. Publisher Full Text

Wang Q, Canutescu A, Dunbrack R Jr: SCWRL and MolIDE: Computer programs for sidechain conformation prediction and homology modeling.
Nature Protocols 2008, 3(12):18321847. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Krivov G, Shapovalov M, Dunbrack R Jr: Improved prediction of protein sidechain conformations with SCWRL4.
Proteins: Structure, Function and Bioformatics 2009, 77(4):778795. Publisher Full Text

Rohl C, Strauss C, Misura K, Baker D: Protein Structure Prediction Using Rosetta.
Methods in Enzymology 2004, 383:6693. PubMed Abstract

Liang S, Grishin NV: Sidechain modeling with an optimized scoring function.
Protein Science 2002, 11(2):322331. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Roitberg A, Elber R: Modeling side chains in peptides and proteins: Application of the locally enhanced sampling and the simulated annealing methods to find minimum energy conformations.
The Journal of chemical physics 1991, 95(12):9277 9287. Publisher Full Text

Street AG, Mayo SL: Intrinsic βsheet propensities result from van der Waals interactions between side chains and the local backbone.
Proceedings of the National Academy of Sciences of the United States of America 1999, 96(16):90749076. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Mendes J, Nagarajaram HA, Soares CM, Blundell TL, Carrondo MA: Incorporating knowledgebased biases into an energybased sidechain modeling method: Application to comparative modeling of protein structure.
Biopolymers 2001, 59(2):72 86. PubMed Abstract  Publisher Full Text

Desmet J, De Maeyer M, Hazes B, Lasters I: The deadend elimination theorem and its use in protein sidechain positioning.
Nature 1992, 356(6369):539542. PubMed Abstract  Publisher Full Text

De Maeyer M, Desmet J, Lasters I: All in one: A highly detailed rotamer library improves both accuracy and speed in the modelling of sidechains by deadend elimination.
Folding and Design 1997, 2:5366. PubMed Abstract  Publisher Full Text

Canutescu A, Shelenkov A, Dunbrack R Jr: A graphtheory algorithm for rapid protein sidechain prediction.
Protein Science 2003, 12(9):20012014. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chazelle B, Kingsford C, Singh M: A semidefinite programming approach to side chain positioning with new rounding strategies.
INFORMS Journal on Computing 2004, 16(4):380392. Publisher Full Text

Kingsford C, Chazelle B, Singh M: Solving and analyzing sidechain positioning problems using linear and integer programming.
Bioinformatics 2005, 21(7):10281036. PubMed Abstract  Publisher Full Text

Zhang J, Gao X, Xu J, Li M: Rapid and accurate protein side chain prediction with local backbone information.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2008, 4955 LNBI:285299.

Holm L, Sander C: Fast and simple Monte Carlo algorithm for side chain optimization in proteins: Application to model building by homology.
Proteins: Structure, Function and Genetics 1992, 14(2):213223. Publisher Full Text

Vasquez M: An evaluation of discrete and continuum search techniques for conformational analysis of side chains in proteins.
Biopolymers 1995, 36:5370. Publisher Full Text

Hwang J, Liao W: Sidechain prediction by neural networks and simulated annealing optimization.
Protein engineering 1995, 8(4):363370. PubMed Abstract  Publisher Full Text

Lee C, Subbiah S: Prediction of protein sidechain conformation by packing optimization.
Journal of Molecular Biology 1991, 217(2):373388. PubMed Abstract  Publisher Full Text

Dunbrack R Jr, Cohen F: Bayesian statistical analysis of protein sidechain rotamer preferences.
Protein Science 1997, 6(8):16611681. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dunbrack R Jr, Karplus M: Conformational analysis of the backbonedependent rotamer preferences of protein sidechains.
Nature Structural Biology 1994, 1(5):334340. PubMed Abstract  Publisher Full Text

Lovell S, Word J, Richardson J, Richardson D: The penultimate rotamer library.
Proteins: Structure, Function and Genetics 2000, 40(3):389408. Publisher Full Text

Shapovalov MV, Dunbrack RL Jr: A smoothed backbonedependent rotamer library for proteins derived from adaptive kernel density estimates and regressions.
Structure 2011, 19(6):844858. PubMed Abstract  Publisher Full Text

Bhat TN, Sasisekharan V, Vijayan M: An analysis of sidechain conformation in proteins.
International journal of peptide and protein research 1979, 13(2):170184. PubMed Abstract

Chandrasekaran R, Ramachandran GN: Studies on the conformation of amino acids. XI. Analysis of the observed side group conformation in proteins.
International journal of protein research 1970, 2(4):223233. PubMed Abstract

Benedetti E, Morelli G, Nmethy G, Scheraga HA: Statistical and energetic analysis of sidechain conformations in oligopeptides.
International journal of peptide and protein research 1983, 22:115. PubMed Abstract

Ponder JW, Richards FM: Tertiary templates for proteins. Use of packing criteria in the enumeration of allowed sequences for different structural classes.
Journal of Molecular Biology 1987, 193(4):775791. PubMed Abstract  Publisher Full Text

Kono H, Doi J: A new method for sidechain conformation prediction using a Hopfield network and reproduced rotamers.

De Maeyer M, Desmet J, Lasters I: All in one: A highly detailed rotamer library improves both accuracy and speed in the modelling of sidechains by deadend elimination.
Folding and Design 1997, 2:5366. PubMed Abstract  Publisher Full Text

Janin J, Wodak S, Levitt M, Maigret B: Conformation of amino acid side chains in proteins.
Journal of Molecular Biology 1978, 125(3):357386. PubMed Abstract  Publisher Full Text

McGregor MJ, Islam SA, Sternberg JE: Analysis of the relationship between slidechain conformation and secondary structure in globular proteins.
Journal of Molecular Biology 1987, 198(2):295310. PubMed Abstract  Publisher Full Text

Dunbrack RL Jr, Karplus M: Backbonedependent rotamer library for proteins. Application to sidechain prediction.
Journal of Molecular Biology 1993, 230(2):543574. PubMed Abstract  Publisher Full Text

Schrauber H, Eisenhaber F, Argos P: Rotamers: To be or not to be? An analysis of amino acid sidechain conformations in globular proteins.
Journal of Molecular Biology 1993, 230(2):592612. PubMed Abstract  Publisher Full Text

Peterson R, Dutton P, Wand A: Improved sidechain prediction accuracy using an ab initio potential energy function and a very large rotamer library.
Protein Science 2004, 13(3):735751. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dunbrack R Jr, Karplus M: Backbonedependent rotamer library for proteins. Application to sidechain prediction.
Journal of Molecular Biology 1993, 230(2):543574. PubMed Abstract  Publisher Full Text

Kamisetty H, Xing EP, Langmead CJ: Free energy estimates of allatom protein structures using generalized belief propagation.
Journal of Computational Biology 2008, 15(7):755766. PubMed Abstract  Publisher Full Text

Chu W, Ghahramani Z, Wild DL: A graphical model for protein secondary structure prediction.
2004, 161168.

Yanover C, Weiss Y: Approximate inference and proteinfolding.
Advances in Neural Information Processing Systems 2003, 15:14571464.

Yedidia JS, Freeman WT, Weiss Y: Constructing freeenergy approximations and generalized belief propagation algorithms.
IEEE Transactions on Information Theory 2005, 51(7):2282 2312. Publisher Full Text

3D Analysis: Dangle Software for Geometry Measurements [http://pibs.duke.edu/software/dangle.php] webcite

Moreland JL, Gramada A, Buzko OV, Zhang Q, Bourne PE: The Molecular Biology Toolkit (MBT): A modular platform for developing molecular visualization applications.