Abstract
Background
Accurately covering the conformational space of amino acid side chains is essential for important applications such as protein design, docking and high resolution structure prediction. Today, the most common way to capture this conformational space is through rotamer libraries  discrete collections of side chain conformations derived from experimentally determined protein structures. The discretization can be exploited to efficiently search the conformational space. However, discretizing this naturally continuous space comes at the cost of losing detailed information that is crucial for certain applications. For example, rigorously combining rotamers with physical force fields is associated with numerous problems.
Results
In this work we present BASILISK: a generative, probabilistic model of the conformational space of side chains that makes it possible to sample in continuous space. In addition, sampling can be conditional upon the protein's detailed backbone conformation, again in continuous space  without involving discretization.
Conclusions
A careful analysis of the model and a comparison with various rotamer libraries indicates that the model forms an excellent, fully continuous model of side chain conformational space. We also illustrate how the model can be used for rigorous, unbiased sampling with a physical force field, and how it improves side chain prediction when used as a pseudoenergy term. In conclusion, BASILISK is an important step forward on the way to a rigorous probabilistic description of protein structure in continuous space and in atomic detail.
Background
With the emergence of the first experimentally determined protein structures, the investigation of the conformations of their amino acid side chains began. It quickly became apparent that the χ (chi) dihedral angles  the main degrees of freedom in side chains  are not distributed freely, but tend to cluster around certain positions [13]. This behavior was already well known for dihedral bond angles in small organic molecules [4]. As more protein structures were solved to a high resolution, an increasingly accurate analysis of the adopted conformations became possible.
Today, the most common way to describe the conformational space of side chains is through socalled rotamer libraries: collections of representative side chain conformations, called rotamers [5]. These libraries are usually compiled from experimentally determined, high resolution protein structures by clustering the side chain conformations. Clusters of side chain angles thus obtained can then for example be represented by Gaussian distributions [6,7]. Traditionally, a rotamer in a library corresponds to the mean conformation of such a cluster and may be interpreted as a local energy minimum [5]. In recent studies, rotamer libraries with many thousands of rotamers were used [8,9]. Such libraries are built to cover the conformational space with high resolution.
The accurate description of side chain conformational space is an important ingredient of the solution to many biomolecular problems, such as protein design, docking and high resolution protein structure prediction. Over the past decades rotamer libraries have been successfully applied in all of these contexts. Arguably, the most well studied field is the prediction of side chain conformations for a fixed protein backbone; there is a manifold of programs in the public domain devoted to this problem [1014].
Rotamer libraries are usually applied as discrete collections of side chain conformations. Discretizing conformational space is a popular approximation to solve problems that are computationally expensive otherwise. For example, when assigning side chains in the densely packed protein core, discretizing the search space leads to exhaustive yet fast search strategies (such as dead end elimination) which can quickly find a global minimum solution [10,15,16]. However, in recent studies it was shown that a Markov chain Monte Carlo (MCMC) sampling scheme combined with a detailed rotamer library yields equally good or better results, which suggests that the combinatorial problem often poses no significant obstacle to finding a global minimum solution [8,13].
The discretization in rotamer libraries inherently leads to edge effects [17,18]. In docking problems for example, small differences in the side chain conformation may result in large differences in energy. Rotamers are often used as stiff building blocks; not considering the conformations in between rotamers may skip energetically favorable configurations [17,18]. There are three common heuristic strategies to tackle these issues. The first, which is most commonly used, is tweaking the energy functions; for example, by reducing the influence of mild steric clashes [11,13,19]. The second is to increase the number of rotamers in the library in order to improve its resolution [8,9,19]. Finally, one can combine rotamer based sampling with some form of continuous optimization [3,17]. Another problem associated with rotamer libraries is the occurrence of socalled nonrotameric states: side chain conformations that cannot be assigned to any of the standard gauche+, gauche or trans states [7]. Such conformations are typically missing in rotamer libraries and difficult to fill in correctly [20,21].
Here we present BASILISK, a dynamic Bayesian network [22] that formulates a probabilistic model of the conformational space of amino acid side chains. The model is generative: it allows sampling of plausible side chain conformations. BASILISK loosely stands for Bayesian network model of side chain conformations estimated by maximum likelihood. BASILISK represents all relevant variables in continuous space and thus avoids the problems that are due to the discretization used in rotamer libraries. Furthermore, BASILISK incorporates the ø (phi) and ψ (psi) angles. This makes it possible to condition the sampling upon the residue's backbone conformation, which is known to exert a strong influence on the side chain's conformation [3]. BASILISK represents all amino acids in a single probabilistic model, which is an entirely novel way of attacking this problem. Such an approach corresponds to a powerful machine learning technique called multitask or transfer learning, which often leads to better models with fewer parameters [23,24].
We first describe the BASILISK model, then proceed to evaluate its performance, and conclude with illustrating some potential applications.
Results and discussion
Parameterization
For our purposes, bond angles and bond lengths in amino acid side chains can be considered as fixed to their ideal values, as they show only very small variations [25]. This leaves the rotations around the bonds  the χ dihedral angles  as the main degrees of freedom. Accordingly, the sequence of χ angles is a good parameterization of the conformation of a given side chain. The same parameterization of the conformational space is also used in most popular rotamer libraries [6,7]. The number of angles necessary to describe a side chain conformation varies between zero and four for the 20 different standard amino acid types. Figure 1 illustrates the dihedral angles for glutamate.
Figure 1. Dihedral angles in glutamate: Dihedral angles are the main degrees of freedom for the backbone (ϕ and ψ angles) and the side chain (χ angles) of an amino acid. The number of χ angles varies between zero and four for the 20 standard amino acids. The figure shows a ballandstick representation of glutamate, which has three χ angles. The fading conformations in the background illustrate a rotation around χ_{1}. The figure was made using PyMOL http://www.pymol.org webcite.
In previous studies, it has been shown that the side chain conformation correlates highly with the conformation of the backbone [6]. For the backbone, the two main degrees of freedom are the ϕ and ψ dihedral angles, which when plotted against each other result in the celebrated Ramachandran plot [26]. We incorporate ϕ and ψ angles in BASILISK to be able to condition the sampling on the backbone conformation.
BASILISK: a generative model of side chain conformations
Bayesian networks are graphical models that determine the possible factorizations of the joint probability distribution of a set of random variables [2729]. A Bayesian network is a graph in which the nodes represent the random variables, and the edges encode their conditional independencies. The graph is directed and acyclic, and is often chosen based on prior, expert knowledge. If sequences of variables are modeled, the models are called dynamic Bayesian networks (DBNs) [22]. Here, dynamic originally referred to time sequences, such as speech signals, but arbitrary sequences can be modeled. Each position in the sequence is called a slice. The sequences represented by a DBN are allowed to have different lengths; a property which we use to our advantage, as explained below.
BASILISK was implemented as a DBN, and its parameters were estimated by a maximum likelihood method using a data set derived from more than 1500 crystal structures (see Data sets for training and testing). The model is shown in Figure 2.
Figure 2. The BASILISK dynamic Bayesian network: The network shown represents an amino acid with two χ angles, such as for example histidine. In this case, the DBN consists of four slices: two slices for the ϕ, ψ angles, followed by two slices for the χ angles. Sampling a set of χ angles is done as follows. First, the values of the input nodes (top row) are filled with bookkeeping indices that determine both the amino acid type (for example histidine) and the labels of the angles (for histidine, ϕ, ψ followed by χ_{1 }and χ_{2}). In the next step, the hidden node values (middle row, discrete nodes) are sampled conditioned upon the observed nodes. These observed nodes always include the index nodes (top row, discrete nodes), and optionally also the ϕ, ψ nodes (first two nodes in the bottom row) if the sampling is conditioned on the backbone. Finally, a set of χ angles is drawn from the von Mises nodes (bottom row), whose parameters are specified by the sampled values of the hidden nodes.
We briefly describe the most important features of the model. For each slice, an input node (which is a discrete variable) indicates which angle and amino acid type are modeled. For example, 3 indicates the χ_{1 }angle of aspartate, 4 indicates the χ_{2 }angle of aspartate, and so forth. We recently used a similar approach to formulate a probabilistic model of RNA structure [30].
The natural manifold of all the dihedral angles combined is the hypertorus: a torus with dimension three or higher. BASILISK constructs a probability distribution on this manifold in the following way: the model has a single output node per slice, in the form of a von Mises distribution [31], which can be considered the circular equivalent of a Gaussian distribution. The von Mises distribution takes the circular nature of the data into account: a dihedral angle in [ π, π[ is naturally represented as a point on the circle. The von Mises distribution belongs to the realm of directional statistics [31]: the statistics of angles, directions and orientations. We previously developed probabilistic models of RNA and protein structure based on the combination of DBNs and directional statistics [30,32,33]. The von Mises distribution was also used previously in a seminal study on probabilistic models of the protein backbone in terms of the ϕ and ψ angles [34], and in a preliminary study on clustering of side chainrotamers [35].
The problem of representing amino acid types that differ in the number of χ angles within one model is elegantly solved by using a DBN with a variable number of slices. There are two slices for the ϕ and ψ angles, followed by a sequence of slices that represent the χ angles. For example, to model glutamine, which has three χ angles in its side chain, one slice is added to the DBN that is shown in Figure 2.
The dependencies between the input nodes (which specify the amino acid type and the angles) and the output nodes (which specify the values of the angles) are mediated by a sequence of interconnected, discrete hidden nodes. The values of these nodes are never observed: their technical purpose is to model the dependencies between the amino acid type and the χ angles on the one hand, and the sequential dependencies between the χ angles on the other hand. The hidden nodes are socalled nuisance variables, that are integrated away in parameter estimation, sampling and inference. It should be noted that the hidden nodes thus introduce dependencies between all angles, and not just between two consecutive angles  a common misconception.
Our aim was to describe all side chain and backbone angles for all different amino acid types in a single probabilistic model. This approach is known as multitask or transfer learning in the field of machine learning [23,24] and has several advantages. As the same set of distributions is used to model all amino acids (and all angles), it leads to a lower amount of free parameters. Moreover, it makes "knowledge transfer" possible between amino acids with similar conformational properties during training [23,24]. Finally, for rotamer libraries, one needs to determine the optimal number of rotamers for each amino acid type separately, while in our approach, only the size of the hidden node needs to be determined. This can be done using an established statistical procedure for model selection (see Model training).
Several other network architectures were discarded during model selection as they could not adequately capture the joint distributions, which illustrates the importance of the network's architecture. Neither regular hidden Markov models, nor various mixture models produced satisfying results (data not shown). A similar observation was made for the probabilistic model of RNA structure mentioned previously [30].
It is important to note that the model as depicted in Figure 2 represents a single amino acid. When assigning side chains to an entire protein backbone, the model is simply applied to each position in the chain. Sampling conformations from BASILISK is fast: generating 50,000 arginine side chain conformations takes about two seconds on an average desktop computer (Intel Core 2, at 2.4 GHz).
Probability distributions
The joint probability distribution encoded by the Bayesian network is
where is the sequence of χ angles, is the angle index information which includes the amino acid type, ϕ and ψ are the backbone angles, and is the sequence of hidden node values.
For most purposes, the conditional and marginal distributions are of interest. BASILISK allows backbone dependent sampling of the χ angles by conditioning on the ϕ and ψ angles. The conditional probability distribution for this case is given by:
where the sums run over all possible hidden node sequences .
The marginal, conditional probability distribution for the backbone independent case is given by:
where the sum again runs over all possible hidden node sequences . The backbone angles are thus simply disregarded in this calculation.
The marginal, conditional probabilities P(χH) for a single angle are represented by von Mises distributions (see material and methods). The parameters of the von Mises distribution are specified by the value of the hidden node, H.
As BASILISK is a simple extension of a hidden Markov model, all relevant joint and marginal probabilities can be calculated in a straightforward manner using the forward algorithm [30,36].
Initial analysis
For generative probabilistic models, an obvious first quality check is visual inspection. Accordingly, we start by investigating whether BASILISK captures the angular preferences found in the training data, before performing a more rigorous and indepth analysis.
For these first tests, we generated over 300,000 samples with the same amino acid composition as the training set. Figure 3 compares the marginal angular distributions of the training set with those of the BASILISK samples for arginine and lysine. We show plots for arginine and lysine because they are the only amino acids with four χ angles; they were most difficult to capture accurately with alternative models (data not shown). A comparison of all remaining relevant amino acids is available as additional material (Additional files 1, 2, 3, Figures S1S3).
Figure 3. Univariate histograms for lysine and arginine: The histograms marked "Training" represent the training set. The histograms marked "BASILISK" represent BASILISK samples. For each amino acid, all histograms are plotted on the same scale.
Additional file 1. Univariate histograms for all amino acids with one χ angle. Histograms marked "Training" were generated from the training set; histograms marked "BASILISK" were generated from BASILISK samples.
Format: PDF Size: 308KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 2. Univariate histograms for all amino acids with two χ angles. Histograms marked "Training" were generated from the training set; histograms marked "BASILISK" were generated from BASILISK samples.
Format: PDF Size: 1.2MB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 3. Univariate histograms for all amino acids with three χ angles. Histograms marked "Training" were generated from the training set; histograms marked "BASILISK" were generated from BASILISK samples.
Format: PDF Size: 649KB Download file
This file can be viewed with: Adobe Acrobat Reader
As a second test, we inspected pairwise histograms of the χ_{1 }and χ_{2 }angles for all relevant amino acids. These plots provide an indication of whether the model captures the correlation between two angles correctly. Again we compare samples from BASILISK model with the training data.
Figure 4 shows the pairwise histograms for isoleucine. In the univariate, marginal plots for isoleucine, we observe three peaks for each χ angle. Hence, one might expect nine peaks with varying density in the bivariate plot. However, the training data only shows five major peaks. The comparison of the plots indicate that these features are indeed captured. Plots for the remaining 13 amino acid that have at least two χ angles are available as additional material (Additional files 4, 5, 6, Figures S4S6).
Figure 4. χ_{1 }versus χ_{2 }histogram for isoleucine: The twodimensional histogram of χ_{1 }(xaxis) against χ_{2 }(yaxis) illustrates the association between the two angles. Their univariate, marginal distributions are shown as well, attached to the respective axes. The histogram marked "Training" represents the training set, while the histogram marked "BASILISK" represents BASILISK samples.
Additional file 4. χ_{1 }versus χ_{2 }histograms for histidine, phenylalanine, glutamate, aspartate and proline. Histograms marked "Training" were generated from the training set; histograms marked "BASILISK" were generated from BASILISK samples.
Format: PDF Size: 760KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 5. χ_{1 }versus χ_{2 }histograms for asparagine, leucine, methionine and lysine. Histograms marked "Training" were generated from the training set; histograms marked "BASILISK" were generated from BASILISK samples.
Format: PDF Size: 602KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 6. χ_{1 }versus χ_{2 }histograms for arginine, tyrosine, tryptophan and glutamine. Histograms marked "Training" were generated from the training set; histograms marked "BASILISK" were generated from BASILISK samples.
Format: PDF Size: 586KB Download file
This file can be viewed with: Adobe Acrobat Reader
The results of the visual inspections provide a strong indication that the model behaves as desired. We now move on to a more qualitative, indepth analysis.
BASILISK concurs with a standard rotamer library
In the following tests, we compare BASILISK to several rotamer libraries, as these are the method of choice to explore the conformational space of amino acid side chains for many purposes.
We use the Dunbrack backbone independent library [6] (bbind02.May.lib) as a representative rotamer library, as it is widely used and based on rigorous statistical analysis of high quality protein structures. In the construction of the Dunbrack library, it was assumed that the distribution of each angle in a rotamer follows a Gaussian distribution. Accordingly, rotamers in this library are reported as a sequence of Gaussian distributions  one for each χ angle. This allows us to evaluate the probability density value for any given side chain conformation according to the Dunbrack library.
For all calculations we used the BASILISK model with the backbone angles marked as unobserved  which results in a backbone independent likelihood  for fair comparison with the backbone independent Dunbrack rotamer library.
As a first test, we determine whether the Dunbrack rotamer library and BASILISK report similar probabilities for the same conformations. We calculated the loglikelihood of the side chain conformations according to the Dunbrack backbone independent rotamer library and according to BASILISK for all side chains in the test set (see Data sets for training and testing). The results show that the two methods indeed correlate very well (Pearson correlation coefficient is 0.88). Figure 5 shows a scatterplot of the loglikelihood values for all rotamer conformations in the Dunbrack library according to the library itself, and according to BASILISK. Again we find a very good correlation (Pearson correlation coefficient is 0.91). Outliers, especially in the low probability region, are limited to very rare rotamer conformations with little to no observations according to the Dunbrack library.
Figure 5. Comparison between BASILISK and a standard rotamer library: We calculated the loglikelihood for every rotamer in the Dunbrack backbone independent rotamer library according to the Gaussian model of the library itself (yaxis), and according to BASILISK (xaxis). The Pearson correlation coefficient is 0.91.
In the following paragraphs we take a more detailed look at the probability distributions encoded in the BASILISK model, and compare with several standard rotamer libraries. For this analysis we again use the Dunbrack library, but also include two additional backbone independent libraries [7,37].
The KullbackLeibler (KL) divergence is a standard measure of the similarity between two probability distributions [28,38]. Here, the KL divergence is used to compare the quality of the rotamer libraries versus BASILISK. We compare BASILISK and the rotamer libraries with the experimental data in the test, which is here used as a reference or truth model. We then calculate the relative KL divergence between BASILISK and each library (see KullbackLeibler divergence). Table 1 shows the results for each amino acid. Positive numbers indicate that BASILISK models the conformations in the test set more accurately than the method compared to, while negative numbers indicate that BASILISK is less accurate.
Table 1. KullbackLeibler divergence analysis.
Overall, BASILISK captures the conformational preferences of amino acid side chains more accurately than any of the rotamer libraries, with few exceptions. For example, the Dunbrack library performs better for tryptophan. We speculate this is due to the relatively low abundance of tryptophan in the training data (< 2%), and the fact that we are training an all amino acid model. An amino acid with few observed data points will have a smaller impact on the estimated parameters than an amino acid that was presented more often during training. The relatively low performance of the Lovell library can be explained by the small number of rotamers in the library. Especially for long side chains such as glutamine, lysine or arginine, too many rotamers appear to be missing.
Note that only libraries reporting both mean and variance for each χ angle are suitable for this analysis. Many rotamer libraries are mere discrete sets of possible conformations [8,9]: they cannot be evaluated in the same, rigorous way.
BASILISK captures the backbone's influence
It is well known that there is a strong correlation between backbone and side chain conformations [3]. This backbone dependency is captured by the BASILISK model by incorporating the backbone's two main degrees of freedom: the ϕ and ψ dihedral angles.
To the best of our knowledge, BASILISK is the first model that captures this correlation in continuous space; that is, without resorting to the usual discretizations of the conformational space. Comparison with other models is therefore difficult. Notably, a comparison with backbone dependent rotamer libraries based on the KL divergence is not possible due to their lack of an expression for the joint probability of the backbone and the side chain angles. However, we do present two decisive tests. First, we resort to visual inspection, followed by a quantitative evaluation based on the KLdivergence between backbone dependent and independent BASILISK. In the next section, we evaluate the influence of the backbone information on side chain sampling for a fixed backbone.
Figure 6 shows a comparison between the angular preferences observed in the training data and those found in BASILISK samples. For this comparison, we generated BASILISK samples and binned them according to their backbone angle values. The bins chosen in the plot are well populated in the training set, and illustrate the drastic effect that the backbone conformation can have on the side chain's conformational space. Certain peaks are almost entirely missing for certain regions of the Ramachandran plot. These observations provide a first indication that BASILISK captures the correlation with the backbone well.
Figure 6. Backbone dependency of the χ_{1 }angle of aspartate: The χ_{1 }histograms for different areas of the Ramachandran plot indicate a strong correlation between the backbone and the side chain conformations. For some regions certain peaks disappear almost entirely. Histograms marked with "Training" represent the training set, and histograms marked "BASILISK" represent BASILISK samples. A, B and C show the histograms for the areas indicated by the three boxes, while D shows the histogram over the entire space. Note that the indicated binning of the backbone space is for visualization only, as BASILISK does not rely on discretization of the conformational space.
For a quantitative evaluation, we turn again to the KLdivergence; this time, to compare BASILISK with and without backbone dependency. As in the previous test, where BASILISK was compared to different rotamer libraries, this test determines which model is closer to the theoretical truth model as embodied by the test set (see Data sets for training and testing). Table 1 (last column) shows the differences between the KL divergences for all relevant amino acids. The negative numbers indicate that the backbone dependent model resembles the experimental data most. This indicates that BASILISK indeed captures the influence of the backbone, and that incorporating backbone information improves the accuracy for all amino acids. This conclusion is also supported by the results described in the next section.
Application: side chain sampling for a fixed backbone
In the previous sections, we evaluated the probability distributions encoded in BASILISK. We now proceed to evaluate the applicability of the model. As a proof of concept we implemented a side chain prediction algorithm that assigns side chains to a fixed protein backbone. Xiang and Honig [8] showed that given a large rotamer library, an MCMC approach leads to good accuracy in predicting buried side chain conformations. Following their study, we combine side chain sampling from BASILISK with an MCMC method and an unmodified LennardJones potential [39]. The main purpose of the potential is to avoid steric clashes between the side chains. The parameters for the potential were derived from the OPLS force field as implemented in Tinker [40,41].
For the evaluation, we considered a χ angle within ± 20° of its value in the crystal structure as correct [8,12]. We only included buried side chains in the protein core in the evaluation, because exposed side chains often have few steric restraints and we did not use any energy term that accounts for the solvent interaction. However, all side chains were included in the MCMC procedure and none were fixed, in order not to bias the simulation towards the native state. The test set contained 43 single chain crystal structures that were used to evaluate side chain prediction algorithms in other studies [8,9,42].
To avoid getting trapped in local minima, we resampled three random residues at a time [11]. This effectively enables side chains to swap their positions. After a fixed set of MCMC steps, the lowest energy structure was selected as the final prediction. For details on the energies and the sampling see Sampling strategy and energies.
BASILISK is a true probabilistic model: it can be used as a proposal distribution to implement MCMC methods that respect detailed balance, since its exact contribution can be taken into account. The first test uses a LennardJones potential as the only energy component, and brings in BASILISK solely as a proposal function. We use the BASILISK model to propose new side chain conformations, which are subsequently accepted or rejected based on their energy (see Sampling strategy and energies). Table 2b shows that already with this very simple approach, we reach a quite reasonable performance: more than 87% of the χ_{1 }angles were correctly predicted (Table 2b).
Table 2. Side chain placement for a fixed backbone.
In a second test, we use the BASILISK likelihood as a pseudoenergy component [43], combined with the LennardJones potential. Table 2c shows that the prediction quality increases (by about 2% for χ_{1}). Incorporating backbone information in BASILISK (Table 2d) increases the performance compared to the backbone independent (by more than 2%) and the LennardJones only case (by more than 4%). It should be noted that none of these tests include energies that evaluate key features such as hydrogen bonds, saltbridges or electrostatic interactions.
For comparison, Table 2e also reports the prediction accuracy of IRECS [12] and SCWRL 4 [14], which are two leading programs in the field. In both cases, the results are on a par with BASILISK. SCWRL 4 and IRECS are specialized programs optimized for both prediction accuracy and speed: they use fine tuned and optimized force fields combined with backbone dependent rotamer libraries. The comparison serves to illustrate the quality of BASILISK as a probabilistic model; with respect to computational performance, SCWRL 4 and IRECS (whose runtime is typically seconds to minutes) are clearly superior to the unoptimized application of BASILISK (where we sampled for several hours to ensure convergence).
Conclusions
In this paper, we introduce a generative, probabilistic model of the conformational space of side chains that allows sampling of realistic, nativelike side chain angles in continuous space. BASILISK incorporates a continuous backbone dependency, which to the best of our knowledge is entirely novel. Another unique feature is that BASILISK represents all relevant natural amino acids within one model. This powerful approach is known as multitask or transfer learning in the field of machine learning, and comes with several advantages.
In the first tests we showed that BASILISK is able to accurately capture the angular preferences found in proteins. Using the KL divergence, we confirmed that the model compares favorably with several standard rotamer libraries.
BASILISK also captures the effect of the backbone conformation on the side chain, and samples realistic angular distributions for various areas of the Ramachandran plot. Again using the KL divergence, we confirmed that including backbone information leads to more accurate results.
As a proof of concept, we implemented a simple side chain prediction method that assigns side chains to a fixed protein backbone, using an MCMC sampling scheme and an unmodified LennardJones potential as energy function. The results show the applicability of combining a continuous description of conformational space with a detailed energy function. Adding BASILISK as an explicit pseudoenergy term improves the prediction results. Best results are obtained when the backbone conformation is taken into account.
Specialized side chain prediction programs combine highly tuned energy functions with an efficient exploitation of the discrete nature of rotamers. This makes them very fast, yet accurate. BASILISK as such does not compete with those programs, but provides a solution where a detailed, continuous description of conformational space is required. The possibility to combine an unmodified, standard force field with BASILISK opens great possibilities for applications such as docking, structure prediction or protein design. For example, the calculation of side chain entropies is important for tasks such as protein quality assessment or protein design [4446]. BASILISK can facilitate these calculations  which are now often performed with discrete rotamer libraries  in continuous space.
BASILISK, when combined with our previously described probabilistic model of the protein backbone (TorusDBN [33]), can be used to sample protein structures in continuous space and in full atomic detail. An obvious potential application is to sample protein conformations by combining TorusDBN and BASILISK with a physical force field, with applications in protein structure prediction, simulation and design.
BASILISK illustrates the enormous potential and increasing importance of probabilistic models and probabilistic machine learning methods in structural bioinformatics [19,30,33,44,47,48]. We believe that BASILISK is an excellent solution for problems that require going beyond discrete representations of amino acid side chains in protein structure simulation, prediction and design.
Methods
Data sets for training and testing
For training, we used a high quality dataset obtained from the PISCES server [49]. The PISCES server can be used to select a set of structures from the Protein Data Bank that respect quality thresholds regarding resolution or Rvalue. We imposed a resolution of 1.6 Å or better, an Rvalue of 0.25 or better, and a pairwise sequence similarity cutoff of 25%. Subsequently, all structures were removed that had a sequence similarity higher than 25% with any of the structures used in the fixed backbone prediction test (see next paragraph).
The resulting set consisted of 1703 crystal structures. All structures were processed using the REDUCE program [50], which automatically corrects some common errors in protein structures (such as histidines with a flipped ring). From this initial set retrieved from the PISCES server, we randomly picked 10% of the structures and reserved these for testing. The test set contained 31,229 side chains from 171 structures. The training set contained angles from 277,975 residues from 1532 structures. The angles were calculated using Biopython's Bio.PDB module [51,52].
Data set for fixed backbone prediction
The data set for the fixed backbone prediction consisted of crystal structures that were previously used in three other studies for the same purpose [8,9,42].
We removed structures with a very small core (less than 10 fully buried residues) and structures with multiple chains or chain breaks, resulting in 43 remaining structures. The input files for the test only retained the backbone atoms; all side chain atoms and heteroatoms were removed.
We used DSSP [53] to calculate the relative accessible surface area in order to determine the solvent exposure state of a residue. We only included fully buried residues for which the accessible surface area is zero in our analysis, as we did not use any energy term accounting for the solvent interaction. However, all side chains were included in the experiments, in order to avoid bias towards the native structure.
Model training
The BASILISK model was implemented and trained using our freely available Mocapy ++ toolkit [54]. The model's parameters were estimated using the stochastic expectation maximization (EM) algorithm [55].
The number of hidden node values (that is, the size of the hidden node) is a hyperparameter that has to be determined separately. For that purpose, we trained 5 models for each hidden node size (15, 20, 25, 30, 35, 40 and 45) and choose the model with the lowest Akaike information criterion (AIC) score [56]. The AIC score is defined as
where L(Θ) is the likelihood of the trained model Θ given the observations , and k is the number of free parameters. The AIC is a measure of the estimated KL divergence between the trained model and a fictitious truth model which generated the training data [56]. In our case, the AIC points towards a model with 30 hidden node states (see Figure 7). The number of relevant, nonezero parameters for the selected model is 9871.
Figure 7. Selecting the optimal model: The Akaike information criterion (AIC) is used to determine the optimal number of hidden node values. The AIC score (y axis) points towards a model with 30 hidden node values (xaxis). The optimal model is shown as a filled circle.
The model also includes backbone information, which usually leads to a quick explosion of parameters due to discretization. For example, dividing the Ramachandran plot in bins of 10° by 10° creates 1296 bins, which combined with 18 rotameric amino acids and several rotamers per amino acid can quickly lead to data sparseness in training. This problem is avoided by treating the data in a natural, continuous space.
Modeling the angles
We use the von Mises distribution [31] to model both the backbone and the side chain angles. This distribution can be considered as the circular equivalent of the univariate Gaussian distribution. As an angle can be represented as a point on the circle, it is the correct approach to use a probability distribution that takes the circularity of the data into account. The probability density function of the von Mises distribution has the following form:
where x ∈ [ π, π[ is the random angle, μ is the mean, κ ≥ 0 is a concentration parameter and I_{0 }is the modified Bessel function of the first kind and of order zero.
Probability densities from rotamer libraries
For the calculation of the KL divergences, we need the probability density of a conformation according to the different rotamer libraries. According to a typical rotamer library, the probability density of a given angle sequence is:
where is the sequence of χ angles, A is the amino acid type, P(R_{A}) is the probability of the rotamer R of amino acid type A, and P(R_{A}) is the probability of the sequence of χ angles for rotamer R of amino acid type A. The sum runs over all defined rotamers for the amino acid A in a rotamer library.
Each individual angle χ_{n }in a rotamer R_{A }is distributed according to a Gaussian distribution , with mean and standard deviation . Hence, in the above expression, the conditional probability density of the sequence of angles is equal to the product of the probability densities of the individual angles:
The product runs over all n χ angles.
KullbackLeibler divergence
In the evaluation of the model, we used the KL divergence [38] to compare BASILISK with various rotamer libraries. For two probability density functions, the KL divergence is defined as
where p is usually the empirical data or a truth model and q usually represents a probabilistic model that approximates p. The KL divergence is nonnegative and only becomes zero if p = q.
In order to evaluate whether BASILISK captures the angular preferences more accurately than a standard rotamer library, we calculate the difference between the KL divergence from the experimental data to a rotamer library, and the KL divergence from the experimental data to BASILISK:
where q_{R }is the probability distribution associated with the rotamer library, q_{B }is the BASILISK model and p is given by the experimental data. In order to calculate this difference, we use the fact that the relative KL divergence can be expressed as a statistical expectation [30,56] leading to:
where is the expectation with respect to p. In our case, the empirical distribution p is a set of n observations x_{1}, x_{2}, ... x_{n}. Hence, we can calculate the expectation by averaging over these observations:
For the backbone dependent case, a fair evaluation based on the KL divergence is not possible: rotamer libraries do not allow the calculation of the joint probability of the x , ϕ and ψ angles.
Sampling strategy and energies
For the fixed backbone test, we used a LennardJones 612 potential. The energy for an atom pair is:
where d is the distance between two atoms, ε is the depth of the attractive well and σ is the optimal distance between two atoms based on their van der Waals radii. The atom dependent ε and σ values were taken from the OPLS/Tinker parameter set [40,41]. Subsequently, we used Boltzmann's law to turn the energies into probabilities:
where E_{LJ}() is the LennardJones energy of conformation , R is the ideal gas constant and T is the temperature. We set the temperature to room temperature, which results in for all calculations.
For sampling, we use the classic MetropolisHastings MCMC approach [28]:
where is the probability of accepting to replace with ; and are the probabilities of and according to the target distribution; and and are the probabilities to move to state from state and to state from state according to the proposal distribution.
In the first test, we use the LennardJones term as the only energy component, and use BASILISK solely as a proposal distribution:
where is the LennardJones derived probability of as before, and is the product of the probabilities of the side chains in according to BASILISK. Note that in this case, we sample according to the LennardJones potential in an unbiased way.
For the second test, we included BASILISK as an explicit pseudoenergy term. We approximate the probability of seeing a certain state as the product of the LennardJones derived probability, , with the probability of the side chain conformation according to BASILISK, :
Because BASILISK is used both as a proposal distribution and a pseudoenergy term, the MetropolisHasting expression reduces to:
For the backbone dependent experiments, is replaced in the above expressions by , where and are the (fixed) backbone angles of .
We proposed three new side chain conformations at random sequence positions in every iteration. This effectively enables side chains to swap places, which is important to solve the combinatorial problem in the densely packed protein core. The backbone independent sampling is done using ancestral sampling [28]. In the backbone dependent case, sampling is done using the forwardbacktrack algorithm [30,32,33,57].
In the fixed backbone experiments, we used 500,000 MCMC iterations. For the final prediction, we selected the structure with the highest probability (or, equivalently, lowest energy). In the evaluation, we considered a χ angle within 20° of the angle observed in the crystal structure as correct.
Availability
A Python implementation of BASILISK, including all the parameters of the optimal model, is freely available for download at https://sourceforge.net/projects/basiliskdbn/ webcite. The source code provides a class that implements the main features of BASILISK, namely sampling side chain angles and evaluating the probability of a given set of angles. The package also includes example scripts that show how to use the BASILISK module. The first script parses a Protein Data Bank file [58] using Biopython's Bio.PDB package [51,52], retrieves all the side chains and calculates the likelihood of their χ angles. A second script samples side chain angles for a given amino acid type and optionally a set of backbone angles. For a full description of the features, please refer to the provided manual.
Authors' contributions
TiH carried out the study. WB assisted in implementing the MCMC framework. MP and JF assisted in using the Mocapy++ toolkit. KEJ implemented the OPLS force field. TH conceived the study. TH and TiH wrote the manuscript. All authors read and approved the final manuscript.
Acknowledgements
We thank our colleagues at the Bioinformatics Centre (University of Copenhagen) for valuable comments and suggestions. We also thank Jesper FerkinghoffBorg (DTU Elektro, Technical University of Denmark), Kanti Mardia (University of Leeds, UK), Thomas Poulsen (Novozymes) and Leonardo De Maria (Novozymes) for helpful discussions. T. Harder, J. Frellsen and M. Paluszewski are funded by the Danish Council for Strategic Research (NaBiIT, 2106060009). W. Boomsma is funded by the Danish Council for Independent Research (FNU, 272080315). KE. Johansson is funded by the Danish Council for Independent Research (FTP, 274080124).
References

Chandrasekaran R, Ramachandran GN: Studies on the conformation of amino acids. XI. Analysis of the observed side group conformation in proteins.
Int J Protein Res 1970, 2:223233. PubMed Abstract

Ponder JW, Richards FM: Tertiary templates for proteins. Use of packing criteria in the enumeration of allowed sequences for different structural classes.
J Mol Biol 1987, 193:775791. PubMed Abstract  Publisher Full Text

Dunbrack RL, Karplus M: Backbonedependent rotamer library for proteins. Application to sidechain prediction.
J Mol Biol 1993, 230:543574. PubMed Abstract  Publisher Full Text

Eyring H: Steric hindrance and collision diameters.
J Am Chem Soc 1932, 54:31913203. Publisher Full Text

Dunbrack RL: Rotamer libraries in the 21st century.
Curr Opin Struct Biol 2002, 12:431440. PubMed Abstract  Publisher Full Text

Dunbrack RL, Cohen FE: Bayesian statistical analysis of protein sidechain rotamer preferences.
Protein Sci 1997, 6:16611681. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lovell SC, Word JM, Richardson JS, Richardson DC: The penultimate rotamer library.
Proteins 2000, 40:389408. PubMed Abstract  Publisher Full Text

Xiang Z, Honig B: Extending the accuracy limits of prediction for sidechain conformations.
J Mol Biol 2001, 311:421430. PubMed Abstract  Publisher Full Text

Peterson RW, Dutton PL, Wand AJ: Improved sidechain prediction accuracy using an ab initio potential energy function and a very large rotamer library.
Protein Sci 2004, 13:735751. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Canutescu AA, Shelenkov AA, Dunbrack RL: A graphtheory algorithm for rapid protein sidechain prediction.
Protein Sci 2003, 12:20012014. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Jain T, Cerutti DS, McCammon JA: Configurationalbias sampling technique for predicting sidechain conformations in proteins.
Protein Sci 2006, 15:20292039. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hartmann C, Antes I, Lengauer T: IRECS: A new algorithm for the selection of most probable ensembles of sidechain conformations in protein models.
Protein Sci 2007, 16:12941307. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lu M, Dousis AD, Ma J: OPUSRota: A fast and accurate method for sidechain modeling.
Protein Sci 2008, 17:15761585. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Krivov GG, Shapovalov MV, Dunbrack RL: Improved prediction of protein sidechain conformations with SCWRL4.
Proteins 2009, 77:778795. PubMed Abstract  Publisher Full Text

Desmet J, DeMayer M, Hazes B, Lasters I: The deadend elimination theorem and its use in protein sidechain positioning.
Nature 1992, 356:539542. Publisher Full Text

Desmet J, Spriet J, Lasters I: Fast and accurate sidechain topology and energy refinement (FASTER) as a new method for protein structure optimization.
Proteins 2002, 48:3143. PubMed Abstract  Publisher Full Text

Wang C, SchuelerFurman O, Baker D: Improved sidechain modeling for proteinprotein docking.
Protein Sci 2005, 14:13281339. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Grigoryan G, Ochoa A, Keating AE: Computing van der Waals energies in the context of the rotamer approximation.
Proteins 2007, 68:863878. PubMed Abstract  Publisher Full Text

Yanover C, SchuelerFurman O, Weiss Y: Minimizing and learning energy functions for sidechain prediction.
Lect Notes Comput Sci 2007, 381395. Publisher Full Text

Schrauber H, Eisenhaber F, Argos P: Rotamers: to be or not to be? An analysis of amino acid sidechain conformations in globular proteins.
J Mol Biol 1993, 230:592612. PubMed Abstract  Publisher Full Text

Petrella RJ, Karplus M: The energetics of offrotamer protein sidechain conformations.
J Mol Biol 2001, 312:11611175. PubMed Abstract  Publisher Full Text

Ghahramani Z: Learning dynamic Bayesian networks.
Lect Notes Comput Sci 1998, 1387:168197. Publisher Full Text

Caruana R: Multitask learning.
Mach Learn 1997, 28:4175. Publisher Full Text

Pan SJ, Yang Q: A survey on transfer learning.
IEEE Trans Knowl Data Eng 2009, in press. PubMed Abstract  PubMed Central Full Text

Engh RA, Huber R: Accurate bond and angle parameters for Xray protein structure refinement.
Acta Crystallogr A 1991, 47:392400. Publisher Full Text

Ramachandran GN, Ramakrishnan C, Sasisekharan V: Stereochemistry of polypeptide chain configurations.
J Mol Biol 1963, 7:9599. PubMed Abstract  Publisher Full Text

Pearl J: Probabilistic reasoning in intelligent systems. Morgan Kaufmann; 1988.

Bishop CM: Pattern recognition and machine learning. Springer; 2006.

Needham CJ, Bradford JR, Bulpitt AJ, Westhead DR: A primer on learning in Bayesian networks for computational biology.
PLoS Comput Biol 2007, 3:e129. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Frellsen J, Moltke I, Thiim M, Mardia KV, FerkinghoffBorg J, Hamelryck T: A probabilistic model of RNA conformational space.
PLoS Comput Biol 2009, 5:e1000406. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Mardia KV, Jupp PE: Directional statistics. John Wiley and Sons, New York, USA; 2000.

Hamelryck T, Kent JT, Krogh A: Sampling realistic protein conformations using local structural bias.
PLoS Comput Biol 2006, 2:e131. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Boomsma W, Mardia KV, Taylor CC, FerkinghoffBorg J, Krogh A, Hamelryck T: A generative, probabilistic model of local protein structure.
Proc Natl Acad Sci USA 2008, 105:89328937. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Edgoose T, Allison L, Dowe DL: An MML classification of protein structure that knows about angles and sequence.
Pac Symp Biocomput 1998, 585596. PubMed Abstract

Fetrow JS, Berg G: Using information theory to discover side chain rotamer classes: analysis of the effects of local backbone structure.
Pac Symp Biocomput 1999, 278289. PubMed Abstract

Durbin R, Eddy SR, Krogh A, Mitchison G: Biological sequence analysis. Cambridge University Press, UK; 1998.

Tuffery P, Etchebest C, Hazout S, Lavery R: A new approach to the rapid determination of protein side chain conformations.
J Biomol Struct Dyn 1991, 8:12671289. PubMed Abstract

Kullback S, Leibler RA: On information and sufficiency.
Ann Math Statist 1951, 22:7986. Publisher Full Text

LennardJones JE: On the forces between atoms and ions.
Proc R Soc Lond A Math Phys Sci 1925, 109:584597. Publisher Full Text

Jorgensen WL, Maxwell DS, TiradoRives J: Development and testing of the OPLS allatom force field on conformational energetics and properties of organic liquids.
J Am Chem Soc 1996, 118:1122511236. Publisher Full Text

Kaminski GA, Friesner RA, TiradoRives J, Jorgensen WL: Evaluation and reparametrization of the OPLSAA force field for proteins via comparison with accurate quantum chemical calculations on peptides.
J Phys Chem B 2001, 105:64746487. Publisher Full Text

Liang S, Grishin NV: Sidechain modeling with an optimized scoring function.
Protein Sci 2002, 11:322331. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Mendes J, Nagarajaram HA, Soares CM, Blundell TL, Carrondo MA: Incorporating knowledgebased biases into an energybased sidechain modeling method: Application to comparative modeling of protein structure.
Biopolymers 2001, 59:7286. PubMed Abstract  Publisher Full Text

Kamisetty H, Xing EP, Langmead CJ: Free energy estimates of allatom protein structures using generalized belief propagation.
J Comp Biol 2008, 15:755766. Publisher Full Text

Kamisetty HK, Langmead CJ: A graphical model approach for predicting free energies of association for proteinprotein interactions under backbone and sidechain flexibility.

Sciretti D, Bruscolini P, Pelizzola A, Pretti M, Jaramillo A: Computational protein design with sidechain conformational entropy.
Proteins 2009, 74:176191. PubMed Abstract  Publisher Full Text

Theobald DL, Wuttke DS: Accurate structural correlations from maximum likelihood superpositions.
PLoS Comput Biol 2008, 4:e43. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hamelryck T: Probabilistic models and machine learning in structural bioinformatics.
Stat Methods Med Res 2009, 18:505526. PubMed Abstract  Publisher Full Text

Wang G, Dunbrack RL: PISCES: a protein sequence culling server.
Bioinformatics 2003, 19:15891591. PubMed Abstract  Publisher Full Text

Word JM, Lovell SC, Richardson JS, Richardson DC: Asparagine and glutamine: Using hydrogen atom contacts in the choice of sidechain amide orientation.
J Mol Biol 1999, 285:17351747. PubMed Abstract  Publisher Full Text

Hamelryck T, Manderick B: PDB file parser and structure class implemented in Python.
Bioinformatics 2003, 19:23082310. PubMed Abstract  Publisher Full Text

Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B: Biopython: freely available Python tools for computational molecular biology and bioinformatics.
Bioinformatics 2009, 25:14221423. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kabsch W, Sander C: Dictionary of protein secondary structure: Pattern recognition of hydrogenbonded and geometrical features.
Biopolymers 1983, 22:25772637. PubMed Abstract  Publisher Full Text

Paluszewski M, Hamelryck T: Mocapy++  A toolkit for inference and learning in dynamic Bayesian networks.
BMC Bioinformatics 2010, 11:126. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Nielsen S: The stochastic EM algorithm: Estimation and asymptotic results.
Bernoulli 2000, 6:457489. Publisher Full Text

Burnham KP, Anderson DR: Model selection and multimodel inference  a practical informationtheoretic approach. Second edition. Springer; 2002.

Cawley SL, Pachter L: HMM sampling and applications to gene finding and alternative splicing.
Bioinformatics 2003, 19:3641. Publisher Full Text

Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The protein data bank.
Nucleic Acids Res 2000, 28:235242. PubMed Abstract  Publisher Full Text  PubMed Central Full Text