Abstract
Background
Hydrogen bonds (Hbonds) play a key role in both the formation and stabilization of protein structures. They form and break while a protein deforms, for instance during the transition from a nonfunctional to a functional state. The intrinsic strength of an individual Hbond has been studied from an energetic viewpoint, but energy alone may not be a very good predictor.
Methods
This paper describes inductive learning methods to train proteinindependent probabilistic models of Hbond stability from molecular dynamics (MD) simulation trajectories of various proteins. The training data contains 32 input attributes (predictors) that describe an Hbond and its local environment in a conformation c and the output attribute is the probability that the Hbond will be present in an arbitrary conformation of this protein achievable from c within a time duration Δ. We model dependence of the output variable on the predictors by a regression tree.
Results
Several models are built using 6 MD simulation trajectories containing over 4000 distinct Hbonds (millions of occurrences). Experimental results demonstrate that such models can predict Hbond stability quite well. They perform roughly 20% better than models based on Hbond energy alone. In addition, they can accurately identify a large fraction of the least stable Hbonds in a conformation. In most tests, about 80% of the 10% Hbonds predicted as the least stable are actually among the 10% truly least stable. The important attributes identified during the tree construction are consistent with previous findings.
Conclusions
We use inductive learning methods to build proteinindependent probabilistic models to study Hbond stability, and demonstrate that the models perform better than Hbond energy alone.
Background
A protein is a long sequence of aminoacids, called residues. Under normal physiological conditions, various forces (electrostatic, van der Waals, ...) lead the protein to fold into a compact structure made of secondary structure elements, αhelices and βstrands, connected by bends (called loops). An Hbond corresponds to the attractive electrostatic interaction between a covalent pair D—H of atoms, in which the hydrogen atom H is bonded to a more electronegative donor atom D, and an electronegative acceptor atom A. Due to their strong directional character, short distance ranges, and large number in folded proteins, Hbonds play a key role in both the formation and stabilization of protein structures [13]. While Hbonds involving atoms from close residues along the mainchain sequence stabilizes secondary structure elements, Hbonds between atoms in distant residues stabilize the overall 3D arrangement of secondary structure elements and loops.
Hbonds form and break while the conformation of a protein deforms. For instance, the transition of a folded protein from a nonfunctional state into a functional (e.g., binding) state may require some Hbonds to break and others to form [4]. So, to better understand the possible deformation of a folded protein, it is desirable to create a reliable model of Hbond stability. Such a model makes it possible to identify rigid groups of atoms in a given protein conformation and determine the remaining degrees of freedom of the structure [7]. Since most Hbonds in a protein conformation are quite stable, it is crucial that the model precisely identifies the least stable bonds. The intrinsic strength of an individual Hbond has been studied before from an energetic viewpoint [5,6]. However, potential energy alone may not be a very good predictor of Hbond stability. Other local interactions may reinforce or weaken an Hbond.
Methods
I. Problem statement
Let c be the conformation of a protein P at some time considered (with no loss of generality) to be 0 and H be an Hbond present in c. Let M (c) be the set of all physically possible trajectories of P passing through c and π be the probability distribution over this set. We define the stability of H in c over the time interval Δ by:
where I (q, H, t) is a Boolean function that takes value 1 if H is present in the conformation q(t) at time t along trajectory q, and 0 otherwise. The value can be interpreted as the probability that H will be present in the conformation of P at any specified time t ∈ (0, Δ), given that P is at conformation c at time 0. Our goal is to design a method for generating good approximations σ of . We also want these approximations to be proteinindependent.
II. General approach
We use machine learning methods to train a stability model σ from a given set Q of MD simulation trajectories. Each trajectory q ∈ Q is a sequence of conformations of a protein. These conformations are reached at times t_{i} = i × δ, i = 0, 1, 2, …, called ticks, where δ is typically on the order of picoseconds. We detect the Hbonds present in each conformation q(t_{i}) using the geometric criteria given in [8]. Note that an Hbond in a given protein is uniquely identified (across different conformations) by its donor, acceptor, and the hydrogen atom. So, we call the presence of a specific Hbond H in a conformation q(t_{i}) an occurrence of H, denoted by h.
For each h, we compute a fixed list of predictors, some numerical, others categorical. Some are timeinvariant, like the number of residues along the mainchain between the donor and acceptor atoms. Others are timedependent. Among them, some describe the geometry of h, e.g., the distance between the hydrogen and the donor. Others describe the local environment of h, e.g., the number of other Hbonds within a certain distance from the midpoint of H.
We train σ as a function of these predictors. The predictor list defines a predictor space ∑ and every Hbond occurrence maps to a point in ∑. Given the input set Q of trajectories, we build a data table in which each row corresponds to an occurrence h of an Hbond present in a conformation q(t_{i}) contained in Q. So, many rows may correspond to the same Hbond at different ticks. In our experiments, a typical data table contains several hundred thousand rows. Each column, except the last one, corresponds to a predictor p and the entry (h, p) of the table is the value of p for h. The entry in the last column is the measured stability y of h. More precisely, let H be the Hbond of which h is an occurrence. Let l = Δ/δ, where Δ is the duration over which we wish to predict the stability of h, and m ≤ l be the number of ticks t_{k}, k = i + 1, i + 2,…,i + l, such that H is present in q(t_{k}). The measured stability y of h is the ratio m/l. We chose l = 50 in most of the tests reported below, as this value both provides a ratio m/l large enough for the measured stability to be statistically meaningful, and corresponds to an interesting prediction timescale (50ps). Typically, most Hbond occurrences are quite stable: over 25% have measured stability 1, about 50% higher than 0.8, and only 15% less than 0.3.
We build σ as a binary regression tree [9]. This machine learning approach has been one of the most successful in practice. Regression trees are often simple to interpret. The method can work with both categorical and numerical predictors in a unified way, as shown in Section III. Each nonleaf node in a regression tree is a Boolean split. So, each node N of the tree determines a region of ∑ in which all the splits associated with the arcs connecting the root of the tree to N are satisfied. We say that an Hbond occurrence falls into N if it is contained in this region. The predicted stability value stored at a leaf node L is the average of the measured stability values by all the Hbond occurrences in the training data table that fall into L. We expect this average, which is taken over many pieces of trajectories, to approximate well the average defined in Equation (1).
Once a regression tree has been generated, it is used as follows. Given an Hbond H in an arbitrary conformation c of an arbitrary protein, the leaf node L of the tree into which H falls is identified by calculating the values of the necessary predictors for H in c. The predicted stability value stored at L is returned.
III. Training algorithm
We construct a model σ as a binary regression tree using the CART method [9]. The tree is generated recursively in a topdown fashion. When a new node N is created, it is inserted as a leaf of the tree if a predefined depth has been reached or if the number of h falling into N is smaller than a predefined threshold. Otherwise, N is added as an intermediate node, its split is computed, and its left and right children are created. A split s is defined by a pair (p, r), where p is the split predictor and r is the split value. If p is a numerical predictor, then r is a threshold on p, and s ≜ p <r. If p is a categorical predictor, then r is a subset of categories, and s ≜ p ∈ r. We define the score w(p, r) of split s = (p, r) at a node N as the reduction of variance in measured stability that results from s. The algorithm chooses the split—both the predictor and the split value—that has the largest score. Only a relatively small subset of predictors selected by the training algorithm is eventually used in a regression tree.
To prevent model overfitting, we limit tree depth to 5 in most of our experiments and limit the minimal number of training samples in an intermediate node to be 10. We further prune the obtained tree using the following adaptive algorithm. We initially set aside a fraction of the training data table called validation subset. Once a tree has been constructed pruning is an iterative process. At each step, one intermediate node N whose split has minimal score becomes a leaf node by removing the subtree rooted at N. This process creates a sequence of trees with decreasing numbers of nodes. We compute the mean square error of the predictions made by each tree on the validation subset. The tree with the smallest error is selected.
Results
I. Experimental setup
We used 6 MD simulation trajectories picked from different sources and called hereafter 1c9oA, 1e85A, 1g9oA_1, and 1g9oA_2 from [10], complex from [11], and 1eia (generated by us). In all of them the time interval δ between two successive ticks is 1ps. Table 1 indicates the protein simulated in each trajectory, its number of residues, the force field used by the simulator, and the duration of the trajectory. Each trajectory starts from a folded conformation resolved by Xray crystallography.
Table 1. Characteristics of the MD simulation trajectories used to create the 6 datasets
From each trajectory we derived a separate data table in which the rows represent Hbond occurrences. Last two columns in Table 1 list the number of distinct Hbonds detected in each trajectory and the total number of Hbond occurrences extracted. Note that complex was generated for a complex of two molecules. All Hbonds occurring in this complex are taken into account in the corresponding data table.
The values of the timevarying predictors are subject to thermal noise. Since a model σ will in general be used to predict Hbond stability in a protein conformation sampled using a kinematic model ignoring thermal noise (e.g., by sampling the dihedral angles ϕ, ψ, and χ) [7], we chose to average the values of these predictors over l' ticks to remove thermal noise. More precisely, the value of a predictor stored in the row of the data table corresponding to an Hbond occurrence in q(t_{i}) is the average value of this predictor in , where . Our analysis shows that l' = 50 is near optimal.
The performance of a regression model can be measured by the root mean square error (RMSE) of the predictions on a test dataset. For a data table T = {(x_{1}, y_{1}), (x_{2}, y_{2})},…, (x_{n}, y_{n})}, where each x_{i}, i = 1,…,n, denotes a vector of predictor values for an Hbond occurrence and y_{i} is the measured stability of the Hbond, and a model σ, the RMSE is defined by: . As RMSE depends not only on the accuracy of σ, but also on the table T, some normalization is necessary to compare results on different tables. So, in our tests we compute the decrease of RMSE relative to a base model σ_{0}. The relative base error decrease (or RBED) is then defined by: In most cases, σ_{0} is simply defined by , i.e., the average measured stability of all Hbond occurrences in the dataset. In other cases, σ_{0} is a model based on the Hbond energy.
II Generality of models trained on multiple trajectories
Our goal is to train models to predict the stability of Hbonds in any protein. So, we trained models on data tables obtained by mixing subsets of 5 data tables and we tested these models on the remaining data table. For each combination of 5 data tables, we trained 4 groups of models varying in the tree’s maximal depth (5 or 15) and in the fraction of Hbond occurrences randomly taken from each data table (10% or 50%). For each group we trained 10 models. Hence, in total, 240 models were generated. Table 2 shows the mean RBED value for each combination of data tables and each group of models. In columns 3 through 8 we indicate the data table used for testing the models trained on a combination of the 5 other data tables. Figure 1 shows a partial tree trained with combinations of all tables, except 1c9oA.
Table 2. Mean RBED values of models trained on multiple trajectories
Figure 1. Top 2 layers of a regression tree trained with combination of all tables, except 1c9oA. The actual tree contains 55 nodes. Each path from the root to a node defines a conjunction of criteria for Hbonds with a certain mean stability. Here, Dist_H_A (the distance between the hydrogen and the acceptor atoms) is the most differentiating predictor. For Hbonds with Dist_H_A≥2.40Å, the mean stability is only 0.38, but it increases to 0.92 if Dist_H_A<2.40Å.
RBED values show that regression tree model significantly reduces base error and keeps predictive power when applied to a protein not present in the training data. Moreover, the variance of RBED values is very small, meaning that the training process yields models that are stable in performance. Furthermore, the RBED values are lower for models tested on complex. Recall that the trajectory complex was generated for a complex made of a protein and a ligand, while every other trajectory was generated for a single protein. So, it is likely that complex contains Hbonds in situations that did not occur in any of the other trajectories. Both deeper trees and larger data fractions tend to improve model accuracy, but the very small gain is not worth the additional model or computation complexity.
III. Comparison with FIRSTenergy model
We've checked whether regression models can predict the stability of Hbonds more accurately than potential energy alone. Table 3 presents the mean RBED value for a model obtained in the first row of Table 2 relative to the base model that is a regression tree built from the same training data using FIRST_energy as the only predictor. FIRST_energy is a modified Mayo potential [5] implemented in FIRST (a protein rigidity analysis software) [7]. Comparison on all 6 data tables show that the more complex models are significantly more accurate than the models based on FIRST_energy alone.
Table 3. Mean RBED values of models using single predictor FIRST_energy
IV. Identification of least stable Hbonds
Most Hbond occurrences tend to be stable. So, accurately identifying the weakest ones is important if one wishes to predict the possible deformation of a protein [7]. To evaluate how well our models identify the least stable Hbonds occurrences, we first identify the subset S of the 10% Hbond occurrences with the smallest measured stability in each test table T. Using a regression tree σ obtained in Section II, we sort the Hbond occurrences in T in ascending order of predicted stability and we compute the fraction w ∈ [0,1] of S that is contained in the first 100×u% occurrences in this sorted list, for successive values of u ∈ [0,1]. We call the function w(u) the identification curve of the least stable Hbonds for σ.
Figure 2 plots the identification curve for 1c9oA. It consists of three curves: the red curve is the (fictitious) ideal identification curve, the blue curve is obtained with one (randomly picked) regression tree computed in Section II, and the green curve is obtained by sorting Hbond occurrences in decreasing values of FIRST_energy. Plots on other proteins present similar curve shapes. For models tested on data tables except complex, about 80% of the 10% Hbond occurrences predicted as the least stable are actually among the 10% truly least stable. The results for complex are less satisfactory because of the reasons discussed in Section II. The regression models are consistently better than the FIRST_energyonly models, though for 1eia the difference is small.
Figure 2. Identification curves of the least stable bonds for 1c9oA (see Results, Section IV).
Discussion
In all our regression trees the root split was done with predictor Dist_H_A (the distance between the hydrogen and acceptor atoms), which therefore appear as the single most discriminative attribute to predict Hbond stability. This observation is consistent with previous findings. Levitt [6] found that most stable Hbonds have Dist_H_A less than 2.07Å. Jeffrey and Saenger [14] also suggested that Dist_H_A is a key attribute affecting Hbond stability, with a value less than 2.2Å for moderate to strong Hbonds. Consistent with these previous findings, the split values of the deepest Dist_H_A nodes in all our regression trees are around 2.1Å. This distance was observed in [6] to sometimes fluctuate by up to 3Å in stable Hbonds, due to highfrequency atomic vibration. This observation supports our decision to average predictor values over windows of l’ ticks.
Predictor FIRST_energy is often used in splits close to the root. This is not surprising since it is a function of several other pertinent predictors: Dist_H_A, Angle_D_H_A (the angle between the donor, the hydrogen atom, and the acceptor), Angle_H_A_AA (the angle between the hydrogen atom, the acceptor, and the atom covalentlybonded to the acceptor), and the hybridization state of the bond. Some other distancebased predictors (Dist_D_AA, Dist_D_A, Dist_H_D), anglebased predictors and Ch_type (describing whether the donor and acceptor are from mainchain or sidechain) predictor appear often in regression trees, but closer to the leaf nodes. They nevertheless play a significant role in predicting Hbond stability. For example, as shown in Figure 1, if Angle_H_A_AA is at least 105Â°, the stability is very high (about 0.96); otherwise, it drops to 0.71. The preference for larger angle matches well with the wellknown linearity of Hbonds [14].
In order to get a more quantitative measure of the relative impact of the predictors on Hbond stability, we define the importance of a predictor p in a regression tree by: , where N_{p} is the set of nodes where the split is made using p, w(s) is the score of the split s, and n(s) is the number of Hbond occurrences falling into the node where split s is made. We trained 10 models on data tables combining 10% of each of the 6 data tables. Importance scores for each predictor were averaged over these models and then linearly scaled to adjust the score of the least important predictor (with nonzero average importance) equal to 1. The average importance of every predictor appearing in at least one model is shown in Figure 3. The figure confirms that distancebased and anglebased predictors, as well as FIRST_energy, are the most important. It also shows that a number of other predictors—including Resi_name_H, Resi_name_A, and Range (difference in residue numbers of donor and acceptor) —have less, but still significant importance.
Figure 3. Predictor importance scores
Overall, we observe that predictors that describe the local environment of an Hbond play a relatively small role in predicting its stability. In particular, we had expected that descriptors such Num_hb_spaceNbr and Num_hb_spaceRgdNbr, which count the number of other Hbonds located in the neighborhood of the analyzed Hbond, would have had more importance. However, this may reflect the fact that the MD simulation trajectories used in our tests are too short to contain enough information to infer the role of such predictors. Indeed, while transitions between metastable states are rare in those trajectories, predictors describing local environments may have greater influence on the stability of Hbonds that must break for such transitions to happen. So, longer trajectories may eventually be needed to better model Hbond stability.
Conclusions
We have described machine learning methods to train proteinindependent regression trees modeling Hbond stability in proteins. Test results demonstrate that trained models can predict Hbond stability quite well. In particular, their performance is significantly better (roughly 20% better) than that of a model based on Hbond energy alone. They can accurately identify a large fraction of the least stable Hbonds in a given conformation. However, our results also suggest that better results could be obtained with a richer set of MD simulation trajectories. In particular, the trajectories used in our experiments might be too short to characterize the stability of Hbonds that break and form during a transition between metastable states.
We believe that the training methods could be improved in several ways:
 It would be better to averaging predictor values before subsampling MD simulation trajectories. This would reduce the risk of filtering out changes in predictor values that are important for Hbond stability. Unfortunately, in our trajectories we only had access to the data after subsampling.
 More sophisticated learning techniques could be used. For example, instead of generating a single tree, we could generate an ensemble of trees, such as Gradient Boosting Trees [16].
 Finally, the notion of stability itself could be refined, for example by distinguishing between the case where an Hbond frequently switches on and off during a prediction window and the case where it rarely switches.
Authors' contributions
All four authors, IC, PY, MM, and JCL, participated in the formulation of the problem, the design of its solution, and the analysis and the interpretation of the results. PY prepared the experimental data. IC adapted a previously available CART software package and ran the experiments. All authors contributed to the writing of the manuscript, read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Acknowledgements
This work was supported in part by a grant from the KAUSTStanford Academic Excellence Alliance program. The authors thank L. Kavraki (Rice Univ.), V. Pande (Stanford), M. Levitt (Stanford), and J. Wang Tsai (Univ. of the Pacific) for providing us MD simulation trajectories and for useful comments during our work.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 1, 2011: Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/12?issue=S1.
References

Baker EN: Hydrogen bonding in biological macromolecules.
International Tables for Crystallography 2006, F:546552.
Chapter 22.2
Publisher Full Text 
Fersht AR, Serrano L: Principles in protein stability derived from protein engineering experiments.
Curr. Opin. Struct. Biol 1993, 3:7583. Publisher Full Text

Schell D, Tsai J, Scholtz JM, Pace CN: Hydrogen bonding increases packing density in the protein interior.
Proteins 2006, 63:278282. PubMed Abstract  Publisher Full Text

Bikadi Z, Demko L, Hazai E: Functional and structural characterization of a protein based on analysis of its hydrogen bonding network by hydrogen bonding plot.
Arch Biochem Biophys 2007, 461:225234. PubMed Abstract  Publisher Full Text

Dahiyat BI, Gordon DB, Mayo SL: Automated design of the surface positions of protein helices.
Protein Science 1997, 6:13331337. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Levitt M: Molecular dynamics of hydrogen bonds in bovine pancreatic trypsin unhibitor protein.
Nature 1981, 294:379380. PubMed Abstract  Publisher Full Text

Thorpe MF, Lei M, Rader AJ, Jacobs DJ, Kuhn LA: Protein flexibility and dynamics using constraint theory.
J. Mol. Graph. Model 2001, 19:6069. PubMed Abstract  Publisher Full Text

McDonald IK, Thornton JM: Satisfying hydrogen bonding potential in proteins.
J. Mol. Biol 1994, 238:777793. PubMed Abstract  Publisher Full Text

Breiman L, Friedman JH, Ilshen RA, Stone CJ: Classification and regression trees. CRC Press; 1984.

Joo H, Qu X, Swanson R, McCallum CM, Tsai J: Modeling the dependency of residue packing upon backbone conformation using molecular dynamics simulation.
Comput. Biol. Chem 2010. PubMed Abstract  Publisher Full Text

Haspel N, Ricklin D, Geisbrecht B, Lambris JD, Lydia EK: Electrostatic contributions drive the interaction between staphylococcus aureus protein EfbC and its complement target C3d.
Protein Sci 2008, 17:18941906. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Levitt M, Hirshberg M, Sharon R, Daggett V: Potential Energy Function and Parameters for simulations of the Molecular Dynamics of Proteins and Nucleic Acids in Solution.
Computer Physics Communications 1995, 91:215231. Publisher Full Text

Srinivasan J, Trevathan M, Beroza P, Case D: Application of a pairwise generalized Born model to proteins and nucleic acids: inclusion of salt effects.

Jeffrey GA, Saenger W: Hydrogen bonding in biological structures. In SpringerVerlag. Germany; 1991.

Tuv E, Borisov A, Torkokola K: Best subset feature selection for massive mixedtype problems.

Friedman JH: Greedy Function Approximation: A Gradient Boosting Machine.
Annals of Statistics 2000, 29:11891232. Publisher Full Text