Integrating water exclusion theory into β contacts to predict binding free energy changes and binding hot spots

Liu, Qian; Hoi, Steven CH; Kwoh, Chee Keong; Wong, Limsoon; Li, Jinyan

doi:10.1186/1471-2105-15-57

Research article
Open access
Published: 26 February 2014

Integrating water exclusion theory into β contacts to predict binding free energy changes and binding hot spots

Qian Liu^1,2,
Steven CH Hoi²,
Chee Keong Kwoh²,
Limsoon Wong³ &
…
Jinyan Li¹

BMC Bioinformatics volume 15, Article number: 57 (2014) Cite this article

2768 Accesses
9 Citations
2 Altmetric
Metrics details

Abstract

Background

Binding free energy and binding hot spots at protein-protein interfaces are two important research areas for understanding protein interactions. Computational methods have been developed previously for accurate prediction of binding free energy change upon mutation for interfacial residues. However, a large number of interrupted and unimportant atomic contacts are used in the training phase which caused accuracy loss.

Results

This work proposes a new method, β ACV_ASA, to predict the change of binding free energy after alanine mutations. β ACV_ASA integrates accessible surface area (ASA) and our newly defined β contacts together into an atomic contact vector (ACV). A β contact between two atoms is a direct contact without being interrupted by any other atom between them. A β contact’s potential contribution to protein binding is also supposed to be inversely proportional to its ASA to follow the water exclusion hypothesis of binding hot spots. Tested on a dataset of 396 alanine mutations, our method is found to be superior in classification performance to many other methods, including Robetta, FoldX, HotPOINT, an ACV method of β contacts without ASA integration, and ACV_ASA methods (similar to β ACV_ASA but based on distance-cutoff contacts). Based on our data analysis and results, we can draw conclusions that: (i) our method is powerful in the prediction of binding free energy change after alanine mutation; (ii) β contacts are better than distance-cutoff contacts for modeling the well-organized protein-binding interfaces; (iii) β contacts usually are only a small fraction number of the distance-based contacts; and (iv) water exclusion is a necessary condition for a residue to become a binding hot spot.

Conclusions

β ACV_ASA is designed using the advantages of both β contacts and water exclusion. It is an excellent tool to predict binding free energy changes and binding hot spots after alanine mutation.

Background

A binding hot spot is a small area in a protein binding interface whose mutation can lead to a big change in binding free energy. The determination of its accurate location in the interface is a fundamental problem in structural biology, and is useful for applications such as rational drug design and protein engineering [1]. In wet labs, a residue’s contribution to binding free energy can be determined through mutation experiments. For example, alanine scanning mutagenesis [2] mutates interfacial residues individually into an alanine, and then measures the change of binding free energy (Δ Δ G) to quantify the contribution of the side chain of the mutated residue. Based on these wet-lab experimental outcomes and databases [3–6], it has been reported that binding free energy is unevenly distributed in protein interfaces [7]. In fact, there are always a small fraction of interfacial residues—the binding hot spot—which make major contribution to the binding [7, 8] with Δ Δ G≥2 kcal/mol [3]. But wet-lab experiments are both time and cost expensive. Reliable computational methods are thus needed for accurate prediction of binding free energy change.

FoldX [9, 10], Robetta [11, 12] and CC/PBSA [13] are some well-known physics-based methods for this prediction problem. These methods use empirical terms (such as hydrogen bonds), the van der Waals terms and Coulomb electrostatics to learn a linear function for estimating the effect on the change of binding free energy after residue mutations. However, the predicted energy by these methods has a large discrepancy from experimentally measured Δ Δ G[14]. Thus, other methods have been proposed to qualitatively identify binding hot spots. For example, protein sequences are used by [15] and ISIS [16], while protein tertiary structures are used together with docking techniques by [17]. Protein quaternary structures have been also widely used [18]. For example, Hotsprint [19] and HotPOINT [20] generate rules to identify binding hot spots from features such as conservation, accessible surface area (ASA), residue propensity and/or residue pairwise potentials. Machine learning models are also widely used for predicting binding hot spots. Decision trees are used in MINERVA [14] to induce rules at different levels of protein information including structure, sequence and molecular interactions. Later, machine learning algorithms SVM and its ensemble are employed to combine energetic terms such as van der Waals potentials, solvation energy, hydrogen bonds and Coulomb electrostatics, and/or other protein sequences and structure information for a better hot spot prediction performance. Recently, Bayesian Networks are used to combine three main sources of information related to conservation, FoldX-calculated Δ Δ G and atomic contacts for a novel probabilistic model of binding hot spots prediction [21]. Very recently, random forests have been proposed to predict hot spots [22] by using structural neighborhood properties of mutated residues and other conventional physicochemical features [23, 24]. Besides alanine mutations, hot spots after mutations to any other type of residues are also investigated [6] and their binding free energy changes can be predicted [13, 25] with good performance. Several of these methods are also assessed in a community-wide test for predicting mutation effects on protein-protein interaction affinity [26].

In spite of intensive research, the prediction still needs a big improvement. The existing methods usually used those atomic contacts based on Voronoi diagram or simply defined by a distance threshold with little consideration on the local atomic organization of the contacts. If the distance threshold is too large, e.g., larger than 6 Å, an atomic contact between two atoms i and j may have no direct contact area, because the space between i and j can accommodate other atoms. Such interrupted contacts constitute a large proportion of the traditionally used contacts. It is highly questionable whether they are really important to protein binding. In fact, important contacts in hot spot prediction [10, 11] or those closely related to binding hot spots [14] are generally not interrupted, such as hydrogen bonds, salt bridges and π−π contacts.

To overcome these drawbacks, we propose a novel classifier βACV_ASA for predicting Δ Δ G and binding hot spots. The main idea of βACV_ASA is to use atomic contact vector (ACV) of β contacts (that’s why our classifier is named βACV for short) instead of distance-cutoff contacts. β contact, found on β-skeletons [27], is our newly defined contact [28]. A β contact between two atoms restricts that there is no other atoms between these two atoms, and requires that the two atoms should have enough direct contact area to form an interaction. The definition of β contacts can filter out a lot of unimportant and interrupted distance-cutoff contacts. Our analysis has found that β contacts are only a small fraction number of those contacts based on a distance threshold [28], but they are effective to distinguish crystal packing from homodimers [28] and to predict protein-ligand binding affinity [29].

Another important idea is that the relative ASA properties are integrated by our βACV classifier based on the water exclusion hypothesis of binding hot spots. The water exclusion hypothesis states that the topological shape of a binding hot spot and its surrounding residues can be characterized as an O-ring structure [3]. Few residues on the O-ring, which are largely exposed to bulk solvent water, can contribute significantly to the protein binding. Thus in βACV, the energy contribution of a β contact to protein binding is required to be inversely proportional to its ASA.

Our βACV_ASA was tested on a dataset of 396 alanine mutations to show its superior performance. We compared βACV_ASA with the following methods: (i) ACV methods using distance-cutoff contacts to reveal the importance of β contacts to protein binding; (ii) a βACV method without ASA integration to confirm whether the water exclusion theory is necessary for binding hot spots; and (iii) several widely-used state-of-the-art methods such as Robetta, FoldX, HotPOINT and KFC to show the overall better prediction capability of βACV_ASA.

Methods

Dataset

The data stored in the ASEdb database [3] and the mutations in BID [4] having Δ Δ G measurements are both used for evaluating our method. In total, our dataset contains 22 protein-protein complexes (detailed in Additional file 1: Table S2). All of them have quaternary structures in PDB and meet the following three requirements. First, no redundancy exists among these protein complexes. Given two protein complexes (e.g., interacted pair A and B, and interacted pair C and D), a sequence identity is calculated through BLAST with the default setting for A and C, A and D, B and C, and B and D, denoted by S(A,C), S(A,D), S(B,C) and S(B,D) respectively. These two protein complexes are redundant if S(A,C)≥40% and S(B,D)≥40%, or S(A,D)≥40% and S(B,C)≥40%. According to this criterion, most of the protein complexes in our dataset are non-redundant. For those redundant complexes, the mutations in the similar proteins must be in different positions. Our requirement on this sequence identity is reasonable, since atomic contacts and ASA used in this work are derived from complexes only, rather than from sequences (such as required by conservation scores). Secondly, only alanine mutations are considered. Thirdly, mutated atoms before mutation must have at least one distance-cutoff atomic contact with the partner proteins. The mutated atoms are those atoms except N, CA, C, O and CB. Under these requirements, our dataset has 396 alanine mutations (detailed in Additional file 1: Table S3). Of these mutations, 86 are binding hot spot residues having Δ Δ G≥2 kcal/mol.

Atomic βcontacts in protein binding interfaces

Atomic β contact is a recently proposed notion of atomic contacts for modeling the well-organized protein 3D structures [28]. Its detail can be found in [28]. For easy reference, we give a brief description of β contacts and a simple method to produce β contacts from a protein quaternary structure.

Atomic βcontacts: a definition

Given a quaternary structure of a protein complex p, a β contact between two atoms i and j in p requires that (i) the spatial distance between i and j is less than a threshold T_d plus the sum of their van der Waals radii defined by [30] (distance-cutoff contacts for short), (ii) i and j share a Voronoi facet in p’s Voronoi diagram, and (iii) the contact cannot break p’s β-skeleton. The β-skeleton [27] of a discrete set p is an undirected graph in computational geometry. In this graph, two points i and j have an edge if angle ikj is sharper than a threshold determined by β, ∀k∈p,k≠i,j. This angle threshold is denoted as ∠ β, which actually defines a forbidden region fr of the contact between i and j, e.g., the gray regions in Figure 1. When β=1, namely ∠ β=90°, fr is the sphere with the midpoint of i and j as the center and with c’s length as the diameter as shown in Figure 1(b). This sphere is similar to van der Waals radii of atoms. The forbidden region fr of a β contact usually does not cover any other atoms. Otherwise, if there is an atom k in fr, for example as shown in Figure 1(a) when ∠ β=90° or in Figure 1(c) when ∠ β=75°, the contact between i and j is not a β contact. A β contact suggests that its two atoms should have enough direct contact area to form an important interaction. The number of atomic β contacts in protein binding interfaces is only a small fraction number of distance-based contacts or less than half the number of contacts in the Voronoi diagrams when T_d=3.3 as found by [28]. Interestingly, the use of β contacts can achieve better prediction performance for distinguishing false binding of crystal packing from homodimers.

A method to produce βcontacts

A protein complex p can be represented as an atomic β contact graph b(p), if all of the heavy atoms are represented by nodes, and the β contacts are represented by edges. To produce b(p) for p, Qhull is first used to obtain the Delaunay triangulation [31] for all nodes. After that, the distance threshold T_d is used to remove those atomic contacts whose distances are too large. T_d is set as 3.3 Å (the diameter of a water molecule 2.8 Å plus 0.5 Å). This threshold is an insensitive factor to β contacts when it is large enough. Please refer to the Additional file 1 for an analysis of β contacts under several different T_ds. Thirdly, each atomic contact is checked to guarantee that it satisfies the β skeleton requirements. To sharpen the difference of those mutations with higher Δ Δ G and those with lower Δ Δ G, the angle threshold ∠ β is set as 75° in this work, whose forbidden region fr is larger than that of ∠ β=90° as shown in Figure 1(b) and (c). That is, ∠ β=75° is a stricter condition than using ∠ β=90° to produce β contacts. The rationale to choose the stricter condition ∠ β=75° is illustrated in the following situation. Assume (i) A, B and C are the center points of three atoms with van der Waals radii 1.8 Å (for example, the radius of common Carbon atoms in protein structures), (ii) there is a covalent bond between A and B (AB for short) with spatial distance 1.5 Å, and a non-covalent bond between A and C (AC for short) and one between B and C (BC for short), and (iii) the van der Waals sphere of A and that of C are circumscribed each other with the spatial distance 3.6 Å, and the same for the van der Waals sphere of A and that of C. Then, the angle ∠ A B C=78°. A stricter threshold than 78° is 75°, which was chosen by this work.

Our prediction methods

This section describes how to construct our βACV_ASA classifier, including how to define βACV and how to integrate ASA into βACV.

βACV: a vector representation for interfacial alanine mutations

We use an atomic contact vector (ACV) [32] of β contacts to represent an interfacial alanine mutation. We also use β contacts of the interfacial bound water molecules to update the vector, and use the atomic environment of the mutation to expand the basic vector.

Constructing a basic βACV vector

To produce a basic βACV for an alanine mutation m u t r ala of residue r in a protein complex p, we build the β contact graph b(p) for p. We then remove the coordinates of the mutated atoms in r to get a quaternary structure resulted from the mutation, denoted by p r mut. Then, we produce another β contact graph b(p r mut) for p r mut.

By the mutation m u t r ala, some contacts in b(p) may disappear in b(p r mut), namely those mutated contacts, while some new contacts can be formed in b(p r mut), called new contacts. Both of those mutated and new contacts are represented in a βACV vector. As the heavy atoms from the 20 standard residues are grouped into 8 atomic types (shown in Additional file 1: Table S4) by this work, our βACV vector has 36 pairs of atomic types as elements. The value v(T i,T j) of each element in βACV with atomic types Ti and Tj is calculated, using Equation 1.

\begin{align} v (Ti, Tj) = & \sum_{(x, y) \in M (Ti, Tj)} \frac{1}{d_{(x, y)}^{2}} - \sum_{(x^{'}, y^{'}) \in N (Ti, Tj)} \frac{1}{d_{(x^{'}, y^{'})}^{2}} \end{align}

(1)

where x and x^′ are of the atomic type Ti, y and y^′ are of the atomic type Tj, (x,y) and (x^′,y^′) are two atomic pairs, d(∗,∗) is the spatial distance between a pair of atoms, and M(T i,T j)(or N(T i,T j)) is the set of all those mutated (or the set of all those new) contacts whose atomic types are Ti and Tj. Here term d²(∗,∗) is specially used to follow the same idea as Coulomb’s law which also uses the inverse of squared distance. Note that the other common contacts between b(p) and b(p r mut) are not used in βACV. Alanine mutations of Ala are assumed to have insignificant Δ Δ G and alanine mutations of Gly are not considered.

It can be seen that a basic βACV considers all mutated contacts and new contacts, including both acrossinterface contacts and those contacts from the same proteins or same biological units. However, atomic contacts between covalent-bond nearby atoms are not used in Equation 1. The covalent-bond nearby atoms of a given atom i are those atoms that have not more than three covalent-bond steps from i. For example, suppose i−j−k−l−m, where − indicates a covalent bond. From i, the covalent-bond step is 0 to i, is 1 to j, is 2 to k, is 3 to l and is 4 to m, respectively. Thus, i,j,k and l are covalent-bond nearby atoms of i, while m is not. In βACV, the contacts between i and its covalent-bond nearby atoms are excluded from M or N in Equation 1. This is reasonable, because spatially close distances between i and its covalent-bond nearby atoms are more likely due to the rigidity of their covalent bonds.

Bound water molecules in protein interfaces

Protein folding and binding occur in a solvent environment in vivo. Water molecules are heavily involved in protein binding and sometimes they can form a compulsory part of the protein interfaces. In this work, a water molecule in PDB is considered as a part of a binding interface if (i) it has at least 3 potential hydrogen-bonds contacts, or (ii) it has 2 potential hydrogen-bond contacts and also has at least 2 other contacts with spatial distances less than 4 Å. A potential hydrogen-bond contact is required to have a spatial distance less than 3.2 Å between a hydrogen donor (such as a nitrogen atom) and a hydrogen acceptor (such as an oxygen atom). Water molecules under this requirement, named bound water molecules, are such closely involved in protein folding and binding that they can play an integral part. Bound water molecules are then grouped into the Oxygen atomic type with more than one hydrogen atom (shown in Additional file 1: Table S4) to update the values of the elements in the basic βACV vector. We did not consider the contacts between any two water molecules.

The neighbourhood atoms of mutated residues

Information of neighbourhood atoms of m u t r ala is used to expand the basic βACV vector. Assume that S is a set of atoms which have β contacts with the mutated atoms under ∠ β=90°. For each atom in S including the mutated atoms, its nearby atoms are added into S. Then, an atomic vector with the above 8 atomic types is also used to represent those atoms in S in the bound state. The value of its element Tk is calculated using v(T k)^b in Equation 2. Similarly, v(T k)^u in Equation 2 is used to calculate another atomic vector for representing the atoms in S in the unbound state.

v {(Tk)}^{b} = \sum_{j \in S, t_{j} = Tk} E_{j}^{loc}; v {(Tk)}^{u} = \sum_{j \in S, t_{j} = Tk} E_{j}^{loc} (u)

(2)

where E j loc (or E j loc(u)) is the relative local ASA of atom j in the bound (or unbound) state calculated via Equation 6 below. Water molecules were not considered here. Thus, each basic βACV vector is now expanded by another 16 atomic types for representing surrounding information of mutated atoms. So, the expanded βACV is a vector representation with 52 elements.

βACV_ASA: integrating the water exclusion hypothesis into βACV

Solvent water is compulsory for protein binding, but water exclusion—small accessible surface area (ASA)—is a necessary condition for a residue to become binding hot spot [3, 33, 34]. Few highly exposed residues can make significant contribution to protein binding strength [34]. Thus, we integrate ASA information into each atomic pair of βACV in Equation 1, and name the method βACV_ASA. We note that except Equation 1, the other definitions in βACV_ASA are the same as those in βACV.

Given a protein complex p, we take the following steps to integrate the water exclusion theory into Equation 1. The first step is to use NACCESS [30] to produce ASA for all of the atoms and residues in both bound and unbound states. For p in the bound state, we then define special ASA terms for an atom i using Equation 3, and for a residue R_i using Equation 4 and Equation 5.

\begin{align} E_{i} & = \sqrt{\frac{{ASA}_{i}}{50}} & B_{i} & = max (0, 1 - E_{i}) \end{align}

(3)

\begin{align} E_{R_{i}}^{bb} & = \sqrt{\frac{{ASA}_{R_{i}}^{bb}}{max ({ASA}_{R_{i}}^{bb})}} & B_{R_{i}}^{bb} & = max (0, 1 - E_{R_{i}}^{bb}) \end{align}

(4)

\begin{align} E_{R_{i}}^{sc} & = \sqrt{\frac{{ASA}_{R_{i}}^{sc}}{max ({ASA}_{R_{i}}^{sc})}} & B_{R_{i}}^{sc} & = max (0, 1 - E_{R_{i}}^{sc}) \end{align}

(5)

In Equation 3, ASA_i is accessible surface area of atom i, while E_i is its relative ASA, and B_i is its relative ASA burial compared to the maximum ASA, where number 50 is roughly half of NACCESS-calculated ASA of a single water molecule without any neighbor atoms (the ASA of a water molecule is 98.47=4×3.14×2.8² Å²). In Equations 4 and 5, ${ASA}_{R_{i}}^{bb}$ and ${ASA}_{R_{i}}^{sc}$ are accessible surface area of backbone atoms (i.e., bb) and of side-chain atoms (i.e., sc) for a residue R_i, while $E_{R_{i}}^{*}$ and $B_{R_{i}}^{*}$ are the relative ASA and the relative ASA burial of ∗∈{b b,s c}. $max ({ASA}_{R_{i}}^{bb})$ and $max ({ASA}_{R_{i}}^{sc})$ are the maximum ASA of backbone atoms and of side-chain atoms for the residue type of R_i, which are calculated in a triplet of ALA- R_i-ALA by NACCESS. These backbone atoms and side-chain atoms are defined in the same way as those in [30].

We compute the local ASA E i loc and local ASA burial B i loc of an atom i via Equation 6.

\begin{align} E_{i}^{loc} & = \{\begin{matrix} E_{i} \times E_{R_{i}}^{bb} if i is a backbone atom of R_{i} \\ E_{i} \times E_{R_{i}}^{sc} if i is a side-chain atom of R_{i} \end{matrix} \\ B_{i}^{loc} & = \{\begin{matrix} B_{i} \times B_{R_{i}}^{bb} if i is a backbone atom of R_{i} \\ B_{i} \times B_{R_{i}}^{sc} if i is a side-chain atom of R_{i} \end{matrix} \end{align}

(6)

where the multiplication of relative ASA burial of both atom i and its residue is used to calculate local ASA burial B i loc. This is because relative ASA of both an atom and its residue are critical in describing the accessibility of an atom. For example, an atom may be buried with small ASA but its covalent-bond atoms might be exposed. When relative ASA of atoms or residues are used individually, the performance was worse (data not shown).

To integrate water exclusion theory into Equation 1, we determine the value v(T i,T j) of each element in βACV_ASA through Equation 7 instead of Equation 1.

\begin{matrix} v (Ti, Tj) = \sum_{(x, y) \in M (Ti, Tj)} \frac{B_{x}^{loc} \times B_{y}^{loc}}{d_{(x, y)}^{2}} \\ - \sum_{(x^{'}, y^{'}) \in N (Ti, Tj)} \frac{B_{x^{'}}^{loc} \times B_{y^{'}}^{loc}}{d_{(x^{'}, y^{'})}^{2}} \end{matrix}

(7)

where T_∗, x, y, x^′, y^′, M and N have the same meaning as those in Equation 1.

Comparison of βcontacts with distance-cutoff contacts

To compare the performance of β contacts with distance-cutoff contacts for predicting Δ Δ G, ACV_ASA based on distance-cutoff contacts is constructed in a similar way to constructing βACV_ASA. To further show the importance of β contacts in protein binding interfaces, non βACV_ASA is also constructed for alanine mutations at the setting of ∠ β=90°. In non βACV_ASA, the values of its elements are the difference of the values of the 52 elements between ACV_ASA and βACV_ASA. To highlight the advantage of β contacts, ACV_ASA is also evaluated with different spatial distance thresholds (from 2.9 Å to 5 Å) for defining atomic contacts across interfaces and within binding sites.

Our βACV_ASA classifier and its variants described above are summarized in Table 1.

Table 1 Description of β ACV _ASA and its variant methods

Full size table

Ridge regression: predict Δ Δ Gand binding hot spots

Ridge regression in Matlab is used here to learn a relation between atomic contact vectors and Δ Δ G. By this regression, values in each column are normalized for the training dataset. Ridge regression minimizes average square error SE between the experimental ( $ΔΔ G_{i}^{e}$ ) and the predicted $ΔΔ G_{i}^{p}$ in the training data with N mutations where $SE = \frac{\sum_{i} {(ΔΔ G_{i}^{e} - ΔΔ G_{i}^{p})}^{2}}{N - 1}$ .

In our evaluation, leave-one-out cross-validation is used for all of the 396 mutations, and then the correlation coefficient R and average standard deviation $δ = \sqrt{SE}$ are calculated. Under this evaluation framework, there is one outlier prediction by β ACV_ASA and one outlier by βACV for the whole dataset with 396 mutations. These outliers have less than -3 kcal/mol predicted Δ Δ G, or more than 11 kcal/mol predicted Δ Δ G, as shown in Additional file 1: Table S3. This may be due to limited alanine mutations of a high Δ Δ G in the dataset.

In this work, predicted hot spot residues are those residue mutations with a predicted Δ Δ G≥2 kcal/mol, same as the true hot spot definition.

Hot spot prediction and evaluation measures

βACV_ASA is also assessed by applying to the classification problem of binding hot spots. Classification performance is measured by p r e c i s i o n(p.), r e c a l l(r.), a c c u r a c y(a c c.) and F1 whose definitions are given in Equation 8.

\begin{matrix} precision (p.) = \frac{TP}{TP + FP} \\ recall (r.) = \frac{TP}{TP + FN} \\ accuracy (acc.) = \frac{TP + TN}{TP + TN + FP + FN} \\ F 1 = \frac{2 \times precision \times recall}{precision + recall} \end{matrix}

(8)

where binding hot spots are considered as the true cases, while non-hot spots as the false cases; TP, FP, TN and FN are true positives, false positives, true negatives and false negatives, respectively. Hence, precision is the number of correct hot spot predictions divided by the number of positive predictions, recall is the fraction of correct hot spot predictions over all hot spots, while accuracy is the number of correctly predicted hot spots and non-hot spots divided by the number of all mutations. These measures are also used in [14, 20, 35] with the same definitions.

Results and discussion

β contacts are better than distance-cutoff contacts for predicting Δ Δ G

Our β ACV_ASA classifier is compared with ACV_ASA and with non βACV_ASA to show the importance of β contacts in the prediction of Δ Δ G under alanine mutations. The prediction results are presented in Figure 2(a), (b) and (c).

It can be seen from Figure 2(a) and (b) that βACV_ASA has a better Δ Δ G prediction performance according to both correlation coefficient R and average standard deviation δ. The number of β contacts used by βACV_ASA is only a small fraction of the number of distance-cutoff contacts used by ACV_ASA. For example, there are 54,286 distance-cutoff contacts across binding interfaces for the 22 protein complexes, but there are only 9,830 β contacts across the binding interfaces (∠ β=90°), and 4,096 β contacts under the setting of ∠ β=75° which is actually used by βACV_ASA. So, βACV_ASA uses only 7.55% number of distance-cutoff atomic contacts but it achieves a better prediction performance.

The comparison between βACV_ASA and non βACV_ASA (Figure 2(a) and (c)) further suggests the importance of β contacts in Δ Δ G prediction. In Figure 2(c), non βACV_ASA has much lower R (0.469) and a higher δ (1.486) than βACV_ASA, but non βACV_ASA uses all non- β contacts of ∠ β=90°, that is, 81.8% number of distance-cutoff atomic contacts.

A lot of alanine mutations are not binding hot spot residues, having a small Δ Δ G, i.e., <2 kcal/mol. These mutations heavily affect the calculation of R and δ. On the other hand, the prediction of residue mutations with a high Δ Δ G is more important. Thus, the classification performance for these binding hot spots is also assessed. The results are shown in Table 2. It is noted that F1 is not the objective function to be optimized in the regression process.

Table 2 Prediction performances by different methods for the same set of binding hot spots

Full size table

In Table 2 among βACV_ASA, ACV_ASA and non βACV_ASA, βACV_ASA has the highest precision, recall, F1 and accuracy, while non βACV_ASA has the lowest. For example, βACV_ASA’s F1 is 0.604, 0.122 higher than non βACV_ASA’s F1. However, ACV_ASA and non βACV_ASA show quite similar performances. The reason would be that non- β contacts (used by non βACV_ASA) are often dominant in distance-cutoff contacts (used by ACV_ASA).

The performances of ACV_ASA with different spatial distance thresholds are shown in Table 3. We can see that the performance has a growing tendency when the threshold increases. Nevertheless, the best performance of ACV_ASA (F1 = 0.5 in Table 2) is much lower than that of βACV_ASA (F1 = 0.604). Of special interest, when the spatial distance threshold is set at 3.6 Å, ACV_ASA has a number of distance-cutoff contacts nearly the same as the number of our used β contacts. In this special case of almost the same number of contacts used, ACV_ASA has much worse performance than βACV_ASA, and only about half of the distance-based contacts are β contacts across the 22 protein-protein binding interfaces. These results affirm that β contacts are advantageous over distance-based contacts for predicting binding hot spot residues.

Table 3 Prediction performance and the numbers of used contacts by β ACV _ASA and ACV _ASA

Full size table

Water exclusion is a necessary condition of hot spot binding

Literature work has reported that water exclusion is a necessary condition for an interfacial residue to become a hot spot residue [3, 33]. To confirm the importance of water exclusion in the prediction of Δ Δ G, the performance by βACV when ASA is not integrated is assessed. This prediction performance is shown in Figure 2(d) and Table 2. Comparing Figure 2(d) with Figure 2(a), βACV_ASA has a better regression performance than βACV indeed. It also has a better F1 performance as seen in Table 2. These results confirm that water exclusion plays an important role in hot spot prediction, and it should be a necessary condition for an interfacial residue to become a binding hot spot residue in protein-protein complexes.

Our method βACV_ASAis superior to several widely-used methods

Our βACV_ASA classifier is also assessed against the state-of-the-art methods FoldX [9, 10], Robetta [11], HotPOINT [20] and KFC [36]. The prediction performances of these previous methods were obtained through their web servers (Robetta, HotPOINT and KFC) or the standalone executable program (FoldX with default settings).

Comparison results

Figures 2(e) and 2(f) show that the prediction performance of FoldX and Robetta are much worse than our βACV_ASA. These two methods have a R of 0.324 or 0.485, much smaller than βACV_ASA’s 0.569; their δ is 1.788 or 1.554, much larger than βACV_ASA’s 1.349. Table 2 also shows their classification performance on the 396 mutations: FoldX’s F1 is 0.44, while Robetta’s F1 is 0.494, both worse than βACV_ASA’s 0.604. From Table 2, our method also has better classification performance than HotPOINT and KFC. Other performance comparison results are provided in the Additional file 1 when tested on BID (including protein-peptide interfacial residues) or under the leave-one-complex-out cross-validation framework.

An example of hot spot predictions

We use 3HFM as a case study to illustrate the difference of the binding hot spot prediction results by βACV_ASA, FoldX and Robetta. The 3HFM complex is an antibody-antigen binding between HyHEL-10 and hen egg white lysozyme. According to ASEdb, a total of 25 alanine mutations were experimented, 11 of which have Δ Δ G more than 2 kcal/mol.

Our βACV_ASA correctly identified 9 binding hot spot residues with a recall of 0.818, but made 3 false positive predictions with a precision of 0.75 (Figure 3). This gives an F1 of 0.783. However, FoldX made only one hot spot prediction which is correct with a recall of 0.091, and Robetta has a recall of 0.455 (5 out of 11) and a precision of 0.833 (5 out of 6), namely an F1 of 0.588. Both of these methods have a lower prediction performance than our βACV_ASA. What is more important is that the four positive predictions correctly made only by our βACV_ASA, not by FoldX or Robetta, have a high Δ Δ G, such as Trp in position 98 of Chain H (Δ Δ G=5.5 kcal/mol) and Tyr in position 50 of Chain L (Δ Δ G=4.6 kcal/mol). Please refer to Additional file 1: Table S3 for detail.

Conclusion

A new classifier βACV_ASA has been proposed to predict Δ Δ G and binding hot spot residues. The novel idea of this classifier is to integrate the water exclusion theory into β contacts. Tested on a data set of 396 alanine mutations, βACV_ASA has been found to outperform many other methods. This confirms that β contacts are truly better than traditional distance-cutoff contacts to capture the energetic characteristics of protein folding and binding. This also confirms that water exclusion is a necessary condition for a residue to become a binding hot spot residue.

Availability of supporting data

All the supporting data are included as additional files.

References

Schreiber G, Fleishman SJ: Computational design of protein-protein interactions. Curr Opin Struct Biol. 2013, 23 (6): 903-910. 10.1016/j.sbi.2013.08.003.
Article PubMed CAS Google Scholar
Wells J: Systematic mutational analyses of protein-protein interfaces. Methods Enzymol. 1991, 202: 390-411.
Article PubMed CAS Google Scholar
Bogan AA, Thorn KS: Anatomy of hot spots in protein interfaces. J Mol Biol. 1998, 280: 1-9. 10.1006/jmbi.1998.1843.
Article PubMed CAS Google Scholar
Fischer TB, Arunachalam KV, Bailey D, Mangual V, Bakhru S, Russo R, Huang D, Paczkowski M, Lalchandani V, Ramachandra C, Ellison B, Galer S, Shapley J, Fuentes E, Tsai J: The binding interface database (BID): a compilation of amino acid hot spots in protein interfaces. Bioinformatics. 2003, 19 (11): 1453-1454. 10.1093/bioinformatics/btg163.
Article PubMed CAS Google Scholar
Kumar MDS, Gromiha MM: PINT: protein-protein interactions thermodynamic database. Nucleic Acids Res. 2006, 34 (suppl 1): D195-D198.
Article PubMed Central PubMed CAS Google Scholar
Moal IH, Fernandez-Recio J: SKEMPI: a Structural Kinetic and Energetic database of Mutant Protein Interactions and its use in empirical models. Bioinformatics. 2012, 28 (20): 2600-2607. 10.1093/bioinformatics/bts489.
Article PubMed CAS Google Scholar
Clackson T, Wells J: A hot spot of binding energy in a hormone-receptor interface. Science. 1995, 267: 383-386. 10.1126/science.7529940.
Article PubMed CAS Google Scholar
Thorn KS, Bogan AA: ASEdb: a database of alanine mutations and their effects on the free energy of binding in protein interactions. Bioinformatics. 2001, 17 (3): 284-285. 10.1093/bioinformatics/17.3.284.
Article PubMed CAS Google Scholar
Guerois R, Nielsen JE, Serrano L: Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. J Mol Biol. 2002, 320 (2): 369-387. 10.1016/S0022-2836(02)00442-4.
Article PubMed CAS Google Scholar
Schymkowitz J, Borg J, Stricher F, Nys R, Rousseau F, Serrano L: The FoldX web server: an online force field. Nucl Acids Res. 2005, 33 (Web Server issue): W382-W388.
Article PubMed Central PubMed CAS Google Scholar
Kortemme T, Baker D: A simple physical model for binding energy hot spots in protein-protein complexes. Proc Natl Acad Sci USA. 2002, 99 (22): 14116-14121. 10.1073/pnas.202485799.
Article PubMed Central PubMed CAS Google Scholar
Kortemme T, Kim DE, Baker D: Computational alanine scanning of protein-protein interfaces. Sci STKE. 2004, 2004 (219): pl2-
PubMed Google Scholar
Benedix A, Becker CM, de Groot BL, Caflisch A, Bockmann RA: Predicting free energy changes using structural ensembles. Nat Methods. 2009, 6: 3-4. 10.1038/nmeth0109-3.
Article PubMed CAS Google Scholar
Cho KI, Kim D, Lee D: A feature-based approach to modeling protein-protein interaction hot spots. Nucl Acids Res. 2009, 37 (8): 2672-2687. 10.1093/nar/gkp132.
Article PubMed Central PubMed CAS Google Scholar
Chen P, Li J, Wong L, Kuwahara H, Huang J, Gao X: Accurate prediction of hot spot residues through physicochemical characteristics of amino acid sequences. Proteins: Struct Funct Bioinformatics. 2013, 81 (8): 1351-1362. 10.1002/prot.24278.
Article CAS Google Scholar
Ofran Y, Rost B: Protein-protein interaction hotspots carved into sequences. PLoS Comput Biol. 2007, 3 (7): e119-10.1371/journal.pcbi.0030119.
Article PubMed Central PubMed Google Scholar
Grosdidier S, Recio JF: Identification of hot-spot residues in protein-protein interactions by computational docking. BMC Bioinformatics. 2008, 9: 447-10.1186/1471-2105-9-447.
Article PubMed Central PubMed Google Scholar
Zhu X, Ericksen SS, Demerdash ONA, Mitchell JC: Data-driven models for protein interaction and design. Proteins: Struct Funct Bioinformatics. 2013, 81 (12): 2221-2228. 10.1002/prot.24405.
Article CAS Google Scholar
Guney E, Tuncbag N, Keskin O, Gürsoy A: HotSprint: database of computational hot spots in protein interfaces. Nucl Acids Res. 2008, 36 (Database-Issue): 662-666.
Google Scholar
Tuncbag N, Gursoy A, Keskin O: Identification of computational hot spots in protein interfaces: combining solvent accessibility and inter-residue potentials improves the accuracy. Bioinformatics. 2009, 25 (12): 1513-1520. 10.1093/bioinformatics/btp240.
Article PubMed CAS Google Scholar
Assi SA, Tanaka T, Rabbitts TH, Fernandez-Fuentes N: PCRPi: Presaging Critical Residues in Protein interfaces, a new computational tool to chart hot spots in protein interfaces. Nucl Acids Res. 2010, 38 (6): e86-10.1093/nar/gkp1158.
Article PubMed Central PubMed Google Scholar
Pallara C, Jimenez-Garcia B, Perez-Cano L, Romero-Durana M, Solernou A, Grosdidier S, Pons C, Moal IH, Fernandez-Recio J: Expanding the frontiers of protein-protein modeling: from docking and scoring to binding affinity predictions and other challenges. Proteins: Struct Funct Bioinformatics. 2013, 81 (12): 2192-2200. 10.1002/prot.24387.
Article CAS Google Scholar
Deng L, Guan J, Wei X, Yi Y, Zhou S: Boosting prediction performance of protein-protein interaction hot spots by using structural neighborhood properties. Research in Computational Molecular Biology, Volume 7821 of Lecture Notes in Computer Science. Edited by: Deng M, Jiang R, Sun F, Zhang X. 2013, Springer: Berlin, Heidelberg, 333-344.
Google Scholar
Wang L, Liu ZP, Zhang XS, Chen L: Prediction of hot spots in protein interfaces using a random forest model with hybrid features. Protein Eng Design Sel. 2012, 25 (3): 119-126. 10.1093/protein/gzr066.
Article CAS Google Scholar
Dehouck Y, Kwasigroch JM, Rooman M, Gilis D: BeAtMuSiC: prediction of changes in protein-protein binding affinity on mutations. Nucleic Acids Res. 2013, 41 (Web Server issue): W333-W339.
Article PubMed Central PubMed Google Scholar
Moretti R, Fleishman SJ, Agius R, Torchala M, Bates PA, Kastritis PL, Rodrigues JPGLM, Trellet M, Bonvin AMJJ, Cui M, Rooman M, Gillis D, Dehouck Y, Moal I, Romero-Durana M, Perez-Cano L, Pallara C, Jimenez B, Fernandez-Recio J, Flores S, Pacella M, Praneeth Kilambi K, Gray JJ, Popov P, Grudinin S, Esquivel-Rodriguez J, Kihara D, Zhao N, Korkin D, Zhu X, et al: Community-wide evaluation of methods for predicting the effect of mutations on protein-protein interactions. Proteins: Struct Funct Bioinformatics. 2013, 81 (11): 1980-1987. 10.1002/prot.24356.
Article CAS Google Scholar
Kirkpatrick DG, Radke JD: A framework for computational morphology. Computational Geometry, Machine Intelligence and Pattern Recognition. 1985, 217-248.
Google Scholar
Liu Q, Kwoh CK, Hoi SCH: Beta atomic contacts: identifying critical specific contacts in protein binding interfaces. PLoS ONE. 2013, 8 (4): e59737-10.1371/journal.pone.0059737.
Article PubMed Central PubMed CAS Google Scholar
Liu Q, Kwoh CK, Li J: Binding affinity prediction for protein-ligand complexes based onβcontacts and B factor. J Chem Inf Model. 2013, 53 (11): 3076-3085. 10.1021/ci400450h.
Article PubMed CAS Google Scholar
Hubbard SJ, Thornton JM: ‘NACCESS’, computer program. Tech. rep., Department of Biochemistry Molecular Biology, University College London, 1993,
Barber BC, Dobkin DP, Huhdanpaa H: The Quickhull algorithm for convex hulls. ACM Trans Math Softw. 1996, 22 (4): 469-483. 10.1145/235815.235821.
Article Google Scholar
Mintseris J, Weng Z: Atomic contact vectors in protein-protein recognition. Proteins. 2003, 53 (3): 629-639. 10.1002/prot.10432.
Article PubMed CAS Google Scholar
Li J, Liu Q: ‘Double water exclusion’: a hypothesis refining the O-ring theory for the hot spots at protein interfaces. Bioinformatics. 2009, 25 (6): 743-750. 10.1093/bioinformatics/btp058.
Article PubMed Central PubMed CAS Google Scholar
Martins JM, Ramos RM, Pimenta AC, Moreira IS: Solvent-accessible surface area: how well can be applied to hot-spot detection?. Proteins: Struct Funct Bioinformatics. 2013, 82 (3): 479-490.
Article Google Scholar
Xia JF, Zhao XM, Song J, Huang DS: APIS: accurate prediction of hot spots in protein interfaces by combining protrusion index with solvent accessibility. BMC Bioinformatics. 2010, 11: 174-10.1186/1471-2105-11-174.
Article PubMed Central PubMed Google Scholar
Zhu X, Mitchell JC: KFC2: a knowledge-based hot spot prediction method based on interface solvation, atomic density and plasticity features. Proteins. 2011, 79 (9): 2671-2683. 10.1002/prot.23094.
Article PubMed CAS Google Scholar

Download references

Acknowledgements

This work was partially supported by an ARC Discovery Project (DP130102124).

Author information

Authors and Affiliations

Advanced Analytics Institute and Center for Health Technologies, Faculty of Engineering and IT, University of Technology, Sydney, Australia
Qian Liu & Jinyan Li
School of Computer Engineering, Nanyang Technological University, Singapore, 639798, Singapore
Qian Liu, Steven CH Hoi & Chee Keong Kwoh
School of Computing, National University of Singapore, Singapore, 117417, Singapore
Limsoon Wong

Authors

Qian Liu
View author publications
You can also search for this author in PubMed Google Scholar
Steven CH Hoi
View author publications
You can also search for this author in PubMed Google Scholar
Chee Keong Kwoh
View author publications
You can also search for this author in PubMed Google Scholar
Limsoon Wong
View author publications
You can also search for this author in PubMed Google Scholar
Jinyan Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinyan Li.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

QL designed the methods and performed the experiments. JL and SH supervised the study. JL, LW and CK participated in the analysis. QL drafted the manuscript. QL, SH, LW, CK and JL read and revised the manuscript. All authors approved the final version.

Electronic supplementary material

12859_2013_6320_MOESM1_ESM.pdf

Additional file 1: This additional file covers an analysis on β contacts of different T_ds, more evaluation results and related discussions (including the statistical significance of the difference among Figure2(a) to2(d), the dataset and evaluation on BID, the evaluation under leave-one-complex-out cross-validation, and a discussion on using the 396 mutations), the groups of atomic types, and the detail of our dataset.(PDF 126 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Liu, Q., Hoi, S.C., Kwoh, C.K. et al. Integrating water exclusion theory into β contacts to predict binding free energy changes and binding hot spots. BMC Bioinformatics 15, 57 (2014). https://doi.org/10.1186/1471-2105-15-57

Download citation

Received: 30 July 2013
Accepted: 19 February 2014
Published: 26 February 2014
DOI: https://doi.org/10.1186/1471-2105-15-57

Integrating water exclusion theory into β contacts to predict binding free energy changes and binding hot spots

Abstract

Background

Results

Conclusions

Background

Methods

Dataset

Atomic βcontacts in protein binding interfaces

Atomic βcontacts: a definition

A method to produce βcontacts

Our prediction methods

βACV: a vector representation for interfacial alanine mutations

Constructing a basic βACV vector

Bound water molecules in protein interfaces

The neighbourhood atoms of mutated residues

βACV ASA : integrating the water exclusion hypothesis into βACV

Comparison of βcontacts with distance-cutoff contacts

Ridge regression: predict Δ Δ Gand binding hot spots

Hot spot prediction and evaluation measures

Results and discussion

β contacts are better than distance-cutoff contacts for predicting Δ Δ G

Water exclusion is a necessary condition of hot spot binding

Our method βACV ASA is superior to several widely-used methods

Comparison results

An example of hot spot predictions

Conclusion

Availability of supporting data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Electronic supplementary material

12859_2013_6320_MOESM1_ESM.pdf

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us

βACV_ASA: integrating the water exclusion hypothesis into βACV

Our method βACV_ASAis superior to several widely-used methods