Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Research article

Integrating water exclusion theory into βcontacts to predict binding free energy changes and binding hot spots

Qian Liu12, Steven CH Hoi2, Chee Keong Kwoh2, Limsoon Wong3 and Jinyan Li1*

Author Affiliations

1 Advanced Analytics Institute and Center for Health Technologies, Faculty of Engineering and IT, University of Technology, Sydney, Australia

2 School of Computer Engineering, Nanyang Technological University, Singapore 639798, Singapore

3 School of Computing, National University of Singapore, Singapore 117417, Singapore

For all author emails, please log on.

BMC Bioinformatics 2014, 15:57  doi:10.1186/1471-2105-15-57


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/15/57


Received:30 July 2013
Accepted:19 February 2014
Published:26 February 2014

© 2014 Liu et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Binding free energy and binding hot spots at protein-protein interfaces are two important research areas for understanding protein interactions. Computational methods have been developed previously for accurate prediction of binding free energy change upon mutation for interfacial residues. However, a large number of interrupted and unimportant atomic contacts are used in the training phase which caused accuracy loss.

Results

This work proposes a new method, βACVASA, to predict the change of binding free energy after alanine mutations. βACVASA integrates accessible surface area (ASA) and our newly defined β contacts together into an atomic contact vector (ACV). A β contact between two atoms is a direct contact without being interrupted by any other atom between them. A β contact’s potential contribution to protein binding is also supposed to be inversely proportional to its ASA to follow the water exclusion hypothesis of binding hot spots. Tested on a dataset of 396 alanine mutations, our method is found to be superior in classification performance to many other methods, including Robetta, FoldX, HotPOINT, an ACV method of β contacts without ASA integration, and ACVASA methods (similar to βACVASA but based on distance-cutoff contacts). Based on our data analysis and results, we can draw conclusions that: (i) our method is powerful in the prediction of binding free energy change after alanine mutation; (ii) β contacts are better than distance-cutoff contacts for modeling the well-organized protein-binding interfaces; (iii) β contacts usually are only a small fraction number of the distance-based contacts; and (iv) water exclusion is a necessary condition for a residue to become a binding hot spot.

Conclusions

βACVASA is designed using the advantages of both β contacts and water exclusion. It is an excellent tool to predict binding free energy changes and binding hot spots after alanine mutation.

Background

A binding hot spot is a small area in a protein binding interface whose mutation can lead to a big change in binding free energy. The determination of its accurate location in the interface is a fundamental problem in structural biology, and is useful for applications such as rational drug design and protein engineering [1]. In wet labs, a residue’s contribution to binding free energy can be determined through mutation experiments. For example, alanine scanning mutagenesis [2] mutates interfacial residues individually into an alanine, and then measures the change of binding free energy (ΔΔG) to quantify the contribution of the side chain of the mutated residue. Based on these wet-lab experimental outcomes and databases [3-6], it has been reported that binding free energy is unevenly distributed in protein interfaces [7]. In fact, there are always a small fraction of interfacial residues—the binding hot spot—which make major contribution to the binding [7,8] with ΔΔG≥2 kcal/mol [3]. But wet-lab experiments are both time and cost expensive. Reliable computational methods are thus needed for accurate prediction of binding free energy change.

FoldX [9,10], Robetta [11,12] and CC/PBSA [13] are some well-known physics-based methods for this prediction problem. These methods use empirical terms (such as hydrogen bonds), the van der Waals terms and Coulomb electrostatics to learn a linear function for estimating the effect on the change of binding free energy after residue mutations. However, the predicted energy by these methods has a large discrepancy from experimentally measured ΔΔG[14]. Thus, other methods have been proposed to qualitatively identify binding hot spots. For example, protein sequences are used by [15] and ISIS [16], while protein tertiary structures are used together with docking techniques by [17]. Protein quaternary structures have been also widely used [18]. For example, Hotsprint [19] and HotPOINT [20] generate rules to identify binding hot spots from features such as conservation, accessible surface area (ASA), residue propensity and/or residue pairwise potentials. Machine learning models are also widely used for predicting binding hot spots. Decision trees are used in MINERVA [14] to induce rules at different levels of protein information including structure, sequence and molecular interactions. Later, machine learning algorithms SVM and its ensemble are employed to combine energetic terms such as van der Waals potentials, solvation energy, hydrogen bonds and Coulomb electrostatics, and/or other protein sequences and structure information for a better hot spot prediction performance. Recently, Bayesian Networks are used to combine three main sources of information related to conservation, FoldX-calculated ΔΔG and atomic contacts for a novel probabilistic model of binding hot spots prediction [21]. Very recently, random forests have been proposed to predict hot spots [22] by using structural neighborhood properties of mutated residues and other conventional physicochemical features [23,24]. Besides alanine mutations, hot spots after mutations to any other type of residues are also investigated [6] and their binding free energy changes can be predicted [13,25] with good performance. Several of these methods are also assessed in a community-wide test for predicting mutation effects on protein-protein interaction affinity [26].

In spite of intensive research, the prediction still needs a big improvement. The existing methods usually used those atomic contacts based on Voronoi diagram or simply defined by a distance threshold with little consideration on the local atomic organization of the contacts. If the distance threshold is too large, e.g., larger than 6 Å, an atomic contact between two atoms i and j may have no direct contact area, because the space between i and j can accommodate other atoms. Such interrupted contacts constitute a large proportion of the traditionally used contacts. It is highly questionable whether they are really important to protein binding. In fact, important contacts in hot spot prediction [10,11] or those closely related to binding hot spots [14] are generally not interrupted, such as hydrogen bonds, salt bridges and ππ contacts.

To overcome these drawbacks, we propose a novel classifier βACVASA for predicting ΔΔG and binding hot spots. The main idea of βACVASA is to use atomic contact vector (ACV) of β contacts (that’s why our classifier is named βACV for short) instead of distance-cutoff contacts. β contact, found on β-skeletons [27], is our newly defined contact [28]. A β contact between two atoms restricts that there is no other atoms between these two atoms, and requires that the two atoms should have enough direct contact area to form an interaction. The definition of β contacts can filter out a lot of unimportant and interrupted distance-cutoff contacts. Our analysis has found that β contacts are only a small fraction number of those contacts based on a distance threshold [28], but they are effective to distinguish crystal packing from homodimers [28] and to predict protein-ligand binding affinity [29].

Another important idea is that the relative ASA properties are integrated by our βACV classifier based on the water exclusion hypothesis of binding hot spots. The water exclusion hypothesis states that the topological shape of a binding hot spot and its surrounding residues can be characterized as an O-ring structure [3]. Few residues on the O-ring, which are largely exposed to bulk solvent water, can contribute significantly to the protein binding. Thus in βACV, the energy contribution of a β contact to protein binding is required to be inversely proportional to its ASA.

Our βACVASA was tested on a dataset of 396 alanine mutations to show its superior performance. We compared βACVASA with the following methods: (i) ACV methods using distance-cutoff contacts to reveal the importance of β contacts to protein binding; (ii) a βACV method without ASA integration to confirm whether the water exclusion theory is necessary for binding hot spots; and (iii) several widely-used state-of-the-art methods such as Robetta, FoldX, HotPOINT and KFC to show the overall better prediction capability of βACVASA.

Methods

Dataset

The data stored in the ASEdb database [3] and the mutations in BID [4] having ΔΔG measurements are both used for evaluating our method. In total, our dataset contains 22 protein-protein complexes (detailed in Additional file 1: Table S2). All of them have quaternary structures in PDB and meet the following three requirements. First, no redundancy exists among these protein complexes. Given two protein complexes (e.g., interacted pair A and B, and interacted pair C and D), a sequence identity is calculated through BLAST with the default setting for A and C, A and D, B and C, and B and D, denoted by S(A,C), S(A,D), S(B,C) and S(B,D) respectively. These two protein complexes are redundant if S(A,C)≥40% and S(B,D)≥40%, or S(A,D)≥40% and S(B,C)≥40%. According to this criterion, most of the protein complexes in our dataset are non-redundant. For those redundant complexes, the mutations in the similar proteins must be in different positions. Our requirement on this sequence identity is reasonable, since atomic contacts and ASA used in this work are derived from complexes only, rather than from sequences (such as required by conservation scores). Secondly, only alanine mutations are considered. Thirdly, mutated atoms before mutation must have at least one distance-cutoff atomic contact with the partner proteins. The mutated atoms are those atoms except N, CA, C, O and CB. Under these requirements, our dataset has 396 alanine mutations (detailed in Additional file 1: Table S3). Of these mutations, 86 are binding hot spot residues having ΔΔG≥2 kcal/mol.

Additional file 1. This additional file covers an analysis onβ contacts of differentTds, more evaluation results and related discussions (including the statistical significance of the difference among Figure2(a) to2(d), the dataset and evaluation on BID, the evaluation under leave-one-complex-out cross-validation, and a discussion on using the 396 mutations), the groups of atomic types, and the detail of our dataset.

Format: PDF Size: 126KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

Atomic β contacts in protein binding interfaces

Atomic β contact is a recently proposed notion of atomic contacts for modeling the well-organized protein 3D structures [28]. Its detail can be found in [28]. For easy reference, we give a brief description of β contacts and a simple method to produce β contacts from a protein quaternary structure.

Atomic β contacts: a definition

Given a quaternary structure of a protein complex p, a β contact between two atoms i and j in p requires that (i) the spatial distance between i and j is less than a threshold Td plus the sum of their van der Waals radii defined by [30] (distance-cutoff contacts for short), (ii) i and j share a Voronoi facet in p’s Voronoi diagram, and (iii) the contact cannot break p’s β-skeleton. The β-skeleton [27] of a discrete set p is an undirected graph in computational geometry. In this graph, two points i and j have an edge if angle ikj is sharper than a threshold determined by β, ∀kp,ki,j. This angle threshold is denoted as β, which actually defines a forbidden region fr of the contact between i and j, e.g., the gray regions in Figure 1. When β=1, namely β=90°, fr is the sphere with the midpoint of i and j as the center and with c’s length as the diameter as shown in Figure 1(b). This sphere is similar to van der Waals radii of atoms. The forbidden region fr of a β contact usually does not cover any other atoms. Otherwise, if there is an atom k in fr, for example as shown in Figure 1(a) when β=90° or in Figure 1(c) when β=75°, the contact between i and j is not a β contact. A β contact suggests that its two atoms should have enough direct contact area to form an important interaction. The number of atomic β contacts in protein binding interfaces is only a small fraction number of distance-based contacts or less than half the number of contacts in the Voronoi diagrams when Td=3.3 as found by [28]. Interestingly, the use of β contacts can achieve better prediction performance for distinguishing false binding of crystal packing from homodimers.

thumbnailFigure 1. Examples ofβ contacts and non-β contacts. Three points, denoted by i, j and k, represent atoms. The dashed circles represent the van der Waals spheres in 2D space. The lines in yellow are of interest.

A method to produce β contacts

A protein complex p can be represented as an atomic β contact graph b(p), if all of the heavy atoms are represented by nodes, and the β contacts are represented by edges. To produce b(p) for p, Qhull is first used to obtain the Delaunay triangulation [31] for all nodes. After that, the distance threshold Td is used to remove those atomic contacts whose distances are too large. Td is set as 3.3 Å (the diameter of a water molecule 2.8 Å plus 0.5 Å). This threshold is an insensitive factor to β contacts when it is large enough. Please refer to the Additional file 1 for an analysis of β contacts under several different Tds. Thirdly, each atomic contact is checked to guarantee that it satisfies the β skeleton requirements. To sharpen the difference of those mutations with higher ΔΔG and those with lower ΔΔG, the angle threshold β is set as 75° in this work, whose forbidden region fr is larger than that of β=90° as shown in Figure 1(b) and (c). That is, β=75° is a stricter condition than using β=90° to produce β contacts. The rationale to choose the stricter condition β=75° is illustrated in the following situation. Assume (i) A, B and C are the center points of three atoms with van der Waals radii 1.8 Å (for example, the radius of common Carbon atoms in protein structures), (ii) there is a covalent bond between A and B (AB for short) with spatial distance 1.5 Å, and a non-covalent bond between A and C (AC for short) and one between B and C (BC for short), and (iii) the van der Waals sphere of A and that of C are circumscribed each other with the spatial distance 3.6 Å, and the same for the van der Waals sphere of A and that of C. Then, the angle ABC=78°. A stricter threshold than 78° is 75°, which was chosen by this work.

Our prediction methods

This section describes how to construct our βACVASA classifier, including how to define βACV and how to integrate ASA into βACV.

βACV: a vector representation for interfacial alanine mutations

We use an atomic contact vector (ACV) [32] of β contacts to represent an interfacial alanine mutation. We also use β contacts of the interfacial bound water molecules to update the vector, and use the atomic environment of the mutation to expand the basic vector.

Constructing a basic βACV vector

To produce a basic βACV for an alanine mutation mutrala of residue r in a protein complex p, we build the β contact graph b(p) for p. We then remove the coordinates of the mutated atoms in r to get a quaternary structure resulted from the mutation, denoted by prmut. Then, we produce another β contact graph b(prmut) for prmut.

By the mutation mutrala, some contacts in b(p) may disappear in b(prmut), namely those mutated contacts, while some new contacts can be formed in b(prmut), called new contacts. Both of those mutated and new contacts are represented in a βACV vector. As the heavy atoms from the 20 standard residues are grouped into 8 atomic types (shown in Additional file 1: Table S4) by this work, our βACV vector has 36 pairs of atomic types as elements. The value v(Ti,Tj) of each element in βACV with atomic types Ti and Tj is calculated, using Equation 1.

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/57/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/57/mathml/M1">View MathML</a>

(1)

where x and x are of the atomic type Ti, y and y are of the atomic type Tj, (x,y) and (x,y) are two atomic pairs, d(∗,∗) is the spatial distance between a pair of atoms, and M(Ti,Tj)(or N(Ti,Tj)) is the set of all those mutated (or the set of all those new) contacts whose atomic types are Ti and Tj. Here term d2(∗,∗) is specially used to follow the same idea as Coulomb’s law which also uses the inverse of squared distance. Note that the other common contacts between b(p) and b(prmut) are not used in βACV. Alanine mutations of Ala are assumed to have insignificant ΔΔG and alanine mutations of Gly are not considered.

It can be seen that a basic βACV considers all mutated contacts and new contacts, including both acrossinterface contacts and those contacts from the same proteins or same biological units. However, atomic contacts between covalent-bond nearby atoms are not used in Equation 1. The covalent-bond nearby atoms of a given atom i are those atoms that have not more than three covalent-bond steps from i. For example, suppose ijklm, where − indicates a covalent bond. From i, the covalent-bond step is 0 to i, is 1 to j, is 2 to k, is 3 to l and is 4 to m, respectively. Thus, i,j,k and l are covalent-bond nearby atoms of i, while m is not. In βACV, the contacts between i and its covalent-bond nearby atoms are excluded from M or N in Equation 1. This is reasonable, because spatially close distances between i and its covalent-bond nearby atoms are more likely due to the rigidity of their covalent bonds.

Bound water molecules in protein interfaces

Protein folding and binding occur in a solvent environment in vivo. Water molecules are heavily involved in protein binding and sometimes they can form a compulsory part of the protein interfaces. In this work, a water molecule in PDB is considered as a part of a binding interface if (i) it has at least 3 potential hydrogen-bonds contacts, or (ii) it has 2 potential hydrogen-bond contacts and also has at least 2 other contacts with spatial distances less than 4 Å. A potential hydrogen-bond contact is required to have a spatial distance less than 3.2 Å between a hydrogen donor (such as a nitrogen atom) and a hydrogen acceptor (such as an oxygen atom). Water molecules under this requirement, named bound water molecules, are such closely involved in protein folding and binding that they can play an integral part. Bound water molecules are then grouped into the Oxygen atomic type with more than one hydrogen atom (shown in Additional file 1: Table S4) to update the values of the elements in the basic βACV vector. We did not consider the contacts between any two water molecules.

The neighbourhood atoms of mutated residues

Information of neighbourhood atoms of mutrala is used to expand the basic βACV vector. Assume that S is a set of atoms which have β contacts with the mutated atoms under β=90°. For each atom in S including the mutated atoms, its nearby atoms are added into S. Then, an atomic vector with the above 8 atomic types is also used to represent those atoms in S in the bound state. The value of its element Tk is calculated using v(Tk)b in Equation 2. Similarly, v(Tk)u in Equation 2 is used to calculate another atomic vector for representing the atoms in S in the unbound state.

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/57/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/57/mathml/M2">View MathML</a>

(2)

where Ejloc (or Ejloc(u)) is the relative local ASA of atom j in the bound (or unbound) state calculated via Equation 6 below. Water molecules were not considered here. Thus, each basic βACV vector is now expanded by another 16 atomic types for representing surrounding information of mutated atoms. So, the expanded βACV is a vector representation with 52 elements.

βACVASA: integrating the water exclusion hypothesis into βACV

Solvent water is compulsory for protein binding, but water exclusion—small accessible surface area (ASA)—is a necessary condition for a residue to become binding hot spot [3,33,34]. Few highly exposed residues can make significant contribution to protein binding strength [34]. Thus, we integrate ASA information into each atomic pair of βACV in Equation 1, and name the method βACVASA. We note that except Equation 1, the other definitions in βACVASA are the same as those in βACV.

Given a protein complex p, we take the following steps to integrate the water exclusion theory into Equation 1. The first step is to use NACCESS [30] to produce ASA for all of the atoms and residues in both bound and unbound states. For p in the bound state, we then define special ASA terms for an atom i using Equation 3, and for a residue Ri using Equation 4 and Equation 5.

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/57/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/57/mathml/M3">View MathML</a>

(3)

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/57/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/57/mathml/M4">View MathML</a>

(4)

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/57/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/57/mathml/M5">View MathML</a>

(5)

In Equation 3, ASAi is accessible surface area of atom i, while Ei is its relative ASA, and Bi is its relative ASA burial compared to the maximum ASA, where number 50 is roughly half of NACCESS-calculated ASA of a single water molecule without any neighbor atoms (the ASA of a water molecule is 98.47=4×3.14×2.82 Å2). In Equations 4 and 5, <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/57/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/57/mathml/M6">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/57/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/57/mathml/M7">View MathML</a> are accessible surface area of backbone atoms (i.e., bb) and of side-chain atoms (i.e., sc) for a residue Ri, while <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/57/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/57/mathml/M8">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/57/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/57/mathml/M9">View MathML</a> are the relative ASA and the relative ASA burial of ∗∈{bb,sc}. <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/57/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/57/mathml/M10">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/57/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/57/mathml/M11">View MathML</a> are the maximum ASA of backbone atoms and of side-chain atoms for the residue type of Ri, which are calculated in a triplet of ALA- Ri-ALA by NACCESS. These backbone atoms and side-chain atoms are defined in the same way as those in [30].

We compute the local ASA Eiloc and local ASA burial Biloc of an atom i via Equation 6.

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/57/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/57/mathml/M12">View MathML</a>

(6)

where the multiplication of relative ASA burial of both atom i and its residue is used to calculate local ASA burial Biloc. This is because relative ASA of both an atom and its residue are critical in describing the accessibility of an atom. For example, an atom may be buried with small ASA but its covalent-bond atoms might be exposed. When relative ASA of atoms or residues are used individually, the performance was worse (data not shown).

To integrate water exclusion theory into Equation 1, we determine the value v(Ti,Tj) of each element in βACVASA through Equation 7 instead of Equation 1.

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/57/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/57/mathml/M13">View MathML</a>

(7)

where T, x, y, x, y, M and N have the same meaning as those in Equation 1.

Comparison of β contacts with distance-cutoff contacts

To compare the performance of β contacts with distance-cutoff contacts for predicting ΔΔG, ACVASA based on distance-cutoff contacts is constructed in a similar way to constructing βACVASA. To further show the importance of β contacts in protein binding interfaces, non βACVASA is also constructed for alanine mutations at the setting of β=90°. In non βACVASA, the values of its elements are the difference of the values of the 52 elements between ACVASA and βACVASA. To highlight the advantage of β contacts, ACVASA is also evaluated with different spatial distance thresholds (from 2.9 Å to 5 Å) for defining atomic contacts across interfaces and within binding sites.

Our βACVASA classifier and its variants described above are summarized in Table 1.

Table 1. Description ofβACVASA and its variant methods

Ridge regression: predict ΔΔG and binding hot spots

Ridge regression in Matlab is used here to learn a relation between atomic contact vectors and ΔΔG. By this regression, values in each column are normalized for the training dataset. Ridge regression minimizes average square error SE between the experimental (<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/57/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/57/mathml/M14">View MathML</a>) and the predicted <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/57/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/57/mathml/M15">View MathML</a> in the training data with N mutations where <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/57/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/57/mathml/M16">View MathML</a>.

In our evaluation, leave-one-out cross-validation is used for all of the 396 mutations, and then the correlation coefficient R and average standard deviation <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/57/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/57/mathml/M17">View MathML</a> are calculated. Under this evaluation framework, there is one outlier prediction by βACVASA and one outlier by βACV for the whole dataset with 396 mutations. These outliers have less than -3 kcal/mol predicted ΔΔG, or more than 11 kcal/mol predicted ΔΔG, as shown in Additional file 1: Table S3. This may be due to limited alanine mutations of a high ΔΔG in the dataset.

In this work, predicted hot spot residues are those residue mutations with a predicted ΔΔG≥2 kcal/mol, same as the true hot spot definition.

Hot spot prediction and evaluation measures

βACVASA is also assessed by applying to the classification problem of binding hot spots. Classification performance is measured by precision(p.), recall(r.), accuracy(acc.) and F1 whose definitions are given in Equation 8.

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/57/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/57/mathml/M18">View MathML</a>

(8)

where binding hot spots are considered as the true cases, while non-hot spots as the false cases; TP, FP, TN and FN are true positives, false positives, true negatives and false negatives, respectively. Hence, precision is the number of correct hot spot predictions divided by the number of positive predictions, recall is the fraction of correct hot spot predictions over all hot spots, while accuracy is the number of correctly predicted hot spots and non-hot spots divided by the number of all mutations. These measures are also used in [14,20,35] with the same definitions.

Results and discussion

β contacts are better than distance-cutoff contacts for predicting ΔΔG

Our βACVASA classifier is compared with ACVASA and with non βACVASA to show the importance of β contacts in the prediction of ΔΔG under alanine mutations. The prediction results are presented in Figure 2(a), (b) and (c).

thumbnailFigure 2. ΔΔG predicted by different methods. In (a)-(d), ‘o’ represents non-polar residues, while ‘x’ represents polar residues. R is the Pearson correlation coefficient and δ is the average standard deviation. R is specially calculated for (a) and (d) based on the data set after removing the one outlier prediction. R is slightly changed to 0.543 or 0.515 respectively when the outlier is not removed. The two diagonal red lines represent y=x±1.5.

It can be seen from Figure 2(a) and (b) that βACVASA has a better ΔΔG prediction performance according to both correlation coefficient R and average standard deviation δ. The number of β contacts used by βACVASA is only a small fraction of the number of distance-cutoff contacts used by ACVASA. For example, there are 54,286 distance-cutoff contacts across binding interfaces for the 22 protein complexes, but there are only 9,830 β contacts across the binding interfaces (β=90°), and 4,096 β contacts under the setting of β=75° which is actually used by βACVASA. So, βACVASA uses only 7.55% number of distance-cutoff atomic contacts but it achieves a better prediction performance.

The comparison between βACVASA and non βACVASA (Figure 2(a) and (c)) further suggests the importance of β contacts in ΔΔG prediction. In Figure 2(c), non βACVASA has much lower R (0.469) and a higher δ (1.486) than βACVASA, but non βACVASA uses all non- β contacts of β=90°, that is, 81.8% number of distance-cutoff atomic contacts.

A lot of alanine mutations are not binding hot spot residues, having a small ΔΔG, i.e., <2 kcal/mol. These mutations heavily affect the calculation of R and δ. On the other hand, the prediction of residue mutations with a high ΔΔG is more important. Thus, the classification performance for these binding hot spots is also assessed. The results are shown in Table 2. It is noted that F1 is not the objective function to be optimized in the regression process.

Table 2. Prediction performances by different methods for the same set of binding hot spots

In Table 2 among βACVASA, ACVASA and non βACVASA, βACVASA has the highest precision, recall, F1 and accuracy, while non βACVASA has the lowest. For example, βACVASA’s F1 is 0.604, 0.122 higher than non βACVASA’s F1. However, ACVASA and non βACVASA show quite similar performances. The reason would be that non- β contacts (used by non βACVASA) are often dominant in distance-cutoff contacts (used by ACVASA).

The performances of ACVASA with different spatial distance thresholds are shown in Table 3. We can see that the performance has a growing tendency when the threshold increases. Nevertheless, the best performance of ACVASA (F1 = 0.5 in Table 2) is much lower than that of βACVASA (F1 = 0.604). Of special interest, when the spatial distance threshold is set at 3.6 Å, ACVASA has a number of distance-cutoff contacts nearly the same as the number of our used β contacts. In this special case of almost the same number of contacts used, ACVASA has much worse performance than βACVASA, and only about half of the distance-based contacts are β contacts across the 22 protein-protein binding interfaces. These results affirm that β contacts are advantageous over distance-based contacts for predicting binding hot spot residues.

Table 3. Prediction performance and the numbers of used contacts byβACVASA and ACVASA

Water exclusion is a necessary condition of hot spot binding

Literature work has reported that water exclusion is a necessary condition for an interfacial residue to become a hot spot residue [3,33]. To confirm the importance of water exclusion in the prediction of ΔΔG, the performance by βACV when ASA is not integrated is assessed. This prediction performance is shown in Figure 2(d) and Table 2. Comparing Figure 2(d) with Figure 2(a), βACVASA has a better regression performance than βACV indeed. It also has a better F1 performance as seen in Table 2. These results confirm that water exclusion plays an important role in hot spot prediction, and it should be a necessary condition for an interfacial residue to become a binding hot spot residue in protein-protein complexes.

Our method βACVASA is superior to several widely-used methods

Our βACVASA classifier is also assessed against the state-of-the-art methods FoldX [9,10], Robetta [11], HotPOINT [20] and KFC [36]. The prediction performances of these previous methods were obtained through their web servers (Robetta, HotPOINT and KFC) or the standalone executable program (FoldX with default settings).

Comparison results

Figures 2(e) and 2(f) show that the prediction performance of FoldX and Robetta are much worse than our βACVASA. These two methods have a R of 0.324 or 0.485, much smaller than βACVASA’s 0.569; their δ is 1.788 or 1.554, much larger than βACVASA’s 1.349. Table 2 also shows their classification performance on the 396 mutations: FoldX’s F1 is 0.44, while Robetta’s F1 is 0.494, both worse than βACVASA’s 0.604. From Table 2, our method also has better classification performance than HotPOINT and KFC. Other performance comparison results are provided in the Additional file 1 when tested on BID (including protein-peptide interfacial residues) or under the leave-one-complex-out cross-validation framework.

An example of hot spot predictions

We use 3HFM as a case study to illustrate the difference of the binding hot spot prediction results by βACVASA, FoldX and Robetta. The 3HFM complex is an antibody-antigen binding between HyHEL-10 and hen egg white lysozyme. According to ASEdb, a total of 25 alanine mutations were experimented, 11 of which have ΔΔG more than 2 kcal/mol.

Our βACVASA correctly identified 9 binding hot spot residues with a recall of 0.818, but made 3 false positive predictions with a precision of 0.75 (Figure 3). This gives an F1 of 0.783. However, FoldX made only one hot spot prediction which is correct with a recall of 0.091, and Robetta has a recall of 0.455 (5 out of 11) and a precision of 0.833 (5 out of 6), namely an F1 of 0.588. Both of these methods have a lower prediction performance than our βACVASA. What is more important is that the four positive predictions correctly made only by our βACVASA, not by FoldX or Robetta, have a high ΔΔG, such as Trp in position 98 of Chain H (ΔΔG=5.5 kcal/mol) and Tyr in position 50 of Chain L (ΔΔG=4.6 kcal/mol). Please refer to Additional file 1: Table S3 for detail.

thumbnailFigure 3. Prediction results byβACVASA for the residues in the interface between Chain Y and Chain HL(together) in 3HFM. In (a) and (b), the true predicted hot spot residues are in magenta, the false predicted non-hot spot residues are in yellow, the false predicted hot spot residues are in orange, while the true predicted non-hot spots are in cyan; X-YZZ stands for residue Y in position ZZ of Chain X.

Conclusion

A new classifier βACVASA has been proposed to predict ΔΔG and binding hot spot residues. The novel idea of this classifier is to integrate the water exclusion theory into β contacts. Tested on a data set of 396 alanine mutations, βACVASA has been found to outperform many other methods. This confirms that β contacts are truly better than traditional distance-cutoff contacts to capture the energetic characteristics of protein folding and binding. This also confirms that water exclusion is a necessary condition for a residue to become a binding hot spot residue.

Availability of supporting data

All the supporting data are included as additional files.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

QL designed the methods and performed the experiments. JL and SH supervised the study. JL, LW and CK participated in the analysis. QL drafted the manuscript. QL, SH, LW, CK and JL read and revised the manuscript. All authors approved the final version.

Acknowledgements

This work was partially supported by an ARC Discovery Project (DP130102124).

References

  1. Schreiber G, Fleishman SJ: Computational design of protein-protein interactions.

    Curr Opin Struct Biol 2013, 23(6):903-910. PubMed Abstract | Publisher Full Text OpenURL

  2. Wells J: Systematic mutational analyses of protein-protein interfaces.

    Methods Enzymol 1991, 202:390-411. PubMed Abstract | Publisher Full Text OpenURL

  3. Bogan AA, Thorn KS: Anatomy of hot spots in protein interfaces.

    J Mol Biol 1998, 280:1-9. PubMed Abstract | Publisher Full Text OpenURL

  4. Fischer TB, Arunachalam KV, Bailey D, Mangual V, Bakhru S, Russo R, Huang D, Paczkowski M, Lalchandani V, Ramachandra C, Ellison B, Galer S, Shapley J, Fuentes E, Tsai J: The binding interface database (BID): a compilation of amino acid hot spots in protein interfaces.

    Bioinformatics 2003, 19(11):1453-1454. PubMed Abstract | Publisher Full Text OpenURL

  5. Kumar MDS, Gromiha MM: PINT: protein-protein interactions thermodynamic database.

    Nucleic Acids Res 2006, 34(suppl 1):D195-D198. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  6. Moal IH, Fernandez-Recio J: SKEMPI: a Structural Kinetic and Energetic database of Mutant Protein Interactions and its use in empirical models.

    Bioinformatics 2012, 28(20):2600-2607. PubMed Abstract | Publisher Full Text OpenURL

  7. Clackson T, Wells J: A hot spot of binding energy in a hormone-receptor interface.

    Science 1995, 267:383-386. PubMed Abstract | Publisher Full Text OpenURL

  8. Thorn KS, Bogan AA: ASEdb: a database of alanine mutations and their effects on the free energy of binding in protein interactions.

    Bioinformatics 2001, 17(3):284-285. PubMed Abstract | Publisher Full Text OpenURL

  9. Guerois R, Nielsen JE, Serrano L: Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations.

    J Mol Biol 2002, 320(2):369-387. PubMed Abstract | Publisher Full Text OpenURL

  10. Schymkowitz J, Borg J, Stricher F, Nys R, Rousseau F, Serrano L: The FoldX web server: an online force field.

    Nucl Acids Res 2005, 33(Web Server issue):W382-W388. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  11. Kortemme T, Baker D: A simple physical model for binding energy hot spots in protein-protein complexes.

    Proc Natl Acad Sci USA 2002, 99(22):14116-14121. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  12. Kortemme T, Kim DE, Baker D: Computational alanine scanning of protein-protein interfaces.

    Sci STKE 2004, 2004(219):pl2. PubMed Abstract | Publisher Full Text OpenURL

  13. Benedix A, Becker CM, de Groot BL, Caflisch A, Bockmann RA: Predicting free energy changes using structural ensembles.

    Nat Methods 2009, 6:3-4. PubMed Abstract | Publisher Full Text OpenURL

  14. Cho KI, Kim D, Lee D: A feature-based approach to modeling protein-protein interaction hot spots.

    Nucl Acids Res 2009, 37(8):2672-2687. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  15. Chen P, Li J, Wong L, Kuwahara H, Huang J, Gao X: Accurate prediction of hot spot residues through physicochemical characteristics of amino acid sequences.

    Proteins: Struct Funct Bioinformatics 2013, 81(8):1351-1362. Publisher Full Text OpenURL

  16. Ofran Y, Rost B: Protein-protein interaction hotspots carved into sequences.

    PLoS Comput Biol 2007, 3(7):e119. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  17. Grosdidier S, Recio JF: Identification of hot-spot residues in protein-protein interactions by computational docking.

    BMC Bioinformatics 2008, 9:447. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  18. Zhu X, Ericksen SS, Demerdash ONA, Mitchell JC: Data-driven models for protein interaction and design.

    Proteins: Struct Funct Bioinformatics 2013, 81(12):2221-2228. Publisher Full Text OpenURL

  19. Guney E, Tuncbag N, Keskin O, Gürsoy A: HotSprint: database of computational hot spots in protein interfaces.

    Nucl Acids Res 2008, 36(Database-Issue):662-666. OpenURL

  20. Tuncbag N, Gursoy A, Keskin O: Identification of computational hot spots in protein interfaces: combining solvent accessibility and inter-residue potentials improves the accuracy.

    Bioinformatics 2009, 25(12):1513-1520. PubMed Abstract | Publisher Full Text OpenURL

  21. Assi SA, Tanaka T, Rabbitts TH, Fernandez-Fuentes N: PCRPi: Presaging Critical Residues in Protein interfaces, a new computational tool to chart hot spots in protein interfaces.

    Nucl Acids Res 2010, 38(6):e86. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  22. Pallara C, Jimenez-Garcia B, Perez-Cano L, Romero-Durana M, Solernou A, Grosdidier S, Pons C, Moal IH, Fernandez-Recio J: Expanding the frontiers of protein-protein modeling: from docking and scoring to binding affinity predictions and other challenges.

    Proteins: Struct Funct Bioinformatics 2013, 81(12):2192-2200. Publisher Full Text OpenURL

  23. Deng L, Guan J, Wei X, Yi Y, Zhou S: Boosting prediction performance of protein-protein interaction hot spots by using structural neighborhood properties. In Research in Computational Molecular Biology, Volume 7821 of Lecture Notes in Computer Science. Edited by Deng M, Jiang R, Sun F, Zhang X. Springer: Berlin, Heidelberg; 2013:333-344. OpenURL

  24. Wang L, Liu ZP, Zhang XS, Chen L: Prediction of hot spots in protein interfaces using a random forest model with hybrid features.

    Protein Eng Design Sel 2012, 25(3):119-126. Publisher Full Text OpenURL

  25. Dehouck Y, Kwasigroch JM, Rooman M, Gilis D: BeAtMuSiC: prediction of changes in protein-protein binding affinity on mutations.

    Nucleic Acids Res 2013, 41(Web Server issue):W333-W339. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  26. Moretti R, Fleishman SJ, Agius R, Torchala M, Bates PA, Kastritis PL, Rodrigues JPGLM, Trellet M, Bonvin AMJJ, Cui M, Rooman M, Gillis D, Dehouck Y, Moal I, Romero-Durana M, Perez-Cano L, Pallara C, Jimenez B, Fernandez-Recio J, Flores S, Pacella M, Praneeth Kilambi K, Gray JJ, Popov P, Grudinin S, Esquivel-Rodriguez J, Kihara D, Zhao N, Korkin D, Zhu X, et al.: Community-wide evaluation of methods for predicting the effect of mutations on protein-protein interactions.

    Proteins: Struct Funct Bioinformatics 2013, 81(11):1980-1987. Publisher Full Text OpenURL

  27. Kirkpatrick DG, Radke JD: A framework for computational morphology.

    Computational Geometry, Machine Intelligence and Pattern Recognition 1985, 217-248. OpenURL

  28. Liu Q, Kwoh CK, Hoi SCH: Beta atomic contacts: identifying critical specific contacts in protein binding interfaces.

    PLoS ONE 2013, 8(4):e59737. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  29. Liu Q, Kwoh CK, Li J: Binding affinity prediction for protein-ligand complexes based onβ contacts and B factor.

    J Chem Inf Model 2013, 53(11):3076-3085. PubMed Abstract | Publisher Full Text OpenURL

  30. Hubbard SJ, Thornton JM: ‘NACCESS’, computer program. Tech. rep., Department of Biochemistry Molecular Biology, University College London, 1993

  31. Barber BC, Dobkin DP, Huhdanpaa H: The Quickhull algorithm for convex hulls.

    ACM Trans Math Softw 1996, 22(4):469-483. Publisher Full Text OpenURL

  32. Mintseris J, Weng Z: Atomic contact vectors in protein-protein recognition.

    Proteins 2003, 53(3):629-639. PubMed Abstract | Publisher Full Text OpenURL

  33. Li J, Liu Q: ‘Double water exclusion’: a hypothesis refining the O-ring theory for the hot spots at protein interfaces.

    Bioinformatics 2009, 25(6):743-750. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  34. Martins JM, Ramos RM, Pimenta AC, Moreira IS: Solvent-accessible surface area: how well can be applied to hot-spot detection?

    Proteins: Struct Funct Bioinformatics 2013, 82(3):479-490. OpenURL

  35. Xia JF, Zhao XM, Song J, Huang DS: APIS: accurate prediction of hot spots in protein interfaces by combining protrusion index with solvent accessibility.

    BMC Bioinformatics 2010, 11:174. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  36. Zhu X, Mitchell JC: KFC2: a knowledge-based hot spot prediction method based on interface solvation, atomic density and plasticity features.

    Proteins 2011, 79(9):2671-2683. PubMed Abstract | Publisher Full Text OpenURL