Protein-protein binding site identification by enumerating the configurations

Guo, Fei; Li, Shuai Cheng; Wang, Lusheng; Zhu, Daming

doi:10.1186/1471-2105-13-158

Research article
Open access
Published: 06 July 2012

Protein-protein binding site identification by enumerating the configurations

Fei Guo^1,2,
Shuai Cheng Li²,
Lusheng Wang² &
…
Daming Zhu¹

BMC Bioinformatics volume 13, Article number: 158 (2012) Cite this article

12k Accesses
35 Citations
Metrics details

Abstract

Background

The ability to predict protein-protein binding sites has a wide range of applications, including signal transduction studies, de novo drug design, structure identification and comparison of functional sites. The interface in a complex involves two structurally matched protein subunits, and the binding sites can be predicted by identifying structural matches at protein surfaces.

Results

We propose a method which enumerates “all” the configurations (or poses) between two proteins (3D coordinates of the two subunits in a complex) and evaluates each configuration by the interaction between its components using the Atomic Contact Energy function. The enumeration is achieved efficiently by exploring a set of rigid transformations. Our approach incorporates a surface identification technique and a method for avoiding clashes of two subunits when computing rigid transformations. When the optimal transformations according to the Atomic Contact Energy function are identified, the corresponding binding sites are given as predictions. Our results show that this approach consistently performs better than other methods in binding site identification.

Conclusions

Our method achieved a success rate higher than other methods, with the prediction quality improved in terms of both accuracy and coverage. Moreover, our method is being able to predict the configurations of two binding proteins, where most of other methods predict only the binding sites. The software package is available at http://sites.google.com/site/guofeics/dobi for non-commercial use.

Background

Most of the existing efforts to identify the binding sites in protein-protein interaction are based on analyzing the differences between interface residues and non-interface residues, often through the use of machine learning or statistical methods. These methods differ in the features analyzed, that is, the sequence and structural or physical attributes. Chung et al.[1] used multiple structure alignments of the individual components in known complexes to derive structurally conserved residues. Sequence profile and accessible surface area information are combined with the conservation score to predict protein-protein binding sites by using a Support Vector Machine. Ofran et al.[2] employed neural networks to predict binding sites, using the sequence environment, the profile and the structural features as input. The random forest algorithm is used to utilize these features from sequences or 3D structures for the binding site prediction [3, 4]. PSIVER [5] uses sequence features for training a Naïve Bayes classifier to predict binding sites. In PSIVER, conditional probabilities of each sequence feature are estimated using a kernel density estimation method.

Besides the machine learning and statistical approaches, 3D structural algorithms and other methods have also been used to identify binding sites through investigating protein surface structures. ProBiS [6] predicts binding sites by local surface structure alignment. It compares the query protein to 3D protein structures in a database to detect proteins with structurally similar sites on the surfaces. Burgoyne et al.[7] analyzed clefts in protein surfaces that are likely to correspond to the binding sites. They ranked them according to sequence conservation and simple measures of physical properties including hydrophobicity, desolvation, electrostatic and van der Waals potentials. Ortuso et al.[8] defined most relevant interaction areas in complexes deriving pharmacophore models from 3D structure information. It is based on 3D maps computed by the GRID program on structurally known molecular complexes.

ProMate [9] is based on the idea of interface and non-interface circles. A circle is first created around each residue. Then, features are extracted from these circles. Statistics are performed and histograms are created for each feature. Thereafter, the probability for each circle of a test protein to be an interface is estimated. The interface circles are clustered for each test protein to identify the binding patch.

Bradford et al.[10] proposed an approach (PPI-Pred) which uses SVM (Support Vector Machine) on surface patch features to predict binding sites. PPI-Pred generates an interacting patch and a non-interacting patch for each protein. Seven features are extracted for each patch to build an SVM model, which is then used to predict if a given test patch is an interacting patch.

In PINUP [11], an empirical scoring function is presented to predict binding sites. The function is a linear combination of energy score, interface propensity and residue conservation score. A patch is formed by a residue and its spatial neighbors within the protein subunit. PINUP takes the top 5% scoring patches and ranks residues based on their occurrences in these patches. The top 15 ranked residues are predicted as the interface residues.

Li et al.[12] proposed another SVM approach (core-SVM). The residues of the proteins are divided into four classes: the interior residues, the core interface residues, the rim interface residues, and the non-interface residues. The core interface and rim interface residues are distinguished by the percentage of their neighboring residues which are interface residues. An SVM is built over eight features extracted from the interface residues, and used to compute the probability of whether a residue is a core interface residue.

Meta-servers have also been constructed to combine the strengths of existing approaches. The program called meta-PPISP [13] combines three individual servers, namely cons-PPISP, ProMate and PINUP; another program called metaPPI [14] combines five prediction methods, namely PPI-Pred, PINUP, PPISP, ProMate, and SPPIDER [15].

Another approach in binding site prediction is to examine the possible structural configurations, or referred to as poses, of protein subunits, that is, how the subunits may dock. Docking methods based on fast Fourier transformation (FFT) [16, 17], geometric surface matching [18], as well as intermolecular energy [19–21] have been proposed. Fernández-Recio et al.[22] simulated protein docking and analyzed the interaction energy landscapes. Their method uses a global docking method based on multi-start global energy optimization of the ligand. It explores the conformational space around the whole receptor, and uses the rigid-body docking configurations to project the docking energy landscapes onto the surfaces. The low-energy regions are predicted as the binding sites.

In this paper, we propose a method which enumerates the configurations of two binding proteins (that is, the possible positions of the two subunits in a complex), and identify binding sites by evaluating the interaction between the components using the Atomic Contact Energy (ACE) function [23]. We perform rigid transformation to enumerate the configurations of two binding proteins. The enumeration is performed in conjunction with a surface identification technique for avoiding clashes between protein subunits when computing rigid transformations. The transformations which result in the minimum score according to the Atomic Contact Energy function are found; the corresponding interacting residues are reported as binding sites. Our method is implemented in a program called DoBi^a.

We perform experiment to compare DoBi with the existing methods using commonly used measures for assessments. The program outperforms the other methods on these measures. DoBi achieved a success rate higher than all the other methods, improving prediction quality in terms of both accuracy and coverage. In addition, it predicts the configurations of two binding proteins, as opposed to giving only the binding sites.

Methods

The main idea of our method is to enumerate “all” configurations between two proteins, where a configuration refers to the 3D coordinates representing the relative position and orientation of two protein subunits in a complex. We use the Atomic Contact Energy (ACE) function to compute the score for a configuration. The configurations with the lowest score are chosen, and the corresponding interacting residues are predicted as binding sites. We use rigid transformation to enumerate the configurations. The key techniques required here contain (1) an efficient algorithm to enumerate “all” configurations (rigid transformations) and (2) a good energy score.

Atomic contact energy

Atomic Contact Energy (ACE) is an atomic desolvation energy measure developed in [24]. It is defined over the energy of replacing a protein-atom/water contact, with a protein-atom/protein-atom contact. The ACE score takes into account 18 atom types, hence resulting in 18×18 possible atom pairs. The score for each atom pair has been determined, based on a statistical analysis of atom-pairing frequencies in known proteins. These pre-determined scores are given as log likelihood values in [24], thus allowing the summation of these values. The pre-determined score of effective contact energy between atom type i and type j is defined as

T [i, j] = - ln \frac{N_{i, j} / C_{i, j}}{(N_{i, 0} / C_{i, 0}) \times (N_{j, 0} / C_{j, 0})}

(1)

where type 0 corresponds to the solvent. The number of i-j contact (N_i,j) and the number of i-0 contact (N_j,0) are estimates of the actual contact numbers of known complexes. In addition, C_i,j and C_i,0 are defined as the expected numbers of i j contact and i-0 contact.

For a given configuration, the ACE score is a summation of each of the atom pairs (one from each subunit) within threshold distance d, and d = 6Å is used in this paper. Denote the sets of atoms from the two subunits as S₁ and S₂, respectively, then the ACE is computed as

E_{ACE} = \sum_{s \in S_{1}, t \in S_{2}, | | s - t | | \leq d} T [s, t]

(2)

where |s−t| is the Euclidean distance between s and t, and T[s,t] is the pre-determined score of the atom pair s and t.

The ACE score can be considered an estimate of the change in desolvation energy of the two proteins in going from the unbound state to the complex. A lower ACE value implies a lower (and hence more favorable) desolvation free energy.

Enumeration of the configurations

In this paper, we assume that subunits are rigid. A protein structure consists of a sequence of residues. Each residue consists of a set of atoms. We assume that the atoms in a residue are ordered as a sequence. Hence, the whole protein structure can be represented by a sequence of atoms. In the rest of this subsection, we let A and B denote two protein structures (subunit), and write A = (a₁,a₂,…,b_m), and B = (b₁,b₂,…,b_n), where a_i, and b_jare atoms of structure A and B. Without loss of generality, we assume that n ≥ m. We also assume that we know the 3D coordinates of each atom in both input proteins. We use A[i:j] to denote the subsequence (a_i,…,a_j), and refer to a subsequence of atoms as a structural fragment.

To enumerate all the configurations, we assume B is fixed, and we perform rotations and translations (referred to as rigid transformations, and simply, transformations, in the rest of the paper) on A. The method proposed here is modified from the algorithms for structure comparison [25].

Assume that two points a_iand a_j of A interact with two points b_i′ and b_j′ of B, then we know that ||a_i− b_i′|| ≤ d and ||a_j − b_j′ || ≤ d. To enumerate the configurations, we enumerate the positions for atoms a_i and a_j first, and for each fixed positions of a_i and a_j, we rotate A about the line formed by a_iand a_j. Let the d-ball of an atom a be the ball with radius d centered at a. We discretize the d-ball of b_i′ with step size εd, where ε is a small constant (and we choose ε = 0.1 for this paper). Each grid point in the d-ball of b_i′ is used as a candidate position for atom a_i for the binding. When a_i is fixed at one of the grid points, the possible positions for a_j form a sphere cap, where the sphere is centered at a_iwith radius |a_i−a_j|, and the cap is the portion of the spheres enclosed in the d-ball of b_j′. Again, we discretize the sphere cap with step size εd. Each grid point on the sphere cap is a candidate position for a_j. This gives us a total of $O ({(\frac{1}{ε})}^{5})$ possible positions for the pair of a_iand a_j. After a_iand a_jare fixed on their respective grid points, the only degree of freedom to move A[i,j] is to rotate it around the axis through a_iand a_j. We use a 1° step size; that is, we explore 360 different positions for the remaining atoms through 360 rotations. Figure 1 illustrates the steps to compute a transformation.

The method will work well if we know two interaction pairs (a_i,b_i′) and (a_j,b_j′). We can simply enumerate all the atoms pairs as the interaction pair candidate. However, there will be O(n⁴) such cases, which makes the computer program too slow in practice. This is perhaps one of the reasons that such a method has not been tried. The focus of the following subsection is to identify two pairs (a_i,b_i′) and (a_j,b_j′) which are more likely to be interaction pairs.

When enumerating “all” configurations, we also want to make sure that (1) only surface fragments can be candidate binding sites for a configuration and (2) there is no clash between the two proteins in such a configuration. Before presenting the details of the method, we define the surface atoms and clashes of two subunits first.

Surface atoms

The interface residues of two proteins are necessarily surface residues. Inspired by the work in LIGSITE_csc[26, 27], we propose a method to identify the surface atoms of a protein.

First, we build a 3D grid with step size 1Å around the protein. Then, each grid point is labeled as a protein point if it is within distance 2Å of any atom, and labeled as empty otherwise. We further subdivide the protein grid points into two types: interior or surface. A protein grid point is labeled as surface if at least one of its six neighboring grid points is empty, otherwise it is labeled as interior. With the grid points labeled, we can label the atoms. an atom is labeled as a surface atom if it is within distance 1.5Å of a surface grid point, otherwise it is labeled as an interior atom.

Figure 2 gives an example in 2D, where a protein grid point is labeled as interior if it has all four neighbors as protein points. In 3D, a protein grid point should be labeled as interior if all of its six neighbors are labeled as protein.

Clashes of two subunits

A configuration cannot result in two subunits to have clashes. The following method is used to capture if a configuration resulted in clashes. Given a configuration, we build a 3D grid as in the previous subsection. For each of the structures A and B, we mark the grid points as interior, surface, or empty. We use a threshold θ to identify whether two subunits clash, by calculating the proportion of interior points for both of them. We say that the two subunits clash if they share more than θ × 100% of their interior points; that is, if X is the number of interior grid points which are shared by both proteins, and X_A and X_Bare the number of interior grid points of each subunit, respectively, then we require that X ≤ θ × min{X_A,X_B} if the subunits do not clash.

Finding the two interaction pairs

In the following subsections, we present the details to explore the potential interaction pairs.

Identify candidate fragment pairs

We first select fragment pairs that are potential binding sites. As discussed in Section “Enumeration of the configurations”, there are O(n⁴) possible fragment pairs (a_i, a_i′) and (b_j, b_j′) for each binding site. To reduce the computational complexity, we adopt a local alignment algorithm to accelerate this selection. This is a raw estimation and we hope that the actual binding sites are not discarded by this process.

We first use a heuristic to quickly discard fragments pairs that are unlikely to bind. The heuristic simplifies the problem, as follows: (1) every atom is within the threshold value required in the ACE computation (that is, we ignore the geometry of the structure); (2) each atom interacts with at most one atom; (3) interacting pairs follows a sequential order. That is, for any two pairs of interacted atoms (a_i, b_i′) and (a_j, b_j′), we have either i < i^′and j < j^′, or i^′< i and j^′< j. With these three simplifications, the standard Smith-Waterman local alignment algorithm [28] can be employed, with the ACE scores used as the penalty (negation of the score) for alignment. We use a penalty of 1 for aligning an atom to a space. Each local aligned segment gives us two fragments, where each atom in the fragment is either aligned to another atom from the partner, or aligned to nothing (i.e., aligned to space).

We present details here. For two sequences P₁and P₂, an alignment of P₁ and P₂ can be obtained by (1) inserting spaces into the two sequences P₁ and P₂ such that the two resulting sequences with inserted spaces P′₁ and P′₂ have the same length and (2) overlap the two resulting sequences P′₁ and P′₂. The score of the alignment is the sum of the scores for all the columns, where each column has a pair of letters (including spaces) and for each pair of letters there is a pre-defined score. A subsequence α of P₁ and a subsequence β of P₂ can be formed as a local aligned segment such that the score between α and β is minimum. Here we want to find all (non-overlapping) pairs of subsequences with a score of at most x. For our purpose, we set x = 0 throughout the paper.

Due to the simplifications, there are many false positive results, and some of the interaction pairs can be filtered. The latter issue can be handled to some extend by raising the threshold. The former issue is tackled by further refinement in the next subsection. In practice, our program outputs 70 to 120 fragment pairs as potential binding sites, which is much smaller than O(n⁴), where the number of atoms n in a protein is from 500 to a few thousands.

Since a binding site is necessarily on the surface of a subunit, we filter out fragments with only very few atoms on the surface. To achieve this, we use a sliding window of length 15 to parse the aligned fragment pair. For each window, if the surface atoms are at least 2/3 (that is, ten atoms) for both fragments, the fragment pair of this window is kept for further processing and this fragment pair is extracted from the alignment. We continue this process on the un-extracted portion of the alignment. If the window does not contain sufficient surface atoms, we continue at the next window. Our choice of 2/3 comes from observations with a docking decoy set from the Dockground [29], where 94% of the binding sites have more than 2/3 of surface atoms.

Identify configurations of fragment pairs

From the fragment pairs obtained in the previous step, a second step is used to further filter out fragment pairs of ACE scores below a threshold. Given two structural fragments A[i,j] = (a_i,…,a_j), and B[i′,j′] = (b i′,…,b j′), we assume that a_i interacts with b_i′, and a_j interacts with b_j′. Using the enumeration method described earlier, we enumerate different configurations for A and B and compute the corresponding ACE score for the atom sets A[i,j] and B[i^′,j^′]. We do not consider any configuration which causes A and B to clash. In this step, a pair of structural fragment which does not give any configuration with an ACE score below a specified threshold is discarded. In this paper, we define the threshold value as 400, since the ACE scores of actual interface in the docking decoy set from Dockground are all less than 400. After this step, it is unlikely for two protein structures which cannot be bound to have an unfiltered fragment pair.

Identify the configuration for the two subunits

In the third step, for each pair of protein structures with at least one remaining fragment pair, we enumerate all the potential configurations for the structures. We want to use the begin and end atoms of the identified fragments for our choice of (a_i, b_i′) and (a_j, b_j′) in the enumeration, since these are the atoms that are likely to be interacting. Assuming that there are k fragment pairs from the same two proteins left after the filtration of the second step, we will have a maximum of 2k distinct atom pairs to choose. Thus, there is a total of at most $(\binom{2 k}{2})$ combinations to consider for the choice of (a_i, b_i′) and (a_j, b_j′).

When the best configuration is obtained, two residues, one from each subunit, are reported as the interface residues if they can be connected with a pair of atoms within distance 4.5Å. In our search for the best configuration, we also require the configurations to be free from clashes.

Results and discussion

Three commonly used measures are utilized to assess the performance of DoBi. Accuracy and Coverage are two common measures to assess the quality of the binding sites adopted by a method [11]. The accuracy of the predicted interface is the fraction of correctly predicted residues over the total number of predicted interface residues; the coverage of the predicted interface is the fraction of correctly predicted interface residues over the total number of actual interface residues. F-score ( $F = 2 \times \frac{Accuracy \times Coverage}{Accuracy + Coverage}$ ) is a weighted average of the accuracy and coverage, where an F-score reaches its best score at 1 and worst score at 0. Another common measure is success rate, which is defined in [9]. A reported result is claimed as a success if at least half of the predicted residues are actual interface residues; that is, the accuracy is no less than 50%. The success rate is the fraction of successful predicted cases in the total number of predicted proteins.

A protein complex may contain several subunits, and multiple binding sites. Each binding site in a protein complex consists of a pair of subunits. Two residues in a pair of subunits are called interface residues if any two atoms, one from each residue, interact. By interact, we mean the distance between the two atoms is less than the sum of the van der Waals radius of the two atoms plus 1Å. The number of residues on interface is referred to as the interface size.

Training set

We use the unbound protein structures from Dockground [29] as the training set to calculate the parameters of DoBi. The docking decoys from Dockground were generated by GRAMM-X scan. The GRAMM-X docking scan was used to generate 102 unbound-unbound complexes and 131 unbound-bound complexes. By excluding the proteins used in the comparison, 36 unbound-unbound complexes and 80 unbound-bound complexes can be used to calculate the value of the threshold θ. When we set θ = 0.17, the overall F-score of DoBi on the training set is 60.5%, which is the best score that DoBi achieves under different threshold values. The details on the training set are shown in Table 1.

Table 1 Details of DoBi on the training set

Protein-protein binding site identification by enumerating the configurations

Abstract

Background

Results

Conclusions

Background

Methods

Atomic contact energy

Enumeration of the configurations

Surface atoms

Clashes of two subunits

Finding the two interaction pairs

Identify candidate fragment pairs

Identify configurations of fragment pairs

Identify the configuration for the two subunits

Results and discussion

Training set

Comparison to the existing methods

Comparison to Fernández-Recio et al.’s method

Comparison to metaPPI, meta-PPISP and PPI-Pred

Comparison to ProMate and PINUP

Comparison to core-SVM

Evaluation on benchmark v4.0

Results on bound states

Results on unbound states

Docking result of DoBi

Factors affecting the performance of DoBi

Conclusions

Endnote

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us