Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Methodology article

Predicting binding sites of hydrolase-inhibitor complexes by combining several methods

Taner Z Sen12*, Andrzej Kloczkowski1, Robert L Jernigan124, Changhui Yan34, Vasant Honavar134, Kai-Ming Ho145, Cai-Zhuang Wang45, Yungok Ihm45, Haibo Cao45, Xun Gu146 and Drena Dobbs146

Author Affiliations

1 L.H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA 50011, USA

2 Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA 50011, USA

3 Department of Computer Science, Iowa State University, Ames, IA 50011, USA

4 Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50011, USA

5 Department of Physics and Astronomy, Iowa State University, Ames, IA 50011, USA

6 Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011, USA

For all author emails, please log on.

BMC Bioinformatics 2004, 5:205  doi:10.1186/1471-2105-5-205


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/5/205


Received:23 September 2004
Accepted:17 December 2004
Published:17 December 2004

© 2004 Sen et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Protein-protein interactions play a critical role in protein function. Completion of many genomes is being followed rapidly by major efforts to identify interacting protein pairs experimentally in order to decipher the networks of interacting, coordinated-in-action proteins. Identification of protein-protein interaction sites and detection of specific amino acids that contribute to the specificity and the strength of protein interactions is an important problem with broad applications ranging from rational drug design to the analysis of metabolic and signal transduction networks.

Results

In order to increase the power of predictive methods for protein-protein interaction sites, we have developed a consensus methodology for combining four different methods. These approaches include: data mining using Support Vector Machines, threading through protein structures, prediction of conserved residues on the protein surface by analysis of phylogenetic trees, and the Conservatism of Conservatism method of Mirny and Shakhnovich. Results obtained on a dataset of hydrolase-inhibitor complexes demonstrate that the combination of all four methods yield improved predictions over the individual methods.

Conclusions

We developed a consensus method for predicting protein-protein interface residues by combining sequence and structure-based methods. The success of our consensus approach suggests that similar methodologies can be developed to improve prediction accuracies for other bioinformatic problems.

Background

Protein-protein interactions play a critical role in protein function. Completion of many genomes is being followed rapidly by major efforts to identify experimentally interacting protein pairs in order to decipher the networks of interacting, coordinated-in-action proteins. Identification of protein-protein interaction sites and detection of specific residues that contribute to the specificity and strength of protein interactions is an important problem [1-3] with broad applications ranging from rational drug design to the analysis of metabolic and signal transduction networks. Experimental detection of residues on protein-protein interaction surfaces can come either from determination of the structure of protein-protein complexes or from various functional assays. The ability to predict interface residues at protein binding sites using computational methods can be used to guide the design of such functional experiments and to enhance gene annotations by identifying specific protein interaction domains within genes at a finer level of detail than is currently possible.

Computational efforts to identify protein interaction surfaces [4-6] have been limited to date, and are needed because experimental determinations of protein structures and protein-protein complexes, lag behind the numbers of protein sequences. In particular, computational methods for identifying residues that participate in protein-protein interactions can be expected to assume an increasingly important role [4,5]. Based on the different characteristics of known protein-protein interaction sites [7], several methods have been proposed for predicting interface residues using a combination of sequence and structural information. These include methods based on the presence of "proline brackets"[8], patch analysis using a 6-parameter scoring function [9,10], analysis of the hydrophobicity distribution around a target residue [7,11], multiple sequence alignments [12-14], structure-based multimeric threading [15], and analysis of amino acid characteristics of spatial neighbors to a target residue using neural networks [16,17]. Our recent work has focused on prediction of interface residues by utilizing analyses of sequence neighbors to a target residue using SVM and Bayesian classifiers [2,3].

There is an acute need for multi-faceted approaches that utilize available databases of protein sequences, structures, protein complexes, phylogenies, as well as other sources of information for the data-driven discovery of sequence and structural correlates of protein-protein interactions [4,5]. By exploiting available databases of protein complexes, the data-driven discovery of sequence and structural correlates for protein-protein interactions offers a potentially powerful approach.

Results and discussion

Here we are using a dataset of 7 hydrolase complexes from the PDB, together with their sequence homologs. The application of our consensus method to other types of complexes, e.g. antibody-antigen complexes is currently under study and will be published later. It should be noted, however, that prediction of binding sites for other types of protein complexes, especially those involved in cell signaling, is likely to be more difficult than for the hydrolase-inhibitor complexes.

Figure 1 shows an example of the consensus method prediction mapped on the structure of proteinase B from S. griseus in a complex with turkey ovomucoid inhibitor (PDB 3sgb [18]). The inhibitor (3sgb_I) is shown at the top in wire frame and the proteinase B chain (3sgb_E), is shown at bottom. Actual interface residues in the proteinase B chain, i.e., amino acids that form the binding site between proteinase B and the inhibitor, were extracted from the PDB structure (see Materials and Methods). Predicted interface and non-interface residues, identified by the consensus method, are shown as color coded atoms as follows: Red spheres = true positives (TP), actual interface residues that are predicted as such; Gray strands = true negatives (TN), non-interface residues that are predicted as such; Yellow spheres = false negatives (FN), interface residues that are misclassified as non-interface residues; Blue spheres = false positives (FP), non-interface residues that are misclassified as interface residues. Note that the binding site in proteinase B is strongly indicated, with 14 out of 15 interface residues correctly classified, along with 2 false positives.

The primary amino acid sequence for proteinase B chain and the interface residue prediction results for the four individual methods and the consensus method are shown in Figure 2. Actual interface residues are identified highlighted in red. The five lines below the amino acid sequence show the locations of interface residues predicted by the different methods (described in detail below): P = Phylogeny; C = Conservatism of Conservatism (CoC); S = Data mining by SVM; T = Threading; E = Consensus. Similar Figures for each protein studied in this work are provided in Supplementary Materials [see 1, 2, 3, 4, 5, and 6].

thumbnailFigure 1. Interface residues predictions mapped on the three dimensional structure of Proteinase B from Streptomyces griseus (3sgb). The target protein is shown in ribbons and atomic spheres; the inhibitor partner is shown at the top in faint wire frame. The residues are color coded as: red = true positives (TP), gray = true negatives (TN), yellow = false negatives (FN), and blue = false positives (FP). Red, yellow, and blue residues are shown in spacefill representation. Note that the actual interface residues extracted from the PDB structure include the red (TP) and yellow (FN) residues. Red and gray residues represent correct predictions of interface and non-interface residues (14 TP+ 210 TN = 224 correct predictions); yellow and blue residues represent incorrect predictions (1 FN + 2 FP= 3)

thumbnailFigure 2. Comparison of individual methods for interface residue prediction with the consensus method. Results are shown for Proteinase B from Streptomyces Griseus (3sgb_E), the same protein shown in Figure 1. Actual interfaces are highlighted in red. Interface residues predicted by each of five different methods are indicated as follows: P = Phylogeny (none predicted for this protein), C = Conservatism of Conservatism; S = Support Vector Machine; T = Threading; and E = Consensus. Amino acid residues present in the protein sequence, but not included in the PDB structure file, are indicated by "X"s in the sequence.

Additional File 1. Comparison of individual methods for interface residue prediction for bovine α-chymotrypsin (1acbe).

Format: PDF Size: 21KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

Additional File 2. Comparison of individual methods for interface residue prediction for porcine pancreatic trypsin (1avwa).

Format: PDF Size: 20KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

Additional File 3. Comparison of individual methods for interface residue prediction for porcine pancreatic elastase (1flee).

Format: PDF Size: 28KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

Additional File 4. Comparison of individual methods for interface residue prediction for kallikrein(1hiaa).

Format: PDF Size: 14KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

Additional File 5. Comparison of individual methods for interface residue prediction for subtilisin BPN' (2sice).

Format: PDF Size: 19KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

Additional File 6. Comparison of individual methods for interface residue prediction for carboxypeptidase A (4cpa).

Format: PDF Size: 47KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

The prediction results for all methods are shown in Table 1 and Table 2. Table 1 shows a complete summary of the classification performance on the proteinase B chain for all 5 methods including the overall Sensitivity (Sen) and Specificity (Spec); Sensitivity (Sen+) and Specificity (Spec+) for interface residues (the "positive" class); and Correlation Coefficient (see Materials and Methods for definitions of these performance parameters). Table 2 shows the overall average performance results for all seven protein complexes studied in this work. Two kinds of averages are considered: the numerical average over each of 7 proteins in the dataset, i.e., the average on a "per protein" basis (<...>p); and the average over the total number of residues, i.e., the average on a "per residue" basis (<...>r).

Table 1. Classification results for Proteinase B from S. griseus (3sgb_E). TP is the number of true positive; TN is the number of true negatives; FP is the number of false positives, and FN is the number of false negatives. Overall sensitivity, overall specificity, sensitivity+, specificity+, and correlation coefficient are defined in the text.

Table 2. Overall Classification Performance Results Averaged over 7 Proteins. Average results for Sensitivity+, Specificity+, overall Sensitivity, overall Specificity, and Correlation Coefficient averaged over the 7 proteins in the dataset. <>pdenotes averaging over the total number of proteins, <>rdenotes averaging over the total number of residues.

Sequence and structure conservation

Amino acid sequences are conserved for many different reasons related to the structure and function of proteins: for stability [19,20], enzyme active sites, subunit interfaces, facilitation of an essential motion (hinges), and binding sites. Developing methods to identify the reason for conservation of individual highly conserved residues is a difficult problem. This is one of the reasons that a combination of approaches may be more likely to permit identification of residues that participate in protein-protein interactions. Even identifying the conserved residues themselves is not completely straightforward, and as will be seen, different approaches will indicate the same residue being conserved to different extents. In this study, we take advantage of this by using several methods to identify sequence and structure conservation. Here we use two principal methods for this purpose, one based on phylogeny to identify sequence conservation and one based on Conservatism of Conservatism [21] to identify structure conservation. These two methods often identify different residues as being conserved.

Phylogeny

To identify protein residues that are conserved – perhaps due to their functional role in forming specific protein-protein interactions – we use ClustalX [22] multiple sequence alignments of protein sequences to generate phylogenetic trees (see Materials and Methods). Conserved residues are defined as those that are identical at a given position in more than 85% alignments, i.e., only 15% substitutions or gaps were allowed. This 85% cutoff value is found to give optimal results (data not shown). Because phylogenetic trees of closely related sequences result in many residues that satisfy this condition (due to the high conservation of sequences, apparently important for protein folding, located in the protein core) we filter the results to focus on surface residues by removing conserved residues residing inside the protein core, i.e., having low solvent accessibility (see Materials and Methods).

As shown in Figure 2, the phylogenetic method does not classify any of the amino acids in proteinase B chain (3sgb_E) as interface residues, i.e., TP = 0 and FP = 0. Thus, for the phylogenetic method prediction, the correlation coefficient (CC), which can range from -1 to +1, converges to zero, whereas overall specificity converges to 0.905. The latter misleading statistic is due to the large number of negative examples (non-interface residues), which are correctly classified. In cases such as this (with unbalanced numbers of positive and negative examples), sensitivity+ and specificity+ measures are especially useful because they more clearly reflect the ability of a method to detect "positive" interface residues. (See the Methods section for definition and further discussion of performance measures). Note that even though Figure 2 shows that the phylogenetic method does not identify any interface residues in this particular example, the results summarized in Table 1 for all seven proteins demonstrate that the ability of the phylogenetic method to correctly predict non-interface residues (reflected in the high overall sensitivity and specificity values), and in combination with other methods, to lead to significantly improved predictions.

Conservatism of conservatism

To detect structurally conserved residues that are possible binding sites we have used the Conservatism of Conservatism method (CoC) developed by Mirny and Shakhnovich[21] We use structural alignments generated by FSSP (fold classification based on structure-structure alignment of proteins) developed by Holm and Sander [23]) to identify protein families with folds similar to that of the each of the 7 proteins. For each family, HSSP [24] (homology-derived secondary structure of proteins) alignments are used to calculate the sequence entropy at each position of the alignment. The HSSP profile is based on the multiple alignment of a sequence and its potential structural homologues [25]. The structural alignment generated by FSSP is used to calculate the value of CoC (see Materials and Methods). Each residue in the protein chain was ranked according to its CoC value at a given position in the sequence. The top 75% of total residues ranked according to their CoC values are defined as conserved. We filter the results of the CoC ranking by removing all structurally conserved residues located inside the protein core by only choosing the residues that have a relative accessibility of at least 25 as calculated by DSSP [26] (dictionary of protein secondary structure). Interface residues in proteinase B predicted by this method are indicated by a "C" in Figure 2. The overall performance of the CoC method is summarized in the second row of Tables 1 and 2. Although the correlation coefficient of the COC method is in the same range of those obtained by phylogeny and support vector machines, 0.37, the sensitivity+ value, 0.71, is surpassed only by the consensus value. Therefore, a larger fraction of interface residues is predicted by CoC than the other three methods. However, the CoC method alone is not sufficient to successfully predict binding sites, and combining this method with other prediction techniques in the consensus method gives improved results (Tables 1 and 2).

Data mining for binding residues

We have generated a support vector machine (SVM) classifier to determine whether or not a surface residue is located in the interaction site using information about the sequence neighbors of a target residue. An 11-residue window consisting of the residue and its 10 sequence neighbors (5 on each side) is chosen empirically. Each amino acid in the 11 residue window is represented using 20 values obtained from the HSSP profile of the sequence. Each target residue is therefore associated with a 220 (11 × 20) element vector. The SVM learning algorithm is given a set of labeled examples of the form (X, Y) where X is the 220 element vector representing a target residue and Y is its corresponding class label, either interface or non-interface residue. The SVM algorithm generates a classifier which takes as input a 220 element vector that encodes a target residue to be classified and outputs a class label. Our previous study [2] reported results for classifiers constructed using a combined set of 115 proteins belonging to six different categories of complexes: antibody-antigen, protease-inhibitor, enzyme complexes, large protease complexes, G-proteins, cell cycle signaling proteins, signal transduction, and miscellaneous. In another study [3], we trained separate classifiers for each major category of complexes (e.g., protease-inhibitor complexes). In the case of protease-inhibitor complexes, leave-one-out experiments were performed on a set of 19 proteins. In each experiment, an SVM classifier was trained using a set of surface residues, labeled as interface or non-interface, from 18 of the 19 proteins. The resulting classifier was used to classify the surface residues of the remaining target protein into interface residue and non-interface residue categories. The interface residues obtained for 3sgb_E are reproduced in Figure 2 and marked by "S". The performance of the SVM classifier for the current test set of complexes is summarized in Tables 1 and 2. The results show that SVM yields relatively high sensitivity+ (0.51) and specificity+ (0.41).

Threading of sequences through structures of interface surfaces

Structural threading was performed for the set of 7 protein complexes using a recently developed threading algorithm [27], which was first used in the CASP5 [28] competition. For each complex structure, we first extract the interfacial region, essentially as described earlier. Residue-residue contacts in the interfacial region are described with contact matrices. The total energy in this threading method is the sum of all pair-wise contact energies for the conformation. Detailed residue-level contact potentials were obtained from the Li, Tang and Wingreen [29] parameterization of the Miyazawa and Jernigan [30] matrix. We represent a protein sequence vector s by the hydrophobicity values of its amino acids hi obtained in this factorization and protein structure by the contact matrix Γ. The problem of finding the best alignment of a query sequence s with a structure having contact matrix Γ is to find the transformation from s to s' that optimizes the energy function. The optimum s' is the dominant eigenvector v0 of the contact matrix Γ. There is a strong correlation between a protein sequence and the dominant eigenvector of its native structure's contact matrix. Here the transformation we seek is obtained by maximizing the correlation between s' and v0. This is an alignment problem, and a dynamic programming method from sequence alignment has been adapted to solve this problem [27].

For each sequence, threading is performed against structures in our template database and alignment results used only when the score exceeds a length-dependent threshold. From the alignments, residues involved in contacts at the interface are identified using a scale based on the number of times a particular residue is indicated and the strength of the threading score. The predicted binding sites for 3sgb_E by the threading method are marked in Figure 1 by "T" and the prediction results are summarized in Tables 1 and 2. The threading-based approach is somewhat more successful than other methods based on its sensitivity+, selectivity+, and correlation coefficient values, but still not as good as the performance obtained by combining it with methods in the consensus approach.

Consensus method for predicting protein binding sites

Based on the results from the predictions with the four independent methods, we have developed a simple consensus method to obtain a better prediction. In the consensus method results presented here, an amino acid is considered to be an interface residue if any of the following conditions are met:

i) at least three independent methods classify it as an "interface residue"

ii) any two methods (except the Phylogeny-Threading pair) predict it

For this set of proteins, the parameters for combining results in the consensus method have been empirically determined without a systematic comparison of the strengths and weaknesses of each method. We employ this simple approach because it provides demonstrable improvement in prediction performance over the individual methods. The consensus interface residue predictions are indicated by an "E" in Figure 1, and performance results are summarized in the last rows of Tables 1 and 2. The consensus method generally results in an enhanced correlation coefficient and sensitivity+, demonstrating the superior performance of the consensus method for identifying interface residues in this protein set. Predictions for each protein, provided in Supplementary Materials [see 1, 2, 3, 4, 5, and 6], illustrate that the improvements can be even more pronounced when the individual predictions of all four methods are relatively weak. This suggests that combining diverse prediction methods may be an excellent approach for the prediction of the binding sites in protein complexes.

Conclusions

Each of the four prediction methods presented in this paper sheds a different light on the conservation and prediction of protein interaction sites, but none of the methods taken separately is as powerful as the combination of all four methods. The simple consensus approach presented here could perhaps be improved by generating an ensemble predictor with more detailed probabilities. Our current work is directed at this approach. It is clear that the present subject is an active field of research [31-38].

Methods

Dataset of hydrolase-inhibitor complexes

The dataset of 7 hydrolase-inhibitor complexes used in this work has been derived from a larger dataset of 70 protein heterocomplexes extracted from PDB by Chakrabarti and Janin [39] and used in our previous studies [2,3]. All are proteins from hydrolase-inhibitor complexes, with six being proteinases: 1acb_E [40] (chain E of PDB structure 1acb), 1fle_E[41], 1hia_A[42], 1avw_A[43], 2sic_E[44], 3sgb_E [18]; and one being a carboxypeptidase: 4cpa [45].

Definition of surface and interface residues

Surface and interface residues for the proteins were identified based on information in the PDB coordinate files as previously described [2,3]. Briefly, solvent accessible surface areas (ASA) for each residue in the unbound protein and in the complex are calculated using DSSP [26]. A surface residue is defined as an interface residue if its calculated ASA in the complex is less than that in the monomer by at least 1 Å2 [46]. In the extraction of interfacial region for threading, however, a distance-based definition of surface is used: a surface residue is defined as an interface residue if its side-chain center is within 6.5Å of the side-chain center of a residue belonging to another chain in the complex.

Based on the ASA definitions, 41% of the residues in the set of 7 proteins were surface residues, corresponding to a total of 631 surface residues. Among these surface residues, 166 were defined as interface residues and 465 as non-interface residues (i.e. surface residues that are not in the interaction sites). Thus, on average, interface residues represent 26% of surface residues, or 11% of total residues for proteins in our dataset.

Using phylogeny to identify conserved residues

Many computational tools have been developed for identifying amino acids that are important for protein function/structure, but there is no consensus regarding the best measure for evolutionary conservation [47]. Evolutionary conservation can be decomposed into three components: i) the overall selective constraints – the number of changes observed at a site; ii) the pattern of amino acid substitutions – the number of amino acid types observed at a site; and iii) the effect of amino acid usage. We have established a reliable relationship between each measure and various aspects of structure. To explore the connection between sequence conservation and functional-structural importance, we proposed a new measure that can decompose the conservation into these three components [47]. This measure is based on phylogenetic analysis. The evolutionary rate at site k during lineage l from amino acids i to j (i,j = 1,...20) can be expressed as λkl (i,j) = ck × alk × Q(i,j|k), where ck accounts for the rate variation among sites, alk for site-specific lineage (or subtree) effect caused by functional divergence [48], and the 20 × 20 matrix Q(i,j|k) is the (site-specific) model for amino acid substitutions. The likelihood function for a given tree can be determined according to a Markov chain model [49]. We have developed an integrated computer program (DIVERGE [50]) that can map these predicted sites onto the protein surface to examine these relationships. We use the solvent accessibility data from DSSP [26] to restrict predicted conserved residues to those located on the protein surface.

Conservatism of conservatism

The phylogeny-based conservation of residues relies on sequence homology. It is well known, however, that many non-homologous proteins share similar folds [51]. It is therefore highly desirable to study the conservation of residues in proteins based on the structural superimposition of non-homologous proteins. In order to obtain insight into the evolutionary conservation of residues in proteins, we use the Conservatism of Conservatism method (CoC). The CoC method was developed by Mirny and Shakhnovich [21] for studying evolutionary conservation of residues in proteins with specific folds from the FSSP database [23]. With the FSSP database, Mirny and Shakhnovich performed an analysis of conserved residues in several common folds. The 20 naturally occurring amino acids were subdivided into 6 different classes, based on their physicochemical characteristics and frequencies of occurrence at different positions in multiple sequence alignments. The evolutionary conservatism within families of homologous proteins was measured through sequence entropy. Structural superimposition of different families of proteins with similar folds was used to calculate CoC for all positions of residues within a fold. Here we have applied a similar approach to identify structurally conserved residues involved in protein interactions.

For each protein, we first calculate the sequence entropy at each position within a family of related sequences from the HSSP database [25]

where is the frequency of the class i of residues (for each of the six classes) at position l in sequence in the multiple sequence alignment. Then we use the FSSP database to obtain the structural alignment. The structural superimposition of different families was used to calculate the conservatism of conservatism (CoC)

where sm(l) is the intrafamily conservatism within the family m at position l, and M is the number of families. The CoC is the measure of the evolutionary conservation of the specific sites within the protein fold. Because the CoC method does not distinguish between residues at the protein surface evolutionarily conserved for functional reasons and residues inside the protein core that are conserved because of their importance to the folding process, we use solvent accessibility data for the unbound molecules to eliminate those conserved residues located inside the protein core.

Data mining approaches to binding site identification

Recent advances in machine learning [52] or data mining [53] offer a valuable approach to the data-driven discovery of complex relationships in computational biology [54,55]. In essence, a data mining approach uses a representative data training set to extract complex a priori unknown relationships, e.g., sequence correlates of protein-protein interactions. Examination of the resulting classifiers can help generate specific hypotheses that can be pursued using molecular and biophysical methods. For example, a classifier that is able to identify protein-protein interface residues on the basis of sequence or structural features can provide insights about sequence characteristics that contribute to important differences in function. The data mining approach for binding site identification consists of the following steps:

• Identify the surface residues in each protein.

• Label each residue in each protein as either an interface residue or a non-interface residue based on appropriate criteria for defining residues in interaction sites.

• Use a machine learning algorithm to train and evaluate a classifier to categorize a target amino acid as either an interface or a non-interface residue. Different types of information about the target residue (e.g., the identity and physicochemical properties of its sequence neighbors, whether or not the target residue is a surface residue) can be supplied as input to the classifier. A variety of machine learning algorithms [52,54] can be used for this purpose.

• Evaluate the classifier (typically using cross-validation or leave-one-out experiments) on independent test data (not used to train the classifier).

• Apply the classifier to identify putative interface residues in a protein, given its sequence (and possibly its structure), but not the sequence or structure of its interaction partner.

Here we have used a support vector machine (SVM) learning algorithm because SVMs are well-suited for the data-driven construction of high-dimensional patterns and are especially useful when the input is a real-valued pattern [56]. In addition, algorithms for constructing SVM classifiers effectively incorporate methods to avoid over-fitting the training data, thereby improving its generality, i.e., the performance of the resulting classifiers on test data. Support vector machine algorithms have proven effective in many applications, including text classification [57], gene expression analysis using microarray data [58], and predicting whether or not a pair of proteins is likely to interact [59].

Threading of sequences through structures of protein-protein interface surfaces

In phylogenetic and data mining approaches, the properties of the protein-protein interface are deduced by concentrating on the sequence information contained in the protein pair under investigation. However, it is well accepted that the physical origin of the specificity of protein-protein interactions comes predominantly from their structures. Thus, in any thorough investigation of protein-protein interactions, it is essential to include information from structural studies. Here we have adapted methods employed in protein structure recognition [60-63] to the problem of predicting protein-protein interface residues. In the first stage, structural models for identifying protein-protein interfaces are generated from existing protein databank (PDB) structures by extracting portions of contacts between different protein chains. We found that if we define the interaction region by the criterion that backbone Cα atoms on the two interacting chains are less than 15 Å apart, reasonably well connected fragments suitable for threading studies are obtained. In the second stage, after identifying a set of candidate template structures, threading is performed to examine the probability that a given model resembles the real interface. The threading algorithm is described in Cao et al. [27]. The threading alignments and scores obtained allowed us to predict which parts of each protein are in the interfacial region in the hydrolase-inhibitor complexes and to predict the most probable residue-residue contacts between the two proteins.

Ensemble predictions for combining results from multiple methods

Different approaches for identifying binding sites from amino acid sequence information yield different (sometimes contradictory, sometimes complementary) results. In such cases, approaches for combining results from multiple predictors have a potential importance. The key idea is that results obtained by using different approaches, which we will call classifiers henceforth, may be correlated (or, more generally, statistically dependent) due to a variety of reasons including the use of a common dataset for constructing or tuning classifiers, use of intermediate variables for encoding input to the classifiers, and similarities between methods (e.g., SVM, neural networks). Regardless of the source of statistical dependency, the goal is to develop methods for weighting the output of each classifier appropriately for the purpose of producing more accurate predictions. Our method takes as input the binary (True/False) output of each classifier (e.g., SVM, CoC) and produces as output a probability that the residue under consideration is an interface residue, using the outputs produced by each of the classifiers. Algorithms for learning Bayesian (or Markov networks) can be then used to learn the network of dependences and the relevant conditional probabilities.

General evaluation measures for assessing the performance of classifiers

Let TP denote the number of true positives – residues predicted to be interface residues that are actually interface residues; TN the number of true negatives – residues predicted not to be interface residues that are in fact not interface residues; FP the number false positives – residues predicted to be interface residues that are not interface residues;FN the number of false negatives – residues predicted not to be interface residues that actually are interface residues. Let N = TP+TN+FP+FN. Sensitivity (recall) and Specificity (precision) are defined for the positive (+) class as well as the negative (-) class. Sensitivity+ = TP/(TP+FN), Sensitivity-= TN/(TN+FP), Specificity+ = TP/(TP+FP), Specificity- =TN/(TN+FN). Overall sensitivity and overall specificity correspond to expected values of the corresponding measures averaged over both classes. The performance of the classifier is summarized by the correlation coefficient, which is given by

The correlation coefficient ranges from -1 to 1 and is a measure of how predictions correlate with the actual data [64]. It is important to note, that when the number of negative instances is much larger than the number of positive instances – as is the case for prediction of interface residues – the Sensitivity+ and Specificity+ measures are more appropriate for assessing prediction performance than the overall Sensitivity and Specificity measures [64]. In the extreme case when a classifier predicts every example to be negative (due to a preponderance of negative training instances) these overall performance measures would still show a high success rate despite the obvious failure of the prediction method. In such cases, the Correlation Coefficient, as well as the Sensitivity+, which is a measure of the fraction of positive instances that are correctly predicted, and Specificity+, which is a measure of the fraction of the positive predictions that are actually positive instances, may provide better performance assessment. Of course, a meaningful comparison of the performance of different classification methods depends critically on the specific application and goal.

Author's contributions

CY, VH and DD performed data mining calculations. XG performed phylogenetic calculations. KMH, CZW, YI, DD, and HC worked on threading. TZS, AK, and RLJ worked on the implementation of CoC and the development of consensus methodology. Every author contributed to the final draft of the paper.

Acknowledgments

The financial support through the NIH grant 1R21GM066387 is acknowledged by V. Honavar, D. Dobbs and R.L. Jernigan. We thank Dimitris Margaritis and other members of our research groups for helpful discussions. We also wish to thank the anonymous reviewers for valuable comments on the original version of this manuscript.

References

  1. Chothia C, Janin J: Principles of Protein-Protein Recognition.

    Nature 1975, 256:705-708. PubMed Abstract OpenURL

  2. Yan CH, Honavar V, Dobbs D: Identification of interface residues in protease-inhibitor and antigen-antibody complexes: a support vector machine approach.

    Neural Computing & Applications 2004, 13:123-129. OpenURL

  3. Yan C, Dobbs D, Honavar V: A two-stage classifier for identification of protein-protein interface residues.

    Bioinformatics 2004, 20:i371-i378. PubMed Abstract | Publisher Full Text OpenURL

  4. Teichmann SA, Murzin AG, Chothia C: Determination of protein function, evolution and interactions by structural genomics.

    Curr Opin Struct Biol 2001, 11:354-363. PubMed Abstract | Publisher Full Text OpenURL

  5. Valencia A, Pazos F: Computational methods for the prediction of protein interactions.

    Curr Opin Struct Biol 2002, 12:368-373. PubMed Abstract | Publisher Full Text OpenURL

  6. Valencia A, Pazos F: Prediction of protein-protein interactions from evolutionary information. In Structural Bioinformatics. Edited by Bourne PE and Weissig H. USA, John Wiley & Sons; 2003:411-426. OpenURL

  7. Young L, Jernigan RL, Covell DG: A role for surface hydrophobicity in protein-protein recognition.

    Prot Sci 1994, 3:717-729. OpenURL

  8. Kini RM, Evans HJ: Prediction of potential protein-protein interaction sites from amino acid sequence. Identification of a fibrin polymerization site.

    FEBS Lett 1996, 385:81-86. PubMed Abstract | Publisher Full Text OpenURL

  9. Jones S, Thornton JM: Prediction of protein-protein interaction sites using patch analysis.

    J Mol Biol 1997, 272:133-143. PubMed Abstract | Publisher Full Text OpenURL

  10. Jones S, Thornton JM: Analysis of protein-protein interaction sites using surface patches.

    J Mol Biol 1997, 272:121-132. PubMed Abstract | Publisher Full Text OpenURL

  11. Gallet X, Charloteaux B, Thomas A, Brasseur R: A fast method to predict protein interaction sites from sequences.

    J Mol Biol 2000, 302:917-926. PubMed Abstract | Publisher Full Text OpenURL

  12. Casari G, Sander C, Valencia A: A method to predict functional residues in proteins.

    Nat Struct Biol 1995, 2:171-178. PubMed Abstract | Publisher Full Text OpenURL

  13. Lichtarge O, Bourne HR, Cohen FE: An evolutionary trace method defines binding surfaces common to protein families.

    J Mol Biol 1996, 257:342-358. PubMed Abstract | Publisher Full Text OpenURL

  14. Pazos F, Helmer-Citterich M, Ausiello G, Valencia A: Correlated mutations contain information about protein-protein interaction.

    J Mol Biol 1997, 271:511-523. PubMed Abstract | Publisher Full Text OpenURL

  15. Lu L, Lu H, Skolnick J: MULTIPROSPECTOR: an algorithm for the prediction of protein-protein interactions by multimeric threading.

    Proteins 2002, 49:350-364. PubMed Abstract | Publisher Full Text OpenURL

  16. Fariselli P, Pazos F, Valencia A, Casadio R: Prediction of protein--protein interaction sites in heterocomplexes with neural networks.

    Eur J Biochem 2002, 269:1356-1361. PubMed Abstract | Publisher Full Text OpenURL

  17. Zhou HX, Shan Y: Prediction of protein interaction sites from sequence profile and residue neighbor list.

    Proteins 2001, 44:336-343. PubMed Abstract | Publisher Full Text OpenURL

  18. Read RJ, Fujinaga M, Sielecki AR, James MN: Structure of the complex of Streptomyces griseus protease B and the third domain of the turkey ovomucoid inhibitor at 1.8-A resolution.

    Biochemistry 1983, 22:4420-4433. PubMed Abstract OpenURL

  19. Ptitsyn OB, Ting KL: Non-functional conserved residues in globins and their possible role as a folding nucleus.

    J Mol Biol 1999, 291:671-682. PubMed Abstract | Publisher Full Text OpenURL

  20. Ting KL, Jernigan RL: Identifying a folding nucleus for the lysozyme/alpha-lactalbumin family from sequence conservation clusters.

    J Mol Evol 2002, 54:425-436. PubMed Abstract | Publisher Full Text OpenURL

  21. Mirny LA, Shakhnovich EI: Universally conserved positions in protein folds: Reading evolutionary signals about stability, folding kinetics and function.

    J Mol Biol 1999, 291:177-196. PubMed Abstract | Publisher Full Text OpenURL

  22. Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG: The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools.

    Nucl Acids Res 1997, 24:4876-4882. Publisher Full Text OpenURL

  23. Holm L, Sander C: Protein structure comparison by alignment of distance matrices.

    J Mol Biol 1993, 233:123-138. PubMed Abstract | Publisher Full Text OpenURL

  24. Sander C, Schneider R: Database of homology derived protein structures and the structural meaning of sequence alignment.

    Proteins 1991, 9:56-58. PubMed Abstract OpenURL

  25. Dodge C, Schneider R, Sander C: The HSSP database of Protein Structure-Sequence Alignments and Family Profiles.

    Nucl Acids Res 1998, 26:313-315. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  26. Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features.

    Biopolymers 1983, 22:2577-2637. PubMed Abstract | Publisher Full Text OpenURL

  27. Cao H, Ihm Y, Wang CZ, Morris JR, Su M, Dobbs D, Ho KM: Three-dimensional threading approach to protein structure recognition.

    Polymer 2004, 45:687-697. Publisher Full Text OpenURL

  28. Moult J, Fidelis F, Zemla A, Hubbard T: Critical assessment of methods of protein structure prediction (CASP)-round V.

    Proteins 2003, 53:334-339. PubMed Abstract | Publisher Full Text OpenURL

  29. Li H, Tang C, Wingreen NS: Nature of Driving Force for Protein Folding: A Result From Analyzing the Statistical Potential.

    Phys Rev Lett 1997, 79:765-768. Publisher Full Text OpenURL

  30. Miyazawa S, Jernigan RL: Estimation of Effective Interresidue Contact Energies From Protein Crystal-Structures - Quasichemical Approximation.

    Macromolecules 1985, 18:534-552. OpenURL

  31. Carugo D, Franzot G: Prediction of protein-protein interactions based on surface patch comparison.

    Proteomics 2004, 4:1727-1736. PubMed Abstract | Publisher Full Text OpenURL

  32. Lu H, Lu L, Skolnick J: Development of Unified Statistical Potentials Describing Protein-Protein Interactions.

    Biophys J 2003, 84:1895-1901. PubMed Abstract | Publisher Full Text OpenURL

  33. Lu L, Arakaki AK, Lu H, Skolnick J: Multimeric Threading-Based Prediction of Protein-Protein Interactions on a Genomic Scale: Application to the Saccharomyces cerevisiae Proteome.

    Genome Res 2003, 13:1146-1154. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  34. Martin S, Roe D, Faulon JL: Predicting protein-protein interactions using signature products.

    Bioinformatics 2004, bth483. OpenURL

  35. Neuvirth H, Raz R, Schreiber G: ProMate: A Structure Based Prediction Program to Identify the Location of Protein-Protein Binding Sites*1.

    Journal of Molecular Biology 2004, 338:181-199. PubMed Abstract | Publisher Full Text OpenURL

  36. Obenauer JC, Yaffe MB: Computational prediction of protein-protein interactions.

    Methods Mol Biol 2004, 261:445-468. PubMed Abstract | Publisher Full Text OpenURL

  37. Ofran Y, Rost B: Predicted protein-protein interaction sites from local sequence information.

    FEBS Lett 2003, 544:236-239. PubMed Abstract | Publisher Full Text OpenURL

  38. Valencia A, Pazos F: Prediction of protein-protein interactions from evolutionary information .

    Methods Biochem Anal 2003, 44:411-426. PubMed Abstract OpenURL

  39. Chakrabarti P, Janin J: Dissecting protein-protein recognition sites.

    Proteins 2002, 47:334-343. PubMed Abstract | Publisher Full Text OpenURL

  40. Frigerio F, Coda A, Pugliese L, Lionetti C, Menegatti E, Amiconi G, Schnebli HP, Ascenzi P, Bolognesi M: Crystal and molecular structure of the bovine alpha-chymotrypsin-eglin c complex at 2.0 A resolution.

    J Mol Biol 1992, 225:107-123. PubMed Abstract | Publisher Full Text OpenURL

  41. Tsunemi M, Matsuura Y, Sakakibara S, Katsube Y: Crystal structure of an elastase-specific inhibitor elafin complexed with porcine pancreatic elastase determined at 1.9 A resolution.

    Biochemistry 1996, 35:11570-11576. PubMed Abstract | Publisher Full Text OpenURL

  42. Mittl PR, Di Marco S, Fendrich G, Pohlig G, Heim J, Sommerhoff C, Fritz H, Priestle JP, Grutter MG: A new structural class of serine protease inhibitors revealed by the structure of the hirustasin-kallikrein complex.

    Structure 1997, 5:253-264. PubMed Abstract | Publisher Full Text OpenURL

  43. Song HK, Suh SW: Kunitz-type soybean trypsin inhibitor revisited: refined structure of its complex with porcine trypsin reveals an insight into the interaction between a homologous inhibitor from Erythrina caffra and tissue-type plasminogen activator1.

    J Mol Biol 1998, 275:347-363. PubMed Abstract | Publisher Full Text OpenURL

  44. Takeuchi Y, Satow Y, Nakamura KT, Mitsui Y: Refined crystal structure of the complex of subtilisin BPN' and Streptomyces subtilisin inhibitor at 1.8 A resolution.

    J Mol Biol 1991, 221:309-325. PubMed Abstract OpenURL

  45. Rees DC, Lipscomb WN: Refined crystal structure of the potato inhibitor complex of carboxypeptidase A at 2.5 A resolution.

    J Mol Biol 1982, 160:475-498. PubMed Abstract | Publisher Full Text OpenURL

  46. Jones S, Thornton JM: Principles of protein-protein interactions.

    Proc Natl Acad Sci U S A 1996, 93:13-20. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  47. Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge, U.K., Cambridge University Press; 1998. OpenURL

  48. Gu X: Statistical methods for testing functional divergence after gene duplication.

    Mol Biol Evol 1999, 16:1664-1674. PubMed Abstract | Publisher Full Text OpenURL

  49. Felsenstein J: Evolutionary trees from DNA sequences:a maximum likelihood approach.

    J Mol Evol 1981, 17:368-376. PubMed Abstract OpenURL

  50. Gu X, Vander Velden K: DIVERGE: Phylogeny-based Analysis for Functional-Structural Divergence of a Protein.

    Bioinformatics 2002, 18:500-501. PubMed Abstract | Publisher Full Text OpenURL

  51. Laurents DV, Subbiah S, Levitt M: Different protein sequences can give rise to highly similar folds through different stabilizing interactions.

    Prot Sci 1994, 3:1938-1944. OpenURL

  52. Mitchell T: Machine Learning. New York, Mc-Graw Hill; 1997. OpenURL

  53. Witten IH, Frank E: Data mining: Practical machine learning tools and techniques with java implementations. San Mateo, CA, Morgan Kaufmann; 1999. OpenURL

  54. Baldi P, Brunak S: Bioinformatics: The Machine Learning Approach. 2nd edition. Cambridge, MA, MIT Press; 2001. OpenURL

  55. Luscombe NM, Greenbaum D, Gerstein M: What is bioinformatics? A proposed definition and overview of the field.

    Methods Inform Med 2001, 40:346-358. OpenURL

  56. Vapnik V: Statistical learning theory. New York, Springer-Verlag; 1998. OpenURL

  57. Hearst MA, Scholkopf B, Dumais S, Osuna E, Platt J: Trends and controversies - support vector machines.

    IEEE Intelligent Systems 1998, 13:18-28. Publisher Full Text OpenURL

  58. Brown MPS, Grundy WN, Lin D, Christianini N, Sugnet CWS, Furey T, Ares Jr. M, Haussler D: Knowledge based analysis of microarray gene expression data using support vector machines.

    Proc Natl Acad Sci USA 2000, 97:262-267. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  59. Bock JR, Gough DA: Predicting protein--protein interactions from primary structure.

    Bioinformatics 2001, 17:455-460. PubMed Abstract | Publisher Full Text OpenURL

  60. Godzik A, Skolnick J: Sequence-structure matching in globular proteins: application to supersecondary and tertiary structure determination.

    Proc Natl Acad Sci USA 1992, 89:12098-12102. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  61. Jones DT, Miller RT, Thornton JM: Successful protein fold recognition by optimal sequence threading validated by rigorous blind testing.

    Proteins 1995, 23:387-397. PubMed Abstract OpenURL

  62. Meller J, Elber R: Linear programming optimization and a double statistical filter for protein threading protocols.

    Proteins 2001, 45:241-261. PubMed Abstract | Publisher Full Text OpenURL

  63. Miyazawa S, Jernigan RL: Identifying sequence-sequence pairs undetected by sequence alignments.

    Protein Eng 2000, 13:459-475. PubMed Abstract | Publisher Full Text OpenURL

  64. Baldi P, Brunak S, Chauvin Y, Andersen CAF, Nielsen H: Assessing the accuracy of prediction algorithms for classification: an overview.

    Bioinformatics 2000, 16:412-424. PubMed Abstract | Publisher Full Text OpenURL