Abstract
Background
Supervised learning and many stochastic methods for predicting proteinprotein interactions require both negative and positive interactions in the training data set. Unlike positive interactions, negative interactions cannot be readily obtained from interaction data, so these must be generated. In proteinprotein interactions and other molecular interactions as well, taking all nonpositive interactions as negative interactions produces too many negative interactions for the positive interactions. Random selection from nonpositive interactions is unsuitable, since the selected data may not reflect the original distribution of data.
Results
We developed a bootstrapping algorithm for generating a negative data set of arbitrary size from proteinprotein interaction data. We also developed an efficient boosting algorithm for finding interacting motif pairs in human and virus proteins. The boosting algorithm showed the best performance (84.4% sensitivity and 75.9% specificity) with balanced positive and negative data sets. The boosting algorithm was also used to find potential motif pairs in complexes of human and virus proteins, for which structural data was not used to train the algorithm. Interacting motif pairs common to multiple folds of structural data for the complexes were proven to be statistically significant. The data set for interactions between human and virus proteins was extracted from BOND and is available at http://virus.hpid.org/interactions.aspx webcite. The complexes of human and virus proteins were extracted from PDB and their identifiers are available at http://virus.hpid.org/PDB_IDs.html webcite.
Conclusion
When the positive and negative training data sets are unbalanced, the result via the prediction model tends to be biased. Bootstrapping is effective for generating a negative data set, for which the size and distribution are easily controlled. Our boosting algorithm could efficiently predict interacting motif pairs from protein interaction and sequence data, which was trained with the balanced data sets generated via the bootstrapping method.
Background
Linear motifs are known to facilitate many proteinprotein interactions [1]. Despite the availability of a large volume of data about proteinprotein interactions and their sequences, linear motifs are difficult to discover, due to their short length, which is between three and ten amino acids [2]. Recently, several methods have been developed for discovering linear motifs of proteinprotein interactions [1,3], but most methods focus on detecting individual linear motifs rather than interacting motif pairs. Motif pairs are more useful than motifs for filtering many spurious protein interactions in current highthroughput data, and for identifying a functional target.
Supervised learning or stochastic methods are often used to predict linear motifs involved in proteinprotein interactions. Both negative and positive interactions are required to train the methods. Unlike positive interaction data, negative samples cannot be readily obtained from proteinprotein interaction data. Assuming a negative interaction where there is no explicit evidence of a positive interaction results in a much larger negative data set than a positive data set. Such an unbalance between positive and negative data sets makes a prediction biased [4,5]. Generating a negative data set via random selection often does not reflect the original distribution of data, thus it does not produce a good prediction model.
There are a few methods for generating a negative data set. Jansen et al. [6] generate a data set of negative interactions by assuming that proteins in different subcellular compartments of a cell do not interact. However, different subcellular locations only indicate that the proteins have a lower chance of binding than those in the same location, and some proteins are found in more than one subcellular compartment of a cell [7]. The method developed by Gomez et al. [8] assumes a negative protein interaction, if there is no explicit evidence of an interaction. However, this assumption generates a negative data set that is too large, resulting in low sensitivity in interaction predictions. The method that uses the shortest path [7] has difficulty in obtaining a negative data set of the desired size. The method that uses sequence similarity [9] also has difficulty in controlling the size of the negative data set.
In this study, we developed a bootstrapping algorithm for generating a negative data set of proteinprotein interactions, and a new boosting algorithm for finding interacting motif pairs from positive and negative data sets. The remainder of the paper describes the algorithms and their experimental results with various parameter values.
Results and discussion
We measured the prediction performance of the boosting algorithm in terms of sensitivity, specificity and accuracy.
In the following description, the sampling size S is the number of negative samples that were examined to generate a single negative data via bootstrapping. When the number of negative samples with mth feature = 1 is greater than the acceptance ratio A, the mth feature of the resampled negative data is set to 1. The feature vector and the acceptance ratio are described in detail in the method section.
Affect of acceptance ratios
From the interactions between human and virus proteins, we generated four different negative data sets, by executing the bootstrapping algorithm with four acceptance ratios (1/10, 1/8, 1/6, 1/4). Then, we used both the negative and positive data sets to test the boosting algorithm via fivefold cross validation. Motif pairs predicted from each fold were combined as follows: M_{i }= {motif pairs found in at least i folds} where i = {1, 2, ..., 5} [7]. Table 1 shows the number of motif pairs predicted with different acceptance ratios.
Table 1. Motif pairs found during fivefold cross validation
As the acceptance ratio increases, resampled negative data have fewer nonzero features, resulting in more motif pairs. This is because the nonzero features of negative data are used to filter out the features that are also nonzero in positive data.
With the sampling size of 120, most noninteraction data were resampled to generate a negative data set. We compared the prediction performance of the algorithm with respect to four different acceptance ratios. As shown in Table 2, prediction of motif pairs with a larger acceptance ratio shows a much better performance than that with a smaller acceptance ratio. As the acceptance ratio increases, negative data have more nonzero features. Hence, data with many zero features are easily classified as negative samples.
Table 2. Prediction performance with respect to acceptance ratios of bootstrapping
Affect of proportions of positive and negative data sets
For the purpose of comparing the prediction performance with respect to different proportions of positive and negative data sets, we generated three negative data sets with the sampling size of 120 and acceptance ratio of 1/8. The data set for 1,712 interactions between human proteins and virus proteins was used as the positive data set. Table 3 and Figure 1 show the prediction performance with respect to three different proportions of positive and negative data sets. As the proportion of positive data increases, sensitivity increases, but specificity decreases. It is interesting to note that the size of the negative data sets alone affects the performance.
Figure 1. Sensitivity and specificity of predictions with respect to proportions of positive and the negative data. As the proportion of positive data increases, the sensitivity increases but the specificity decreases.
Table 3. Prediction performance with respect to proportions of positive and negative data
Affect of boosting algorithms
The execution time of the boosting algorithm is influenced by the number of hypotheses (T; for Yu's AdaBoost algorithm only), the number of partitioned data sets (S), and the number of randomly selected training data for weak hypotheses (R). Suppose that we set parameters; T = 4, S = 5 and R = 100,000. Yu's AdaBoost uses 5 × 4 = 20 weak hypotheses. But, our boosting algorithm uses only five weak hypotheses. While Yu's AdaBoost uses four weak hypotheses per data set, our boosting algorithm uses only one weak hypothesis per data set. With fewer weak hypotheses than Yu's AdaBoost algorithm, our algorithm has a better performance, as shown in Table 4.
Table 4. Prediction performance of two boosting algorithms
Motif pairs found in complexes of human and virus proteins
Table 5 shows the pvalues for each set of motif pairs. The pvalue of M_{1 }= 1, implying that motif pairs of M_{1 }had no more significance than random motif pairs. However, motif pairs of M_{2}M_{5 }were more significant than random motif pairs. Figure 2 shows a complex of human and HIV1 proteins (PDB ID: 1AGF). Among the total of 63 contact residues between chains A and C, 16 residue pairs were included in M_{2}.
Conclusion
When positive and negative training data sets are unbalanced, the result via the prediction model tends to be biased. We developed a bootstrapping algorithm for generating a negative data set of arbitrary size from proteinprotein interaction data. We also developed an efficient boosting algorithm for finding interacting motif pairs in human and virus proteins. The boosting algorithm showed the best performance (84.4% sensitivity and 75.9% specificity) with balanced positive and negative data sets. The boosting algorithm was also used to find potential motif pairs in complexes of human and virus proteins, for which structural data was not used for training the algorithm. Interacting motif pairs common to multiple folds of structural data of complexes were proven to be statistically significant.
This method predicts proteinprotein interactions and motif pairs using the protein sequence data. The sequence information alone is insufficient to predict motif pairs for some proteins, but our method provides a useful model for predicting motif pairs in proteinprotein interactions when the sequence is the only information available. The data set for interactions between human and virus proteins was extracted from BOND and is available at http://virus.hpid.org/interactions.aspx webcite. The complexes of human and virus proteins were extracted from PDB and their identifiers are available at http://virus.hpid.org/PDB_IDs.html webcite.
Methods
Data set
We extracted the latest data of interactions between human and virus proteins from BOND [10]. As of May, 2008, there were 1,712 interactions between 1,029 human proteins and 603 virus proteins. These interactions were considered as positive data. From 1,712 interactions, we constructed three negative data sets of 2,252, 1,712, and 2,283 samples via the bootstrapping method.
Feature vector
The way of extracting features in our study was similar to the one used in the studies of Gomez et al. [8] and Yu et al. [7]. In the study by Gomez et al., fourtuple features were used to identify a subsequence of four amino acids. Based on biochemical similarities of amino acids, twenty amino acids were classified into six categories: {IVLM}, {FYW}, {HKR}, {DE}, {QNTP}, and {ACGS} [11]. After classification, there were 6^{4 }= 1,296 possible substrings of length four.
For a given protein sequence, a fourtuple feature is represented as a 1,296bit binary vector, in which each bit indicates whether the corresponding lengthfour string occurs in the protein. The encoding scheme for the interaction binary vector is described in Table 6.
Table 6. Encoding scheme for the interacting motif pairs
Both our previous study [9] and the study of Yu et al. [7] found interacting motif pairs in yeast proteins. A binary vector representing an interacting motif pair is a palindrome, so the total number M_{symmetric }of possible motif pairs is determined by
The interactions between human and virus proteins are the interactions between heterogeneous proteins. Hence, the total number M_{asymmetric }of possible motif pairs is as follows.
Our method is intended for finding motif pairs with 4 consecutive residues (i, i+1, i+2 and i+3) in each motif. Hence, a motif with nonconsecutive residues cannot be found even if the residues are spatially close to each other. Since the total number of possible motif pairs is 6^{m}·6^{m }= (6^{m})^{2 }= 6^{2m }for a motif of size m (equation 5), the total number of possible motif pairs increases exponentially as the size of m increases. The total number of possible motif pairs can be reduced with a motif of a smaller size (e.g., 2 or 3 residues), but the motif of a small size has too many occurrences in the sequences, which significantly reduces the selectivity of the motif.
Bootstrapping for resampling
As in Gomez et al.'s method [8], we assumed a negative interaction if there was no explicit evidence of an interaction. However, this assumption generates a much larger number of negative samples than positive samples. If we randomly select only some of the negative samples, we might miss information from unselected negative samples. Dupret and Koda [5] used bootstrapping to identify the optimal resampling proportions in binary classification experiments.
In our study, we used bootstrapping to generate negative data sets via resampling negative data. Algorithm 1 describes our bootstrapping method, which is controlled by the sampling size S and acceptance ratio A. Executing the bootstrapping algorithm yields a single resampled negative data from S negative data. The resampled negative data is represented as a feature vector Y = {y_{1}, y_{2}, ..., y_{M}} via Algorithm 1. The number of 1's in the feature vector Y is controlled by the acceptance ratio A. A larger value of A produces a feature vector with fewer nonzero elements.
Algorithm 1 – Bootstrapping algorithm
This algorithm generates the feature vector Y for a single negative data from S samples, where S is the sampling size and A is the acceptance ratio for setting a feature to 1.
1. Randomly sample S protein pairs (P_{s1}, P_{s2}) with replacement from noninteracting protein pairs, where s = {1, 2, ..., S}.
2. Initialize n_{i }= 0 for i = {1, 2, ..., M}
3. Initialize y_{i }= 0 for i = {1, 2, ..., M}
4. For s == {1, 2, ..., S}
a. Make a binary vector X_{s }= {x_{s1}, x_{s2}, ..., x_{sM}} _{}for a pair of proteins (P_{s1}, P_{s2})
b. For m = {1...M}
If x_{sm }= 1, n_{m }= n_{m }+ 1 {n_{m }is the number of samples for which the mth feature = 1}
5. For m = {1...M}
If n_{m}/S > A, set y_{m }= 1
6. Y = {y_{1}, y_{2}, ..., y_{M}}_{} is a feature vector representing resampled negative data.
The boosting algorithm
In general, the boosting method finds a highly accurate hypothesis by combining weak hypotheses, each of which is only moderately accurate. Typically, each weak hypothesis is a simple classification rule. In AdaBoost (Adaptive Boosting), each weak hypothesis generates not only a classification rule but also a confidence score that estimates the reliability of the classification [12].
The study of Yu et al. [7] uses the AdaBoost algorithm for finding motif pairs in homogeneous protein interactions. One of the differences between Yu's algorithm and ours is the number of weak hypotheses used in the algorithms. In Yu's AdaBoost algorithm, if the weight (α_{s1}) of the first weak hypothesis is much greater than the weights of other hypotheses, the final hypothesis is determined mainly by the first weak hypothesis and other hypotheses have negligible effect on the final hypothesis.
Our boosting algorithm determines the weights of weak hypotheses and uses the training data in a different way from Yu's algorithm. While Yu's AdaBoost algorithm uses different weights and the same training data per weak hypothesis, our algorithm uses the same weights and different training data per weak hypothesis. Our boosting algorithm uses fewer weak hypotheses than Yu's algorithm, and requires much less time than their algorithm.
Our algorithm consists of two parts: boosting algorithm and WINNOW2 algorithm. The boosting algorithm described in Algorithm 2 takes as input a training set (x_{1}, y_{1}), ..., (x_{n}, y_{n}), where each pair is a binary vector of length M, which represents an interaction with a label in the label set Y. Y = {1, +1} indicates whether each interaction is positive or negative. The boosting algorithm calls the WINNOW2 algorithm to obtain a weak hypothesis in an iterative series of rounds, where t = {1, ..., S}. In each round, the boosting algorithm computes the weight (α_{t}) of the weak hypothesis h_{c,t}. The final hypothesis H_{t }for Set_{t }is the weighted sum of weak hypotheses h_{c,i }(i = 1, ..., S and i ≠ t).
We used a regulated stochastic WINNOW2 algorithm [13] with R = 200,000 as a weak classifier (Algorithm 3). The WINNOW2 algorithm is similar to that of Yu et al. [7], except for the step of updating learner factors. Yu's algorithm updates learner factors when x_{ki }(feature vector) is 0, but our algorithm updates them when x_{ki }is 1. Yu's algorithm takes as input a training set and computes normalized sample weights in each boosting round. In the step of drawing a sample data, data with larger weights are drawn more frequently than those with smaller weights. Since the sample weights are difficult to adjust in each round, our algorithm uses the same weight for every sample and draws samples with equal frequency. But, the training data is changed in every round, and the call to the WINNOW2 algorithm produces different hypotheses according to the training data. Finally, additional regulation is performed to discover effective components. The components with large learner factors are identified as effective components. These effective components are considered as the motif pairs of proteinprotein interactions.
Suppose that there are five data sets (S = 5) and four weak hypotheses (T = 4 in Yu's algorithm) per round. Yu's AdaBoost algorithm requires 5 × 4 = 20 weak hypotheses to classify the data. In contrast, our boosting algorithm requires only one weak hypothesis per round, and five weak hypotheses in total, thus it does not need the parameter T. Since the execution times of the algorithms are proportional to the number of hypotheses, our algorithm is more than four times faster than Yu's algorithm for the same data set, without reducing the prediction accuracy [9]. The frameworks for both algorithms are shown in Figures 3 and 4.
Figure 3. Framework for Yu's AdaBoost algorithm. The AdaBoost algorithm requires 20 weak hypotheses for T = 4 and S = 5.
Figure 4. The framework of our boosting algorithm. Our algorithm requires only 5 weak hypotheses for S = 5.
Algorithm 2 – boosting algorithm
The boosting algorithm calls the WINNOW2 algorithm to obtain weak hypotheses. S is the number of divided data sets.
1. Given divided data set Set_{1}, Set_{2}, ..., Set_{S }where .
2. For t = 1, ..., S
a. Given training data (x_{1}, y_{1}), (x_{2}, y_{2}), ..., (x_{n}, y_{n}) from Set_{t }where x_{i }∈ {0, 1}^{M}, y_{i }∈ Y = {1, +1} for {i = 1, 2, ..., n}
b. Call the WINNOW2 algorithm to obtain the weak hypothesis h_{c,t}.
c. Compute the error r_{t }of the weak hypothesis h_{c,t }at level c.
d. Compute the weight α_{t }of the weak hypothesis
3. Output the final hypothesis for Set_{t}:
Algorithm 3 – WINNOW2 algorithm
The WINNOW2 algorithm trains the weak hypothesis. R is the number of randomly selected data.
1. Given training data (x_{1}, y_{1}), (x_{2}, y_{2})..., (x_{n}, y_{n}).
2. Initialize learner factor w_{i }= 1 for i = {1, 2, ..., M}, and threshold θ = M/2
3. For r = {1, ..., R}
a. Randomly select a sample data (x_{k}, y_{k}), and let vector x_{k }denote (x_{k1}, x_{k2}, ..., x_{kM})
b. The learner responds as follows:
4. Define a regulated classifier h_{c }at level c as follows:
where w_{i,c }= w_{i }if w_{i }≥ c, and w_{i,c }= 0 otherwise.
5. Let N_{c }denote the number of positive predictions by classifier h(c) in the training data and N_{0 }denote the number of positive predictions with the cutoff of 0.
Output the classifier h_{C }where C = arg max {c  N_{c }= N_{0}}.
6. The features with nonzero w_{i,c }are effective motif pairs.
Verification with structural data
To further evaluate the algorithm for the structures of heterogeneous multiprotein complexes, we extracted structural data for complexes of human and virus proteins from PDB [14]. Complexes with RNA or DNA chains were not retrieved. Circa June 2008, there were a total of 105 complexes of human and virus proteins in PDB.
We used fivefold cross validation to evaluate the algorithm. The data set was split into five parts of equal size. The boosting algorithm using the WINNOW2 algorithm for weak hypotheses was trained with one part and tested with the remaining four parts. The traintest procedure consisted of five iterations.
When a residue pair in different chains contained an atomic pair within the distance of 5 Å, we considered the residue pair as a contact residue pair. If a motif pair had at least one contact residue pair, we considered the motif pair as a verifiable motif pair [7]. To assess the statistical significance of motif pairs predicted by our algorithm, we estimated the pvalue of motif pairs by executing Algorithm 4 with m = 100,000 [9]. Motif pairs with lower pvalues are more significant than those with higher pvalues.
Algorithm 4 – Estimation of pvalues of motif pairs
A motif pair with a smaller pvalue is more significant than a random motif pair R_{i}.
1. Given a set S of motif pairs collected by weak hypotheses.
2. Randomly draw m motif pairs {R_{1}, R_{2}, ..., R_{m}} where R_{i }has the same size as M_{k }(k = 1, 2, ...., 5)
3. Compute the pvalue of the set S as follows:
where V(S) is the number of verifiable motif pairs.
Competing interests
The authors declare that they have no competing interests.
Acknowledgements
This work was supported by the Korea Research Foundation Grant funded by the Korean Government (KRF2006D00038).
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/10?issue=S1
References

Davey NE, Shields DC, Edwards RJ: SLiMDisc: short, linear motif discovery, correcting for common evolutionary.
Nucleic Acid Res 2006, 34:35463554. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Neduva V, Russel RB: Linear motifs: Evolutionary interaction switches.
FEBS Letters 2005, 579:33423345. PubMed Abstract  Publisher Full Text

Neduva V, Russel RB: DILIMOT: discovery of linear motifs in proteins.
Nucleic Acid Res 2006, 34(Web Server issue):W350W355. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dupret G, Koda M: Bootstrap resampling for unbalanced data in supervised learning.
European Journal of Operational Research 2001, 134:141156. Publisher Full Text

Jansen R, Gerstein M: Analyzing protein function on a genomic scale: the importance goldstandard positives and negatives for network prediction.
Current opinion in Microbiology 2004, 7:535545. PubMed Abstract  Publisher Full Text

Yu H, Qian M, Deng M: Using a Stochastic AdaBoost Algorithm to Discover Interactome Motif Pairs from Sequences.

Gomez SM, Noble WS, Rzhetsky A: Learning to Predict ProteinProtein Interactions from Protein Sequences.
Bioinformatics 2003, 19:18751881. PubMed Abstract  Publisher Full Text

Kim J, Park B, Han K: Prediction of Interacting Motif Pairs using Stochastic Boosting.
Proceedings of Frontiers in the Convergence of Bioscience and Information Technologies 2007, 95100.

Alfarano C, Andrade CE, Anthony K, et al.: The Biomolecular Interaction Network Database and related tools 2005 update.
Nucleic Acid Res 2005, 33(Database issue):D418D424. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Taylor WR, Jones DT: Deriving an amino acid distance matrix.
Journal of Theoretical Biology 1993, 164:6583. PubMed Abstract  Publisher Full Text

Schapire RE, Singer Y: Improved Boosting Algorithms Using Confidencerated Predictions.
Machine Learning 1999, 37:297336. Publisher Full Text

Littlestone N: Learning Quickly When Irrelevant Attributes Abound. A New Linearthreshold Algorithm.

Deshpande N, Addess KJ, Bluhm WF, et al.: The RCSB Protein Data Bank: a redesigned query system and relational database based on the mmCIF schema.
Nucleic Acids Research 2005, 33:D233D237. PubMed Abstract  Publisher Full Text  PubMed Central Full Text