Abstract
Background
With the development of genomesequencing technologies, protein sequences are readily obtained by translating the measured mRNAs. Therefore predicting proteinprotein interactions from the sequences is of great demand. The reason lies in the fact that identifying proteinprotein interactions is becoming a bottleneck for eventually understanding the functions of proteins, especially for those organisms barely characterized. Although a few methods have been proposed, the converse problem, if the features used extract sufficient and unbiased information from protein sequences, is almost untouched.
Results
In this study, we interrogate this problem theoretically by an optimization scheme. Motivated by the theoretical investigation, we find novel encoding methods for both protein sequences and protein pairs. Our new methods exploit sufficiently the information of protein sequences and reduce artificial bias and computational cost. Thus, it significantly outperforms the available methods regarding sensitivity, specificity, precision, and recall with crossvalidation evaluation and reaches ~80% and ~90% accuracy in Escherichia coli and Saccharomyces cerevisiae respectively. Our findings here hold important implication for other sequencebased prediction tasks because representation of biological sequence is always the first step in computational biology.
Conclusions
By considering the converse problem, we propose new representation methods for both protein sequences and protein pairs. The results show that our method significantly improves the accuracy of proteinprotein interaction predictions.
Background
The concerted interactions of thousands of proteins in cells form the basis of most of the biological processes. Genomewide identification of proteinprotein interactions is important to understand the underlying mechanisms of many biological phenomena e.g. cell cycles, apoptosis, signal transduction, and pathogenesis of diseases. Recently, highthroughput experimental methodologies have been developed to screen the proteinprotein interactions (PPIs) in a genomewide way, e.g. yeast twohybrid systems [1], mass spectrometry [2,3], and protein microarrays [4,5]. But these genomewide studies are limited to a few of model organisms, for example, Escherichia coli [6], Helicobacter pylori [7], Saccharomyces cerevisiae [3,8,9], Caenorhabditis elegans [10], Drosophila melanogaster [11], and Homo sapiens [12,13]. These preliminary explorations provide valuable resources to study the model organisms [14]. More importantly, it allows us to learn the interacting rules from the available PPIs to construct a universal predictor for accelerating the mapping of whole interactomes of organisms, especially those species barely characterized.
To construct a universal predictor, we need to extract protein attributes that are crucial to PPIs predictions. Among the various attributes of proteins, the primary sequences are the most basic and the easiest to obtain because of the rapid development of genomic sequencing technologies. In addition, the primary sequences of proteins virtually specify their structures that provide the molecular basis for PPIs. So protein primary sequences hold the promise to contain virtually sufficient information to construct the most universal predicting method [15].
We know that almost all proteins are composed of twenty amino acids but different proteins have various lengths. Here the first challenge to construct a universal PPI predictor is that how to represent the various lengths of proteins by numerical vectors with the same dimension if vectorbased computational methods are used. Even if the methods are not based on vectors, what features of the protein sequences are important to PPIs should be addressed first. So far, many methods have been proposed [1520]. However the converse problem, that is, to what extent the protein sequences can be reconstructed based on their vector representations, is often untouched. Obviously, addressing this converse problem will facilitate the comparison of various representation schemes. Here, we develop an optimization model to evaluate theoretically the qualities of various representation schemes by considering the converse problems of protein representation as well as the computational costs.
Based on the key ingredients revealed by the optimization model, we suggest new coding methods for both protein sequences and protein pairs. Strict evaluations on datasets of Escherichia coli and Saccharomyces cerevisiae suggest that our new vector representation for protein sequences improves the prediction accuracy significantly while reducing greatly the computational complexity. The new vector representation of the protein pairs further improves the prediction accuracy and has excellent theoretical properties, i.e., symmetry, reversibility, and unbiasedness.
Results
Evaluating the converse problem of protein vector representations
We consider two theoretical aspects to evaluate various vector representations of protein sequences. One is to what extent the protein sequence information is extracted by the vectors. This can be evaluated by checking whether and how protein sequences can be constructed conversely from the vectors. The other is how the vector dimension increases as the information extracted. Because of the curse of dimensionality, representations with low vector dimension are appreciated in real applications. These criteria can be summarized as the following optimization model:
s.t.
where S is a set of protein sequences, V is the vector representation of S generated by the mapping f and g is the inverse function of f. dim(V) means the dimension of V .
Based on the evaluation model, we compared the available kmer based (denoted by K) [15,16,20] and segmentation based (denoted by P) [21] vector representations. kmer based representation counts the number of each kmer appearing in protein sequences, so the vector dimension is 20^{k}, increasing exponentially as k. When k is large enough (often much larger than three), protein sequences can be reconstructed uniquely from the corresponding vectors by seeking an Eulerian trail in a network constructed by the relationships of kmers. Segmentationbased methods divide a protein sequence into p pieces and then count the number of each amino acid appearing in each piece. So the resultant vector dimension is 20*p. When p is equal to the length of protein sequence, the protein sequence can be reconstructed easily by filling amino acids in each segment because there is only one amino acid in each segment. When p is less than the length of protein sequence, some sequence information is lost and the protein sequence cannot be reconstructed exactly.
Inspired by the reversibility and lowdimension requirements of the evaluation model and the fact that protein sequences are "sequences", we propose a new vector representation scheme by recording the positions (denoted by Q). Q treats the positions of each type of amino acids as a distribution and records the q quantile positions of each type of amino acids. A toy example is illustrated in Figure 1. The dimension of the resultant vectors of Q method is 20*q, increasing linearly as q. Because position information is complementary to the amino acid or kmer counts, super representation schemes, for example, QP and KQP, can be constructed. For instance, QP divides a protein sequence to p pieces and then counts the number and records the q quantile positions of each type of amino acids in each piece, resulting a 20*(1+q)*p vector. KQP divides a protein sequence to p pieces and then counts the number and records the q quantile positions of each kmer in each piece, resulting a 20^{k}*(1+q)*p vector. A detailed comparison of these representing methods is summarized in Table 1. In summary, we find that QP vectors are expected to extract more information with low dimension and the followup experimental results suggest the advantage of this method.
Figure 1. A toy example to illustrate the encoding schemes for protein sequences. Given a toy sequence of two letters, kmer based methods, denoted by K, count the number of each kmer in the sequence. Here k = 2. The counting process is represented as a matrix in which the rows represent the first letter of 2mers and the columns represent the second letter of 2mers. The dimension of the resultant vector is 2^{2 }= 4. If k = 3, the dimension will be 2^{3 }= 8. For real protein sequences, the dimension will be 20^{3 }= 8, 000. Segmentation based methods, denoted by P, divide the sequence evenly into p pieces first and then count the number of each letter in each piece. Here p = 2. The dimension of the resultant vector is 2*2 = 4. If p = 3, the dimension will be 2*3 = 6. For real protein sequence, the dimension will be 20*3 = 60. Quantile based methods, denoted by Q, record the positions of q quantiles of instead of the number of each letter. Here q = 2 and the first and the median positions of each letter are recorded.
Table 1. Features of various representation schemes of protein sequence according to our evaluation model
The converse problem of vector representation of protein pairs
To predict PPIs, we need further encode protein pairs into a single vector. The reversibility requirement also applies to the vector representation of protein pairs. Here, symmetry is the first condition that must be satisfied. Proteinprotein interaction is widely believed to be symmetric interaction in biology [22], i.e., protein A interacting with protein B has the same meaning with the fact that protein B interacts protein A. For example, proteinprotein interaction networks are always treated as undirected graphs [23] because proteins bind together and have no explicit direction. In this sense proteinprotein interactions are mutual, therefore the representation of protein pairs should be naturally symmetric. Otherwise the predicting result for AB may be inconsistent with that of BA. Available symmetry solutions for protein pairs either work on vector level, e.g., abs(ν_{A}ν_{B}) [19], or work on kernel level, e.g., [15,24], but do not consider the reversibility. Here we propose a new solution based on the symmetry of sum and multiplication operations (denoted by SM). By applying arithmetical and geometric average operations additionally, a second refined scheme is given (denoted by AG). For SM, given the vector representations of Protein A (ν_{A}) and Protein B (ν_{B}), we construct two new vectors: one is ν_{A}+ν_{B }and the other is ν_{A}*ν_{B}, in which * means the corresponding elements multiplication. Then the two symmetric vectors are concatenated into one vector. For AG, the arithmetical average of ν_{A }and ν_{B }(denoted by ν_{AM}) and the geometric average of ν_{A }and ν_{B }(denoted by ν_{GM}) are calculated. That is, the ith dimensional element of ν_{AM }and ν_{GM }are given by the following formulations:
When ν_{AM }and ν_{GM }are calculated, the symmetric representation of protein pair (A, B) will be the concatenation of ν_{AM }and ν_{GM}. AG has three important properties: 1) The resultant vector is symmetric regarding to protein pairs (A, B) and (B, A) because of the commutative laws of addition and multiplication; 2)For each dimension i, and can be reversely constructed from and by solving Equations (4) and (5); 3)Each dimension of the symmetric representation is of the same scale as the original vectors ν_{A }and ν_{B }because of the average operations, without artificial noise introduced. These three properties facilitate the extraction of information in the protein vectors and are beneficial to learning the rules underlying PPIs (see results for more detailed discussions).
Overview of performances of various methods
We first compared our new proposals to two published methods (a kmerbased method proposed by Shen et al. [15] and a segmentationbased method proposed by Luo et al. [21] on the model organisms Escherichia coli and Saccharomyces cerevisiae with two types of negative samples (Figure 2). The Receiver Operating Characteristic (ROC) curves show that our approach outperforms the other two available methods (Figure 3), suggesting that it may extract more information which is essential to PPIs. The advantage of our approach is due to both the new vector representation of protein sequences and the novel symmetric representation of protein pairs. Strict evaluation of them is as follows.
Figure 2. The ROC curves of four available predicting methods on Escherichia coli and Saccharomyces cerevisiae datasets. A, ROC curves on Escherichia coli data with negative samples constructed by subcellular information; B, ROC curves on Escherichia coli data with negative samples sampled randomly from the complementary network; C, ROC curves on Saccharomyces cerevisiae data with negative samples constructed by subcellular information; D, ROC curves on Saccharomyces cerevisiae data with negative samples sampled randomly from the complementary network.
Figure 3. Flow chart for applying SVMs to predict PPIs from sequences. Four issues must be addressed. First, protein sequences must be represented by vectors. Second, the vector representation of protein pairs must be symmetric. Third, a set of noninteracting protein pairs (negative PPIs) should be provided because SVMs are supervised learning algorithms. Fourth, a proper kernel will facilitate the nonlinear prediction. The focus of this paper is on the first and second issue. Community standard procedures are adopted to address the third and fourth issues in this paper.
Comparison of symmetric representation methods of protein pairs
As we mentioned, the representation of protein pairs should be symmetric. Otherwise the predicting result for AB may be inconsistent with that of BA. Here we compared four symmetric representing schemes. One scheme is ν_{A}ν_{B}, denoted by dist. It is on the vector level and used in [19]. The other is proposed by Shen et al. and is on the kernel level [15]. The conjoint triad method proposed by Shen et al. is used for all the four schemes as the vector representation of protein sequences to guarantee the fairness of the comparison. The conjoint triad method is a variant of kmer method that classifies twenty amino acids into seven families [15]. These four solutions are denoted by AGCTF (A: arithmetical, G: geometric, CTF: conjoint triad features), distCTF (dist: distance), skerCTF (sker: S kernel, the name of the kernel proposed by Shen et al), and SMCTF (S: sum, M: multiplication), respectively. The comparison is conducted on Escherichia coli and Saccharomyces cerevisiae data sets with two types of negative samples. "Benchmark negatives" means that the negative samples are from the subcellular localization information. "Random negatives" means that the negative samples are sampled randomly from the complementary graph.
The comparison results are illustrated in Table 2. It can be seen that the AUC (the area under the ROC curve) value of distCTF is the least. This is because it ignores much information contained in the original vectors when constructing the symmetric vector representations. The other three solutions are comparable with a little difference regarding to AUC values. On Escherichia coli dataset with benchmark negative samples, skerCTF achieves the highest AUC (0.998). AGCTF reaches 0.996 and SMCTF reaches 0.988. On the other three comparisons, AGCTF always reaches the highest AUC values. And AGCTF is better than SMCTF because it solves the scale problem. Regarding the other indices, e.g. accuracy, sensitivity, specificity, and precision, AGCTF also outperforms the other solutions. AGCTF considers adequately the converse problem and solves the scale question, so its good performance is expected. Because it is based on the vector level, it is easy to track the physical meanings and the computation is efficient. The extremely high AUC values on the benchmark negative data sets are due to the bias incorporated during the construction of negative samples, which has been pointed out previously [25].
Table 2. The performance of four symmetric representing schemes for protein pairs
Comparison of vector representations of protein sequences
The above comparison reveals that the symmetry solution based on the arithmetical and geometric averages perform best. In this subsection, we choose to fix this strategy in the comparison of various vector representation schemes of protein sequences. In this way we can eliminate the differences introduced by different symmetric representations and make results rigorous. In total, four vector representation schemes of protein sequences are compared. They are: 1) conjoint triad features proposed by Shen et al. [15], denoted by AGCTF; 2) segmentationbased method with p = 5[21], denoted by AGP100; 3) position based method with q = 17, denoted by AGQ340; and 4) the combination of segmentation and position with p = 3, q = 5, denoted by AGQP360. q is set to seventeen for AGQ340 because the resultant vectors have the almost same dimension compared to AGCTF and AGQP360. p = 3 and q = 5 for AGQP360 is because of the same reason. We choose p = 5 for AGP100 because it is a representative of this class of methods and reaches the best AUC value in crossvalidation.
The comparison is illustrated in Table 3. On the benchmark negative data sets, these four representations achieve similar AUC values on both Escherichia coli and Saccharomyces cerevisiae data sets. On the Escherichia coli benchmark negative data set, AUC of AGCTF reaches the highest 0.996. AGQP360 and AGP100 reach 0.994, which are a little bit smaller. AGQ340 has the least AUC 0.989. On the yeast benchmark negative data set, AGQP360 has the highest AUC 0.993 while AGCTF, AGP100, and AGQ340 have AUCs 0.991, 0.991, and 0.989, respectively. Regarding the other indices including accuracy, sensitivity, specificity, and precision, AGQP360 outperforms the other methods.
Table 3. The performance of four vector representing schemes for protein sequences
Because of the bias in the benchmark negative data sets, each method can achieve very high AUC values but may limit its discriminating capacity. The negative samples sampled randomly from the complementary graphs are assumed to be unbiased so they may provide more discrimination power [25]. On the Escherichia coli random negative data set, AGQP360 gets the highest AUC, 0.899, which is higher than that of AGP100 by one percent. AGCTF has the third highest AUC (0.886) and AUC of AGQ340 is the least (0.854). AGQP360 also has the highest accuracy, sensitivity, specificity, and precision. On the Saccharomyces cerevisiae random negative data set, AGQP360 still shows outperforming performances.
We also compared the performances of AGQP360 and AGCTF on the third type of negative samples to highlight the benefits of linearly scalable vector representations including segmentation based, positionbased, and their combination (Table 4). Given a true protein sequence, uShuffle can generate artificial protein sequences that have the same composition of kmers with the true sequence [26]. These artificial proteins have been used as negative samples in the previous studies to predict PPIs [27]. Here we construct three negative datasets of this type by reserving the composition of 1mers, 2mers and 3mers, respectively. AGQP360 performs well on all the three data sets but AGCTF only performs well on the 1mers and 2mers datasets. On the 3mers negative data set, AGCTF loses its discriminative capacity because the conjoint triad features are in nature based on 3mers. To get the discriminative power, k must increase to 4 or more but the vector dimensions will increase exponentially, aggravating greatly the computational burden and the dimensionality curse. Compared with that, the linearly scalable vector representations can handle this issue easily.
Table 4. AUC values of AGQP360 and AGCTF on the artificial negative data sets
Comparisons on human PPIs data were also implemented strictly (see SI Table 1, 2 and 3). The results on random negative samples and three types of shuffled negative samples all support the superiority of the new vector representations for both protein sequences and protein pairs.
Discussion and conclusion
Predicting PPIs only from the sequence information is an important and challenging problem in the postgenomic era. We note that most current computational methods are trying to encode protein sequences with various lengths into vector with the same dimension. So the first inevitable question for successful prediction is how to encode protein sequences effectively and efficiently in vector spaces. Previous studies propose various encoding methods but seldom consider the converse problem. In this study, we propose an evaluation model and analyze the available kmer based methods and segmentation based methods by investigating the converse problem, and suggest that when k or p is large enough, a protein sequence corresponds to a unique vector. But the dimension of the resultant vectors increases exponentially for kmers based methods and linearly for segmentation based methods. And kmer based methods emphasize extracting the local information while segmentation based methods emphasize the global information.
Viewing the protein sequences as distributions of amino acids, we propose a new dimensionlinearlyincreasing vector representation scheme for protein sequences by recording the positions of q quantiles of each type of amino acids. It can serve as an independent encoding method and can also combine with segmentation based methods to form super methods, whose dimension increases still linearly with the scaling parameters p and q. Experiments on Escherichia coli and Saccharomyces cerevisiae datasets with various types of negative samples suggest the outperforming power of the proposed super methods. Comparisons on the artificial negative samples further highlight the superiority of linearly scalable methods.
Applying the reversibility requirement on the symmetric vector representation of protein pairs results in a simple and reversible solution that is comparable to or even outperforms the available complicated kernels. Because it is based on the vector level, it is separated from the kernels and facilitates designing specific kernels to catch the nature of PPIs in the future.
Considering adequately the converse problem and seeking optimal representations has both theoretical and computational significance. It may theoretically point out the advantages and drawbacks of available methods and provide insights into how to improve the current methods. Furthermore, we only investigate the dictionary based encoding methods in this study. Physiochemical properties based methods are not investigated but they are ready to be incorporated into our framework as the additional information other than sequence. We think the information holds their potential to unravel the physical and chemical principles underlying the interactions.
Obviously, there are a lot of other unsolved questions in predicting computationally PPIs. For example, proteins interact with each other through certain domains or building blocks rather than the global sequences. Which parts are essential to protein interactions and how to computationally identify them need more deep investigations. The second limitation of sequencebased predictions is how to predict remote PPIs across organisms. Currently the predicting accuracy of remote PPIs is much lower than the intraorganism predictions. We note that the current domain databases may provide a few clues. However, their bias and incompleteness, especially information loss, should also be considered adequately. Another question is that the gold standard negative samples of PPIs are missing. Various methods have been proposed to construct the negative samples to highlight the patterns embedding in the positive data sets. But artificial biases are also introduced. How to construct unbiased negative samples is a big issue and still in argument currently.
Methods
The benchmark data and predicting methods
Numerically, we evaluate the vector encoding methods and our improvements with support vector machines (SVM) on Escherichia coli and Saccharomyces cerevisiae PPIs datasets. SVMs are one type of the stateoftheart supervised machine learning methods and have been used extensively in various disciplines including bioinformatics. Here we use SVMs to evaluate various representation schemes. Details of SVMs can be found in refs [28]. Other learning methods are also qualified to do evaluation but the selection of learning methods is not the focus of this paper. Four general issues must be addressed when applying SVMs to predicting PPIs (Figure 3). First, protein sequences must be represented by vectors. Second, the vector representation of protein pairs must be symmetric. Third, goldstandard negative data (a set of noninteraction protein pairs) should be provided because SVMs are supervised learning algorithms. Fourth, a proper kernel will facilitate the prediction greatly. Since the focus of this paper is only related to the first and the second issues, community standard solutions are adopted to address the third and fourth issues in this paper. Specifically, we use three types of negative samples which have been widely used in the previous studies for predicting PPIs. The first type is constructed manually based on the subcellular localization of proteins, which assumes that proteins with different subcellular localizations are not prone to interact. The second type is sampled randomly from the complementary graph of the PPIs network, which assumes the sparseness of the PPIs network. The third type is constructed by disturbing randomly the amino acid sequences of interacting protein pairs while conserving the composition of amino acids or kmers by uShuffle [26]. Yu et al. propose a fourth method for constructing the negative PPI samples by imposing the degree distribution of the positive PPI set to the negative PPIs [29]. They propose an excellent question what roles the special network structures of PPIs networks play in PPIs prediction. However, we argue that the requirement of the same degree distribution of the positive and negative PPI sets is not reasonable (the complementary graph of a PPIs network cannot be of the same degree distribution as the PPIs network). So this type of negative PPIs was not suitable for evaluating the performances of PPIs prediction from sequences. Despite that the PPIs networks are assumed to be sparse, we select randomly the same number of the negative samples to do the evaluation. If more negative samples are included, the unknown true PPIs may also be included as negative samples. The positive and the first type of negative data of PPIs are from [19] that were manually curated for quality. We use softmargin SVM to resolve the remaining errors in the data. All the evaluations are conducted by fivefold crossvalidations. Gaussian kernels are adopted for the fourth issue and the parameters are tuned by a grid search.
The protein sequences are from the RefSeq database of NCBI. PPIs involving proteins whose sequences are not available are filtered. Finally, 6,962 positive interactions are included in the crossvalidation experiments for Escherichia coli and 6,635 positive interactions are included in the crossvalidation experiments for Saccharomyces cerevisiae. The numbers of negative samples are the same as the number of positive samples for balance. Human PPIs were downloaded from the Human Protein Reference Database (HPRD) on Dec. 21st, 2009 [30].
Protein sequences are converted into vectors by four schemes (CTF, P100, Q340, and QP360). CTF classifies the twenty amino acids into seven classes and then applies kmer based method with k = 3. The details can be found in [15]. P100 divides a protein sequence into five pieces first and then counts the number of each type of amino acid. Q340 records seventeen quantile positions for each type of amino acid. QP360 first divides a protein sequence into three pieces, then counts the number of each type of amino acid and records five quantile positions for each type of amino acid in each piece. Each protein sequence is normalized according to its length. That is, the elements of the resultant vector are divided by the length of the protein sequence. The symmetric representations of protein pairs include four methods (dist, Sker, SM and AG). Given ν_{A }and ν_{B}, dist generates the symmetric vector by abs (ν_{A}ν_{B}). Sker calculates the kernel matrix according to the S kernel defined in [15]. SM creates the symmetric vector by concatenating ν_{A}+ν_{B }and ν_{A}*ν_{B }in which * means the multiplication of the corresponding elements. AG gets the symmetric representation according to (4) and (5). libsvm 2.88 [31] is used to implement the algorithms of support vector machines on a PC machine with Intel Core 2 Due CPU 2.83 Hz. The Gaussian kernel is applied. The parameters are tuned by a grid search method and the optimal ones are (C = 10, γ = 0.025) for CTF methods and (C = 10, γ = 0.0125) for other methods. All the evaluations are conducted in fivefold crossvalidations.
Authors' contributions
XR proposed the idea for this work. XR and YCW designed the predictive methods and the experiments, prepared the experiments and wrote the paper. YW analyzed the results and revised the paper. XSZ and NYD participated in developing the methods and revised the article. All authors read and approved the final manuscript.
Acknowledgements
The authors are grateful to all members of ZHANGroup in AMSS, CAS for their valuable discussion and comments. Funding: This work is partly supported by the Natural Science Foundation of China projects 60873205, 10801131, 10631070, 10971223, 11071252 and Chinese Academy of Sciences project kjcxyws7.
References

Fields S, Song O: A novel genetic system to detect proteinprotein interactions.
Nature 1989, 340(6230):245246. PubMed Abstract  Publisher Full Text

Engen JR: Analysis of protein complexes with hydrogen exchange and mass spectrometry.
Analyst 2003, 128(6):623628. PubMed Abstract  Publisher Full Text

Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, et al.: Functional organization of the yeast proteome by systematic analysis of protein complexes.
Nature 2002, 415(6868):141147. PubMed Abstract  Publisher Full Text

Lakey JH, Raggett EM: Measuring proteinprotein interactions.
Current Opinion in Structural Biology 1998, 8(1):119123. PubMed Abstract  Publisher Full Text

Zhu H, Bilgin M, Bangham R, Hall D, Casamayor A, Bertone P, Lan N, Jansen R, Bidlingmaier S, Houfek T, et al.: Global Analysis of Protein Activities Using Proteome Chips.
Science 2001, 293(5537):21012105. PubMed Abstract  Publisher Full Text

Butland G, PeregrÃnAlvarez JM, Li J, Yang W, Yang X, Canadien V, Starostine A, Richards D, Beattie B, Krogan N, et al.: Interaction network containing conserved and essential protein complexes in Escherichia coli.
Nature 2005, 433(7025):531537. PubMed Abstract  Publisher Full Text

Rain JC, Selig L, Reuse HD, Battaglia V, Reverdy C, Simon S, Lenzen G, Petel F, Wojcik J, Schachter V, et al.: The proteinprotein interaction map of Helicobacter pylori.
Nature 2001, 409(6817):211215. PubMed Abstract  Publisher Full Text

Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, et al.: A comprehensive analysis of proteinprotein interactions in Saccharomyces cerevisiae.
Nature 2000, 403(6770):623627. PubMed Abstract  Publisher Full Text

Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive twohybrid analysis to explore the yeast protein interactome.
Proceedings of the National Academy of Sciences of the United States of America 2001, 98(8):45694574. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li S: A Map of the Interactome Network of the Metazoan C. elegans.
Science 2004, 303(5657):540543. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi CE, Godwin B, Vitols E, et al.: A Protein Interaction Map of Drosophila melanogaster.
Science 2003, 302(5651):17271736. PubMed Abstract  Publisher Full Text

Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S, et al.: A human proteinprotein interaction network: a resource for annotating the proteome.
Cell 2005, 122(6):957968. PubMed Abstract  Publisher Full Text

Rual JF, Venkatesan K, Hao T, HirozaneKishikawa T, Dricot A, Li N, Berriz GF, Gibbons FD, Dreze M, AyiviGuedehoussou N, et al.: Towards a proteomescale map of the human proteinprotein interaction network.
Nature 2005, 437(7062):11731178. PubMed Abstract  Publisher Full Text

von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P: Comparative assessment of largescale data sets of proteinprotein interactions.
Nature 2002, 417(6887):399403. PubMed Abstract  Publisher Full Text

Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, Li Y, Jiang H: Predicting proteinprotein interactions based only on sequences information.
Proceedings of the National Academy of Sciences 2007, 104(11):43374341. Publisher Full Text

BenHur A: Kernel methods for predicting proteinprotein interactions.
Bioinformatics 2005, 21(suppl1):i38i46. PubMed Abstract  Publisher Full Text

Gomez SM, Noble WS, Rzhetsky A: Learning to predict proteinprotein interactions from protein sequences.
Bioinformatics 2003, 19(15):18751881. PubMed Abstract  Publisher Full Text

Bock JR, Gough DA: Predicting proteinprotein interactions from primary structure.
Bioinformatics 2001, 17(5):455460. PubMed Abstract  Publisher Full Text

Najafabadi H, Salavati R: Sequencebased prediction of proteinprotein interactions by means of codon usage.
Genome Biology 2008, 9(5):R87. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Leslie C, Eskin E, Noble WS: The spectrum kernel: a string kernel for SVM protein classification.
Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing 2002, 564575.

Luo L, Zhang Sw, Chen W, Pan Q: Predicting proteinprotein interaction based on the sequencesegmented amino acid composition.

Nelson D, Cox M: Lehninger Principles of Biochemistry.
In Fourth edition. Edited by W. H. Freeman. 2004.

Barabasi AL, Oltvai Z: Network biology: understanding the cell's functional organization.
Nature Reviews Genetics 2004, 5(2):101113. PubMed Abstract  Publisher Full Text

Martin S, Roe D, Faulon JL: Predicting proteinprotein interactions using signature products.
Bioinformatics 2005, 21(2):218226. PubMed Abstract  Publisher Full Text

BenHur A, Noble WS: Choosing negative examples for the prediction of proteinprotein interactions.
BMC Bioinformatics 2006, 7(Suppl 1):S2. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Jiang M, Anderson J, Gillespie J, Mayne M: uShuffle: A useful tool for shuffling biological sequences while preserving the klet counts.
BMC Bioinformatics 2008, 9(1):192192. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Guo Y, Yu L, Wen Z, Li M: Using support vector machine combined with auto covariance to predict proteinprotein interactions from protein sequences.
Nucleic Acids Research 2008, 36(9):30253030. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Boser BE, Guyon IM, Vapnik VN: A Training Algorithm for Optimal Margin Classifiers.

Yu J, Guo M, Needham CJ, Huang Y, Cai L, Westhead DR: Simple sequencebased kernels do not predict proteinprotein interactions.
Bioinformatics 2010, 26(20):26102614. PubMed Abstract  Publisher Full Text

Prasad TSK, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, et al.: Human Protein Reference Database2009 update.

Chang CC, Lin CJ: LIBSVM: A library for support vector machines.