Post-translational modifications (PTMs) have a key role in regulating cell functions. Consequently, identification of PTM sites has a significant impact on understanding protein function and revealing cellular signal transductions. Especially, phosphorylation is a ubiquitous process with a large portion of proteins undergoing this modification. Experimental methods to identify phosphorylation sites are labor-intensive and of high-cost. With the exponentially growing protein sequence data, development of computational approaches to predict phosphorylation sites is highly desirable.
Here, we present a simple and effective method to recognize phosphorylation sites by combining sequence patterns and evolutionary information and by applying a novel noise-reducing algorithm. We suggested that considering long-range region surrounding a phosphorylation site is important for recognizing phosphorylation peptides. Also, from compared results to AutoMotif in 36 different kinase families, new method outperforms AutoMotif. The mean accuracy, precision, and recall of our method are 0.93, 0.67, and 0.40, respectively, whereas those of AutoMotif with a polynomial kernel are 0.91, 0.47, and 0.17, respectively. Also our method shows better or comparable performance in four main kinase groups, CDK, CK2, PKA, and PKC compared to six existing predictors.
Our method is remarkable in that it is powerful and intuitive approach without need of a sophisticated training algorithm. Moreover, our method is generally applicable to other types of PTMs.
Post-translational modifications (PTMs) have important implication on the protein functions involved in signal transductions and many human diseases. Especially, phosphorylation is one of the most ubiquitous of these processes with a reported 30 ~50% of eukaryotic proteins undergoing this modification. For this reason, identifying phosphorylation sites is important for understanding functional role of proteins and cell signalling networks. In order to determine phosphorylation sites several experimental tools such as mass spectrometry have been used. Experimental efforts using those techniques have made it possible to construct several databases for phosphorylation sites, such as Phospho.ELM [1,2], PhosphoSite , and PhosPhAt . However, those techniques are time-consuming and high cost approaches. Due to such practical limitation, an efficient computational algorithm to recognize phosphorylation sites is highly desirable.
Previously, several methods to predict phosphorylation sites have been developed by probing evolutionary information, using physicochemical properties, or searching motif patterns. The most successful algorithms are machine learning-based approaches. Using the artificial neural network (ANN) models, NetPhosYesat  predicts phosphorylation sites in yeast, and NePhosK  provides a sequence-based phosphorylation site prediction service. Examples of support vector machine (SVM)-based approaches are PredPhospho , AutoMotif [8,9], and kinasePhos2.0  which trains SVM by using amino acid coupling patterns and solvent accessibility. Recently, probabilistic frameworks and new kernel methods were suggested. PPSP  used Bayesian decision theory to predict PK-specific phosphorylation sites, and SiteSeek  was implemented with a high search sensitivity by introducing a new adaptive locally-effective kernel method with hydrophobic information. In addition, conditional random field model was applied to predict kinase-specific phosphorylation .
Despite high performance of those machine learning or statistical approaches, development of simple, intuitive, and generally applicable algorithms has been pursued. A group-based approach, GPS, simply and intuitively recognizes phosphorylation sites by calculating peptide similarities with BLOSUM62 matrix and deciding which group is closest to the given peptide after clustering known peptide groups . Our study aimed to develop a new algorithm by inventing a new scoring method, as well as by introducing an effective noise-reducing system, which can be applied to different types of modifications. We developed a new scoring scheme to measure the sequence similarity by combining pairwise sequence similarity scores and profile-profile alignment scores. Basic assumption was that physicochemical information, motif information, and evolutionary information could be retrieved by measuring sequence similarities. We also generalized the motif scoring method, which has been conventionally used for predicting phosphorylation sites, by performing profile-profile alignments with gaps. It turned out that such generalization significantly improved the prediction accuracy. Considering both features together, we developed a new peptide sequence similarity scoring method. We then applied a noise-reducing system exploiting indirect relationships among peptide sequences. When we tested our new method on 48 different kinase groups, the results indicated that the two innovative features of our present work, i.e., a new sequence similarity scoring method and the noise-reducing system, both contributed to the outstanding performance of the new method in recognizing phosphorylation sites correctly, showing better performance than AutoMotif which is one of the best-performing methods. Also, by testing unbiased data set we can achieve better or comparable performance compared to six existing predictors.
We developed our new method using Phospho.ELM (released in December 2008) database . The database contains experimentally validated phosphorylation sites for 254 different kinases. From the database we selected kinase groups which contained more than 20 known phosphorylation sites, resulting in 48 different kinase groups in our test set. To develop and evaluate the new method, positive (phosphorylation) and negative (non-phosphorylation) peptides were needed to make the 'reference set'. For a specific phosphorylation type, positive peptides were all peptides in Phospho.ELM database that had the same type of phosphorylation. Negative peptides were randomly selected from sequences which shared the same phosphorylation residue types with positive peptides. We selected negative peptides 10 times more than the number of positive peptides. The whole dataset can be downloaded from our web server.
Peptide sequence similarity scoring scheme
Our scoring system was designed to give a high score when two peptides have high similarity, indicating that if a query peptide gets high scores with known phosphorylation peptides, the query peptide is predicted to be a peptide with the same type of phosphorylation. To calculate peptide similarities we combined two different sequence similarity measures, one using BLOSUM62 matrix  and the other using profile-profile alignment which contains evolutionary information. Both measures are widely used to calculate sequence similarities. We assumed that comparing sequence similarity with BLOSUM62 matrix could provide similarity measure for physicochemical properties of the two sequences and motif patterns indirectly. Similarity score using BLOSUM62 matrix, SBLOSUM62, between peptides A and B with fixed window size 7 was defined as
The score(Ai, Bi) is the substitution scores between two amino acids Ai and Bi in BLOSUM62 matrix. The window size 7 was determined by referencing a previously developed method, GPS . If the candidate phosphorylation sites were near the N or C terminus, we represented the absent terminal sequences as X.
The second component of our new scoring scheme is the profile-profile alignment scores. The conventional way to measure sequence similarity for the purpose of predicting PTM sites is to use motif scoring methods where gapless alignments were typically assumed. We generalized the motif scoring method by performing profile-profile alignments allowing gaps. To calculate similarity scores based on profile-profile alignment we first generated the position specific scoring matrix (PSSM) and the position specific frequency matrix (PSFM) for a protein sequence which contained a given peptide by using PSI-BLAST . We used blastpgp version 2.2.15 with default parameters except the options for the number of iterations (j = 5) and the cutoff E-value value (h = 0.001). Then we extracted PSSM and PSFM corresponding to the given peptide. Using both matrices we computed profile-profile alignment scores for the position i of a peptide A and the position j of a peptide B (PPAij) as follows,
where, fik and fjk are the frequencies of PSFM matrix at the position of i and j, and Sik and Sjk represent the scores of PSSM matrix at the position of i and j. The detail procedure was reported in our previous work . We aligned both peptides A and B and calculated similarity scores by using dynamic programming with gap penalty of 3.0 and gap extension penalty of 0.75. We referred to this profile-profile alignment scores as Sprofile. During profile-profile alignment we selected the window size as 41 after trying several different window sizes such as 7, 19, 41, and 101. In previous works, wide ranges of values from 7 to 19 have been used as a window size for calculating the sequence similarity. We increased the window size from 7 up to 101 to evaluate effect of considering long-range region surrounding phosphorylation sites. The performances for PKB-group with different window sizes were measured, and then 41 was selected as the optimal window size for calculating the profile-profile alignments. Once having calculated both types of similarity scores, SBLOSUM62 and Sprofile, we multiplied both scores to calculate the final similarity score (Scombined) of the two peptides as follows,
The positive effect of combined-measure is described in Result and discussion section. We also tried a linear combination of SBLOSUM62 and Sprofile as the final similarity score and found that the multiplicative form of the two scores showed better performance.
Noise reduction scheme utilizing indirect relationships
By using similarity scores we can rank all reference peptides for a given query peptide. If a scoring system is perfect, it would give higher scores to all true phosphorylation peptides than to any of non-phosphorylation peptides. However, our scoring system is obviously imperfect, partly because our current scoring system only considers sequence features. We may be able to improve the accuracy by adding new features, but in this work we focused on designing noise-reducing system by considering indirect relationships among reference peptides. Basic idea is that if highly ranked reference peptides tend to be known phosphorylation peptides, the query peptide is likely to be a phosphorylation site, otherwise a non-phosphorylation site.
Figure 1 illustrates the noise-reducing scoring system of this work. In step 1, for a given query (in this example, a phosphorylation peptide), we calculate Scombined scores for all peptides in the reference set, and select the top α highly ranked peptides (in this example, α = 5, consisting of 2 phosphorylation peptides (Pj) and 3 non-phosphorylation peptides (Nj)). Next in step 2, indirect relationship matrix is constructed by calculating Scombined scores between those top α hits and all peptides in the reference set (in this example, there are 5 phosphorylation and 5 non-phosphorylation peptides). When constructing the matrix, if a peptide i is not included in the top 5 hits for a peptide j, the score (j, i) of the matrix is set to zero. In step 3, indirect scores are calculated by using indirect relationship matrix. We assume that scores between positive (or negative) peptides are signal, while those between positive (or negative) and negative (or positive) are noise. According to our hypothesis, indirect scores can be calculated as follows,
Figure 1. Illustration of the noise-reducing system. Illustration of the noise-reducing system. In step 1, we find the top 5 hits for a given query, where Pj is a phosphorylation peptide and Nj is a non-phosphorylation peptide. Next, Scombined scores are calculated between the top 5 hits and all peptides in a reference set (10 peptides), where if a peptide i is not included in top 5 hits for a peptide j the score (j, i) is set to zero. In step 3, by summing each row of indirect relationship matrix we calculate indirect scores. During summation we assume that scores between positive (or negative) peptides are signal, while those between positive (or negative) and negative (or positive) are noise. Finally, we check the number of phosphorylation peptides among the top 4 hits by indirect scores. In this example P2, P3, P4, and N2 are recognized as the top 4 hits, and among them 3 peptides are phosphorylation peptides, and thereby we predict that the query peptide is a phosphorylation peptide.
where we give the weight of 10 to positive peptides since the number of negative peptides is 10 times that of positive peptides. For example, suppose that P2 recognized P1 and P4 as signal, and N3 as noise. Then, the final indirect score of P2 (5.70) is calculated by adding P1 and P4 with weight 10 (10*(0.61+0.01)) and subtracting N3 (0.50). Finally, from the indirect scores we consider the top β hits (in this example, β = 4) as query related peptides. Then, if the number of positive peptides are greater than γ (in this example, γ = 2), we predict the query peptide as a phosphorylation peptide. In this example P2, P3, P4, and N2 are the top 4 hits, and thereby we predict that the query peptide is a phosphorylation peptide.
There existed several parameter determination issues in constructing the noise-reducing system such as the number of highly ranked peptides α, the value of β and γ. To determine those parameters we searched the optimized parameter set using PKB-group. It is obvious that different kinase groups have the different optimized parameters but we applied the same parameters to all cases to avoid over-fitting. As a result, for the number of highly ranked peptides we selected the value of half number of positive peptides in the reference set. For the value of β and γ we used value of one-third and one-fourth of positive peptides, respectively.
We assessed the prediction performance with leave-one-out cross validation (LOOCV). We used all dataset except one as the reference set and tested our scoring system with left-out one peptide. The accuracy (ACC), precision (P), and recall (R) were calculated to measure the performance. The equations are as follows.
We define phosphorylation peptides as a positive class, and TP, TN, FP, and FN
denote true positives, true negatives, false positives, and false negatives, respectively. We benchmarked another prediction server, AutoMotif [8,9]. We directly compared the performance of our method to that of AutoMotif since our dataset and evaluation scheme were same as AutoMotif. The performance data of AutoMotif were extracted from website . Furthermore, to test unbiased data set we used new data set constructed by Wan et al. . We calculated ROC score defined as the areas under the ROC curves, the plot of true positives as a function of the number of false positives , and compared our result to PPSP, PredPhospho, GPS, KinasePhos 1.0, NetPhosK, and Scansite.
Results and discussion
Performance variation of various features for PKB-group
We evaluated the ability of different level of similarity measures to discriminate phosphorylation and non-phosphorylation peptides. As we described in Method section, we tested four different types of scores, SBLOSUM62, Sprofile, Scombined, and our noise-reducing system. We selected PKB-group (protein kinase B) as a toy example to select parameters and to test performance variation of various features. The serine/threonine kinase PKB has been shown to play a crucial role in the control of diverse and important cellular functions such as cell survival and glycogen metabolism . The result of performance comparison for PKB-group is shown in Table 1. In the aspect of precision and recall, Scombined was significantly better than SBLOSUM62 and Sprofile. The precision and recall were increased more than 22% and 54%, respectively from those of SBLOSUM62 by combining Sprofile. This indicates that combining both similarity measures lead to significant positive effect. The positive effect is likely to be originated from removing ambiguous cases such as high score in SBLOSUM62 but low score in Sprofile, and vice versa. Furthermore, the noise-reducing system highly increased recall (24% increased) compared to Scombined at the similar level of precisions (0.87 at Noise-reducing, 0.84 at Scombined), indicating the noise-reducing system filtered out many false positives and rescued many true phosphorylation peptides which were falsely recognized as non-phosphorylation peptides by Scombined.
Table 1. Performances of PKB-group kinases using different features
We also evaluated overall performance across 48 kinase groups. The result is summarized in Table 2, showing that in terms of overall performance our noise-reducing system is most effective for identifying phosphorylation sites. Especially, recall was increased about 15% from Scombined at the 0.68 precision. Also significant performance enhancement was occurred in Scombined compared to both SBLOSUM62 and Sprofile. The results remark that combining Sprofile with SBLOSUM62 is generally effective to increase discriminate ability for phosphorylation sites. Moreover, we expect that we can apply the same noise-reducing system to other types of post-translation modifications such as ubiquitination.
Table 2. Performances of 48 kinase groups using different features
To better understand how the noise-reducing system determines phosphorylation sites with higher accuracy, we consider one specific example, the phosphorylation peptide, [RGGSASRS]. The first and fourth hits (both are non-phosphorylation sites, false positives) of the given peptide are [DGTSLKV] and [VQDTYQI], respectively, with Scombined. However, the first hit does not have any indirect relationships with other highly ranked peptides, thereby in the noise-reducing system its rank drops to the 38th. Also the fourth hit shows indirect relationships with highly ranked positive peptides (this is a false relationship), therefore the rank of third hit also drops to 40th. On the other hand, the ranks of several positive peptides are increased since several true indirect relationships exist among highly ranked positive peptides. This example shows exactly how the noise-reducing system works to detect phosphorylation sites.
Importance of considering long-range region surrounding a phosphorylation site
Several mechanisms have been proposed to understand kinase specific binding properties. Protein kinase forms a protein complex with its substrate through recognizing phosphorylation binding domains or short sequence patterns of substrates. To recognize kinase-specific motifs, not only short sequence patterns but also local structure around a phosphorylation site may be important. We assume that this information can be measured by calculating similarities of long-range region surrounding phosphorylation sites with profile-profile alignment (PPA).
We evaluated our assumption by comparing performance with various features. We drew number of true matches (phosphorylation peptides) according to number of false matches (non-phosphorylation peptides) up to 1000 false matches among 48 kinase families. The results are shown in figure 2, and the performance varies according to window size of peptides. In figure, values in brackets represent number of windowed residues. Here, SBLOSUM62 represents a short sequence pattern comparison scheme with BLOSUM62 matrix for 7 windowed residues. When we extended window size as 41 the performance was degraded, therefore we fixed window size as 7 in SBLOSUM62. To evaluate effect of long-range similarities first, we searched proper window size of long-range region by using PPA based score (Sprofile) alone. Among three different window sizes, Sprofile with 19 windowed residues provided more precise similarities compared to 7 or 41 windowed residues. Sprofile (19 windowed residues), Sprofile (7 windowed residues), and Sprofile (41 windowed residues) recognized 610, 567, and 539 true matches up to 500 false matches. Next, we combined both SBLOSUM62 and Sprofile and measured performance variations. In this case, the best performance was occurred at the combination of SBLOSUM62 and Sprofile with 41 windowed residues. From the figure we note that if we combined SBLOSUM62 and Sprofile, the performance with Sprofile (41 windowed residues) was increased more than other two cases even though it showed worst performance when we considered Sprofile score alone. Scombined with Sprofile (41 windowed residues) detected 1613 true matches up to 1000 false matches, while considering 19 and 7 windowed residues recognized 1583 and 1485 true matches, respectively. The results indicate that considering both short and long-range properties important to increase search sensitivity. When we search phosphorylation peptides the most important property may be physicochemical properties of adjacent residues to a phosphorylation site. However, together with this information, considering long-range region similarities can provide evolutionary or structural similarities surrounding kinase binding sites. Thereby both properties effectively contribute to measures peptide similarities.
Figure 2. ROC curves with various features. The figure shows number of true matches (phosphorylation peptides) according to number of false matches (non-phosphorylation peptides) up to 1000 false matches among 48 kinase families. In figure, values in brackets represent number of windowed residues in peptides. From the figure we note that Scombined with Sprofile (41 windowed residues) shows best performance. The fact remarks that considering long-range region is effective to identify phosphorylation peptides.
Performance comparison with AutoMotif
Our new method was applied to 48 kinase groups, and the performance of 36 groups was compared to the benchmarked results of AutoMotif. The performance data of AutoMotif were extracted from AutoMotif web server. The mean accuracy, precision, and recall of AutoMotif with a polynomial kernel were 0.91, 0.47, and 0.17, respectively, while those of the new method were 0.93, 0.67, and 0.40, respectively. The performance variations among 48 kinase groups are shown in Table 3. We note that in general the new method shows better performance than AutoMotif. Especially, 22 kinase groups showed overall increased performance from AutoMotif in all three aspects (accuracy, precision, and recall).
Table 3. Performance comparison with AutoMotif for 48 kinases.
From the individual performance variations, it is notable that the new method was more effective for the kinase groups that contain relatively a small number of positive peptides. For the large kinase groups that contain more than 100 positive peptides, such as CDK-group, CK2-group, PKA-group, PKC-alpha, MAPK1, PKC_group, Src, and CK2-alpha, only in three kinase groups (MAPK1, Src, and CK2-alpha) our new method showed the performance enhancement in all three aspects, while in a majority of small kinase groups that contain less than 100 positive peptides, the new method achieved significantly better performance than AutoMotif. This tendency of performance improvement depending on the number of positive peptides may have stemmed from the amount of information. If we have many positive peptides, we can easily recognize phosphorylation peptides since the search space of phosphorylation peptides is dense. On the other hand, if we do not have enough number of positive peptides, prediction methods based on the sequence information of positive peptides inevitably contain much noise, making the predictions unreliable. In this situation applying our new method effectively reduces noise and retrieves weak signals, resulting in high search sensitivity. In this regard we expect that we can apply the new method to other small PTM groups and equally achieve performance enhancement.
Moreover, we conducted 10-fold cross validation to measure the performance variation. The results are shown in Table 4. The mean accuracy, precision, and recall (0.93, 0.59 and 0.43, respectively) were comparable or slightly degraded from LOOCV of the new method, but still the performance was better than AutoMotif (LOOCV). The result indicates that the new method performs well in an independent query set.
Table 4. Average performance of 10-fold cross validation.
Performance assessment with independent test set
We also benchmarked our results to six existing methods, GPS , KinasePhos , NetPhosK , PPSP, PredPhospho , and Scansite  by using independent test set created by Wan et al. . The test set consists of four main kinase families (CDK, CK2, PKA, and PKC) and contains phosphorylation sites derived from Phospho.ELM, PhosphoSite, and Swiss-Prot. The advantage of new test set is unbiased and independent data, and thereby we can fairly compare several different algorithms. To assess performance of new test set we generated new reference set by using remaining data after removing new test set from PhosPho.ELM.1208. We evaluated the performance by comparing area of under ROC curves. The results of the six existing methods reported by Wan et al.  were used.
The area of under ROC curves are shown in Table 5. In table, top two ranked methods are bolded. We noted that our method was ranked as top or second in four kinases families. Our method showed better performance than GPS, KinasePhos, PredPhospho, and Scansite. Especially, compared to GPS designed to search phosphorylation peptides through sequence similarity using BLOSUM62 matrix, the performance improvement in our method indicates that addressing evolutionary information could be helpful to identity phosphorylation peptides. Furthermore, compared to PPSP our method shows significant improved results in CK2 family (6% increased) but similar performance in other kinase families. To conclude it is hard to say new method is outstanding compared to other methods but our method is generally effective to recognize phosphorylation peptides in four main kinase families.
Table 5. Performance variation with seven methods. The scores indicate the area under the ROC curves
Web server construction
We constructed the web server to provide an easy access to our new phosphorylation site prediction method. Our web server, PostMod (prediction of Post-translational Modification sites) was implemented with python CGI scripts and html. Currently we provide prediction of phosphorylation sites for 48 kinases but our future direction is to apply the new method to other kinds of PTM sites. Figure 3 shows the sever input (A) and output pages (B). The search sequence can be submitted by pasting it into the text box. The server allows a user to select one of 48 different kinase types. For the query sequence the server searches all putative phosphorylation sites and generates several peptides. Meanwhile, the server generates PSSM and PSFM matrix of the query sequence by executing blastpgp 2.2.20. After that, the sever compares sequence similarity with the peptide sequences in the reference set which are phosphorylation and non-phosphorylation sites of the corresponding kinase type from phospho.ELM database. Figure 3(B) is the output page for AMPK beta-1 chain (UniProt id is P80386). All putative 36 phosphorylation sites (S, T) were shown together with confidence scores. Three predicted phosphorylation sites were bolded. The confidence score is the fraction of phosphorylation peptides among the top β hits (see Method section). We set the threshold value for phosphorylation sites as 0.5 of confidence score. User can receive search result via email. The web server is available through our website .
Figure 3. The PostMod server input page (A) and result page (B). In input page, search sequence is pasted into the text box and one of 48 kinase types is selected. The default kinase type is AMPK_group. The example sequence is AMPK beta-1 chain (UniProt id is P80386). The search result of phosphorylation sites are shown in (B). There are 36 candidate phosphorylation sites (S, T) and three of them are recognized as phosphorylation sites (bolded line).
The present method is remarkable in the sense that it is simple and computationally costless, and yet shows the outstanding performance improvement for the various kinds of kinases. We showed that when we combined the BLOSUM62 matrix-based similarity measure and the profile-profile alignment scores, the recognition results were significantly improved. Moreover, applying our noise-reducing system by exploiting indirect relationships effectively eliminated noise, and thereby increased the overall performance. The overall performance enhancement on 48 kinds of different kinases suggests that our method is generally applicable to other types of PTMs. Furthermore, it is expected that combining our method with better similarity methods would achieve higher accuracy for finding phosphorylation sites.
Performance degradation in a conventional sequence similarity measures is mainly originated from improper similarity scoring system, which gives higher scores to unrelated peptides, producing many false positives. The best solution may be developing new features which well discriminate positives from negatives. If we do not have such powerful features, we need to concentrate on removing noise. In this manner we addressed a concept of indirect relationships, and we showed that considering indirect relationships can be a powerful tool to eliminate the false positives.
To conclude, applying the new method produces good results without need of sophisticated machine learning techniques in detecting phosphorylation sites. Furthermore, we expect that applying our new method to other kinds of biological analysis would achieve high performance improvement.
The authors declare that they have no competing interests.
I.J, A.M, M.Y, and D.K designed the methods, and experimental setup. I.J carried out the implementation of the all methods. I.J wrote the manuscript under D.K, A.M and M.Y's technical supervision and mentorship. All authors have read and approved the final manuscript.
We thank Jungsul Lee and Seungpyo Hong for helpful advice in development of web browser. Inkyung Jung thanks Yusun Lee for many discussions. This work is supported by CHUNG Moon Soul Center for BioInformation and BioElectironics (CMSC), and by the National Research Foundation of Korea (NRF) grant funded by the Korea Government (2009-0086964).
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 1, 2010: Selected articles from the Eighth Asia-Pacific Bioinformatics Conference (APBC 2010). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S1.
Heazlewood JL, Durek P, Hummel J, Selbig J, Weckwerth W, Walther D, Schulze WX: PhosPhAt: a database of phosphorylation sites in Arabidopsis thaliana and a plant-specific phosphorylation site predictor.
Wong YH, Lee TY, Liang HK, Huang CM, Wang TY, Yang YH, Chu CH, Huang HD, Ko MT, Hwang JK: KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns.
Computers Chem 1996, 20:25-34. Publisher Full Text