Abstract
Background
Protein kinases play crucial roles in cell growth, differentiation, and apoptosis. Abnormal function of protein kinases can lead to many serious diseases, such as cancer. Kinase inhibitors have potential for treatment of these diseases. However, current inhibitors interact with a broad variety of kinases and interfere with multiple vital cellular processes, which causes toxic effects. Bioinformatics approaches that can predict inhibitorkinase interactions from the chemical properties of the inhibitors and the kinase macromolecules might aid in design of more selective therapeutic agents, that show better efficacy and lower toxicity.
Results
We applied proteochemometric modelling to correlate the properties of 317 wildtype and mutated kinases and 38 inhibitors (12,046 inhibitorkinase combinations) to the respective combination's interaction dissociation constant (K_{d}). We compared six approaches for description of protein kinases and several linear and nonlinear correlation methods. The best performing models encoded kinase sequences with amino acid physicochemical zscale descriptors and used support vector machines or partial least squares projections to latent structures for the correlations. Modelling performance was estimated by double crossvalidation. The best models showed high predictive ability; the squared correlation coefficient for new kinaseinhibitor pairs ranging P^{2 }= 0.670.73; for new kinases it ranged P^{2}_{kin }= 0.650.70. Models could also separate interacting from noninteracting inhibitorkinase pairs with high sensitivity and specificity; the areas under the ROC curves ranging AUC = 0.920.93. We also investigated the relationship between the number of protein kinases in the dataset and the modelling results. Using only 10% of all data still a valid model was obtained with P^{2 }= 0.47, P^{2}_{kin }= 0.42 and AUC = 0.83.
Conclusions
Our results strongly support the applicability of proteochemometrics for kinomewide interaction modelling. Proteochemometrics might be used to speedup identification and optimization of protein kinase targeted and multitargeted inhibitors.
Background
Protein kinases comprise a large family of membranebound and cytosolic enzymes, with 518 genes identified in the human genome [1]. All protein kinases catalyze the transfer of the γphosphate of adenosine triphosphate (ATP) to the hydroxyl group of tyrosine, serine, or threonine residues of protein substrates. Together with the protein phosphotases, kinases act as regulatory switches for essentially all cellular processes, including metabolic pathways, cell growth, differentiation, survival, and apoptosis. Abnormal function of protein kinases leads to development of many serious diseases, such as cancer, diabetes, inflammatory and autoimmune disorders, and diseases of the heart. In particular, many cancers (breast, ovary, lung, liver, colon, and prostate cancer, lymphoma, glioma, melanoma, and others) may be linked with increased activity of specific growthfactorreceptor tyrosine kinases due to overexpression, or mutations leading to constitutively active forms [2].
Great hopes were placed that inhibition of dysfunctional kinases will lead to new highly effective therapies. The first smallmolecule kinase inhibitor, imatinib, was launched in 2001 as an anticancer agent for the treatment of chronic myeloid leukemia; its action being to inhibit the constitutively active form of Abelson tyrosine (ABL) kinase. Since then, eight compounds targeting the kinase catalytic domain were approved for treatment of various forms of cancer; over thirty kinase inhibitors are in the clinical phases of development, and many more are in preclinical pipelines.
A major problem in the development of kinase inhibitors is to achieve specificity. Most of the kinase inhibitors in current development interact with the kinases' ATP binding cleft, where they compete with ATP [3]. However, the ATPbinding site is highly conserved among all kinases and it is therefore difficult to design a drug selective for only one kinase at a time. Other functional domains that have been exploited to target kinases are also conserved among numerous kinases making the design of selective inhibitors problematic also in these cases. In fact, a largescale screening undertaken by Fabian et al. [4] revealed that the three first FDA approved inhibitors actually interacted with about one sixth of the protein kinases included in the screen; each of them cross interacted with between 18 to 23 of 119 evaluated protein kinases. Seventeen other kinase inhibitors in preclinical and clinical phases of development were also tested in this study and were shown to possess various degree of promiscuity; only one of the compounds interacted with less than five kinases.
Many promising kinase inhibitors were abandoned early due to toxicity [5]. Yet another common reason for failure was lack of clinical efficacy. The latter problem can be attributed to the multitude and complexity of cellular signaling cascades, with redundant pathways and complex feed back mechanisms. Use of multitargeted compounds that can selectively inhibit a specific group of kinases of such pathways might increase the chance to achieve clinical antitumor activity [6]. Yet another reason for lack of clinical efficacy is resistance that arises due to mutations in the targeted oncogene. E.g., drug resistance in imatinibtreated leukemia patients appears due to mutations in the BCRABL fusion protein. This prompts the need for new generations of drugs that can override the acquired resistance by inhibiting the mutated oncogene [7,8].
A computational method widely applied in drug design is quantitative structureactivity relationship (QSAR) modelling. QSAR models are used to optimize lead compounds for target activity and other properties (e.g., ADME and toxicity) and to perform virtual screening to find new hits. However, drawbacks of QSAR are that its models consider only properties of ligands and that it analyzes interactions with only one drug target at a time. Hence QSAR models are unable to generalize between multiple targets.
A more general approach is proteochemometric modelling, which we introduced some time ago to study differences in mechanisms of molecular recognition for groups of related proteins [9,10]. Proteochemometric models are based on experimentally determined interaction data for series of proteins interacting with series of ligands, like organic compounds, peptide inhibitors, substrates, etc. These data are correlated to descriptors of the two sets of interacting entities, which creates models that can be used to predict activities of yet untested ligandprotein combinations, as well as foresee activity profiles of novel unseen ligands and proteins.
Proteochemometric models take advantage of the fact that 3 D structures of homologous proteins are more conserved than their primary sequences and functions. Thus, proteins that have diverged functionally during evolution may still share the same structural organization and exploit similar molecular interaction mechanisms. The principle behind proteochemometrics is simple. It requires (1) consistent interaction data, (2) numerical descriptions of relevant physicochemical and/or structural properties of both ligands and the protein macromolecules, and (3) a nonlinear correlation method that jointly uses the two sets of descriptors to explain ligandprotein complementarities and interaction profiles. We have previously successfully applied proteochemometrics to create highresolution models for ligand interactions with several classes of Gprotein coupled receptors and for inhibition of multiple mutated variants of the HIV1 protease. The aim of this study was to evaluate several types of kinase descriptors and compare the performance of different multivariate correlation methods in largescale proteochemometric modelling of protein kinaseinhibitor interactions.
Results
Performance of different types of kinase descriptors in PCA and PLSDA models
In order to compare the performance of the alignmentbased approach and the five alignmentindependent approaches used herein for describing protein kinase sequences we applied principal component analysis (PCA) and partial leastsquares discriminant analysis (PLSDA). PCA was performed to visualize how different types of descriptors separate the seven groups of protein kinases confined in the data set of 317 sequences. PLSDA was used to obtain a quantitative measure of the ability of the descriptors to discriminate these groups. The seven kinase groups were as defined in [1], namely: AGC, CaMK, CK1, CMGC, STE, TK, and TKL. (The socalled atypical and other kinases were included in the PCA analysis but they were excluded from the PLSDA modelling.)
The first three principal components of the PCA models for the six sets of descriptors are visualized in Figure 1, Panels A to F. As seen from panels A and B, SOPAA and CTD descriptors distribute the kinases in a more or less random fashion, albeit part of tyrosine kinases are separated from other groups, and the STE and CK1 groups are quite compact. Clustering into groups is more evident when the AACDC descriptors and MACCs of zscale descriptors are used (Panels C and D). For these descriptors the location of the TK group, which is the largest group in the data set, shows almost no overlap with the other groups. Finally, the ACCs of zscale descriptors (using the maximum lag L = 50; see below the reason for selecting this lag) and the zscale descriptors of aligned sequences give good separation of most of the kinase groups (Panels E and F). However, a notable difference between the two last is that ACCs separate subgroups of TKs, while the first three PCs of descriptors of the aligned sequences do not reveal such subclustering. On the other hand, the alignmentbased descriptors are the only ones that separate CMGC kinases as being substantially different from the other groups. As seen from Panel F, for the alignmentbased approach the CMGC kinases form a distinct cluster in the first two PCs.
Figure 1. Plots representing the separation of kinase groups in the three first components of PCA models. Panels AF show results of six PCA models using various types of alignment independent (AG) and alignment based (F) kinase descriptions. Each one of the 317 protein kinases is represented by a tetrahedron, colorcoded according to its belonging to a kinase group: black, AGC (named after member families PKA, PKG and PKC); red, CAMK (calcium/calmodulin regulated kinases); blue, CK1 (casein kinases); electric green, CMGC (named after member families CDK, MAPK, GSK3, and CLK); orange/amber, STE (homologues of yeast Sterile kinases); magenta, TK (tyrosine kinases); sea green, TKL (tyrosine kinaselike kinases); gray, atypical/other. Note that in Panels A and B kinase groups do not form distinct clusters, whereas in the other panels the largest kinase groups are clearly separated.
PLSDA finds the directions in PC space where maximum separation among the classes is obtained and where each class forms a maximally compact cluster. In an ideal situation a crossvalidated correlation coefficient Q^{2 }= 1 indicates that all members of a class are predicted to have y = 1, whereas all nonmembers are predicted to have y = 0. In reality Q^{2 }is always lower than 1, which is due to intraclass variations. Nevertheless, a Q^{2 }within the range 0.60.8 still indicates a good separation of classes, with few or no mispredictions. Should Q^{2 }drop down to 0.40.6, or even less, we have a warning that classes overlap and that the model will make multiple mispredictions. (Anyhow, the predictions would still be better than random. In fact, a random model has a Q^{2 }= 0).
Cross validation results for each type of kinase description for each kinase group are shown graphically in Figure 2, where panels A to F present PLSDA results for the same descriptor types as in Figure 1, AF. Similarly as for the PCA models, zscale based descriptions perform the best, with the alignment based approach performing over all the best. As seen, extremely high predictive ability was obtained with the Q^{2 }values for the seven kinase groups ranging from 0.89 to 0.97, the overall Q^{2 }being 0.94 (Figure 2, Panel F). Comparisons of all six panels of Figure 2 reveal that, irrespectively of the description type, the best separation is obtained for TKs. The lowest Q^{2 }values were for all descriptions obtained for TKL kinases suggesting that this group is more diverse than the other groups; (as its name indicates, the TKL group comprises enzymes that are phylogenetically related to TKs, although they are in fact serinethreonine kinases). However, crossvalidation results showed that none of the TKL kinases was mispredicted as being nonmember, and none of the other kinases was mispredicted as being member of TKL group in the models that used MACCs or alignment based descriptions. However, the model that exploited ACCs mispredicted one TKL kinase (the TNNI3K kinase).
Figure 2. Predictive ability of PLS discriminant analysis (PLSDA) models estimated by five foldcrossvalidation. Kinase descriptions in the six models presented in panels AF correspond to respective panels in Figure 1. Q^{2 }values may vary from 0 to 1. A low value indicates that a kinase group is randomly mixed with the others, while a value approaching 1 indicates a perfect discrimination. The overall high Q^{2 }values certify that the respective types of kinase descriptions have captured sequence properties that are present in members of a specific kinase group while being absent among all other kinases.
Selection of optimal lags for ACC and MACC transforms
An additional goal of the preliminary modelling was to identify the optimal complexity of the ACC and MACC descriptions. (In other words to find the maximum lag L, up to which descriptors contribute to improved separation of kinase groups). As described in Methods, covariances over long distances are less helpful in finding physicochemical similarities in related protein sequences due to the differences in the length of segments that connect their functional units. Use of very many ACC or MACC terms with large lags may then give rise to chance correlations, deteriorating the resolution of any mathematical models created from them. By comparing PLSDA models exploiting ACC and MACC descriptors with different maximum lags (L being 10, 25, 50, and 100) we showed that for both descriptor types the results were somewhat inferior for L = 10; the overall Q^{2 }being 0.76 and 0.86 for ACC and MACC based models, respectively. Increasing L to 25 gave major improvements (the overall Q^{2 }being 0.90 and 0.89, respectively); further increase to L = 50 produced yet slightly better models. Finally, including very long distance covariances with L = 100 led to slightly reduced predictive ability, the Q^{2}s dropping to 0.88 and 0.87 for ACCs and MACCs, respectively. An interesting finding was that the performance of the two descriptor types was quite similar when the maximum lag was set to L = 25 and larger. This was so both in terms of overall Q^{2}, and with respect to Q^{2}'s for the seven groups of kinases (data not shown). Based on all these results we elected to use ACC and MACC descriptors with maximum lag 50 in all further modelling of kinaseinhibitor interactions.
Performance of different types of kinase descriptors and multivariate correlation methods in predicting kinaseinhibitor activity
We used several machine learning methods to correlate the descriptors of kinase inhibitors and kinases to the interaction activities. The methods used were as follows: decision trees (DT), one nearest neighbour (1NN) and knearest neighbour (kNN) approach, support vector machines (SVM), and partial leastsquare projections to latent structures (PLS). The first four methods induce nonlinear models, whereas PLS is a linear method. When using PLS we created both linear and non linear models; in the latter case the dataset included crossterms derived from kinase and inhibitor descriptions.
The predictive abilities for new inhibitorkinase combinations (P^{2}) and new kinases (P^{2}_{kin}) as assessed by outerloop crossvalidation are presented in Table 1. The most predictive models were obtained using SVM, where for all three zscale based description methods the P^{2 }values fell in the range 0.700.73 and the P^{2}_{kin }values in the range 0.670.70. The PLS (with crossterms) and kNN models performed almost as good. (However, the performance of the kNN model exploiting ACC descriptions was inferior; its P^{2}_{kin }being only 0.53.) Models based on AACDC descriptors performed clearly worse than the zscale based descriptions, but also here the SVM model was the most predictive; the P^{2 }being 0.68 and P^{2}_{kin }being 0.64, whereas the values of these parameters for PLS model were only 0.58 and 0.53.
Table 1. Results of proteochemometric modelling of kinaseinhibitor interactions using different types of kinase descriptions and different data analysis methods
The inferior performance for the AACDC descriptions is not surprising. In fact it seems quite unlikely that the fraction of any single dipeptide would show significant correlation with the functional properties of the kinases. Such correlations, however, can become evident for larger sets of dipeptide combinations (i.e., tripeptides, tetrapeptides, and longer similar sequence stretches), giving an advantage to the SVM model which by the use of its nonlinear kernel can approximate highcomplexity interaction effects between the descriptors. The difference between the performances of SVM and PLS models is even larger when proteins are described by CTD or by SOPAA descriptors; the P^{2}_{kin }for PLS models using these two sets of descriptors being, respectively, 0.45 and 0.44, compared to 0.60 and 0.63 for the SVM models.
For any set of descriptors the kNN method outperformed 1NN (see Table 1). However, the optimal number of neighbours found to be used by the crossvalidation innerloop was quite low, and ranged in all cases 3 to 5. The predictions of kNN models are thus based on local subsets of the data set, and for this reason it would be problematic to use these models to draw any general conclusions on the molecular properties that determine kinaseinhibitor complementarity.
Finally, as expected, PLS modelling without use of kinaseinhibitor crossterms explained only a minor part of the activity variation; the P^{2}_{kin }for all three zscaleexploiting models being 0.32 (see Table 1). This result shows that the nonlinear part which describes kinaseinhibitor selectivity dominate over the linear part that describes the average activity of a ligand for the protein series and the average activity of all ligands for a particular protein. The high nonlinearity in the dataset is also likely the reason for the moderate success of the decision tree algorithm, which for any of the six used kinase descriptions created a massive tree with over 300 leaves explaining 6571% of the activity variation (data not shown). However, all these trees suffered in ability to generalize to novel kinases; the P^{2}_{kin }for various descriptions ranging only 0.300.43.
Distribution of prediction errors in SVM, PLS and kNN models
The performance of the SVM, PLS, and kNN models exploiting zscale descriptors of aligned sequences (i.e. the description that gave the best models) is further illustrated in Figure 3. The figure presents histograms for the prediction errors calculated in the outerloop of crossvalidation for 1/5 of the kinases that had been entirely excluded from the modelling (see Methods for details). The distributions of errors in the SVM and PLS models are very similar (cf. panels A and B). The cumulative plot demonstrates that in the SVM model the difference between predicted and observed pK_{d }values range 00.25 logarithmic units for 57% of the kinaseinhibitor combinations; for 75% of the combinations they fall below 0.5 logarithmic units; for 89% they are less than one logarithmic units, and for 99% less than two logarithmic units. The corresponding fractions in the PLS model are 49%, 70%, 88%, and 98%. To interpret these results one should keep in mind that the total span of kinaseinhibitor activities exceeded five logarithmic units, namely from pK_{d }= 5 to 10.62, and all noninteracting entities were assigned the numerical value pK_{d }= 4; hence mispredictions by more than six units could be theoretically possible.
Figure 3. Distribution of prediction errors in kinaseinhibitor interaction activity models. Shown are prediction errors in models using three different data modelling methods, namely: support vector machines (Panel A), partial leastsquares projections to latent structures (Panel B), and knearest neighbour approach (Panel C). Prediction errors are estimated by outerloop crossvalidation, iteratively excluding 1/5 of the kinases in the data set. The histograms represent the absolute values of prediction errors (i.e. blue bars; labelling on the left side of panels); the cumulative plot of prediction errors is represented by red lines; labelling on the right side of panels).
For the kNN model the pattern of error distribution is quite different (Figure 3, Panel C). Here the prediction error was zero for more than one half of the noninteracting pairs (i.e. all their nearest neighbours had also been identified as noninteracting in the primary screen and were in the modelling assigned the same numerical value pK_{d }= 4). However, 14% of the prediction errors exceed one logarithmic unit and 4% exceed two logarithmic units, thus indicating that predictions of the k NN model are less accurate compared to those obtained by SVM and PLS. In other words, activities for inhibitors interacting with overall quite similar kinases may vary a lot and regression models can better explain this than the nearest neighbour approach.
Dependence of modelling performance on the size of the dataset
Although both SVM, PLS, and kNN models showed good predictive ability they were based on more than 12,000 data points. It would thus be of obvious interest to know the robustness of the proteochemometric approach when less data are available. We therefore assessed the relationship between the sparseness of the data matrix used and the performance of the model. To this end we created models using 60, 40, 20, and 10 percent of all data. For example, when 10% of the data was used to calculate the P^{2}_{kin }value, the set of 317 kinases was randomly split into ten partitions of about equal size. Modelling was then performed using only one of these partitions at a time and the nine remaining partitions were used to evaluate the model obtained. The procedure of splitting the dataset was iterated ten times in order to assure reproducibility of the results. The P^{2 }and P^{2}_{kin }measures for models exploiting zscale descriptors of aligned kinase sequences are presented in Table 2, where the values for 80% the dataset size are in fact identical with the abovepresented results of 5fold outerloop crossvalidation (cf. Table 1). The performance of the models decreases only slightly when 4060% of the whole dataset is used for the model building, and the models are still predictive when as few as 10% of all kinaseinhibitor combinations or when 10% of all kinases are present in the dataset (i.e., estimating P^{2 }and P^{2}_{kin}, respectively). Moreover, the small margins between the P^{2 }and P^{2}_{kin }parameters indicate that the reliability of predictions for "new unassayed kinases" does not differ much from the reliability of predictions for the kinases for which some interaction data have been already assayed and used in the modelling. Comparisons of the results for the three data analysis methods also indicate that their performance is more similar for larger datasets. For sparsely populated datasets the performance of kNN method deteriorates faster than for the SVM and PLS methods.
Table 2. Results of kNN, SVM, and PLS modelling using subsets of full kinaseinhibitor dataset
Predicting interacting versus noninteracting kinaseinhibitor pairs
Although all models predict interaction activities on a continuous scale, they can also be used to predict whether new inhibitors and kinases interact or not. In the quantitative modelling we assigned the value pK_{d }= 4 to all inhibitorkinase combinations that had been found not to interact in the primary screen  the screen for which the detection limit was pK_{d }= 5. Hence if the activity predicted for an inhibitorkinase pair falls below a prespecified threshold level, the pair could be classified as noninteracting, while if it falls above this threshold it could be classified as interacting. The selection of the threshold value will affect the sensitivity and specificity of the classification, which can be defined as:
A common measure for the classification quality is the Receiver Operating Characteristic (ROC) curve, which is plotted as sensitivity versus one minus specificity upon varying the discrimination threshold value. The area under the ROC curve (AUC) is a measure of the discriminatory power of a classifier, which is insensitive to class distributions and the costs of misclassifications; AUC = 1 indicates perfect classification, while AUC = 0.5 means that the classifier does not perform better then random guessing.
Figure 4 compares ROC curves for the kNN, SVM, and PLS models, built on the largest and on the smallest sets of kinases as described in the previous section (i.e. using 80% and 10% of all 317 kinases). Inspection of Figure 4 shows, for instance, that at a sensitivity of 0.80 the SVM model build on the largest set of kinases has a specificity of 0.92. In other words, using a threshold that identifies 80% of truly active kinaseinhibitor pairs as being active, the number of falsepositives amounts to only 8%. The performance of the PLS and kNN models were slightly worse, at the sensitivity of 0.80 the falsepositives amount to 11 and 13%, respectively.
Figure 4. ROC curves for SVM, PLS, and kNN models. Shown are ROC curves for SVM, PLS, and kNN models built on data for 80% (solid lines) and for 10% (dashed lines) of 317 protein kinases. The area under the ROC curve (AUC) is a measure of the discriminatory power of a classifier.
The good performance of the classification is further indicated from the ROC areas, which for the models built on 80% of the kinases were 0.93, 0.92 and 0.91 for, respectively, the SVM, PLS, and k NN model. Interestingly, the models built on only 10% of the kinases also show good classification performance, the ROC areas being, respectively, 0.83, 0.82, and 0.79 for SVM, PLS, and kNN models. This finding indicates that even in the cases when quantitative models do not possess very high predictive ability in terms of P^{2}, they may still be able to separate active and inactive kinaseinhibitor combinations. Accordingly, our models should be useful for virtual largescale screening to select the promising objects prior to their experimental testing, while sorting away objects with a less probability of having the properties sought for in a development project.
Discussion
Design of selective and multiselective medications requires understanding of the properties of the biological targets that distinguish the chosen target(s) from numerous similar "antitargets" encoded in the human genome. Contemporary drug design has to a large extent been focused to structurebased methods where ligands are designed to fit into a binding pocket of the target. This requires knowledge of the exact 3 D structures of the targets and antitargets, which is a problem for proteinkinases as Xray structures have been solved for only 124 human protein kinase domains [11].
Proteochemometrics, on the other hand, has a distinct advantage when the studied proteins share the same structural organization since primary amino acid sequences can then be used without the need to have highresolution 3 D structures of the targets. Proteochemometrics has also the advantage that multiple targets and antitargets can be encompassed in one single model. Structural alignments of protein kinases have shown that they all contain universal conserved subdomains whereas their amino acid sequences still show quite notable variation. In fact, there is generally a much higher degree of conservation of the 3Dstructures among protein families than of their primary sequences [12]. The average pairwise sequence identity over the kinase domains falls below 30%, and only a small fraction of residues are markedly conserved across the entire superfamily [13]. Use of sequencederived descriptions can hence be considered to be a rational approach for kinase representation in multivariate modelling, stated that the sequence descriptions are made in such a way that they are relevant for the structural and functional organization of the kinases. Descriptions can be derived based on prior sequence alignments or in alignment independent ways, the latter approaches are advantageous for less similar sequences, when unambiguous alignments are impossible to obtain.
In the first phase of this study we performed PCA and PLSDA, using one set of alignment based and five sets of alignment independent descriptors of protein kinase amino acid sequences. The purpose of this analysis was to evaluate the ability of the different descriptions to separate kinases into groups according to their functions. PLSDA for the best model (which exploited alignment based zscale descriptions) afforded excellent separation of the seven groups of kinases; the crossvalidated squared correlation coefficients fell between 0.930.98 for six of the groups, while for the more diverse tyrosine kinaselike kinase group it was 0.89.
As explained in the Methods section, PLSDA models create regression equations for each of the modelled classes and thus identify properties that are more typical, or even unique, for a particular class compared to the other classes. Thus, inspection of the alignment based PLSDA regression equation exploiting zscale descriptors reveals that in some cases the description of the physicochemical properties of very short sequence stretches and even of single residues are sufficient to separate all members of one kinase group from all other kinases. In one such example, when we inspected the alignment based PLSDA model we revealed that a conserved proline residue located surrounded by two hydrophobic amino acids in the activation loop of the TKs sequences is the sufficient pattern for class separation. In the majority of the cases this triplet is embraced by two positively charged lysine or arginine residues (e.g., the sequence stretch being KFPIK in ABL1 kinase, KVPIK in EGFR kinase, and RLPVK in KIT kinase). Analysis of the alignment independent PLSDA model exploiting AACDC descriptors further identifies that groups of kinases are often distinguished by the model by small sets of dipeptides (for instance, each of dipeptides CW, VW, RN, and GM is present in more than 90% of TKs compared to only 1535% of kinases from the six other groups). Such identified specific sequence residues or patterns, which may be identified by our models, could accordingly potentially be addressed in the design of targeted and multitargeted drugs. In fact, a few such amino acids (sometimes termed 'selectivity filters') have been previously exploited in drug design for kinases. This includes the socalled gatekeeper residue, which is a bulky amino acid present in most kinases, while 20% of the kinases have a threonine at this position. The property was used in design of selectivity for ABL kinase inhibitors. (However, unfortunately, the residue position is also a common site for mutations that confer resistance to imatinib, gefitinib, and erlotinib [14]). A study of Cohen et al. [15] designed inhibitors for RSK family kinases by targeting two selectivity filters in the ATP binding site, namely the threonine gatekeeper and a cysteine residue, which is an uncommon amino acid in the kinases' active site. These two amino acids that distinguishes RSKs from other protein kinases were sufficient to confer high activity of the designed inhibitor.
Although we here limited PLSDA modelling to separation of seven major groups of the kinase superfamily the analysis can be performed hierarchically at any resolution, e.g., to delineate particular families, subfamilies, and even single kinases.
In the subsequent studies we created quantitative models for kinaseinhibitor interaction activities using the six types of kinase descriptions and performing correlations using SVM, PLS, kNN, and decision trees. The small molecule inhibitors were in all models represented by a unified set of 3Dstructural and physicochemical property descriptors. Models that exploited zscale descriptions of the alignable parts of the protein kinase sequences performed the best. However, using ACC or MACC transformations gave only slightly inferior models when correlations to the activity data were done by SVM or PLS. ACC transformed descriptors performed worse with the kNN approach, while MACC transformations resulted in a weaker model with use of decision trees. The advantages of ACC and MACC transforms are that they do not require prior alignment and that they are calculated from fulllength sequences of kinase domains, which in the present data set varied from 194 to 606 residues (albeit for about one half of kinases it ranged 240260 residues; for less than 30% kinases it exceeded 280 residues). Whereas ACCs reflect the covariances of amino acid properties over whole sequences, MACCs pinpoint individual pairs of residues with specific property combinations. MACC based models may thus identify patterns that are not confined to the same location in each and every protein and/or are situated in sequence stretches that can not be aligned unambiguously over the whole dataset. Consequently, models exploiting MACCs may complement the alignmentbased models in analysis and prediction of kinaseinhibitor interactions. The three other descriptions for the protein sequences used (CTD, SOPAA, and AACDC) showed inferior performances compared to zscale based descriptions and thus appear less useful in proteochemometric modelling.
SVM outperformed the other data analysis methods, including PLS, in both the prediction accuracy for the active kinaseinhibitor combinations as manifested by P^{2 }and P^{2}_{kin }parameters (Tables 1 and 2) and in the ability to distinguish interacting versus noninteracting kinaseinhibitor pairs as revealed by the areas under the ROC curves (Figure 4). Accordingly, SVM seems to be the optimal choice for predicting full kinomewide selectivity profiles of the existing compounds, and for virtual screening to find new hits with desired selectivities. However, an important point is that SVM is essentially a 'black box' technique, which makes interpretations of its models difficult. Thus, even if the performance of SVM in virtual screening is superior to PLS, it is problematic to comprehend which of the molecular properties of kinases and inhibitors that are important in the model. PLS contrasts to 'black box' methods like SVM and to locally derived kNN and DT models because it expresses the correlation results in a single straightforwardly interpretable regression equation. Moreover, PLS provides additional tools for model diagnostics, such as score and loading plots and 'distance to model' parameters that allow identification of outliers and assessment of reliability of extrapolations outside the modelled chemical and interaction spaces [16]. Consequently, the parallel use of PLS and SVM modelling techniques may be advantageous when one aims at obtaining models for both predictions and interpretations, and crosschecking of model performances. (In this context it ought to be mentioned that several approaches have been recently suggested to give SVM models some transparency [1719], which may be in the advantage for use of SVM in proteochemometric modelling).
The models built on small subparts of the dataset showed the robustness of the proteochemometric modelling approach. Thus, even for the smallest dataset comprising only about 30 kinases the SVM and PLS models showed acceptable predictive ability. The performances of the models based on small datasets were even more impressive in prediction of interacting versus noninteracting kinaseinhibitor pairs; the discriminatory power of SVM and PLS models being, respectively, 0.83 and 0.82 for the models created on 30 kinases (compared to 0.93 and 0.92 for the largest dataset size). These results may have a wide impact to the protein kinase field as they mean that a relatively limited amount of experimental work is needed to afford qualitative and quantitative interaction models that will generalize for the whole kinome.
Success of any empirical modelling depends on the quality of data, which in proteochemometrics should comprise accurate activity measurements and descriptions of relevant physicochemical and/or structural properties of proteins and their ligands. Yet another prerequisite for proteochemometrics is an adequate composition of the dataset, which should be balanced and include both interacting and noninteracting proteinligand combinations. Unfortunately, 'negative' results are often omitted in study reports. Moreover, interaction databases populated by data from multiple series, contain typically activities for a fairly low fraction of all possible ligandprotein combinations, which implies that a bulk of the noninteracting entity pairs are absent. Modelling of sparse data matrices with overrepresented high activity data would inevitably give rise to falsepositive predictions. Hence, the success of any modelling study owes most to using a wellbalanced dataset, such as the here used dataset comprising data for both active and inactive kinaseinhibitor combinations for more than one half of the human kinome.
Although the modelled dataset covered more than 12,000 interactions, the series of 38 kinase inhibitors can not be considered as large, even though it included seven of the eight presently approved anticancer agents as well as other compounds with mutually dissimilar inhibition profiles. One can thus expect to gain further improvements by analyzing data for many more chemical compounds providing wider and denser coverage of the chemical and interaction spaces. In the present study the dataset parts for modelling and validation were selected randomly to assure objective assessment of the modelling performances. However, it is possible to apply statistical experimental design [20] to choose small representative panels of kinases to be used for assaying and interaction modelling. One technique is Doptimal design that could be used to select kinases that cover most of the diversity of the kinase sequence and activity space. Designed molecular libraries have proven much more informative than random collections, and they have been shown in some cases to allow a 10^{3}10^{4 }fold reduction of the experimental work required, while still retaining the full generalization ability of derived interaction models [21,21]. We can hence conclude that the values in Table 1 are the lowest limits of the predictive abilities, which would be surpassed in any models for datasets of the same size if kinases were selected according to principles of statistical experimental design. Hence, for any experimental work to be undertaken in the kinase field following this study we would strongly encourage the use of experimental design. The final outcome will be kinome wide models that can predict the interaction strength of a random chemical over all known protein kinases.
Conclusions
In this study we developed kinomewide proteochemometric models for the prediction of kinaseinhibitor interaction profiles. We compared several alignmentbased and alignmentindependent approaches for the description of protein kinases, evaluated the performances of linear and nonlinear correlation methods, and investigated the relationship between the size of the dataset and the predictive ability of the models obtained. Our best models are highly predictive on a quantitative scale, and can delineate interacting and noninteracting kinaseinhibitor combinations. One of the findings of this study is that models built on quite limited amount of kinase data are still capable to generalize over the whole human kinome. We thus foresee that the here shown routes to concomitant proteochemometric kinome wide modelling will markedly speedup the discovery and optimization of protein kinase targeted and multitargeted drugs.
Methods
Interaction activity data
We used the dataset published by Karaman et al. [23] comprising dissociation constants (K_{d}) of 38 smallmolecule kinase inhibitors tested against a panel of 317 human kinases, in total 38 × 317 = 12,046 activities. All major kinase groups, as defined by Manning et al. [1], were represented in the dataset, namely: AGC, CaMK, CK1, CMGC, STE, TK, and TKL. The kinase inhibitor series included approved drugs (dasatinib, erlotinib, gefitinib, imatinib, lapatinib, sorafenib, and sunitinib), trial drugs and experimental compounds (flavopiridol, roscovitine, and others), and the natural product staurosporine. For 24.8% of the inhibitorkinase combinations an activity better than 10 μM had been observed in a primary screen, and the exact K_{d }values were then determined. The dissociation constants found ranged from 10^{5}M to 2.4 × 10^{11}M and were expressed as negative logarithms of the K_{d }values (pK_{d}); the transformed values ranging from 5 to 10.62. In order to obtain a full data matrix we assigned a numerical value pK_{d }= 4 to the inhibitorkinase pairs that had been identified as not interacting in the primary screen; i.e., pK_{d }was set one unit lower than the threshold value (pK_{d }= 5) of the primary screen. This was a tradeoff between two qualities of the conceived mathematical models to be derived from the data: a very high margin would prioritize discrimination between the active and inactive kinaseinhibitor pairs on the expense of the accuracy for the predictions for the active ones; on the other hand, a low margin would reduce the model's discriminative ability between interacting and noninteracting pairs. Our selected value seemed reasonable since it would allow achieving both goals, stated that the errors of prediction of a model do not exceed one logarithmic unit.
Description of kinase inhibitors
The structures of kinase inhibitors were drawn by ISIS/Draw and converted to 3 D by the Corina unit of the Tsar 3.3 (Accelrys, Inc.) software. Partial atomic charges were derived using the Charge2 utility and the geometries were optimized by energy minimization using the Cosmic utility of Tsar 3.3. Compounds were then characterized by various molecular descriptors using Dragon 2.1 software (Talete S.r.l.). The following descriptor classes were calculated: constitutional descriptors, counts of functional groups and atomcentered fragments, geometrical descriptors, charge and aromaticity indices, empirical descriptors, and molecular properties. When two descriptors were highly correlated (pairwise r^{2 }> 0.9), we excluded the one showing the highest correlation with any other descriptor of the descriptor set. In this way, 150 molecular descriptors were obtained for each inhibitor for the modelling. All descriptors were mean centred and scaled to unit variance prior to use in modelling.
Description of protein kinases
The panel of protein kinases comprised 317 entities (i.e., more than a half the known human kinome). Of these 28 contained point or cassette mutations, and a few kinases contained deletions of up to eight residue long sequence stretches. The sequences for the kinases' kinase domains were retrieved from KinBase database http://kinase.com/kinbase webcite. Although the length of the kinase domains varied from 194 to 606 amino acids, almost 90% of them were just between 240 to 300 amino acids long.
Alignmentbased physicochemical zscale description of kinase sequences
We used two types of kinase sequence descriptions: alignment based and alignment independent. For the alignment based, a multiple sequence alignment was performed over the entire sequence set by the ClustalW 2.0 software [24], using its default settings (GONNET 250 matrix) and applying ten iteration cycles to refine the progressive alignment. Those parts of the alignment that contained gaps for more than 50% of the kinase sequences were removed from the alignment, which left 264 aligned positions. (These gaps corresponded to sequence stretches that were quite unique among most kinases and they were located far from the ATP binding site). The aligned positions were then described by amino acid physicochemical properties encapsulated in the five zscales, z_{1}z_{5}, derived by Sandberg et al. [25]. Zscales are quantitative descriptors obtained from principal component analysis (vide infra) of 26 measured and computed physicochemical properties of the 20 naturally encoded amino acids and 67 synthetic alpha amino acids. The three first of these zscales describe about 70% of the variation in the original data, and all five describe more than 95% of the variation. Being principal components, zscales are meancentered and uncorrelated to eachother, and can be tentatively interpreted as reflecting hydrophobicity (z_{1}), steric properties (z_{2}), polarity (z_{3}) and other electronic properties (z_{4}, z_{5}) of amino acids. In this way, the differences in physicochemical properties of the aligned kinase sequences were represented by 264 × 5 = 1320 descriptors.
Auto and crosscovariances (ACCs) of zscale descriptors
Zscales are directly useful for encoding proteins stated that the proteins show substantial conservation in their 3 D structural organization and that their primary sequences are conserved to the extent that alignments can be done unambiguously. However, if sequences are aligned wrongly our attempts to find similarities and differences in the proteins' physicochemical space would be thwarted. Therefore, methods have been sought to avoid the alignment step and transform sequence descriptions directly into uniform matrices. One such method, the auto and crosscovariance (ACC) transform, describes changes in some property or some property combinations over sequence stretches of different lengths [26]. This is done according to the equations:
where AC represents autocovariances of the same property (zscale) and CC the crosscovariances of different zscales, and where z = 1, 2, ..., Z (Z is 5, i.e. the number of zscales), i = 1, 2, ..., Nlag (i is the amino acid position in the sequence and N the total number of amino acids), lag = 1, 2, ..., L (L is the maximum lag, i.e. the longest sequence stretch used, which can be up to the length of the shortest sequence in the dataset), and V is the zscale value. The total number of ACC terms depends on the chosen L and on the number of zscales, and is L × Z^{2}.
Larger maximum lags L allow for more detailed description accounting for interactions of amino acids at distant parts in a sequence. However, even closely related proteins differ often by sequence insertions/deletions. As a result, the probability of assigning an interaction to the same ACC term is inversely proportional to the distance between the sequence positions. Long distance covariances would hence be less helpful in finding physicochemical similarities in related sequences. We here calculated ACCs with maximum lags 10, 25, 50, and 100.
Maximums of auto and crosscovariances (MACCs) of zscale descriptors
ACCtransformations provide a uniform set of descriptors that are independent of the length of each sequence and which are able to capture characteristic physicochemical patterns of the protein. One limitation of ACCs is that specific local sequence patterns may become concealed by the overall properties of the given sequence. Another drawback is the difficulties to make interpretations. For example autocovariances of the z_{1}scale would be similar for a sequence consisting of predominantly hydrophilic amino acids (represented by positive values) and a sequence consisting of predominantly hydrophobic amino acids (negative values). In both cases multiplications give positive values. (AC_{z1 }terms would, however, separate such two sequences from sequences where hydrophilic amino acids alternate with hydrophobic ones with certain periodicities).
To cope with these limitations of ACCs, a modified algorithm was suggested in [27], where the positive and negative descriptor values are considered separately and only the maximum values for all possible interactions at each lag is used to describe the sequences. Of the two algorithms developed in [27] we applied the MACC1 transformation giving 4 × L × Z^{2 }terms; i.e. four times as many descriptors as an ACC with the same maximum lag. (The alternative MACC2 algorithm was not used as it ignores the direction of a sequence and hence seemed inappropriate for encoding proteins). We here calculated MACCs with maximum lags 10, 25, 50, and 100.
It may be pointed out that, whereas each ACC term is calculated from the whole protein sequence, the corresponding four MACC1 terms represent extremes of particular physicochemical property combinations somewhere in the sequence. The MACC descriptions thus retain full interpretability and can be traced back to each residue pair. However, they may overstate the roles of extreme physicochemical properties for a protein structure/function and depreciate the roles of 'moderate' amino acids.
Composition, transition and distribution (CTD) of amino acid properties
The CTD alignmentindependent descriptors were proposed by Dubchak and coworkers [28], and are based on seven amino acid properties (attributes): 1) hydrophobicity, 2) normalized van der Waals volume, 3) polarity, 4) polarizability, 5) charge, 6) secondary structure, and 7) solvent accessibility. For each of these seven attributes, amino acids are divided into three classes. E.g., for the hydrophobicity attribute, class 1 comprises polar amino acids (RKEDQN), class 2 neutral amino acids (GASTPHY), and class 3 hydrophobic amino acids (CLVIMFW). The composition descriptors then represent the overall percentage of each class in the sequence. Since there are seven attributes and three classes, 7 × 3 = 21 composition descriptions can be computed. The transition descriptors represent frequencies with which an attribute changes class along the sequence, e.g., a class 1 amino acid is followed by a class 2 amino acid or vice versa. Since there are three possible transitions between classes, 7 × 3 = 21 transition descriptors can be computed. The distribution descriptors represent the distribution of each attribute in the sequence. For each attribute and for each class, five distribution descriptors are computed based on the following criteria: location of the first residue, 25% residues, 50% residues, 75% residues and 100% residues with a given property. For instance, if the total length of a sequence is N amino acids, and all polar amino acids (i.e. members of hydrophobicity class 1) are among the first i residues of the sequence, then the distribution descriptor for 100% residues of the given class would be calculated as i/N. Thus, the total number of distribution descriptors is 5 × 7 × 3 = 121. CTD descriptors were computed by using PROFEAT (Protein Feature) web server [29].
Sequenceorder and pseudoamino acid (SOPAA) descriptors
The sequenceorder and pseudoamino acid descriptors were proposed by Chou [30,31] and are used most successfully to predict protein subcellular location. We here used the PROFEAT web server to calculate 60 sequenceordercoupling numbers, 100 quasisequenceorder descriptors, and 50 pseudoamino acid descriptors. The sequenceordercoupling numbers are derived from the physicochemical distance matrix between pairs of amino acids. The coupling number of rank d is defined as the sum of squared physicochemical distances between all amino acids being located d residues from each other. This is mathematically described by the equation:
where d_{i,i+d }is the physicochemical distance between the two amino acids at position i and i+d, and N is the total length of the sequence. PROFEAT allows computing these descriptors starting from rank d = 1 (i.e. neighbouring residues) up to d = 30 and using two different distance matrices (physicochemical distance by SchneiderWrede and chemical distance by Grantham) [29].
Quasisequenceorder descriptors are thereafter computed from coupling numbers and from protein amino acid composition (see [29] for mathematical equations). Fifty quasisequenceorder descriptors can be derived from each set of coupling numbers. The first 20 quasisequenceorder descriptors reflect the effects of the amino acid composition and are calculated according to the equation:
where a is one of the twenty natural amino acids, f_{a }is the normalized occurrence for this amino acid, and w is a weighting factor (w = 0.1).
The thirty other quasisequenceorder descriptors reflect the effects of sequence order, and are defined as:
where the rank d is from 1 to 30.
Fifty pseudo amino acid descriptors were computed similarly as quasisequence order descriptors. However, the coupling numbers in the equations were replaced by more complex correlation factors reflecting various physicochemical properties of amino acids (see [31] for details). The whole set of SOPAA descriptors thus comprised 210 alignment independent descriptors encapsulating both the quantitative (physicochemical) and qualitative (amino acid letter code) sequence properties.
Amino acid and dipeptide composition (AACDC)
Amino acid composition descriptors represent the fractions for each of the twenty natural amino acids in a protein sequence, while dipeptide composition descriptors represent the fractions of 20 × 20 = 400 possible dipeptides in the sequence [32]. Despite its simplicity the method has been applied successfully, e.g., for classification of Gprotein coupled receptors [33,34], nuclear receptors [35], predictions of protein fold and predicting the subcellular localization of proteins [3638]. Amino acid and dipeptide composition descriptors were computed by using the PROFEAT server.
Data preprocessing
All descriptors were meancentered and scaled to unit variance prior to their use. In order to account for differences in the number of inhibitor and kinase descriptors, block scaling was applied. This was done by assigning each block the weight 1/sqrt(N), where N is number of descriptors in the block. In this way, the total sum of variances of all descriptors in each block became equal to 1. The response variable (pK_{d}) was mean centered prior to applying data analysis.
Data analysis
Principal component analysis (PCA)
PCA is a multivariate projection method, which provides compression of datasets containing large numbers of variables (see [39] for algorithms and geometrical interpretation). Contrary to the original variables, which are always multicollinear, the socalled principal components (PCs) are orthogonal to each other; the first component extracts the largest variance in the dataset, the second component extracts the largest of the remaining variance, and so on. The major patterns within the original data can often be captured by a small number of components.
All the variance in a dataset with N objects is explained by N1 or less PCs. Thus, all descriptors of kinase inhibitors in the present dataset could be transformed into 37 PCs without any loss of information, and with the preservation of full interpretability. Similarly, any number of descriptors of 317 kinases can be compressed to 316 PCs (in fact, already half of this number explained over 9095% of the variance in any of the six sets of kinase descriptions used herein).
Partial leastsquares projections to latent structures (PLS)
PLS can be considered as an extension of PCA, which along with the independent variables (X matrix) deals with one or several dependent variables (Y vector or matrix). PLS aims to find the relationship between the two matrices and to develop a predictive model. This is achieved by simultaneously projecting X and Y to latent variables (PLS components), with an additional constraint to correlate them. (Thus, compared to PCs, the PLS components are tilted to maximize covariance between projections of X and Y). PLS derives a regression equation for each y variable where the regression coefficients reveal the direction and magnitude of the influence of Xvariables on y [16].
A special case of PLS is PLS discriminant analysis (PLSDA) where y variables are categorical and express the class membership of objects (members of a given class are numerically represented by the value 1 while nonmembers are represented by 0).
Several algorithms have been developed for performing PLS; here we used orthogonalizedPLS [40] as implemented in SimcaP+ 11.5 (Umetrics AB) and NIPALS [41] as implemented in Unscrambler9.8 (CAMO Software AS) (the latter algorithm was applied for PLSDA modelling). An important decision in PLS is the choice of the number of PLS components. Each extracted component increases the explained variation of both X and Y. However, while the first components normally find real correlations between the two blocks, increased model complexity may give rise to chance correlations. To avoid overfitting we applied fivefold innerloop cross validation (see below).
Accounting for nonlinear cooperative effects in PLS modelling
PLS is a linear correlation method. However, in proteochemometrics there is a need to describe nonlinear ligandprotein interaction effects (i.e., those effects that are governed by the complementarity of the interacting moieties and determine the selectivities for the interactions) [9]. This is typically done by deriving crossterms between ligand and protein descriptors. Since the number of crossterms is equal to the product of ligand and protein descriptors it may be unfeasible to calculate them directly. E.g., having at hand 150 inhibitor and 1,320 zscale descriptors, computing crossterms would result in 198,000 new variables, which would make any further analysis highly resource consuming. A practical approach is rather to compute the crossterms from the principal components of the original descriptors. For calculation of crossterms we here used all 37 PCs of the ligand descriptors, but only as many of PCs of kinase descriptors that explained 95% of their total variance (this allowed us to further reduce the size of the datasets by a factor of two). Crossterms were scaled to Pareto variance; the block weight for crossterms was initially set to 0 and thereafter increased by a regular step size until an optimal PLS model (according to inner loop crossvalidation) was obtained. We have earlier shown that this approach exerts no negative influence on the final modelling results [42].
Support vector machines (SVM)
SVM is a machine learning technique for classification and regression that uses linear or nonlinear kernelfunctions to project the data into a highdimensional feature space. Correlation is then performed in this hyperspace based on the structural risk minimization principle; i.e., aiming to increase the generalization ability of a model [43,44]. We induced nonlinear proteochemometric regression models using the epsilonSVR method and radial basis function kernel as implemented in the libSVM 2.88 software [45]. Fivefold innerloop cross validation was performed to find optimal values for the width of the kernel function γ and error penalty parameter C.
Knearest neighbour method (kNN)
The kNN algorithm predicts y values for a test set object as the average (or weighted average) of the y values of its k nearest neighbours in the trainingset. kNN models were induced using the Weka 3.6 software [46]. We characterized the similarity between inhibitorkinase pairs from the Euclidian distance in the X descriptor space and applied 1/distance weighting, as described [47]. In contrast to PLS and SVM modelling, where the inhibitor and kinase descriptor blocks were scaled to equal total variance, the relative scaling of the descriptor blocks was varied systematically in the kNN modelling by multiplying the block weight for kinase descriptors by factors 0.25, 0.5, 1, 2, and 4; (in this way, kinase descriptors obtained lower or higher importance than inhibitor descriptors in assessing inhibitorkinase complex similarity). Fivefold innerloop cross validation was applied to find the optimal scaling and number of nearest neighbours for prediction.
Decision trees
Decision trees were created using the M5P algorithm [48] as implemented in Weka 3.6. This algorithm derives linear regression models at the terminal nodes (leaves) of the tree. After building the tree, it was pruned and smoothing was performed. The optimal value for the minimum number of objects, allowing a new leaf, was determined using fivefold innerloop cross validation.
Double cross validation of kinaseinhibitor interaction models
The predictive ability of models is commonly quantified by the crossvalidated squared correlation coefficient, Q^{2}. In crossvalidation the objects are divided into a number of groups. Models are then developed from the dataset, which has been reduced by one of the groups, and predictions for the excluded objects are calculated. The process is then iteratively repeated until all groups have been omitted once. The Q^{2 }is then calculated as:
where is the average of the measured outcome values for the N objects in the dataset.
A Q^{2 }> 0.4 is generally considered acceptable for modelling biological data [21]. However, some studies have pointed out that Q^{2 }may give an overly optimistic assessment of model performance in the case that the crossvalidation results are used to optimize model parameters or to select the best among many alternative models [49,50]. To remedy this we applied double crossvalidation (also called doubleloop or nested crossvalidation) [51] where the dataset was split into totally 25 parts. In each round of inner crossvalidation a model was built on 16/25 of the whole dataset and evaluated on 4/25 of it, while the remaining data were put aside for the outer loop. Once the inner loop crossvalidation had found the optimal model, its true performance was verified against 5/25 of data that had never been used during the optimization.
We wanted to evaluate the predictive ability for both new kinaseinhibitor combinations and for new kinases with no measured interaction data. In the former case each part of randomly split dataset comprised about 1/25 of 12,046 kinaseinhibitor pairs and in the latter case it comprised all data for approximately 1/25 of 317 kinases. The squared correlation coefficients from the outer loop of crossvalidations for these two different selections are in the following denoted as P^{2 }and P^{2}_{kin}, respectively (letter P is used instead of Q as in previous studies [51] to emphasize that these are unbiased performance estimates based on external predictions).
Authors' contributions
ML conceived the study, carried out data analysis, and wrote the initial draft. JESW supervised the study and edited the manuscript. Both authors read and approved the final manuscript.
Acknowledgements
Data sets for the models herein are available at http://www.farmbio.uu.se/dokument.php?avd=2. This work was financially supported by the Swedish Society for Medical Research, the Swedish Fund for Research without Animal Experiments, Uppsala University (KoF 07), and the Swedish Research Council (VRM 04X05957).
References

Manning G, Whyte DB, Martinez R, Hunter T, Sudarsanam S: The protein kinase complement of the human genome.
Science 2002, 298:19121934. PubMed Abstract  Publisher Full Text

Madhusudan S, Ganesan TS: Tyrosine kinase inhibitors in cancer therapy.
Clin Biochem 2004, 37:618635. PubMed Abstract  Publisher Full Text

Griffin JD: Interaction maps for kinase inhibitors.
Nat Biotechnol 2005, 23:308309. PubMed Abstract  Publisher Full Text

Fabian MA, Biggs WH, Treiber DK, Atteridge CE, Azimioara MD, Benedetti MG, Carter TA, Ciceri P, Edeen PT, Floyd M, Ford JM, Galvin M, Gerlach JL, Grotzfeld RM, Herrgard S, Insko DE, Insko MA, Lai AG, Lélias JM, Mehta SA, Milanov ZV, Velasco AM, Wodicka LM, Patel HK, Zarrinkar PP, Lockhart DJ: A small moleculekinase interaction map for clinical kinase inhibitors.
Nat Biotechnol 2005, 23:329336. PubMed Abstract  Publisher Full Text

Scapin G: Protein kinase inhibition: different approaches to selective inhibitor design.
Curr Drug Targets 2006, 7:14431454. PubMed Abstract  Publisher Full Text

Kamb A, Wee S, Lengauer C: Why is cancer drug discovery so difficult?
Nat Rev Drug Discov 2007, 6:115120. PubMed Abstract  Publisher Full Text

Daub H, Specht K, Ullrich A: Strategies to overcome resistance to targeted protein kinase inhibitors.
Nat Rev Drug Discov 2004, 3:10011010. PubMed Abstract  Publisher Full Text

Shah NP, Sawyers CL: Mechanisms of resistance to STI571 in Philadelphia chromosomeassociated leukemias.
Oncogene 2003, 22:73897395. PubMed Abstract  Publisher Full Text

Wikberg JE, Lapinsh M, Prusis P: Proteochemometrics: A tool for modelling the molecular interaction space. In In Chemogenomics in Drug Discovery  A Medicinal Chemistry Perspective. Edited by Kubinyi H, Müller G. Weinheim: WileyVCH; 2004:289309.

Wikberg JE, Spjuth O, Eklund M, Lapins M: Chemoinformatics taking Biology into Account: Proteochemometrics. In In Computational Approaches in Cheminformatics and Bioinformatics. Edited by Guha R, Bender A. Weinheim: Wiley; 2010.
ISBN: 9780470384411

Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank.
Nucleic Acids Res 2000, 28:235242. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Illergård K, Ardell DH, Elofsson A: Structure is three to ten times more conserved than sequenceA study of structural response in protein cores.
Proteins 2009, 77:499508. PubMed Abstract  Publisher Full Text

Bamborough P, Drewry D, Harper G, Smith GK, Schneider K: Assessment of chemical coverage of kinome space and its implications for kinase drug discovery.
J Med Chem 2008, 51:78987914. PubMed Abstract  Publisher Full Text

Carter TA, Wodicka LM, Shah NP, Velasco AM, Fabian MA, Treiber DK, Milanov ZV, Atteridge CE, Biggs WH, Edeen PT, Floyd M, Ford JM, Grotzfeld RM, Herrgard S, Insko DE, Mehta SA, Patel HK, Pao W, Sawyers CL, Varmus H, Zarrinkar PP, Lockhart DJ: Inhibition of drugresistant mutants of ABL, KIT, and EGF receptor kinases.
Proc Natl Acad Sci USA 2005, 102:1101111016. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cohen MS, Zhang C, Shokat KM, Taunton J: Structural bioinformaticsbased design of selective, irreversible kinase inhibitors.
Science 2005, 308:13181321. PubMed Abstract  Publisher Full Text

Wold S, Sjöström M, Eriksson L: PLSregression: a basic tool of chemometrics.
Chemom Intell Lab 2001, 58:131150. Publisher Full Text

NaviaVázquez A, ParradoHernández E: Support vector machine interpretation.
Neurocomputing 2006, (69):17541759. Publisher Full Text

Ustün B, Melssen WJ, Buydens LM: Visualisation and interpretation of Support Vector Regression models.
Anal Chim Acta 2007, 595:299309. PubMed Abstract  Publisher Full Text

Devos O, Ruckebusch C, Durand A, Duponchel L, Huvenne JP: Support vector machines (SVM) in near infrared (NIR) spectroscopy: Focus on parameters optimization and model interpretation.
Chemom Intell Lab 2009, 96:2733. Publisher Full Text

Lundstedt T, Seifert E, Abramo L, Thelin B, Nyström Å, Pettersen J, Bergman R: Experimental design and optimization.
Chemometr Intell Lab 1998, 42:340. Publisher Full Text

Linusson A, Wold S, Nordén B: Statistical molecular design of peptoid libraries.
Mol Divers 1998, 4:103114. PubMed Abstract  Publisher Full Text

Alifrangis LH, Christensen IT, Berglund A, Sandberg M, Hovgaard L, Frokjaer S: Structureproperty model for membrane partitioning of oligopeptides.
J Med Chem 2000, 43:10313. PubMed Abstract  Publisher Full Text

Karaman MW, Herrgard S, Treiber DK, Gallant P, Atteridge CE, Campbell BT, Chan KW, Ciceri P, Davis MI, Edeen PT, Faraoni R, Floyd M, Hunt JP, Lockhart DJ, Milanov ZV, Morrison MJ, Pallares G, Patel HK, Pritchard S, Wodicka LM, Zarrinkar PP: A quantitative analysis of kinase inhibitor selectivity.
Nat Biotechnol 2008, 26:127132. PubMed Abstract  Publisher Full Text

Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG: Clustal W and Clustal X version 2.0.
Bioinformatics 2007, 23:29472948. PubMed Abstract  Publisher Full Text

Sandberg M, Eriksson L, Jonsson J, Sjöström M, Wold S: New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids.
J Med Chem 1998, 41:24812491. PubMed Abstract  Publisher Full Text

Wold S, Jonsson J, Sjöström M, Sandberg M, Rännar S: DNA and peptide sequences and chemical processes multivariately modelled by principal component analysis and partial leastsquares projections to latent structures.
Anal Chim Acta 1993, 277:239252. Publisher Full Text

Cruciani G, Baroni M, Carosati E, Clementi M, Valigi R, Clementi S: Peptide studies by means of principal properties of amino acids derived from MIF descriptors.
J Chemom 2004, 18:146155. Publisher Full Text

Dubchak I, Muchnik I, Holbrook SR, Kim SH: Prediction of protein folding class using global description of amino acid sequence.
Proc Natl Acad Sci USA 1995, 92:87008704. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li ZR, Lin HH, Han LY, Jiang L, Chen X, Chen YZ: PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence.
Nucleic Acids Res 2006, 34:W32W37. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chou KC: Prediction of protein subcellular locations by incorporating quasisequenceorder effect.
Biochem Biophys Res Commun 2000, 278:477483. PubMed Abstract  Publisher Full Text

Chou KC: Prediction of protein cellular attributes using pseudoaminoacidcomposition.
Proteins 2001, 43:246255. PubMed Abstract  Publisher Full Text

Van Heel M: A new family of powerful multivariate statistical sequence analysis techniques.
J Mol Biol 1991, 220:877887. PubMed Abstract  Publisher Full Text

Bhasin M, Raghava GP: GPCRpred: an SVMbased method for prediction of families and subfamilies of Gprotein coupled receptors.
Nucleic Acids Res 2004, 32:W383W389. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gao QB, Wang ZZ: Classification of Gprotein coupled receptors at four levels.
Protein Eng Des Sel 2006, 19:511516. PubMed Abstract  Publisher Full Text

Bhasin M, Raghava GP: Classification of nuclear receptors based on amino acid composition and dipeptide composition.
J Biol Chem 2004, 279:2326223266. PubMed Abstract  Publisher Full Text

Bhasin M, Raghava GP: ESLpred: SVMbased method for subcellular localization of eukaryotic proteins using dipeptide composition and PSIBLAST.
Nucleic Acids Res 2004, 32:W414W419. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Reczko M, Bohr H: The DEF data base of sequence based protein fold class predictions.
Nucleic Acids Res 1994, 22:36163619. PubMed Abstract  PubMed Central Full Text

Hua S, Sun Z: Support vector machine approach for protein subcellular localization prediction.
Bioinformatics 2001, 17:721728. PubMed Abstract  Publisher Full Text

Wold S, Esbensen K, Geladi P: Principal component analysis.
Chemom Intell Lab 1987, 2:3752. Publisher Full Text

Trygg J, Wold S: Orthogonal projections to latent structures (OPLS).
J Chemom 2002, 16:119128. Publisher Full Text

Geladi P, Kowalski BR: Partial leastsquares regression: a tutorial.
Anal Chim Acta 1986, 185:117. Publisher Full Text

Lapinsh M, Prusis P, Uhlén S, Wikberg JE: Improved approach for proteochemometrics modeling: application to organic compoundamine G proteincoupled receptor interactions.
Bioinformatics 2005, 21:42894296. PubMed Abstract  Publisher Full Text

Vapnik V: The Nature of Statistical Learning Theory. Second edition. New York: Springer; 1999.

Drucker H, Burges CJ, Kaufman L, Smola A, Vapnik V: Support Vector Regression Machines.

Chang CC, Lin CJ: [http://www.csie.ntu.edu.tw/~cjlin/libsvm] webcite
LIBSVM: a library for support vector machines. 2001.
(accessed Feb 1, 2009)

Witten IH, Frank E: Data Mining: Practical machine learning tools and techniques. 2nd edition. San Francisco: Morgan Kaufman; 2005.

Aha D, Kibler D, Albert M: Instancebased learning algorithms.

Quinlan RJ: Learning with Continuous Classes. In In Proceedings of the 5th Australian Joint Conference on Artificial Intelligence. Singapore: World Scientific; 1992:343348.

Golbraikh A, Tropsha A: Beware of q2!
J Mol Graph Model 2002, 20:269276. PubMed Abstract  Publisher Full Text

Peterson SD, Schaal W, Karlén A: Improved CoMFA modeling by optimization of settings.
J Chem Inf Model 2006, 46:35564. PubMed Abstract  Publisher Full Text

Freyhult E, Prusis P, Lapinsh M, Wikberg JE, Moulton V, Gustafsson MG: Unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling.
BMC Bioinformatics 2005, 6:50. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text