Abstract
Background
Machinelearning tools have gained considerable attention during the last few years for analyzing biological networks for protein function prediction. Kernel methods are suitable for learning from graphbased data such as biological networks, as they only require the abstraction of the similarities between objects into the kernel matrix. One key issue in kernel methods is the selection of a good kernel function. Diffusion kernels, the discretization of the familiar Gaussian kernel of Euclidean space, are commonly used for graphbased data.
Results
In this paper, we address the issue of learning an optimal diffusion kernel, in the form of a convex combination of a set of prespecified kernels constructed from biological networks, for protein function prediction. Most prior work on this kernel learning task focus on variants of the loss function based on Support Vector Machines (SVM). Their extensions to other loss functions such as the one based on KullbackLeibler (KL) divergence, which is more suitable for mining biological networks, lead to expensive optimization problems. By exploiting the special structure of the diffusion kernel, we show that this KL divergence based kernel learning problem can be formulated as a simple optimization problem, which can then be solved efficiently. It is further extended to the multitask case where we predict multiple functions of a protein simultaneously. We evaluate the efficiency and effectiveness of the proposed algorithms using two benchmark data sets.
Conclusion
Results show that the performance of linearly combined diffusion kernel is better than every single candidate diffusion kernel. When the number of tasks is large, the algorithms based on multiple tasks are favored due to their competitive recognition performance and small computational costs.
Background
Many types of genomic data can be represented as a graph (network), where the nodes represent genes or proteins, and edges may represent similarities between protein sequences, edges in a metabolic pathway, and physical interactions between proteins [1]. Machine learning tools have been commonly used to analyze biological networks for knowledge discovery and pattern analysis [2]. In this paper, we focus on learning from biological networks for protein function prediction. This problem has been studied extensively by using computational approaches recently [1]. Neighborhoodbased methods [3,4] assign functions to proteins based on the most frequent functions within a neighborhood of the protein and they differ mainly in how the "neighborhood" of a protein is defined. More sophisticated prediction functions have been exploited in [5,6]. Methods based on network diffusion [7,8] view the protein network as a flow network and functions of proteins are diffused from annotated proteins to their neighbors in various ways. Other approaches for protein function annotation from biological networks include the graphcutbased approaches [9,10] and those derived from the kernel methods [1113].
Kernel methods are versatile tools for learning from graphbased data, as they only require the characterization of similarities between objects by the use of kernel trick [2,14]. Diffusion kernels [15], which can be considered as the discretization of the wellknown Gaussian kernel of Euclidean space, are commonly used for graphbased data. In kernel methods, the information on the data is conveyed only in the kernel function, which uniquely determines the mapping of the original inputs onto a feature space. Thus, one of the central issues in kernel methods is the selection of a good kernel function for a specific problem at hand. A recent trend in kernel learning (selection) is to formulate it as convex programs, which lead to a globally optimal solution [16]. The idea of learning a linear combination of prespecified kernels for Support Vector Machines (SVM) was originally proposed in [17] where this problem was formulated as semidefinite programs (SDP) and Quadratically Constrained Quadratic Programs (QCQP). In general, approaches based on learning a convex combination of kernels offer the additional advantage of facilitating heterogeneous data integration from different sources [18].
The objective functions for kernel learning used in [17] are performance measures for hard margin SVM, 1norm soft margin SVM, and 2norm soft margin SVM. An alternative criterion for kernel matrix learning is the KullbackLeibler (KL) divergence [19] between the two zeromean Gaussian distributions defined by the input and output kernel matrices [20]. One particularly appealing feature of the KL divergence criterion is that unlabeled (test) data can be integrated naturally into the training process, thereby improving generalizations. The formulations in [17] also use unlabeled data, but in a weak form by enforcing the trace magnitude of the kernel matrix including both training and test data in the constraint. Direct incorporation of unlabeled data by the formulations in [17] through the KL divergence criterion involves a matrix determinant term. The resulting formulation is a socalled maximumdeterminant problem [21], which is a general framework that contains semidefinite programming (SDP) [16] as a special case. Although its theoretical soundness, experiences with semidefinite programming indicate that it is computationally expensive and thus can not be scaled to largescale problems. The maximumdeterminant problem is a more general framework than SDP and the pathfollowing algorithms used to solve it is more expensive.
Diffusion kernels [15] capture the longrange relationships between vertices of graphs and are stateoftheart for building kernels for graphs. In this paper, we focus on learning diffusion kernels constructed from biological networks, using the KL divergence criterion. In particular, we show that when the KL divergence criterion is used to optimize a convex combination of diffusion kernels with different parameters, the resulting optimization problem does not involve the matrix determinant term and thus can be solved by gradient descent methods. Previous studies [22,23] have shown that the removal of the matrixdeterminant term in the KL divergence criterion has limited effect on its performance. When this modified criterion is used to learn a linear combination of diffusion kernels, the resulting optimization problem is convex and thus solutions by gradient descent methods are guaranteed to be globally optimal. A protein typically performs multiple functions. Most existing approaches formulate a separate task for each of the functions and they are learned independently. They decouple the functions of proteins and potentially compromise the performance as the functions of proteins are usually related. We show that our singletask kernel learning formulation based on the KL divergence criterion can be extended to the multitask case by enforcing all tasks to share a common kernel. The resulting formulation leads to a single optimization problem, which learns multiple functions of proteins simultaneously. Experimental results show that this multipletask kernel learning in a joint optimization framework keeps competitive prediction performance, while its computational cost is similar to that for a single task, thus dramatically reducing the time complexity.
Methods
We study the problem of protein function prediction from biological networks, which are represented as graphs. For a graph , the vertices represent proteins and edges characterize the relationship between proteins. In the following discussion, the vertex and edge sets are denoted as V and E, respectively. The total number of proteins in the network is n = V. The adjacency matrix A is used to denote the similarity between vertices where A_{i,j }describes the similarity between vertices v_{i }and v_{j}. The functions of some proteins in the network are already known and the goal of protein function prediction is to infer the functions of unannotated proteins based on the functions of annotated proteins and the network topology. In particular, for a graph = (V, E), the vertices in V can be partitioned into a training set and a test set. The functions of proteins in the training set are already known while those of proteins in the test set are unknown. Each edge in E reflects the local similarities between its ending vertices. The learning problem is to predict the functions of proteins in the test set based on the label information of training set and the topology of the graph.
Background and Related Work
Kernel methods are particularly suitable for learning from graphbased data, as they only require the similarities between proteins to be encoded in the kernel matrix. In kernel methods, a symmetric function , where denotes the input space, is called a kernel function if it satisfies the Mercer's condition [14]. When used for a finite number of samples in practice, this condition can be stated as follows: for any x_{1}, ...,x_{n }∈ the Gram matrix K ∈ ℝ^{n × n}, defined by K_{ij }= κ(x_{i}, x_{j}) is positive semidefinite. Any kernel function κ implicitly maps the input set to a highdimensional (possibly infinite) Hilbert space equipped with the inner product through a mapping
The adjacency matrix A can't be directly used as a kernel matrix. First, the adjacency matrix contains the local similarity information only, which may not be effective for function prediction. Secondly, the adjacency matrix may not even be positive semidefinite. To derive a kernel matrix from the adjacency matrix, the idea of random walk and network diffusion has been used. The basic idea is to compute the global similarity between vertices v_{i }and v_{j }as the probability of reaching v_{j }at some time point T when the random walker starts from v_{i}. This idea is justified at least to some extent by observing that the random walker tends to meander around its origin as there is a larger number of paths of length T to its neighbors than to remote vertices [2].
To avoid some potential problems such as the choice of value for T and assurance of positive semidefiniteness for the kernel matrix, a random walk with an infinite number of infinitesimally small steps is used instead. It can be formally described as:
where β is a parameter for controlling the extent of diffusion and L ∈ ℝ^{n × n }is the graph Laplacian matrix defined as
where A is the adjacency matrix, e is the vector of all ones, and diag(Ae) is a diagonal matrix with the diagonal entries being the corresponding row summation of the matrix A. It turns out that for any symmetric matrix L, e^{βL }is always positive definite and thus can be used as a kernel matrix. The diffusion effect of such kernel can be explicitly seen when it is expanded as [2]:
where the local information encoded in L is continuously diffused by repeated multiplications. The parameter β in the diffusion kernel controls the extent of diffusion and it has a similar effect as the scaling parameter in Gaussian kernels. If the β is too small, the local information can not be diffused effectively, resulting in a kernel matrix that only captures local similarities. On the other hand, if it is too large, the neighborhood information will be lost. Furthermore, the optimal value for β is problem and datadependent. Thus it is highly desirable to tune the β value adaptively from the data.
We approach the kernel tuning problem by learning an optimal kernel as a linear combination of prespecified diffusion kernels constructed with different values of β. This is motivated from the work in [17] where the optimal kernel for SVM, in the form of a linear combination of prespecified kernels, is learned based on the large margin criteria. In particular, the generalized performance measure based on 1norm soft margin SVM used in [17] is
where C > 0 is the regularization parameter in SVM, e is the vector of all ones, G(K) is defined by G_{ij}(K) = k(x_{i}, x_{j})y_{i}y_{j}, and the ith entry of y denoted as y_{i }is the class label (1 or 1) of the ith data point x_{i}. Lanckriet et al. [17] showed that when the optimal kernel is restricted to the linear combination of the given p kernels K_{1}, ..., K_{p}, the kernel learning problem can be formulated as a semidefinite program. Furthermore, when the coefficients of the linear combination are constrained to be nonnegative, the kernel learning problem can be formulated as a Quadratically Constrained Quadratic Program [16]. As was shown in [20], an alternative performance measure is the KL divergence between the two zeromean Gaussian distributions associated with the input and output kernel matrices. We show that when this KL divergence criterion is used to learn a linear combination of diffusion kernels constructed with different values of β, the resulting optimization problem can be solved efficiently. We further show that it can be extended to the multipletask case. Such integration of multiple tasks into one optimization problem can potentially exploit the complementary information among different tasks.
Diffusion Kernel Learning: The SingleTask Case
We focus on learning an optimal kernel for a single task, which will then be extended to the multitask case. The underlying idea is that the Laplacian matrix L, defined in Eq. (3), contains the connectivity information of all vertices in the graph. By adaptively tuning the kernel constructed from L on the training vertices, the entries corresponding to test vertices are expected to be tuned in some optimal way as well. To restrict the search space and improve the generalization ability, we focus on learning an optimal kernel as a linear combination of a set of diffusion kernels constructed with different values of β, indicating different extents of diffusion. In particular, we choose a sequence of values for β as β_{1}, ...,β_{p}, and the corresponding diffusion kernels can be constructed as
We may assume that the kernels defined in Eq. (6) reflect our (weak) prior knowledge about the problem. The goal is to integrate the tuning of the coefficients into the learning process and the algorithm can adaptively select an optimal linear combination of the given kernels. Note that it is numerically favorable to normalize the kernels though this does not affect the results theoretically [14]. We normalize the kernels as follows:
and the optimal kernel can be represented as
for a set of nonnegative coefficients
KullbackLeibler Divergence Formulation
Kernel matrices are positive semidefinite and thus can be used as the covariance matrices for Gaussian distributions. It was shown in [20] that the kernel matrix can be learned by minimizing the KullbackLeibler (KL) divergence between the zeromean Gaussian distributions associated with the input and output kernel matrices. In this paper, we focus on learning the optimal coefficients α_{i }from the data automatically based on minimizing this KL divergence criterion. As described in [20], the KL divergence between the zeromean Gaussian distributions defined by the input kernel K_{x }and output kernel K_{y }can be expressed as
where · denotes the matrix determinant, N_{x }and N_{y }denote the zeromean Gaussian distributions associated with K_{x }and K_{y}, respectively, and n is the number of samples. When the output kernel K_{y }is defined as K_{y }= yy^{T}, the KL divergence in Eq. (9) can be expressed as
where "const" denotes terms that are independent of K_{x}, and K_{x }is the input kernel matrix, which is defined as a linear combination of the given p kernels as
Note that a regularization term, with λ as the regularization parameter, is added to Eq. (11) to deal with the singularity problem of kernel matrices as in [20], and we require as in multiple kernel learning (MKL) [17]. The optimal coefficients α = [α_{1}, ..., α_{p}]^{T }are computed by minimizing KL(N_{y}N_{x}). By substituting Eq. (11) into Eq. (10), and removing the constant term, we obtain the following optimization problem:
where α = (α_{1}, ..., α_{p})^{T}, α ⩾ 0 denotes that all components of α are nonnegative, and the vector a ∈ ℝ^{n }is the problemspecific target vector, corresponding to the general target in Eq. (9), defined as follows:
Note that we assign the label 0 to vertices in the test set so that it will not bias towards either class. Similar idea has been used in [24] for semisupervised learning. In multiple kernel learning [17], the sumtoone constraint on the weights is enforced as in Eq. (12). We present results on both constrained and unconstrained formulations in the experiments. Results show that the constrained formulations achieved better performance than the unconstrained ones.
Recall that the graph Laplacian matrix L is symmetric, so its eigendecomposition can be expressed as
where
is the diagonal matrix of eigenvalues and P ∈ ℝ^{n × n }is the orthogonal matrix of corresponding eigenvectors. According to the definition of the function of matrices [25], we have
where
The main result is summarized in the following theorem:
Theorem 1. Given a set of p diffusion kernels, as defined in Eq. (7), the problem of learning the optimal kernel matrix, in the form of a convex combination of the given p kernel matrices as in Eq. (12), can be formulated as the following optimization problem:
where b = (b_{1}, ..., b_{n}) = P^{T }a, g_{j }is the jth diagonal entry of the diagonal matrix G, defined as
and D_{i }is the diagonal matrix defined in Eq.(16).
Proof. The first term in Eq. (12) can be written as:
where the third equality follows from the property of the trace, that is,
Similarly, the second term in Eq. (12) can be written as:
By combining the first term in Eq. (21) and the second term in Eq. (22), we prove the theorem.
The formulation in Theorem 1 is a nonlinear optimization problem. It involves a nonlinear objective function with p variables and linear equality and inequality constraints. Due to the presence of the log term in the objective, it is a nonconvex problem and a globally optimal solution may not exist. However, our experimental results show that this formulation consistently produces superior performance.
Convex Formulation
The optimization problem in Theorem 1 is not convex. Previous studies [22,23] indicate that the removal of the log determinant term in the KL divergence criterion in Eq. (12) has a limited effect on the performance. This leads to the following optimization problem:
Following Theorem 1, we can show that the optimization problem above can be simplified as
where g_{j }and b are defined as in Theorem 1.
The optimization problem in Eq. (26) is convex and thus a globally optimal solution exists. Numerical experiments indicate that the simple gradient descent algorithm converges very quickly to the optimal solution. Furthermore, the prediction performance of this convex formulation is comparable to that of the formulation proposed in Theorem 1. This convex formulation shares some similarities with the one in [26], where a set of Laplacian matrices derived from multiple networks is combined.
Diffusion Kernel Learning: The MultiTask Case
It is known that proteins often perform multiple functions, which are typically related. Many existing function prediction approaches decouple multiple functions and formulate each function prediction problem as a separate binaryclass classification problem. Such methods do not consider the relationship among the multiple functions of a protein and potentially compromise the overall performance.
We propose to extend our formulation for the singletask case to deal with multiple tasks simultaneously. In particular, we formulate a single optimization problem for the simultaneous prediction of multiple functions for a protein. The joint learning of multiple functions can potentially exploit the relationship among functions and improve the performance. In terms of computational complexity, the proposed joint optimization problem is shown to be comparable to that of the singletask formulation.
A key observation is that when the prespecified diffusion kernels are computed from the same biological network with different values of β, the graph Laplacian matrices are the same for all tasks. By enforcing all tasks to share a common linear combination of kernels, we obtain the following joint optimization problem:
where a^{(k) }∈ ℝ^{n }for i = 1, ..., t is the vector of class labels for the kth task as in Eq. (13), and t is the number of tasks. Note that all t tasks are related in this joint formulation by enforcing a common kernel matrix for all tasks. The objective function in Eq. (27) uses an equal weight for all tasks. If some tasks are known to be more important than others, a more general objective function with varying weights for different tasks may be used instead. Following Theorem 1, we can simplify the optimization problem in Eq. (27), as summarized in the following theorem:
Theorem 2. Given a set of p diffusion kernels, as defined in Eq. (7), the problem of optimal multitask kernel learning, in the form of a convex combination of the given p kernels, can be formulated as the following optimization problem:
where g_{j }is defined as in Theorem 1, b_{k }= P^{T }a^{(k)}, a^{(k) }is defined as in Eq. (13) for the kth task, and t is the total number of tasks.
Proof. The first term in Eq. (27) can be rewritten as
Similarly, the second term can be rewritten as
The detailed intermediate steps of derivation are the same as those in the proof of Theorem 1 and thus are omitted. By combining these two terms together, we prove the theorem.
The optimization problem in Theorem 2 is not convex. Similar to the singletask case, the log determinant term in Eq. (27) may be removed, which leads to the following convex optimization problem:
Experimental evidences show that this convex optimization problem is comparable to the formulation in Theorem 2 in prediction performance.
Results and Discussion
We evaluate the performance of the proposed formulations on two benchmark data sets, and compare them with relevant methods, including the Neighbor Counting approach [4] and the FSWeighted Averaging approach [5,6]. We construct 60 diffusion kernels from each data set using different values for β and the proposed formulations are applied to compute a linear combination of the precomputed kernels. The performance of the obtained kernel is compared with that of the individual kernel. To see the relative performance of the objective functions, we also use the 1norm soft margin SVM criterion, proposed in [17], to compute the linear combination of kernels and the results are presented. All of the formulations proposed in this paper are solved using the MATLAB [27] function fmincon which employs the sequential quadratic programming method [28]. The QCQP formulation for optimizing the 1norm soft margin SVM criterion is solved using the MOSEK [29] software package. After the kernels are computed, they are fed into SVM for classification and the LIBSVM [30] software package is used in the experiments. All of the experiments are performed on a PC with Intel Pentium D 820 2.8G CPU and 2G RAM.
In the following experiments, a total of 60 diffusion kernels are precomputed and the values of β used are β_{i }= 0.1 × i, for i = 1, ..., 60. In order to investigate the performance of each individual kernel, we use each kernel for the classification and compute the average Receiver Operating Characteristic (ROC) values over all of the tasks. The ROC value produced by the best averaged individual kernel is used as a baseline. It is called rBaseline as all tasks are restricted to use the same kernel. We further relax the requirement that all tasks use the same kernel and compute the sequence of ROC values achieved by the best individual kernel for each of the tasks. This is considered another baseline called uBaseline as the kernel used by each task is unrestricted. Note that the kernel matrices for both rBaseline and uBaseline represent the single best candidate kernel in the ideal case that the labels of test data are known, and their performance is not guaranteed in practice. In contrast, the kernel matrices computed by the proposed formulations are the optimal kernel matrices in the form of linear combination of the given candidate kernel matrices. In order to evaluate the effectiveness of the weights obtained by the proposed formulations, we assign each kernel the same weight and compute the performance of the combined kernel. It is called eBaseline as all kernel matrices have an equal weight.
For convenience of presentation, the formulations proposed in Theorem 1, Eq. (26), Theorem 2, and Eq. (34) are denoted as DKL_{KL}, DKL, mDKL_{KL}, and mDKL, respectively. For DKL_{KL }and mDKL_{KL}, we also propose to remove the constraints in their optimization problems and the resulting formulations are denoted as and , respectively. (See the caption of Table 1 for detailed description.) The method based on optimizing 1norm soft margin SVM criterion by solving QCQP proposed in [17] is denoted as SM1. The six proposed formulations are summarized in Table 1.
Table 1. Summary of the proposed formulations._{}
Experiments on the Ligand Data Set
The Ligand data set was derived by Vert and Kanehisa [31] from the Ligand database of chemical reactions in biological pathways [32]. It contains a graph reflecting the interactions between proteins and the function information for them. The graph is a yeast biological network in which a path between vertices implies a possible series of reactions catalyzed by proteins along it. The numbers of vertices and edges in this graph are 753 and 7860, respectively. For the functions of proteins, the functional categories of the MIPS Comprehensive Yeast Genome Database (CYGD) [33] are considered as the gold standard. These categories are not mutually exclusive, and each protein may have multiple functions. There are 36 different functions considered for this data set.
Comparison of ROC Values
We use the ROC as the performance measure and the λ value is fixed to 10^{6 }in the experiments. Our experimental results show that the algorithms are not sensitive to the value of λ, as long as it is neither too large nor too small. Figure 1 plots the number of tasks with ROC value above a threshold for all methods. The average ROC values achieved by all methods are also summarized in Table 2. In order to test statistical significance, we also compute the pvalues of Wilcoxon signed test and the results are reported in Table 3. We can observe that mDKL achieves the best performance among all methods. All the proposed formulations except outperform the three baseline methods. This implies that the computed linear combination of kernels can potentially exploit the complementary information in different kernels and thus improve performance. The ROC value achieved by SM1 is lower than those of the three baseline methods, implying that the SVM criterion is less effective for such tasks. Note that the SM1 criterion also uses information from unlabeled data, but in a weak form. The formulation achieves a ROC value lower than the three baseline methods. This shows that the constraints have important normalizing effects and can not be removed. By comparing the relative performance of formulations with and without the log term, we can conclude that removing this term usually does not affect the performance. Another important observation is that mDKL and mDKL_{KL }outperform DKL and DKL_{KL}, implying that constraining the multiple tasks to share a common kernel does not degrade the performance if the kernel used is a linear combination of kernels obtained by the proposed formulations. In contrast, if the kernel used is a single kernel, this restriction will degrade the performance, as illustrated by the relative performance of rBaseline and uBaseline. For the eBaseline method, it can be observed that, except for , all of other proposed formulations outperform it. This illustrates that our formulations can compute an optimal kernel matrix by assigning different weights to the candidate kernel matrices. We can observe from Table 3 that the difference between the performance of the two baseline methods (rBaseline and eBaseline) and that of DKL and mDKL are statistically significant. All diffusion kernel based approaches are competitive with the Neighbor Counting approach [4] and the FSWeighted Averaging approach [5,6]. Neighbor Counting and FSWeighted Averaging use the local information, more specifically the level1 neighborhood (Neighbor Counting) and both level1 and level2 neighborhoods (FSWeighted Averaging), for the prediction. The experimental results show the effectiveness of capturing the longrange relationships (global information) between proteins in the network in diffusion kernels [15].
Table 2. Mean ROC values and execution time (in seconds) of various methods on the Ligand Data Set.
Table 3. pvalues obtained from Wilcoxon signed test comparing DKL and mDKL with other formulations for the Ligand data set.
Figure 1. Comparison of ROC values for various algorithms on the Ligand Data Set. The horizontal axis represents the ROC values and the vertical axis is the number of tasks with ROC values above the corresponding horizontal axis value.
Figure 2 plots the average ROC values for the 60 kernels (the maximum mean ROC value is used in rBaseline) and Figure 3 plots the best ROC values for the 36 tasks. We can observe that for tasks 29 and 33, the best ROC values are small. This implies that all the kernels perform poorly for these two tasks. To illustrate the relative performance of the proposed formulations with that of the baseline method graphically, we plot in Figure 4 the ROC values obtained by the proposed formulations with respect to uBaseline using scatter plots. We can observe that there are two points below the 45degree line in each plot. Those two points correspond to tasks 29 and 33 and they are difficult to classify by all methods. As most points in the plots are above the 45degree line, we can conclude that the proposed formulations outperform uBasline on most tasks.
Figure 2. Mean ROC values over 36 tasks for each kernel on the Ligand Data Set (the kernel with the maximum mean ROC value is used in rBaseline). The horizontal axis denotes the β values used to build the corresponding kernel and the vertical axis is the mean ROC value.
Figure 3. Best ROC values for tasks achieved by the best kernel (uBaseline) on the Ligand Data Set. The horizontal axis represents the tasks and the vertical axis is the corresponding best ROC value.
Figure 4. Comparison of the relative performance of the proposed formulations with that of uBaseline on the Ligand Data Set. The horizontal axis represents uBaseline and the vertical axis corresponds to DKL, DKL_{KL}, mDKL, mDKL_{KL}. Each point in the scatter plots corresponds to ROC values produced by the compared methods on the same task.
Comparison of Execution Time
In order to compare the efficiency of various kernel learning methods, we list in Table 2 the execution time of the compared methods. It can be observed that all methods based on multiple tasks are more efficient than their singletask counterparts. In particular, the execution time of mDKL is roughly 1/36 of that of DKL, which is consistent with our theoretical analysis. In general, convex formulations are more efficient than their nonconvex original formulations and the optimization problems with the constraints removed take a longer time to converge. By taking the performance into account, the DKL and mDKL may be the best choices in practice.
Stability Test
In order to obtain a robust performance estimate for the various methods, we randomly partition the data set into a training set and a test set ten times and the average ROC values and standard deviations across splittings are reported in Table 4. Compared with the results in Table 2, we can see that the relative performance of each method in these two tables is very similar. In particular, mDKL and mDKL_{KL }achieve the best overall performance. Except for the two unconstrained formulations and , all of other proposed formulations achieve higher ROC values than the three baseline methods. It is worth noting that the performance of uBaseline and rBaseline is obtained by using the labels of both the training and test data and such performance is not guaranteed in practice when only the labels of the training data are used.
Table 4. Average ROC values and the corresponding standard deviations over 11 splittings on the Ligand Data Set. One of the splittings was specified by the contributor of the data and the remaining ten splittings are randomly generated.
Experiments on the von Mering Data Set
The von Mering data set was created by von Mering et al. [34] from proteinprotein interactions identified via six different methods. It contains a graph consisting of 2617 vertices (proteins) and 11855 edges. There are 76 different functions (tasks) associated with the proteins in the graph. The performance of different methods is reported in Figure 5. Two baseline methods, rBaseline and uBaseline, constructed exactly the same way as those for the Ligand data set are used and their performance is summarized in Figure 6 and Figure 7, respectively. The value for is again set to 10^{6 }in the experiments. Figure 8 compares the relative performance of the proposed formulations with that of the uBaseline graphically.
Figure 5. Comparison of ROC values for various algorithms on the von Mering Data Set. The horizontal axis represents the ROC values and the vertical axis is the number of tasks with ROC values above the corresponding horizontal axis value.
Figure 6. Mean ROC values over 76 tasks for each kernel on the von Mering Data Set. The horizontal axis denotes the β values used to build the corresponding kernel and the vertical axis is the mean ROC values.
Figure 7. Best ROC values for diferent tasks achieved by different kernels on the von Mering Data Set. The horizontal axis represents the tasks and the vertical axis is the corresponding best ROC values.
Figure 8. Comparison of the relative performance of the proposed formulations with that of uBaseline on the von Mering Data Set. The horizontal axis represents uBaseline and the vertical axis corresponds to DKL, DKL_{KL}, mDKL, mDKL_{KL}. Each point in the scatter plots corresponds to ROC values produced by the compared methods on the same task.
Comparison of ROC Values
We use the ROC values of each method to compare their relative performance. Similar to Figure 1 for the Ligand data set, Figure 5 plots the change of the number of tasks with ROC value above a certain threshold as the threshold varies for each of the compared method. For ease of comparison, Table 5 also lists the average ROC values achieved by the compared methods. Similarly, the pvalues of Wilcoxon signed test for this data set are reported in Table 6. As the SM1 formulation requires excessive storage and computational time for this relatively large data set, we are not able to obtain its result in this experiment. From these results we can observe that mDKL and mDKL_{KL }achieve the best performance. In general, the performance of DKL, DKL_{KL}, mDKL, and mDKL_{KL }is very close. All of the proposed formulations except perform better than the three baseline methods. The difference between DKL and DKL_{KL }as well as the difference between mDKL and mDKL_{KL }is very small, which further confirms that the removal of the log term does not affect the performance of algorithm much. For the formulations with constraints removed, i.e., and , their performance is the lowest among the proposed formulations. Similar to the case for the Ligand data set, we conclude that constraining the multiple tasks to share a common kernel does not degrade the performance if the kernel used is a linear combination of kernels obtained by the proposed formulations. In contrast, if the kernel used is a single kernel, this restriction will degrade the performance, as illustrated by the relative performance of rBaseline and uBaseline. In terms of the eBaseline, we can observe from Table 5 that all of our proposed formulations achieve higher ROC values than the eBaseline method, in which all of the kernel matrices are assigned the same weight. We can observe from Table 6 that the difference between the performance of all of the three baselines and that of DKL and mDKL is statistically significant. We can again observe that all diffusion kernel based approaches are competitive with the Neighbor Counting approach and the FSWeighted Averaging approach.
Table 5. Mean ROC values and execution time (in seconds) of various methods on the von Mering Data Set.
Table 6. pvalues obtained from Wilcoxon signed test comparing DKL and mDKL with other formulations for the von Mering data set.
Figure 8 presents the scatter plots of four proposed formulations with respect to uBaseline. It can be observed that most points are above the 45degree line, which implies that the linear combination of kernels is better than the ideally best individual kernel. In general, the performance of DKL_{KL}, DKL, mDKL_{KL}, mDKL is better than uBaseline. And this is also confirmed by the mean ROC values listed in Table 5.
Comparison of Execution Time
Table 5 also lists the execution time of various kernel learning methods. Similar conclusions can be drawn from this table as to the execution time on the Ligand data set. All methods based on multiple tasks are more efficient than their singletask counterparts. By comparing the results in Table 2 and Table 5 we can also observe that as the number of tasks increases, the time difference between methods based on multiple tasks and those based on single tasks increases too. Thus, the formulations based on multiple tasks are preferred when the number of tasks is large.
Stability Test
Similar to the Ligand data set, we generate ten random splittings of the data into training and test sets and report the average ROC values and standard deviations in Table 7. By comparing with results in Table 5, we can see that the relative performance of each method is similar in both tables. All of the proposed formulations outperform eBaseline.
Table 7. Average ROC values and the corresponding standard deviations over 11 splittings on the von Mering Data Set. One of the splittings was specified by the contributor of the data and the remaining ten splittings are randomly generated.
Conclusion
In this paper, we address the issue of learning an optimal diffusion kernel based on KL divergence criterion for protein function prediction. By exploiting the special structure of the diffusion kernel, we show that this KL divergence based kernel learning problem can be formulated as a simple optimization problem, which can be solved efficiently. We also extend the formulation to the multitask case where we predict multiple functions of a protein simultaneously.
We have conducted experiments on two benchmark data sets. Our results show that the performance of linearly combined diffusion kernel is better than every single candidate diffusion kernel. Results also show that the removal of the log term in the KL divergence criterion does not degrade its recognition performance, while it leads to a reduced computational cost. When the number of tasks is large, the algorithms based on multiple tasks are favored due to their competitive recognition performance and small computational costs. One possible extension is to incorporate the learning of the regularization parameter in the proposed formulations as in [17]. The difference between the proposed learning framework and those in [17] is that our formulations require that the eigenvectors of the candidate kernel matrices to be the same. Thus the proposed formulations may not be applied for heterogeneous data integration. We plan to apply the proposed algorithms for the analysis of other graphbased biological data.
Authors' contributions
LS designed the methodology, implemented programs, and participated in manuscript preparation. SJ derived the KL divergence formulation, and drafted the manuscript. JY originally conceived the project, guided the implementation, and drafted the manuscript. All authors have read and approved the final manuscript.
Acknowledgements
This research is sponsored in part by the Arizona State University and by the National Science Foundation under Grant No. IIS0612069.
References

Pandey G, Kumar V, Steinbach M: Computational Approaches for Protein Function Prediction: A Survey. In Tech Rep TR 06028, Department of Computer Science and Engineering. University of Minnesota, Twin Cities, MN; 2006.

Schölkopf B, K T, JP V: Kernel Methods in Computational Biology. Cambridge, MA: MIT Press; 2004.

Hishigaki H, Nakai K, Ono T, Tanigami A, Takagi T: Assessment of prediction accuracy of protein function from proteinprotein interaction data.
Yeast 2001, 18:523531. PubMed Abstract  Publisher Full Text

Schwikowski B, Uetz P, Fields S: A network of proteinprotein interactions in yeast.
Nature Biotechnology 2000, 18:12571261. PubMed Abstract  Publisher Full Text

Chua HN, Sung WK, Wong L: Exploiting Indirect Neighbours and Topological Weight to Predict Protein Function from ProteinProtein Interactions.
Bioinformatics 2006, 22:16231630. PubMed Abstract  Publisher Full Text

Chua HN, Sung WK, Wong L: Using Indirect Protein Interactions for the Prediction of Gene Ontology Functions.
BMC Bioinformatics 2007, 8:S8. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Nabieva E, Jim K, Agarwal A, Chazelle B, Singh M: Wholeproteome prediction of protein function via graphtheoretic analysis of interaction maps.
Bioinformatics 2005, 21:302310. Publisher Full Text

Weston J, Elisseeff A, Zhou D, Leslie CS, Noble WS: Protein ranking: From local to global structure in the protein similarity network.
Proc Natl Acad Sci 2004, 101:65596563. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Vazquez A, Flammini A, Maritan A: Global protein function prediction from proteinprotein interaction networks.
Nature Biotechnology 2003, 21:697700. PubMed Abstract  Publisher Full Text

Karaoz U, Murali TM, Letovsky S, Zheng Y, Ding C, Cantor CR, Kasif S: Wholegenome annotation by using evidence integration in functionallinkage networks.
Proc Natl Acad Sci 2004, 101:28882893. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

BenHur A, Noble WS: Kernel methods for predicting protein protein interactions.
Bioinformatics 2005, 21(Suppl 1):i38i46. PubMed Abstract  Publisher Full Text

Roth V, Fischer B: Improved functional prediction of proteins by learning kernel combinations in multilabel settings.
BMC Bioinformatics 2007, 8:S12. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Tsuda K, Noble WS: Learning kernels from biological networks by maximizing entropy.
Bioinformatics 2004, 20:326333. Publisher Full Text

Schölkopf B, Smola AJ: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA: MIT Press; 2002.

Kondor RI, Lafferty JD: Diffusion Kernels on Graphs and Other Discrete Structures.

Boyd S, Vandenberghe L: Convex Optimization. Cambridge: Cambridge University Press; 2004.

Lanckriet G, Cristianini N, Bartlett P, Ghaoui LE, Jordan MI: Learning the Kernel Matrix with Semidefinite Programming.

Lanckriet G, Bie TD, Cristianini N, Jordan M, Noble W: A statistical framework for genomic data fusion.
Bioinformatics 2004, 20:26262635. PubMed Abstract  Publisher Full Text

Kullback S, Leibler RA: On Information and Sufficiency.
Annals of Mathematical Statistics 1951, 22:7986. Publisher Full Text

Lawrence ND, Sanguinetti G: Matching kernels through KullabckLeibler divergence minimisation. In Technical Report CS0412, Department of Computer Science. The University of Sheffeld; 2004.

Vandenberghe L, Boyd S, Wu S: Determinant Maximization with Linear Matrix Inequality Constraints.
SIAM Journal on Matrix Analysis and Applications 1998, 19:499533. Publisher Full Text

Smola AJ, Bartlett PL: Sparse greedy Gaussian process regression.

Smola AJ, Schölkopf B: Sparse greedy matrix approximation for machine learning.

Zhou D, Bousquet O, Lal TN, Weston J, Schölkopf B: Learning with Local and Global Consistency.

Golub GH, Van Loan CF: Matrix Computations. 3rd edition. Baltimore, MD: The Johns Hopkins University Press; 1996.

Tsuda K, Shin H, Schölkopf B: Fast protein classification with multiple networks.
Bioinformatics 2005, 21:5965. Publisher Full Text

The Matlab Package [http://www.mathworks.com] webcite

Nocedal J, Wright S: Numerical Optimization. 2nd edition. New York: Springer; 2006.

The MOSEK Package [http://www.mosek.com] webcite

Chang CC, Lin CJ: [http://www.csie.ntu.edu.tw/~cjlin/libsvm] webcite

Vert JP, Kanehisa M: GraphDriven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA.

The Ligand data set [http://www.genome.ad.jp/ligand/] webcite

The MIPS Comprehensive Yeast Genome Database [http://mips.gsf.de/genre/proj/yeast/] webcite

von Mering C, Krause R, Snel B, Cornell M, Oliver S, Fields S, Bork P: Comparative assessment of largescale data sets of proteinprotein interactions.
Nature 2002, 417(6887):399403. PubMed Abstract  Publisher Full Text