Abstract
Background
Support Vector Machines (SVMs) – using a variety of string kernels – have been successfully applied to biological sequence classification problems. While SVMs achieve high classification accuracy they lack interpretability. In many applications, it does not suffice that an algorithm just detects a biological signal in the sequence, but it should also provide means to interpret its solution in order to gain biological insight.
Results
We propose novel and efficient algorithms for solving the socalled Support Vector Multiple Kernel Learning problem. The developed techniques can be used to understand the obtained support vector decision function in order to extract biologically relevant knowledge about the sequence analysis problem at hand. We apply the proposed methods to the task of acceptor splice site prediction and to the problem of recognizing alternatively spliced exons. Our algorithms compute sparse weightings of substring locations, highlighting which parts of the sequence are important for discrimination.
Conclusion
The proposed method is able to deal with thousands of examples while combining hundreds of kernels within reasonable time, and reliably identifies a few statistically significant positions.
1 Background
Kernel based methods such as Support Vector Machines (SVMs) have proven to be powerful for sequence analysis problems frequently appearing in computational biology (e.g. [14]). They employ a socalled kernel function k(s_{i}, s_{j}) which intuitively computes the similarity between two sequences s_{i }and s_{j}. The result of SVM learning is a αweighted linear combination of kernel elements and the bias b (see Section 4.1 for more details):
where the s_{i }'s are N labeled training sequences (y_{i }∈ {± 1}). One of the problems with kernel methods compared to probabilistic methods (such as position weight matrices or interpolated Markov models [5]) is that the resulting decision function (1) is hard to interpret and, hence, difficult to use in order to extract relevant biological knowledge from it (see also [3,6]). We approach this problem by considering convex combinations of M kernels, i.e.
with β_{k }≥ 0 and ∑_{k}β_{k }= 1, where each kernel k_{k }uses only a distinct set of features of the sequence. For appropriately designed subkernels k_{k}, the optimized combination coefficients can then be used to understand which features of the sequence are of importance for discrimination. This is an important property missing in current kernel based algorithms.
In this work we consider the problem of finding the optimal convex combination of kernels (i.e. determining the optimal β's in (2)). This problem is known as the Multiple Kernel Learning (MKL) problem [4,7,8] (see also [9,10,38] for related approaches). Sequence analysis problems usually come with large numbers of examples and one may wish to combine many kernels representing many possibly important features. Unfortunately, algorithms proposed for Multiple Kernel Learning so far are not capable of solving the optimization problem for realistic problem sizes (e.g. ≥ 10,000 examples) within reasonable time. Even recently proposed decomposition algorithms for this problem, such as the one proposed in [7], are not efficient enough since they suffer for instance from the inability to keep all kernel matrices (K_{j }∈ ℝ^{N×N}, j = 1, ..., M) in memory. (Note that kernel caching can become ineffective if the number of combined kernels is large.) We consider the reformulation of the MKL problem into a semiinfinite linear problem (SILP), which can be iteratively approximated quite efficiently. In each iteration one only needs to solve the classical SVM problem (with one of the efficient and publicly available SVM implementations; cf. references in [11] and also [12,39] to gain a further speedup in case of string kernels) and then performs an update of the kernel convex combination weights β. Separating the SVM optimization from the optimization of the kernel coefficients can thus lead to significant improvements for large scale problems with general kernels (cf. Section 4 for details).
We illustrate the usefulness of the proposed algorithm in combination with a recently proposed string kernel on DNA sequences – the socalled weighted degree (WD) kernel [13]. Its main idea is to count the (exact) cooccurrence of kmers at position l of two compared DNA sequences of equal length L (e.g. a window around some signal on the DNA). The kernel can be written as a linear combination of d parts with coefficients β_{k} (k = 1, ..., d):
where L is the length of the sequences s, d is the maximal oligomer length considered and u_{k,l}(s) is the oligomer of length k at position l of sequence s. Moreover, I(true) := 1 and 0 otherwise.
One question is how the weights β_{k }for the various kmers in (3) should be chosen. So far, only heuristic settings in combination with expensive modelselection methods have been used (e.g. [13]). The MKL approach offers a clean and efficient way to find the optimal weights β: We define d kernels
and then optimize the convex combination of these kernels (with coefficients β) using the MKL algorithm (cf. (3)). The optimal weights β indicate which oligomer lengths are important for the classification problem at hand (see results in Section 2.2).
Additionally, it is interesting to introduce an importance weighting over the positions in the subsequence. Hence, we define a separate kernel for each position and each oligomer length, i.e.
k_{k,l}(s_{i}, s_{j}) = I(u_{k,l}(s_{i}) = u_{k,l}(s_{j})),
and optimize the weightings of the combined kernel, which may be written as
The simpler case would be to only consider one kernel per position in the sequence: k(s_{i}, s_{j}) = with
where γ is the default weighting as used in [13].
Obviously, if one would be able to obtain an accurate classification by a sparse weighting β_{k,l}, then one can quite easily interpret the resulting decision function. For instance for signal detection problems (such as splice site detection), one would expect a few important positions with long oligomers near the site and some additional positions within the exon capturing the nucleotide composition (short oligomers; cf. Sections 2.4 and 2.5).
While the proposed MKL algorithms are applicable to arbitrary kernels, we particularly consider the case of string kernels and show how their properties can be exploited in order to significantly speedup the computations. We extend previous work by [8,14,15] and employ tries [16] during training and testing. In Section 4 we develop a method that can avoid kernel caching and we therefore obtain very memory efficient and fast algorithms (which also speedup standard SVM training).
By bootstrapping and applying a combinatorial argument, we derive a statistical test that discovers the most important kernel weights. Using this test, we elucidate on simulated pseudoDNA sequences with two hidden 7mers which kmers in the sequence were used for the SVM decision. Additionally we apply our method to the problem of splice site classification (C. elegans acceptor sites) and to the problem recognizing alternatively spliced exons [17].
2 Results and Discussion
The main goal of this work is to provide an explanation of the SVM decision rule, for instance by identifying sequence positions that are important for discrimination. As a first test we apply our method to a toy problem where everything is known and we can directly validate the findings of our algorithm with the underlying truth. As a next step, we show that our MKL algorithm performs as well or slightly better than the standard SVM and leads to SVM classification functions that are computationally more efficient. In the remaining part we show how the weights can be used to obtain a deeper understanding of how the SVM classifies sequences and match it with knowledge about the underlying biological process.
2.1 MKL Learning Detects Motifs in Toy Data set
As a proof of concept, we test our method on a toy data set with two hidden 7mers (at positions 10 & 30) at four different noise levels (we used different numbers of random positions in the 7mers that were replaced with random nucleotides; for a detailed description of the data see Appendix A. 1). We use the kernel as defined in (5) with one subkernel per position and oligomer length. We consider sequences of length L = 50 and oligomers up to length d = 7, leading to M = 350 subkernels. For every noise level, we train on 100 bootstrap replicates and learn the 350 WD kernel parameters in each run. On the resulting 100 weightings we performed the reliability test (cf. Section 4.3). The results are shown in Figure 1 (columns correspond to different noise levels – increasing from left to right). Each figure shows a kernel weighting β, where columns correspond to weights used at a certain sequence position and rows to the kmer length used at that position. The plots in the first row show the weights that are detected to be important at a significance level of α = 0.05 in bright (yellow) color. The likelihood for every weight to be detected by the test and thus to reject the null hypothesis is illustrated in the plots in the second row (cf. Section 4.3 for details). Bright colors mean that it is more likely to reject .
Figure 1. In this "figure matrix", columns correspond to the noise level, i.e. different numbers of nucleotides randomly substituted in the motif of the toy data set (cf. Appendix A.1). Each subfigure shows a matrix with each element corresponding to one kernel weight: columns correspond to weights used at a certain sequence position (1–50) and rows to the oligomer length used at that position (1–7). The first row of the figure matrix shows the kernel weights that are significant, while the second row depicts the likelihood of every weight to be rejected under
.As long as the noise level does not exceed 2/7, longer matches of length 3 and 4 seem sufficient to distinguish sequences containing motifs from the rest. However, only the 3mer is detected with the test procedure. When more nucleotides in the motifs are replaced with noise, more weights are determined to be of importance. This becomes especially obvious in column 3 were 4 out of 7 nucleotides within each motif were randomly replaced, but still an average ROC score of 99.6% is achieved. In the last column the ROC score drops down to 83% (not shown), but only weights in the correct range 10 ... 16 and 30 ... 36 are found to be significant.
2.2 Optimization of WD Kernel Weights Speeds up Computations and Improves Accuracy
We compare the standard SVM with WD kernel (default weighting as in [13]) and kernel caching (SVMlight implementation [18]) and our MKLSVM algorithm with WD kernel (optimized weighting) and using tries (cf. Section 4). We applied both algorithms on the C. elegans acceptor splice data set using 100,000 sequences in training, 100,000 examples for validation and 60,000 examples to test the classifiers performance (cf. Appendix A.2). In this data set each sequence is a window centered around a AG dimer containing 141 nucleotides (nt), together with the corresponding label +1 for true acceptor splice sites and 1 for decoys (cf. [13] and Appendix A.2 for more details). Using this setup we perform 5fold crossvalidation over the maximal oligomer length d ∈ {10,12,15,17,20} (cf. (3)) and the SVM regularization constant C ∈ {0.5, 2, 5, 10}. A detailed comparison of the WD kernel approach with other stateoftheart methods is provided in [13] and goes beyond the scope of this work.
On the validation set we find that for the SVM using the standard WD kernel (using the default weighting), d = 20 and C = 0.5 gives best classification performance (ROC score 99.66% on validation set), while the MKLSVM using the WD kernel (optimized weighting) gives best results for d = 12 and C = 1 (ROC score also 99.66% on validation set). Figure 2 shows the WD kernel weights computed by the MKLSVM approach. It suggests that 12mers and 6mers seem to be of high importance and 14mers are also important. On the test data set the resulting SVM classifier with standard WD kernel performs as good as on the validation data set (ROC score 99.66% again), while the classifier obtained by MKLSVMs with optimized WD kernel weights achieves a 99.67% ROC score. Astonishingly training the MKLSVM (i.e. with weight optimization and tries) was 1.5 times faster than training the original SVM (with kernel caching). Also, the resulting classifier provided by the new algorithm is considerably faster than the one obtained by the classical SVM since many βweights are zero (see also [19]).
It should be noted that the obtained weighting in this experiment is only partially useful for interpretation. In the case of splice site detection, it is unlikely that kmers of length 12 play the most important role. More likely to be important are oligos of length up to six. We believe that the large weight for the longest oligo is an artifact which comes from the fact that we are combining kernels with quite different properties. (The 12th kernel leads to a kernel matrix that is most diagonally dominant, which we believe is the reason for having a large weight. This problem can be partially alleviated by including the identity matrix in the convex combination. However as ℓ_{2}norm soft margin SVMs can be implemented by adding a constant to the diagonal of the kernel [20,21], this leads effectively to an additional ℓ_{2}norm penalization.) In the following example we consider one weight per position. In this case the combined kernels are more similar to eachother and we expect more interpretable results.
Figure 2. Optimized WD Kernel Weights.
2.3 Optimal Positional Importance Weighting is Related to Positional Weight Matrices
An interesting relation of the learned weightings to the relative entropy between Positional Weight Matrices (PWMs) can be shown with the following experiment: We train an SVM with a WD kernel that consists of 60 firstorder subkernels (i.e. only single nucleotide matches are considered) on acceptor splice sites from C. elegans (100,000 sequences for training, 160,000 sequences for validation). The characteristic acceptor splice site AG dimer is at positions 31 & 32. We extracted the sequences from a window (30, +28) around the dimer. The learned weights β_{k }are shown in Figure 3 (left). For comparison we computed the PWMs (Markov chains of zeroth order) for the positive and the negative class separately (denoted by and ). Additionally, we computed the relative entropy Δ_{i }between the two probability estimates and at each position j by , leading to Figure 3 (right). The shape of both plots is quite similar, i.e. both methods consider upstream information, as well as a position directly after the splice site to be highly important. As a major difference the WDweights in the exons remain on a high level. Note that both methods use only zeroth order information. Nevertheless the classification accuracy is already quite high. On the separate validation set the SVM already achieves a ROC score of 99.07% and the Positional Weight Matrices a ROC score of 98.83%.
Figure 3. (left) Value of the learned weightings of an SVM with a WD kernel of 60 firstorder subkernels, (right) relative entropy obtained between the Positional Weight Matrices for the positive and the negative class, both trained for acceptor splice site detection.
2.4 Positional WD Kernel Weights Helps Understanding Splice Site Classification
Note that Markov chains become intractable and less accurate for high orders, which seem on the other hand necessary for achieving high accuracies in many sequence analysis tasks. SVMs, however, are efficient and accurate even for great oligomer lengths. We therefore expect that MKLSVMs may also in this case provide useful insights at which positions the discriminative information is hidden.
In order to illustrate this idea we perform another experiment: We considered the larger region from 50 nt to +60 nt around the splice site and used the WD kernel with d = 15. We defined a kernel for every position that only accounts for substrings that start at the corresponding position (up to length 15). To get a smoother weighting and to reduce the computing time we only used [111/2] = 56 weights (combining every two positions to one weight). Figure 4 shows the average computed weighting on ten bootstrap runs trained on about 65,000 examples. Several regions of interest can be identified: a) The region 50 nt to 40 nt, which corresponds to the donor splice site of the previous exon (many introns in C. elegans are very short, often only 50 nt), b) the region 25 nt to 15 nt that coincides with the location of the branch point, c) the intronic region closest to the splice site with greatest weight (8 nt to 1 nt; the weights for the AG dimer are zero, since it appears in splice sites and decoys) and d) the exonic region (0 nt to +50 nt). Slightly surprising are the high weights in the exonic region, which we suspect only model triplet frequencies. The decay of the weights seen from +15 nt to +45 nt might be explained by the fact that not all exons are actually long enough. Furthermore, since the sequence ends in our case at +60 nt, the decay after +45 nt is an edge effect as longer substrings cannot be matched.
Figure 4. Optimized WD kernel weights considering subsequences starting at different positions (one weight per two positions).
2.5 Finding Motifs for Splice Site Detection
We again consider the classification of acceptor splice sites against nonacceptor splice sites (with centered AG dimer) from the C. elegans (cf. Appendix A.2 for details on the generation of the data sets). We trained our Multiple Kernel Learning algorithm (C = 2) on 5,000 randomly chosen sequences of length 111 nt with a maximal oligomer length of d = 10. This leads to M = 1110 kernels in the convex combination. Figure 5 shows the results obtained for this experiment (similarly organized as Figure 1). We can observe (cf. Figure 5b&c) that the optimized kernel coefficients are biologically plausible: longer significant oligomers were found close to the splice site position, oligomers of length 3 and 4 are mainly used in the exonic region (modeling triplet usage) and short oligomers near the branch site. Note, however, that one should use more of the available examples for training in order to extract more meaningful results (adapting 1110 kernel weights may have lead to overfitting). In some preliminary tests using more training data we observed that longer oligomers and also more positions in the exonic and intronic regions become important for discrimination.
Figure 5. Figure a) shows the average weight (over 10 runs) of the weights per position (one weight for two positions) and d) the averaged weights per oligomer length (uniform position weighting). Figures b) displays the position and oligomer length combinations that were found to be significantly used (40 bootstrap runs). Figure c) shows the likelihood for rejecting
. In all runs we used 5, 000 training examples.Note that the weight matrix would be the outer product of the position weight vector (cf. Figure 5a) and the oligomerlength weight vector (cf. Figure 5d), if position and oligomer length would be independent. This is clearly not the case: it seems very important (according to the weight for oligomerlength 5) to consider longer oligomers for discrimination (see also Figure 2) in the central region, while it is only necessary and useful to consider monomers and dimers in other parts of the sequence.
2.6 Understanding the Recognition of Alternatively Spliced Exons
In this section we consider the problem of recognizing one major form of alternative splicing, namely the exclusion of exons from the transcript. It has been shown that alternatively spliced exons have certain properties that distinguish them from constitutively spliced exons (cf. [17] and references therein). In [17] we developed a method that only uses information that is available to the splicing machinery, i.e. the DNA sequence itself, and accurately distinguishes between alternatively and constitutively spliced exons (50% true positive rate at a 1% false positive rate; see http://www.fml.tuebingen.mpg.de/raetsch/projects/RASE webcite for more details). Using our MKL method we have identified regions near the 5' and 3' end of the considered exons that carry most of the discriminative information. We show that these regions contain many hexamers that are significantly more often present than average in constitutively spliced exons.
In order to recognize alternatively spliced exons we consider the 5' and 3' end of the exons separately and use an extended version of the WD kernel (exhibiting an improved positional invariance, cf. [17]) on a 201 nt window centered around the exon start and end together with additional kernels capturing information about the length of the exon and the flanking introns [17].
To interpret the SVM classifiers result we employ Multiple Kernel Learning to determine the weights and for the two WD kernels around the acceptor (5') and donor (3') site. In Figure 6 the learned weighting is shown (weights for other subkernels not shown). A higher weight at a certain position in the sequence corresponds to an increased importance of substrings starting at this location. Given this weighting, we can identify five regions which seem particularly important for discrimination: ab) within the upstream intron the region 70 nt to 40 nt and 30 nt to 0 nt (relative to the end of the intron), c) the exon positions +30 nt to +70 nt (relative to the beginning of the exon) and d) 90 nt to 30 nt (relative to the end of the exon). And finally e) the downstream intron positions 0 – 70 nt (relative to the beginning of the intron).
Figure 6. We use Multiple Kernel Learning to determine weights for the WD kernel. Shown is learned weighting for the WD kernel at the acceptor and at the donor site. From areas of higher weight (upstream intron: regions 70 nt to 40 nt and 30 nt to 0 nt, Exon: +30 nt to +70 nt and 30 nt to 90 nt, downstream intron 0 nt to +70 nt) overrepresented hexamers have been extracted and are shown in Table 1.
To illustrate that these regions represent distinct discriminative features for the problem at hand, we counted the occurrence of all hexamers in the positive and negative examples. Using the frequency p^{ }of occurrence of a hexamer in the negative examples as background model, we computed how likely it is to observe the frequency p^{+ }in the positive sequences (Evalue; using the binomial distribution). In Table 1 we display for each of the five regions the six hexamers with highest Evalue. In region a) the motif CTAACC frequently appears in various variations, while region b) is rich with C's and T's. Particularly interesting seem the motifs AGTGAG and CAGCAG which only appear significantly in the region near the exon start and exon end, respectively. The downstream intron contains many G's and T's. (Members of the CELF gene family bind for instance to GTrich regions; A. Zahler, personal communication).) A more complete list of the overrepresented hexamers are found on the supplementary website http://www.fml.tuebingen.mpg.de/raetsch/projects/RASE webcite.
Table 1. Shown are the top six ranked hexamers (by Evalue) extracted for the upstream intron the in between exon and the following downstream exon. The first column in the upper part shows the most important hexamers in the intron for the region 70 nt to 40 nt relative to the end of the intron. The lower part states 6mers contained 30 nt until the end of the upstream intron. Similarly the second column displays hexamers in the exon from +30 nt to +70 nt (upper half, relative to exon start) and 30 nt to  90 nt (lower part, relative to exon end) and the last column 6mers in the downstream intron from 0 nt to +70 nt.
3 Conclusion
In this work we have developed a novel Multiple Kernel Learning algorithm applicable to largescale sequence analysis problems that additionally assists in understanding how decisions are made. Using a novel reformulation of the MKL problem, we were able to reuse available SVM implementations that, in combination with using tries, have lead us to a very efficient MKL algorithm suitable for the analysis of large scale sequence analysis problems. In experiments on toy, splicesite detection and alternative exon recognition problems we have illustrated the usefulness of the Multiple Kernel Learning approach. The optimized kernel convex combination gives valuable hints at which positions discriminative oligomers of which length are located in the sequences. This solves to a certain extent one of the major problems with Support Vector Machines: now the decisions become interpretable. On the toy data set we rediscovered hidden sequence motifs even in presence of a large amount of noise. In the first experiments on the acceptor splice site detection problem we discovered patterns in the optimized weightings which are biologically plausible. For the recognition of alternatively spliced exons we have identified several regions near the 5' and 3' end of the exons that display distinguished patterns. It is future work to extend our computational evaluation and to consider other signal detection problems.
4 Methods
4.1 Support Vector Machines
We use Support Vector Machines [22] which are extensively studied in the literature (e.g. [11,20,21]). Their classification function can be written as in (1). The α_{i}'s are the Lagrange multipliers and b is the usual bias which are the results of SVM training. The kernel k is the key ingredient for learning with SVMs. It implicitly defines the feature space and the mapping Φ via
In case of the afore mentioned WD kernel, Φ maps into a feature space of all possible kmers of length up to d for each sequence position (D ≈ 4^{d+1}L). For a given sequence s, a dimension of Φ (s) is 1, if it contains a certain substring at a certain position. The dotproduct between two mapped examples then counts the cooccurrences of substrings at all positions.
For a given set of training examples (s_{i}, y_{i}) (i = 1, ..., N), the SVM solution is obtained by solving the following
optimization problem that maximizes the soft margin between both classes [23]:
where the parameter C determines the tradeoff between the size of the margin and the margin errors ξ_{i}. The dual optimization problem is as follows:
Note that there exist a large variety of different software packages that can efficiently solve the above optimization problem even for more than one hundred thousand of examples (cf. references in [11] and also [12] to gain further speedups when string kernels are used).
4.2 The Multiple Kernel Learning Optimization Problem
4.2.1 Idea
In the Multiple Kernel Learning (MKL) problem one is given N data points (, y_{i}) (y_{i }∈ {± 1}), where is subdivided into M components = (s_{i,1}, ..., s_{i,M}) with and k_{j }is the dimensionality of the jth component. Then one solves the following convex optimization problem [7], which is equivalent to the linear SVM for M = 1:
where d_{j }is a prior weighting of the kernels (in [7], has been chosen such that the combined kernel has trace one). For simplicity, we assume that d_{j }= 1 for the rest of the paper and that the normalization is done within the mapping φ (if necessary). Note that the ℓ_{1}norm of β is constrained to one, while one is penalizing the ℓ_{2}norm of w_{j }in each block j separately. The idea is that ℓ_{1}norm constrained or penalized variables tend to have sparse optimal solutions, while ℓ_{2}norm penalized variables do not [24]. Thus the above optimization problem offers the possibility to find sparse solutions on the block level with nonsparse solutions within the blocks.
4.2.2 Reformulation as a SemiInfinite Linear Program
The above optimization problem can also be formulated in terms of support vector kernels [7]. Then each block j corresponds to a separate kernel (K_{j})_{r,s }= k_{j}(s_{r,j}, s_{s,j}) computing the dotproduct in feature space of the jth component. In [7] it has been shown that the following optimization problem is equivalent to (9):
In order to solve (10), one may solve the following saddle point problem (Lagrangian):
minimized w.r.t. α ∈ , γ ∈ ℝ (subject to α ≤ C and ∑_{i }α_{i}y_{i }= 0) and maximized w.r.t. β ∈ . Setting the derivative w.r.t. to γ to zero, one obtains the constraint and (11) simplifies to:
Assume α* would be the optimal solution, then θ* := S(α*) – ∑_{i }α_{i }is minimal and, hence, S(α) – ∑_{i }α_{i }≥ θ* for all α (subject to the above constraints). Hence, finding a saddlepoint of (12) is equivalent to solving the following semiinfinite linear program:
4.2.3 A Column Generation Method
Note that there are infinitely many constraints (one for every vector α). Typically algorithms for solving semiinfinite problems work by iteratively finding violated constraints, i.e. α vectors, for intermediate solutions (β, θ). Then one adds the new constraint (corresponding to the new α) and resolves for β and θ [25]. The pseudocode is outlined in Algorithm 1. Note, however, that there are no known convergence rates for such algorithms [25], but it often converges to the optimal solution in a small number of iterations [26,27]. (It has been shown that solving semiinfinite problems like (13), using a method related to boosting (e.g. [28]) one needs at most iterations, where is the unnormalized constraint violation and the constants may depend on the kernels and the number of examples N [24,29]. At least for not too small values of this technique produces reasonably fast good approximate solutions. See [8] for more details.)
Fortunately, finding the constraint that is most violated corresponds to solving the SVM optimization problem for a fixed weighting of the kernels:
where K = ∑_{j}β_{j}K_{j}. Due to the number of efficient SVM optimizers, the problem of finding the most violated constraint can be solved efficiently, too.
Finally, one needs some convergence criterion. Note that the problem is solved when all constraints are satisfied while the β's and θ are optimal. Hence, it is a natural choice to use the normalized maximal constraint violation as a convergence criterion. In our case this would be:
where (β^{t},θ^{t}) is the optimal solution at iteration t  1 and α^{t }corresponds to the newly found maximally violating constraint of the next iteration (i.e. the SVM solution for weighting β^{t}; cf. Algorithm 1 for details). We usually only try to approximate the optimal solution and stop the optimization as soon as ε_{t }≤ ε, were ε was set to 10^{4 }or 10^{3 }in our experiments.
4.2.4 A chunking algorithm for simultaneous optimization of α and β
Usually it is infeasible to use standard optimization tools (e.g. MINOS, CPLEX, LOQO) for solving the SVM training problems on data sets containing more than a few thousand examples. Socalled decomposition techniques overcome this limitation by exploiting the special structure of the SVM problem. The key idea of decomposition is to freeze all but a small number of optimization variables (working set) and to solve a sequence of constantsize problems (subproblems of (8)).
The general idea of Chunking and Sequential Minimal Optimization (SMO) algorithm has been proposed by [30,31] and is implemented in many SVM software packages. Here we would like to propose an extension of the Chunking algorithm to optimize the kernel weights β and the example weights α at the same time. The algorithm is motivated from an insufficiency of the columngeneration algorithm described in the previous section: If the β's are not optimal yet, then the optimization of the α's until optimality is not necessary and therefore inefficient. It would be considerably faster if for any newly obtained α in the Chunking iterations, we could efficiently recompute the optimal β and then continue optimizing the α's using the new kernel weighting.
Intermediate Recomputation of β Recomputing β involves solving a linear program and the problem grows with each additional αinduced constraint. Hence, after many iterations solving the LP may become infeasible. Fortunately, there are two facts making it still possible: (1) only a small number of the added constraints are active and one may for each newly added constraint remove an old inactive one – this prevents the LP from growing arbitrarily and (2) for Simplexbased LP optimizers such as CPLEX there exists the socalled hotstart feature which allows one to efficiently recompute the new solution, if one, for instance, only adds a few additional constraints. The SVMlight optimizer which we are going to modify, internally needs the output = ∑_{i}α_{i}y_{i }k(s_{i}, s_{j}) for all training examples in order to select the next variables for optimization [18]. However, if one changes the kernel weights, then the stored values become invalid and need to be recomputed. In order to avoid the full recomputation one has to additionally store a M × N matrix f_{k,j }= ∑_{i}α _{i}y_{i}k_{k}(s_{i}, s_{j}), i.e. the outputs for each kernel separately. If the β's change, then can be quite efficiently recomputed by = ∑_{k}β_{k}f_{k,j}).
Faster α Optimization using Tries Finally, in each iteration the Chunking optimizer may change a subset of the α's. In order to update and f_{j,k }one needs to compute full rows j of each kernel for every changed α_{j}. Usually one uses kernelcaching to reduce the computational effort of this operation, which is, however, in our case not efficient enough since the effect of the kernel caches degrades drastically in the case of having many kernels. Fortunately, for the WD kernel there is a way to avoid this problem by using socalled tries (cf. [16]; similarly proposed by [14] and others). While we cannot improve a single kernel evaluation (which is already (L)), it turns out to be possible to drastically speedup the computation of a linear combination of kernels, i.e.
where I is the index set. The idea is to create a trie for each position l = 1, ..., L of the sequence. We propose to attach weights to internal nodes and the leaves of the trie, allowing an efficient storage of weights for kmers (1 ≤ k ≤ d). Now we may add all kmers (k = 1, ..., d) of s_{i }(i ∈ I) starting at position l to the trie associated with position l (using weight α_{i}β_{k}; operations per position: (dI)). Once created, the lth trie can be traversed down in order to lookup which kmers in a test sequence (starting at position l) have a nonzero contribution to g(s):
Following the path defined by the kmer u one adds all weights along the way and stops when no children exists (see Figure 7). Note that we now can compute g in (Ld) operations (compared to (ILd) in the original formulation). Empirically we noticed that the proposed Chunking algorithm is often 3–5 times faster than the columngeneration algorithm proposed in the last section, while achieving the same accuracy. In the experiments in Section 2 we only used the Chunking algorithm with a chunk size of Q = 41.
The pseudocode of the algorithm which takes the discussion of this section into account is displayed in Algorithm 2.
Figure 7. Three sequences AAA, AGA, GAA beeing added to the trie. The plot displays the resulting weights at the nodes.
4.3 Estimating the Reliability of a Weighting
Finally we want to assess the reliability of the learned weights β. For this purpose we generate T bootstrap samples and rerun the whole procedure resulting in T weightings β^{t}.
To test the importance of a weight β_{k,i }(and therefore the corresponding kernels for position and oligomer length) we apply the following method: We define a Bernoulli variable ∈ {0,1}, k = 1, ..., d, i = 1, ..., L, t = 1, ..., T by
The sum has binomial distribution Bin (T, p_{0}), p_{0 }unknown. We estimate p_{0 }with , i.e. the empirical probability to observe P( = 1), ∀k, i, t. We test whether Z_{k,i }is as large as could be expected under Bin(T, ) or larger, i.e. the nullhypothesis is : p ≤ c* (vs : p >c*). Here c* is defined as + 2Std_{k,i,t } and can be interpreted as an upper bound of the confidence interval for p_{0}. This choice is taken to be adaptive to the noise level of the data and hence the (non)sparsity of the weightings β^{t}. The hypotheses are tested with a MaximumLikelihood test on an αlevel of α = 0.05; that is c** is the minimal value for that the following inequality hold:
For further details on the test see [32] or [33]. This test is carried out for every . (We assume independence between the weights in one single β, and hence assume that the test problem is the same for every β_{k,i}). If can be rejected, the kernel learned at position i on the kmer is important for the detection and thus (should) contain biologically interesting knowledge about the problem at hand.
Authors' contributions
GR proposed and implemented the SILP formulation of the MKL problem, prepared data sets, drafted the manuscript and helped in carrying out experiments. SS invented the Weighted Degree Kernel, analyzed several weighting schemes and reformulated it as a MKL problem, helped implementing the MKL algorithms and carried out most of the experiments. CS developed the statistical significance test and critically revised the article.
A Data Generation
A.1 Toy Data
We generated 11,000 sequences of length 50, where the symbols of the alphabet {A, C, G, T} follow a uniform distribution. We chose 1,000 of these sequences to be positive examples and hid two motifs of length seven: at position 10 and 30 the motifs GATTACA and AGTAGTG, respectively. The remaining 10,000 examples were used as negatives. Thus the ratio between examples of class +1 and class 1 is ≈ 9%. In the positive examples, we then randomly replaced s ∈ {0, 2, 4, 5} symbols in each motif. Leading to four different data sets which where randomly permuted and split such that the first 1,000 examples became training and the remaining 10,000 validation examples.
A.2 Splice Site Sequences
We collected all known C. elegans ESTs from Wormbase [34] (release WS118; 236,868 sequences), dbEST [35] (as of February 22, 2004; 231,096 sequences) and UniGene [36] (as of October 15, 2003; 91,480 sequences). Using blat [37] we aligned them against the genomic DNA (release WS118). We refined the alignment by correcting typical sequencing errors, for instance by removing minor insertions and deletions. If an intron did not exhibit the GT/AG or GC/AG dimers at the 5' and 3' ends, respectively, then we tried to achieve this by shifting the boundaries up to 2 nucleotides. For each sequence we determined the longest open reading frame (ORF) and only used the part of each sequence within the ORF. In a next step we merged agreeing alignments, leading to 135,239 unique ESTbased sequences. We repeated the above with all known cDNAs from Wormbase (release WS118; 4,848 sequences) and UniGene (as of October 15, 2003; 1,231 sequences), which lead to 4,979 unique sequences. We removed all EST matches fully contained in the cDNA matches, leaving 109,693 ESTbased sequences.
We clustered the sequences in order to obtain independent training, validation and test sets. In the beginning each of the above EST and cDNA sequences were in a separate cluster. We iteratively joined clusters, if any two sequences from distinct clusters a) match to the genome at most l00 nt apart (this includes many forms of alternative splicing) or b) have more than 20% sequence overlap (at 90% identity, determined by using blat). We obtained 17,763 clusters with a total of 114,672 sequences. There are 3,857 clusters that contain at least one cDNA. Finally, we removed all clusters that showed alternative splicing.
Since the resulting data set is still too large, we only used sequences from randomly chosen 20% of clusters with cDNA and 30% of clusters without cDNA to generate true acceptor splice site sequences (15,507 of them). Each sequence is 398 nt long and has the AG dimer at position 200. Negative examples were generated from any occurring AG within the ORF of the sequence (246,914 of them were found). We used a random subset of 60,000 examples for testing, 100,000 examples for parameter tuning and up to 100,000 examples for training (unless stated otherwise).
Algorithms
Algorithm 1 The column generation algorithm employs a linear programming solver to iteratively solve the semiinfinite linear optimization problem (13). The accuracy parameter ε is a parameter of the algorithm.
D^{0 }= 1, θ^{1 }= 0, for k = 1, ..., M
for t = 1,2, ... do
obtain SVM's α^{t }with kernel k^{t }(s_{i}, s_{j}) :=
for k = 1, ..., M do
end for
(β^{t+1},θ^{t+1}) = argmax θ
end for
Algorithm 2 Outline of the Chunking algorithm (extension to SVMlight) that optimizes α and the kernel weighting β simultaneously. The accuracy parameter ε and the subproblem size Q are assumed to be given to the algorithm. For simplicity we omit the removal of inactive constraints. Also note that from one iteration to the next the LP only differs by one additional constraint. This can usually be exploited to save computing time for solving the LP.
f_{k,i }= 0, , α_{i }= 0, for k = 1, ..., M and i = 1, ..., N
for t = 1, 2, ... do
Check optimality conditions and stop if optimal
select Q suboptimal variables i_{1}, ... i_{Q }based on and α
α^{old }= α
solve (8) with respect to the selected variables and update α
create triestructures to prepare efficient computation of
f_{k,i }= f_{k,i }+ g_{k} (s_{i}) for all k = 1, ..., M and i = 1, ..., N
for k = 1, ..., M do
end for
(β^{t+1},θ^{t+1}) = argmax θ
w.r.t β ∈ , θ ∈ ℝ with ∑_{k}β_{k }= 1
else
θ^{t+1 }= θ^{t}
end if
end for
Acknowledgements
The authors gratefully acknowledge partial support from the PASCAL Network of Excellence (EU #506778), DFG grants JA 379 /132 and MU 987/21. We thank Alexander Zien, K.R. Müller, B. Schölkopf, D. Weigel and M.K. Warmuth for great discussions and C.S. Ong for proof reading the manuscript. G.R. would like to acknowledge a visiting appointment with National ICT Australia during the preparation of this work.
N.B. The appendix contains details regarding the data generation. Additional information about this work can be found at http://www.fml.tuebingen.mpg.de/raetsch/projects/mkl_splice.
References

Zien A, Rätsch G, Mika S, Schölkopf B, Lengauer T, Müller KR: Engineering Support Vector Machine Kernels That Recognize Translation Initiation Sites.
Biolnformatics 2000, 16(9):799807. Publisher Full Text

Jaakkola T, Diekhans M, Haussler D: discriminative framework for detecting remote protein homologies.
J Comput Biol 2000, 7(1–2):95114. PubMed Abstract  Publisher Full Text

Zhang X, Heller K, Hefter I, Leslie C, Chasin L: Sequence information for the splicing of human premRNA identified by support vector machine classification.

Lanckriet G, Bie TD, Cristianini N, Jordan M, Noble W: A statistical framework for genomic data fusion.
Bioinformatics 2004, 20:26262635. PubMed Abstract  Publisher Full Text

Delcher A, Harmon D, Kasif S, White O, Salzberg S: Improved microbial gene identification with GLIMMER.
Nucleic Acids Research 1999, 27(23):46364641. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie C: Profilebased string kernels for remote homology detection and motif extraction.
Computational Systems Bioinformatics Conference 2004 2004, 146154.

Bach FR, Lanckriet GRG, Jordan MI: Multiple kernel learning, conic duality, and the SMO algorithm.
Twentyfirst international conference on Machine learning 2004., 69
ACM Press

Sonnenburg S, Rätsch G, Schäfer C: Learning Interpretable.
RECOMB LNBI 3500, SpringerVerlag Berlin Heidelberg 2005, for Biological Sequence Classification.:389407.

Chapelle O, Vapnik V, Bousquet O, Mukherjee S: Choosing Multiple Parameters for Support Vector Machines.

Ong C, Smola A, Williamson R: Learning the Kernel with Hyperkernels.

Müller KR, Mika S, Rätsch G, Tsuda K, Schölkopf B: An Introduction to KernelBased Learning Algorithms.
IEEE Transactions on Neural Networks 2001, 12(2):181201. PubMed Abstract  Publisher Full Text

Sonnenburg S, Rätsch G, Schölkopf B: Large Scale Genomic Sequence SVM Classifiers.
Proceedings of the International Conference on Machine Learning, ICML 2005.

Accurate Splice Site Prediction for Caenorhabditis Elegans, MIT Press. MIT Press series on Computational Molecular Biology. 2003, 277298.

Leslie C, Eskin E, Noble W: The Spectrum Kernel: A String Kernel for SVM protein Classification.
Proceedings of the Pacific Symposium on Biocomputing, Kaua'i, Hawaii 2002, 564575.

Vishwanathan S, Smola A: Fast Kernels for String and Tree Matching.
Kernel Methods in Computational Biology, MIT Press series on Computational Molecular Biology, MIT Press 2003, 113130.

Comm ACM 1960, 3(9):490499. Publisher Full Text

Rätsch G, Sonnenburg S, Schölkopf B: RASE: Recognition of Alternatively Spliced Exons in C. elegans.
Bioinformatics 2005, 21:i369i377. PubMed Abstract  Publisher Full Text

Joachims T: Making LargeScale SVM Learning Practical. In Advances in Kernel Methods – Support Vector Learning. Edited by Schölkopf B, Burges C, Smola A. Cambridge, MA: MIT Press; 1999:169184. PubMed Abstract  Publisher Full Text

Engel Y, Mannor S, Meir R: Sparse Online Greedy Support Vector Regression.

Cristianini N, ShaweTaylor J: An Introduction to Support Vector Machines. Cambridge, UK: Cambridge University Press; 2000.

Schölkopf B, Smola AJ: Learning with Kernels. Cambridge, MA: MIT Press; 2002.

Vapnik V: The nature of statistical learning theory. New York: Springer Verlag; 1995.

Rätsch G: Robust Boosting via Convex Optimization. PhD thesis. University of Potsdam, Computer Science Dept., AugustBebelStr. 89, 14482 Potsdam, Germany; 2001.

Hettich R, Kortanek K: SemiInfinite Programming: Theory, Methods and Applications.

Bennett K, Demiriz A, ShaweTaylor J: A Column Generation Algorithm for Boosting. In Proceedings, 17th ICML. Edited by Langley P. San Francisco: Morgan Kaufmann; 2000:6572.

Rätsch G, Demiriz A, Bennett K: Sparse Regression Ensembles in Infinite and Finite Hypothesis Spaces.
Machine Learning 2002, 48(1–3):193221.
Special Issue on New Methods for Model Selection and Model Combination. Also NeuroCOLT2 Technical Report NCTR2000085.

Meir R, Rätsch G: An Introduction to Boosting and Leveraging. In Proc. of the first Machine Learning Summer School in Canberra, LNCS. Edited by Mendelson S, Smola A. Springer; 2003:in press.

Rätsch G, Warmuth MK: Efficient Margin Maximization with Boosting.
Journal of Machine Learning Research 2005, 6(Dec):21312152.

Vapnik V: Estimation of Dependences Based on Empirical Data. Berlin: SpringerVerlag; 1982.

Platt J: Fast Training of Support Vector Machines using Sequential Minimal Optimization. In Advances in Kernel Methods – Support Vector Learning. Edited by Schölkopf B, Burges C, Smola A. Cambridge, MA: MIT Press; 1999:185208. PubMed Abstract  Publisher Full Text

Mood A, Graybill F, Boes D: Introduction to the Theory of Statistics. third edition. McGrawHill; 1974.

Testing Statistical Hypotheses. Springer, New York, second edition edition. 1997.

Harris TW, et al.: WormBase: a multispecies resource for nematode biology and genomics.
Nucl Acids Res 2004., 32
Database issue:D4117

Boguski M, Tolstoshev TLC: dbESTDatabase for "Expressed Sequence Tags".
Nat Genet 1993, 4(4):3323. PubMed Abstract  Publisher Full Text

Wheeler DL, et al.: Database Resources of the National Center for Biotechnology.
Nucl Acids Res 2003, 31:3833. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kent W: BLATthe BLASTlike alignment tool.
Genome Res 2002, 12(4):65664. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bennett KP, Momma M, Embrechts MJ: MARK: a boosting algorithm for heterogeneous kernel models.

Sonnenburg S, Rätsch G, Schäfer S, Schölkopf B: Large Scale Multiple Kernel Learning.
Journal of Machine Learning Research 2006.
Accepted