Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

This article is part of the supplement: NIPS workshop on New Problems and Methods in Computational Biology

Open Access Highly Accessed Proceedings

Learning Interpretable SVMs for Biological Sequence Classification

Gunnar Rätsch1*, Sören Sonnenburg2 and Christin Schäfer2

Author Affiliations

1 Friedrich Miescher Laboratory, Max Planck Society, Spemannstr. 39, Tübingen, Germany

2 Fraunhofer Institute FIRST, Kekuléstr. 7, 12489 Berlin, Germany

For all author emails, please log on.

BMC Bioinformatics 2006, 7(Suppl 1):S9  doi:10.1186/1471-2105-7-S1-S9

The electronic version of this article is the complete one and can be found online at:


Published:20 March 2006

© 2006 The Author(s); licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Support Vector Machines (SVMs) – using a variety of string kernels – have been successfully applied to biological sequence classification problems. While SVMs achieve high classification accuracy they lack interpretability. In many applications, it does not suffice that an algorithm just detects a biological signal in the sequence, but it should also provide means to interpret its solution in order to gain biological insight.

Results

We propose novel and efficient algorithms for solving the so-called Support Vector Multiple Kernel Learning problem. The developed techniques can be used to understand the obtained support vector decision function in order to extract biologically relevant knowledge about the sequence analysis problem at hand. We apply the proposed methods to the task of acceptor splice site prediction and to the problem of recognizing alternatively spliced exons. Our algorithms compute sparse weightings of substring locations, highlighting which parts of the sequence are important for discrimination.

Conclusion

The proposed method is able to deal with thousands of examples while combining hundreds of kernels within reasonable time, and reliably identifies a few statistically significant positions.

1 Background

Kernel based methods such as Support Vector Machines (SVMs) have proven to be powerful for sequence analysis problems frequently appearing in computational biology (e.g. [1-4]). They employ a so-called kernel function k(si, sj) which intuitively computes the similarity between two sequences si and sj. The result of SVM learning is a α-weighted linear combination of kernel elements and the bias b (see Section 4.1 for more details):

<a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M1">View MathML</a>

where the si 's are N labeled training sequences (yi ∈ {± 1}). One of the problems with kernel methods compared to probabilistic methods (such as position weight matrices or interpolated Markov models [5]) is that the resulting decision function (1) is hard to interpret and, hence, difficult to use in order to extract relevant biological knowledge from it (see also [3,6]). We approach this problem by considering convex combinations of M kernels, i.e.

<a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M2">View MathML</a>

with βk ≥ 0 and ∑kβk = 1, where each kernel kk uses only a distinct set of features of the sequence. For appropriately designed sub-kernels kk, the optimized combination coefficients can then be used to understand which features of the sequence are of importance for discrimination. This is an important property missing in current kernel based algorithms.

In this work we consider the problem of finding the optimal convex combination of kernels (i.e. determining the optimal β's in (2)). This problem is known as the Multiple Kernel Learning (MKL) problem [4,7,8] (see also [9,10,38] for related approaches). Sequence analysis problems usually come with large numbers of examples and one may wish to combine many kernels representing many possibly important features. Unfortunately, algorithms proposed for Multiple Kernel Learning so far are not capable of solving the optimization problem for realistic problem sizes (e.g. ≥ 10,000 examples) within reasonable time. Even recently proposed decomposition algorithms for this problem, such as the one proposed in [7], are not efficient enough since they suffer for instance from the inability to keep all kernel matrices (Kj ∈ ℝN×N, j = 1, ..., M) in memory. (Note that kernel caching can become ineffective if the number of combined kernels is large.) We consider the reformulation of the MKL problem into a semi-infinite linear problem (SILP), which can be iteratively approximated quite efficiently. In each iteration one only needs to solve the classical SVM problem (with one of the efficient and publicly available SVM implementations; cf. references in [11] and also [12,39] to gain a further speedup in case of string kernels) and then performs an update of the kernel convex combination weights β. Separating the SVM optimization from the optimization of the kernel coefficients can thus lead to significant improvements for large scale problems with general kernels (cf. Section 4 for details).

We illustrate the usefulness of the proposed algorithm in combination with a recently proposed string kernel on DNA sequences – the so-called weighted degree (WD) kernel [13]. Its main idea is to count the (exact) co-occurrence of k-mers at position l of two compared DNA sequences of equal length L (e.g. a window around some signal on the DNA). The kernel can be written as a linear combination of d parts with coefficients βk (k = 1, ..., d):

<a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M3">View MathML</a>

where L is the length of the sequences s, d is the maximal oligomer length considered and uk,l(s) is the oligomer of length k at position l of sequence s. Moreover, I(true) := 1 and 0 otherwise.

One question is how the weights βk for the various k-mers in (3) should be chosen. So far, only heuristic settings in combination with expensive model-selection methods have been used (e.g. [13]). The MKL approach offers a clean and efficient way to find the optimal weights β: We define d kernels

<a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M4">View MathML</a>

and then optimize the convex combination of these kernels (with coefficients β) using the MKL algorithm (cf. (3)). The optimal weights β indicate which oligomer lengths are important for the classification problem at hand (see results in Section 2.2).

Additionally, it is interesting to introduce an importance weighting over the positions in the subsequence. Hence, we define a separate kernel for each position and each oligomer length, i.e.

kk,l(si, sj) = I(uk,l(si) = uk,l(sj)),

and optimize the weightings of the combined kernel, which may be written as

<a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M5">View MathML</a>

The simpler case would be to only consider one kernel per position in the sequence: k(si, sj) = <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M6">View MathML</a> with

<a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M7">View MathML</a>

where γ is the default weighting as used in [13].

Obviously, if one would be able to obtain an accurate classification by a sparse weighting βk,l, then one can quite easily interpret the resulting decision function. For instance for signal detection problems (such as splice site detection), one would expect a few important positions with long oligomers near the site and some additional positions within the exon capturing the nucleotide composition (short oligomers; cf. Sections 2.4 and 2.5).

While the proposed MKL algorithms are applicable to arbitrary kernels, we particularly consider the case of string kernels and show how their properties can be exploited in order to significantly speedup the computations. We extend previous work by [8,14,15] and employ tries [16] during training and testing. In Section 4 we develop a method that can avoid kernel caching and we therefore obtain very memory efficient and fast algorithms (which also speedup standard SVM training).

By bootstrapping and applying a combinatorial argument, we derive a statistical test that discovers the most important kernel weights. Using this test, we elucidate on simulated pseudo-DNA sequences with two hidden 7-mers which k-mers in the sequence were used for the SVM decision. Additionally we apply our method to the problem of splice site classification (C. elegans acceptor sites) and to the problem recognizing alternatively spliced exons [17].

2 Results and Discussion

The main goal of this work is to provide an explanation of the SVM decision rule, for instance by identifying sequence positions that are important for discrimination. As a first test we apply our method to a toy problem where everything is known and we can directly validate the findings of our algorithm with the underlying truth. As a next step, we show that our MKL algorithm performs as well or slightly better than the standard SVM and leads to SVM classification functions that are computationally more efficient. In the remaining part we show how the weights can be used to obtain a deeper understanding of how the SVM classifies sequences and match it with knowledge about the underlying biological process.

2.1 MKL Learning Detects Motifs in Toy Data set

As a proof of concept, we test our method on a toy data set with two hidden 7-mers (at positions 10 & 30) at four different noise levels (we used different numbers of random positions in the 7-mers that were replaced with random nucleotides; for a detailed description of the data see Appendix A. 1). We use the kernel as defined in (5) with one sub-kernel per position and oligomer length. We consider sequences of length L = 50 and oligomers up to length d = 7, leading to M = 350 sub-kernels. For every noise level, we train on 100 bootstrap replicates and learn the 350 WD kernel parameters in each run. On the resulting 100 weightings we performed the reliability test (cf. Section 4.3). The results are shown in Figure 1 (columns correspond to different noise levels – increasing from left to right). Each figure shows a kernel weighting β, where columns correspond to weights used at a certain sequence position and rows to the k-mer length used at that position. The plots in the first row show the weights that are detected to be important at a significance level of α = 0.05 in bright (yellow) color. The likelihood for every weight to be detected by the test and thus to reject the null hypothesis <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M8">View MathML</a> is illustrated in the plots in the second row (cf. Section 4.3 for details). Bright colors mean that it is more likely to reject <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M8">View MathML</a>.

thumbnailFigure 1. In this "figure matrix", columns correspond to the noise level, i.e. different numbers of nucleotides randomly substituted in the motif of the toy data set (cf. Appendix A.1). Each sub-figure shows a matrix with each element corresponding to one kernel weight: columns correspond to weights used at a certain sequence position (1–50) and rows to the oligomer length used at that position (1–7). The first row of the figure matrix shows the kernel weights that are significant, while the second row depicts the likelihood of every weight to be rejected under

<a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M8">View MathML</a>

.

As long as the noise level does not exceed 2/7, longer matches of length 3 and 4 seem sufficient to distinguish sequences containing motifs from the rest. However, only the 3-mer is detected with the test procedure. When more nucleotides in the motifs are replaced with noise, more weights are determined to be of importance. This becomes especially obvious in column 3 were 4 out of 7 nucleotides within each motif were randomly replaced, but still an average ROC score of 99.6% is achieved. In the last column the ROC score drops down to 83% (not shown), but only weights in the correct range 10 ... 16 and 30 ... 36 are found to be significant.

2.2 Optimization of WD Kernel Weights Speeds up Computations and Improves Accuracy

We compare the standard SVM with WD kernel (default weighting as in [13]) and kernel caching (SVM-light implementation [18]) and our MKL-SVM algorithm with WD kernel (optimized weighting) and using tries (cf. Section 4). We applied both algorithms on the C. elegans acceptor splice data set using 100,000 sequences in training, 100,000 examples for validation and 60,000 examples to test the classifiers performance (cf. Appendix A.2). In this data set each sequence is a window centered around a AG dimer containing 141 nucleotides (nt), together with the corresponding label +1 for true acceptor splice sites and -1 for decoys (cf. [13] and Appendix A.2 for more details). Using this setup we perform 5-fold cross-validation over the maximal oligomer length d ∈ {10,12,15,17,20} (cf. (3)) and the SVM regularization constant C ∈ {0.5, 2, 5, 10}. A detailed comparison of the WD kernel approach with other state-of-the-art methods is provided in [13] and goes beyond the scope of this work.

On the validation set we find that for the SVM using the standard WD kernel (using the default weighting), d = 20 and C = 0.5 gives best classification performance (ROC score 99.66% on validation set), while the MKL-SVM using the WD kernel (optimized weighting) gives best results for d = 12 and C = 1 (ROC score also 99.66% on validation set). Figure 2 shows the WD kernel weights computed by the MKL-SVM approach. It suggests that 12-mers and 6-mers seem to be of high importance and 1-4-mers are also important. On the test data set the resulting SVM classifier with standard WD kernel performs as good as on the validation data set (ROC score 99.66% again), while the classifier obtained by MKL-SVMs with optimized WD kernel weights achieves a 99.67% ROC score. Astonishingly training the MKL-SVM (i.e. with weight optimization and tries) was 1.5 times faster than training the original SVM (with kernel caching). Also, the resulting classifier provided by the new algorithm is considerably faster than the one obtained by the classical SVM since many β-weights are zero (see also [19]).

It should be noted that the obtained weighting in this experiment is only partially useful for interpretation. In the case of splice site detection, it is unlikely that k-mers of length 12 play the most important role. More likely to be important are oligos of length up to six. We believe that the large weight for the longest oligo is an artifact which comes from the fact that we are combining kernels with quite different properties. (The 12th kernel leads to a kernel matrix that is most diagonally dominant, which we believe is the reason for having a large weight. This problem can be partially alleviated by including the identity matrix in the convex combination. However as ℓ2-norm soft margin SVMs can be implemented by adding a constant to the diagonal of the kernel [20,21], this leads effectively to an additional ℓ2-norm penalization.) In the following example we consider one weight per position. In this case the combined kernels are more similar to each-other and we expect more interpretable results.

thumbnailFigure 2. Optimized WD Kernel Weights.

2.3 Optimal Positional Importance Weighting is Related to Positional Weight Matrices

An interesting relation of the learned weightings to the relative entropy between Positional Weight Matrices (PWMs) can be shown with the following experiment: We train an SVM with a WD kernel that consists of 60 first-order sub-kernels (i.e. only single nucleotide matches are considered) on acceptor splice sites from C. elegans (100,000 sequences for training, 160,000 sequences for validation). The characteristic acceptor splice site AG dimer is at positions 31 & 32. We extracted the sequences from a window (-30, +28) around the dimer. The learned weights βk are shown in Figure 3 (left). For comparison we computed the PWMs (Markov chains of zero-th order) for the positive and the negative class separately (denoted by <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M9">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M10">View MathML</a>). Additionally, we computed the relative entropy Δi between the two probability estimates <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M9">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M10">View MathML</a> at each position j by <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M11">View MathML</a>, leading to Figure 3 (right). The shape of both plots is quite similar, i.e. both methods consider upstream information, as well as a position directly after the splice site to be highly important. As a major difference the WD-weights in the exons remain on a high level. Note that both methods use only zero-th order information. Nevertheless the classification accuracy is already quite high. On the separate validation set the SVM already achieves a ROC score of 99.07% and the Positional Weight Matrices a ROC score of 98.83%.

thumbnailFigure 3. (left) Value of the learned weightings of an SVM with a WD kernel of 60 first-order sub-kernels, (right) relative entropy obtained between the Positional Weight Matrices for the positive and the negative class, both trained for acceptor splice site detection.

2.4 Positional WD Kernel Weights Helps Understanding Splice Site Classification

Note that Markov chains become intractable and less accurate for high orders, which seem on the other hand necessary for achieving high accuracies in many sequence analysis tasks. SVMs, however, are efficient and accurate even for great oligomer lengths. We therefore expect that MKL-SVMs may also in this case provide useful insights at which positions the discriminative information is hidden.

In order to illustrate this idea we perform another experiment: We considered the larger region from -50 nt to +60 nt around the splice site and used the WD kernel with d = 15. We defined a kernel for every position that only accounts for substrings that start at the corresponding position (up to length 15). To get a smoother weighting and to reduce the computing time we only used [111/2] = 56 weights (combining every two positions to one weight). Figure 4 shows the average computed weighting on ten bootstrap runs trained on about 65,000 examples. Several regions of interest can be identified: a) The region -50 nt to -40 nt, which corresponds to the donor splice site of the previous exon (many introns in C. elegans are very short, often only 50 nt), b) the region -25 nt to -15 nt that coincides with the location of the branch point, c) the intronic region closest to the splice site with greatest weight (-8 nt to -1 nt; the weights for the AG dimer are zero, since it appears in splice sites and decoys) and d) the exonic region (0 nt to +50 nt). Slightly surprising are the high weights in the exonic region, which we suspect only model triplet frequencies. The decay of the weights seen from +15 nt to +45 nt might be explained by the fact that not all exons are actually long enough. Furthermore, since the sequence ends in our case at +60 nt, the decay after +45 nt is an edge effect as longer substrings cannot be matched.

thumbnailFigure 4. Optimized WD kernel weights considering subsequences starting at different positions (one weight per two positions).

2.5 Finding Motifs for Splice Site Detection

We again consider the classification of acceptor splice sites against non-acceptor splice sites (with centered AG dimer) from the C. elegans (cf. Appendix A.2 for details on the generation of the data sets). We trained our Multiple Kernel Learning algorithm (C = 2) on 5,000 randomly chosen sequences of length 111 nt with a maximal oligomer length of d = 10. This leads to M = 1110 kernels in the convex combination. Figure 5 shows the results obtained for this experiment (similarly organized as Figure 1). We can observe (cf. Figure 5b&c) that the optimized kernel coefficients are biologically plausible: longer significant oligomers were found close to the splice site position, oligomers of length 3 and 4 are mainly used in the exonic region (modeling triplet usage) and short oligomers near the branch site. Note, however, that one should use more of the available examples for training in order to extract more meaningful results (adapting 1110 kernel weights may have lead to overfitting). In some preliminary tests using more training data we observed that longer oligomers and also more positions in the exonic and intronic regions become important for discrimination.

thumbnailFigure 5. Figure a) shows the average weight (over 10 runs) of the weights per position (one weight for two positions) and d) the averaged weights per oligomer length (uniform position weighting). Figures b) displays the position and oligomer length combinations that were found to be significantly used (40 bootstrap runs). Figure c) shows the likelihood for rejecting

<a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M8">View MathML</a>

. In all runs we used 5, 000 training examples.

Note that the weight matrix would be the outer product of the position weight vector (cf. Figure 5a) and the oligomer-length weight vector (cf. Figure 5d), if position and oligomer length would be independent. This is clearly not the case: it seems very important (according to the weight for oligomer-length 5) to consider longer oligomers for discrimination (see also Figure 2) in the central region, while it is only necessary and useful to consider monomers and dimers in other parts of the sequence.

2.6 Understanding the Recognition of Alternatively Spliced Exons

In this section we consider the problem of recognizing one major form of alternative splicing, namely the exclusion of exons from the transcript. It has been shown that alternatively spliced exons have certain properties that distinguish them from constitutively spliced exons (cf. [17] and references therein). In [17] we developed a method that only uses information that is available to the splicing machinery, i.e. the DNA sequence itself, and accurately distinguishes between alternatively and constitutively spliced exons (50% true positive rate at a 1% false positive rate; see http://www.fml.tuebingen.mpg.de/raetsch/projects/RASE webcite for more details). Using our MKL method we have identified regions near the 5' and 3' end of the considered exons that carry most of the discriminative information. We show that these regions contain many hexamers that are significantly more often present than average in constitutively spliced exons.

In order to recognize alternatively spliced exons we consider the 5' and 3' end of the exons separately and use an extended version of the WD kernel (exhibiting an improved positional invariance, cf. [17]) on a 201 nt window centered around the exon start and end together with additional kernels capturing information about the length of the exon and the flanking introns [17].

To interpret the SVM classifiers result we employ Multiple Kernel Learning to determine the weights <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M12">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M13">View MathML</a> for the two WD kernels around the acceptor (5') and donor (3') site. In Figure 6 the learned weighting is shown (weights for other subkernels not shown). A higher weight at a certain position in the sequence corresponds to an increased importance of substrings starting at this location. Given this weighting, we can identify five regions which seem particularly important for discrimination: a-b) within the upstream intron the region -70 nt to -40 nt and -30 nt to 0 nt (relative to the end of the intron), c) the exon positions +30 nt to +70 nt (relative to the beginning of the exon) and d) -90 nt to -30 nt (relative to the end of the exon). And finally e) the downstream intron positions 0 – 70 nt (relative to the beginning of the intron).

thumbnailFigure 6. We use Multiple Kernel Learning to determine weights for the WD kernel. Shown is learned weighting for the WD kernel at the acceptor and at the donor site. From areas of higher weight (upstream intron: regions -70 nt to -40 nt and -30 nt to 0 nt, Exon: +30 nt to +70 nt and -30 nt to -90 nt, downstream intron 0 nt to +70 nt) overrepresented hexamers have been extracted and are shown in Table 1.

To illustrate that these regions represent distinct discriminative features for the problem at hand, we counted the occurrence of all hexamers in the positive and negative examples. Using the frequency p- of occurrence of a hexamer in the negative examples as background model, we computed how likely it is to observe the frequency p+ in the positive sequences (E-value; using the binomial distribution). In Table 1 we display for each of the five regions the six hexamers with highest E-value. In region a) the motif CTAACC frequently appears in various variations, while region b) is rich with C's and T's. Particularly interesting seem the motifs AGTGAG and CAGCAG which only appear significantly in the region near the exon start and exon end, respectively. The downstream intron contains many G's and T's. (Members of the CELF gene family bind for instance to GT-rich regions; A. Zahler, personal communication).) A more complete list of the over-represented hexamers are found on the supplementary web-site http://www.fml.tuebingen.mpg.de/raetsch/projects/RASE webcite.

Table 1. Shown are the top six ranked hexamers (by E-value) extracted for the upstream intron the in between exon and the following downstream exon. The first column in the upper part shows the most important hexamers in the intron for the region -70 nt to -40 nt relative to the end of the intron. The lower part states 6-mers contained -30 nt until the end of the upstream intron. Similarly the second column displays hexamers in the exon from +30 nt to +70 nt (upper half, relative to exon start) and -30 nt to - 90 nt (lower part, relative to exon end) and the last column 6-mers in the downstream intron from 0 nt to +70 nt.

3 Conclusion

In this work we have developed a novel Multiple Kernel Learning algorithm applicable to large-scale sequence analysis problems that additionally assists in understanding how decisions are made. Using a novel reformulation of the MKL problem, we were able to reuse available SVM implementations that, in combination with using tries, have lead us to a very efficient MKL algorithm suitable for the analysis of large scale sequence analysis problems. In experiments on toy, splice-site detection and alternative exon recognition problems we have illustrated the usefulness of the Multiple Kernel Learning approach. The optimized kernel convex combination gives valuable hints at which positions discriminative oligomers of which length are located in the sequences. This solves to a certain extent one of the major problems with Support Vector Machines: now the decisions become interpretable. On the toy data set we re-discovered hidden sequence motifs even in presence of a large amount of noise. In the first experiments on the acceptor splice site detection problem we discovered patterns in the optimized weightings which are biologically plausible. For the recognition of alternatively spliced exons we have identified several regions near the 5' and 3' end of the exons that display distinguished patterns. It is future work to extend our computational evaluation and to consider other signal detection problems.

4 Methods

4.1 Support Vector Machines

We use Support Vector Machines [22] which are extensively studied in the literature (e.g. [11,20,21]). Their classification function can be written as in (1). The αi's are the Lagrange multipliers and b is the usual bias which are the results of SVM training. The kernel k is the key ingredient for learning with SVMs. It implicitly defines the feature space and the mapping Φ via

<a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M56">View MathML</a>

In case of the afore mentioned WD kernel, Φ maps into a feature space of all possible k-mers of length up to d for each sequence position (D ≈ 4d+1L). For a given sequence s, a dimension of Φ (s) is 1, if it contains a certain substring at a certain position. The dot-product between two mapped examples then counts the co-occurrences of substrings at all positions.

For a given set of training examples (si, yi) (i = 1, ..., N), the SVM solution is obtained by solving the following

optimization problem that maximizes the soft margin between both classes [23]:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M14">View MathML</a>

where the parameter C determines the trade-off between the size of the margin and the margin errors ξi. The dual optimization problem is as follows:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M15">View MathML</a>

Note that there exist a large variety of different software packages that can efficiently solve the above optimization problem even for more than one hundred thousand of examples (cf. references in [11] and also [12] to gain further speedups when string kernels are used).

4.2 The Multiple Kernel Learning Optimization Problem

4.2.1 Idea

In the Multiple Kernel Learning (MKL) problem one is given N data points (<a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M16">View MathML</a>, yi) (yi ∈ {± 1}), where <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M16">View MathML</a> is subdivided into M components <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M16">View MathML</a> = (si,1, ..., si,M) with <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M17">View MathML</a> and kj is the dimensionality of the j-th component. Then one solves the following convex optimization problem [7], which is equivalent to the linear SVM for M = 1:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M18">View MathML</a>

where dj is a prior weighting of the kernels (in [7], <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M57','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M57">View MathML</a> has been chosen such that the combined kernel has trace one). For simplicity, we assume that dj = 1 for the rest of the paper and that the normalization is done within the mapping φ (if necessary). Note that the ℓ1-norm of β is constrained to one, while one is penalizing the ℓ2-norm of wj in each block j separately. The idea is that ℓ1-norm constrained or penalized variables tend to have sparse optimal solutions, while ℓ2-norm penalized variables do not [24]. Thus the above optimization problem offers the possibility to find sparse solutions on the block level with non-sparse solutions within the blocks.

4.2.2 Reformulation as a Semi-Infinite Linear Program

The above optimization problem can also be formulated in terms of support vector kernels [7]. Then each block j corresponds to a separate kernel (Kj)r,s = kj(sr,j, ss,j) computing the dot-product in feature space of the j-th component. In [7] it has been shown that the following optimization problem is equivalent to (9):

<a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M19">View MathML</a>

In order to solve (10), one may solve the following saddle point problem (Lagrangian):

<a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M20">View MathML</a>

minimized w.r.t. α <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M21">View MathML</a>, γ ∈ ℝ (subject to α C and ∑i αiyi = 0) and maximized w.r.t. β <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M22">View MathML</a>. Setting the derivative w.r.t. to γ to zero, one obtains the constraint <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M23">View MathML</a> and (11) simplifies to:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M24">View MathML</a>

Assume α* would be the optimal solution, then θ* := S(α*) – ∑i αi is minimal and, hence, S(α) – ∑i αi θ* for all α (subject to the above constraints). Hence, finding a saddle-point of (12) is equivalent to solving the following semi-infinite linear program:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M25">View MathML</a>

4.2.3 A Column Generation Method

Note that there are infinitely many constraints (one for every vector α). Typically algorithms for solving semi-infinite problems work by iteratively finding violated constraints, i.e. α vectors, for intermediate solutions (β, θ). Then one adds the new constraint (corresponding to the new α) and resolves for β and θ [25]. The pseudo-code is outlined in Algorithm 1. Note, however, that there are no known convergence rates for such algorithms [25], but it often converges to the optimal solution in a small number of iterations [26,27]. (It has been shown that solving semi-infinite problems like (13), using a method related to boosting (e.g. [28]) one needs at most <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M26">View MathML</a> iterations, where <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M27">View MathML</a> is the unnormalized constraint violation and the constants may depend on the kernels and the number of examples N [24,29]. At least for not too small values of <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M27">View MathML</a> this technique produces reasonably fast good approximate solutions. See [8] for more details.)

Fortunately, finding the constraint that is most violated corresponds to solving the SVM optimization problem for a fixed weighting of the kernels:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M28">View MathML</a>

where K = ∑jβjKj. Due to the number of efficient SVM optimizers, the problem of finding the most violated constraint can be solved efficiently, too.

Finally, one needs some convergence criterion. Note that the problem is solved when all constraints are satisfied while the β's and θ are optimal. Hence, it is a natural choice to use the normalized maximal constraint violation as a convergence criterion. In our case this would be:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M29">View MathML</a>

where (βt,θt) is the optimal solution at iteration t - 1 and αt corresponds to the newly found maximally violating constraint of the next iteration (i.e. the SVM solution for weighting βt; cf. Algorithm 1 for details). We usually only try to approximate the optimal solution and stop the optimization as soon as εt ε, were ε was set to 10-4 or 10-3 in our experiments.

4.2.4 A chunking algorithm for simultaneous optimization of α and β

Usually it is infeasible to use standard optimization tools (e.g. MINOS, CPLEX, LOQO) for solving the SVM training problems on data sets containing more than a few thousand examples. So-called decomposition techniques overcome this limitation by exploiting the special structure of the SVM problem. The key idea of decomposition is to freeze all but a small number of optimization variables (working set) and to solve a sequence of constant-size problems (subproblems of (8)).

The general idea of Chunking and Sequential Minimal Optimization (SMO) algorithm has been proposed by [30,31] and is implemented in many SVM software packages. Here we would like to propose an extension of the Chunking algorithm to optimize the kernel weights β and the example weights α at the same time. The algorithm is motivated from an insufficiency of the column-generation algorithm described in the previous section: If the β's are not optimal yet, then the optimization of the α's until optimality is not necessary and therefore inefficient. It would be considerably faster if for any newly obtained α in the Chunking iterations, we could efficiently recompute the optimal β and then continue optimizing the α's using the new kernel weighting.

Intermediate Recomputation of β Recomputing β involves solving a linear program and the problem grows with each additional α-induced constraint. Hence, after many iterations solving the LP may become infeasible. Fortunately, there are two facts making it still possible: (1) only a small number of the added constraints are active and one may for each newly added constraint remove an old inactive one – this prevents the LP from growing arbitrarily and (2) for Simplex-based LP optimizers such as CPLEX there exists the so-called hot-start feature which allows one to efficiently recompute the new solution, if one, for instance, only adds a few additional constraints. The SVM-light optimizer which we are going to modify, internally needs the output <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M30">View MathML</a> = ∑iαiyi k(si, sj) for all training examples in order to select the next variables for optimization [18]. However, if one changes the kernel weights, then the stored <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M30">View MathML</a> values become invalid and need to be recomputed. In order to avoid the full re-computation one has to additionally store a M × N matrix fk,j = ∑iα iyikk(si, sj), i.e. the outputs for each kernel separately. If the β's change, then <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M30">View MathML</a> can be quite efficiently recomputed by <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M30">View MathML</a> = ∑kβkfk,j).

Faster α Optimization using Tries Finally, in each iteration the Chunking optimizer may change a subset of the α's. In order to update <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M30">View MathML</a> and fj,k one needs to compute full rows j of each kernel for every changed αj. Usually one uses kernel-caching to reduce the computational effort of this operation, which is, however, in our case not efficient enough since the effect of the kernel caches degrades drastically in the case of having many kernels. Fortunately, for the WD kernel there is a way to avoid this problem by using so-called tries (cf. [16]; similarly proposed by [14] and others). While we cannot improve a single kernel evaluation (which is already <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M31">View MathML</a>(L)), it turns out to be possible to drastically speedup the computation of a linear combination of kernels, i.e.

<a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M32">View MathML</a>

where I is the index set. The idea is to create a trie for each position l = 1, ..., L of the sequence. We propose to attach weights to internal nodes and the leaves of the trie, allowing an efficient storage of weights for k-mers (1 ≤ k d). Now we may add all k-mers (k = 1, ..., d) of si (i I) starting at position l to the trie associated with position l (using weight αiβk; operations per position: <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M31">View MathML</a>(d|I|)). Once created, the l-th trie can be traversed down in order to lookup which k-mers in a test sequence (starting at position l) have a non-zero contribution to g(s):

Following the path defined by the k-mer u one adds all weights along the way and stops when no children exists (see Figure 7). Note that we now can compute g in <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M31">View MathML</a>(Ld) operations (compared to <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M31">View MathML</a>(|I|Ld) in the original formulation). Empirically we noticed that the proposed Chunking algorithm is often 3–5 times faster than the column-generation algorithm proposed in the last section, while achieving the same accuracy. In the experiments in Section 2 we only used the Chunking algorithm with a chunk size of Q = 41.

The pseudo-code of the algorithm which takes the discussion of this section into account is displayed in Algorithm 2.

thumbnailFigure 7. Three sequences AAA, AGA, GAA beeing added to the trie. The plot displays the resulting weights at the nodes.

4.3 Estimating the Reliability of a Weighting

Finally we want to assess the reliability of the learned weights β. For this purpose we generate T bootstrap samples and rerun the whole procedure resulting in T weightings βt.

To test the importance of a weight βk,i (and therefore the corresponding kernels for position and oligomer length) we apply the following method: We define a Bernoulli variable <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M33">View MathML</a> ∈ {0,1}, k = 1, ..., d, i = 1, ..., L, t = 1, ..., T by

<a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M34">View MathML</a>

The sum <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M35">View MathML</a> has binomial distribution Bin (T, p0), p0 unknown. We estimate p0 with <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M36">View MathML</a>, i.e. the empirical probability to observe P(<a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M33">View MathML</a> = 1), ∀k, i, t. We test whether Zk,i is as large as could be expected under Bin(T, <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M37">View MathML</a>) or larger, i.e. the null-hypothesis is <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M8">View MathML</a>: p c* (vs <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M38">View MathML</a>: p >c*). Here c* is defined as <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M37">View MathML</a> + 2Stdk,i,t <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M33">View MathML</a> and can be interpreted as an upper bound of the confidence interval for p0. This choice is taken to be adaptive to the noise level of the data and hence the (non)-sparsity of the weightings βt. The hypotheses are tested with a Maximum-Likelihood test on an α-level of α = 0.05; that is c** is the minimal value for that the following inequality hold:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M39">View MathML</a>

For further details on the test see [32] or [33]. This test is carried out for every <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M40','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M40">View MathML</a>. (We assume independence between the weights in one single β, and hence assume that the test problem is the same for every βk,i). If <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M8">View MathML</a> can be rejected, the kernel learned at position i on the k-mer is important for the detection and thus (should) contain biologically interesting knowledge about the problem at hand.

Authors' contributions

GR proposed and implemented the SILP formulation of the MKL problem, prepared data sets, drafted the manuscript and helped in carrying out experiments. SS invented the Weighted Degree Kernel, analyzed several weighting schemes and reformulated it as a MKL problem, helped implementing the MKL algorithms and carried out most of the experiments. CS developed the statistical significance test and critically revised the article.

A Data Generation

A.1 Toy Data

We generated 11,000 sequences of length 50, where the symbols of the alphabet {A, C, G, T} follow a uniform distribution. We chose 1,000 of these sequences to be positive examples and hid two motifs of length seven: at position 10 and 30 the motifs GATTACA and AGTAGTG, respectively. The remaining 10,000 examples were used as negatives. Thus the ratio between examples of class +1 and class -1 is ≈ 9%. In the positive examples, we then randomly replaced s ∈ {0, 2, 4, 5} symbols in each motif. Leading to four different data sets which where randomly permuted and split such that the first 1,000 examples became training and the remaining 10,000 validation examples.

A.2 Splice Site Sequences

We collected all known C. elegans ESTs from Wormbase [34] (release WS118; 236,868 sequences), dbEST [35] (as of February 22, 2004; 231,096 sequences) and UniGene [36] (as of October 15, 2003; 91,480 sequences). Using blat [37] we aligned them against the genomic DNA (release WS118). We refined the alignment by correcting typical sequencing errors, for instance by removing minor insertions and deletions. If an intron did not exhibit the GT/AG or GC/AG dimers at the 5' and 3' ends, respectively, then we tried to achieve this by shifting the boundaries up to 2 nucleotides. For each sequence we determined the longest open reading frame (ORF) and only used the part of each sequence within the ORF. In a next step we merged agreeing alignments, leading to 135,239 unique EST-based sequences. We repeated the above with all known cDNAs from Wormbase (release WS118; 4,848 sequences) and UniGene (as of October 15, 2003; 1,231 sequences), which lead to 4,979 unique sequences. We removed all EST matches fully contained in the cDNA matches, leaving 109,693 EST-based sequences.

We clustered the sequences in order to obtain independent training, validation and test sets. In the beginning each of the above EST and cDNA sequences were in a separate cluster. We iteratively joined clusters, if any two sequences from distinct clusters a) match to the genome at most l00 nt apart (this includes many forms of alternative splicing) or b) have more than 20% sequence overlap (at 90% identity, determined by using blat). We obtained 17,763 clusters with a total of 114,672 sequences. There are 3,857 clusters that contain at least one cDNA. Finally, we removed all clusters that showed alternative splicing.

Since the resulting data set is still too large, we only used sequences from randomly chosen 20% of clusters with cDNA and 30% of clusters without cDNA to generate true acceptor splice site sequences (15,507 of them). Each sequence is 398 nt long and has the AG dimer at position 200. Negative examples were generated from any occurring AG within the ORF of the sequence (246,914 of them were found). We used a random subset of 60,000 examples for testing, 100,000 examples for parameter tuning and up to 100,000 examples for training (unless stated otherwise).

Algorithms

Algorithm 1 The column generation algorithm employs a linear programming solver to iteratively solve the semi-infinite linear optimization problem (13). The accuracy parameter ε is a parameter of the algorithm.

D0 = 1, θ1 = 0, <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M41','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M41">View MathML</a> for k = 1, ..., M

for t = 1,2, ... do

    obtain SVM's αt with kernel kt (si, sj) := <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M42','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M42">View MathML</a>

    for k = 1, ..., M do

        <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M43','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M43">View MathML</a>

    end for

    <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M44','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M44">View MathML</a>

    if <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M45">View MathML</a>then break

   (βt+1t+1) = argmax θ

       w.r.t β <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M22">View MathML</a>, θ ∈ ℝ with <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M46','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M46">View MathML</a>

       <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M47','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M47">View MathML</a>

end for

Algorithm 2 Outline of the Chunking algorithm (extension to SVM-light) that optimizes α and the kernel weighting β simultaneously. The accuracy parameter ε and the subproblem size Q are assumed to be given to the algorithm. For simplicity we omit the removal of inactive constraints. Also note that from one iteration to the next the LP only differs by one additional constraint. This can usually be exploited to save computing time for solving the LP.

fk,i = 0, <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M48','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M48">View MathML</a>, αi = 0, <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M41','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M41">View MathML</a> for k = 1, ..., M and i = 1, ..., N

for t = 1, 2, ... do

    Check optimality conditions and stop if optimal

    select Q suboptimal variables i1, ... iQ based on <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M49','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M49">View MathML</a> and α

    αold = α

    solve (8) with respect to the selected variables and update α

    create trie-structures to prepare efficient computation of <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M50">View MathML</a>

    fk,i = fk,i + gk (si) for all k = 1, ..., M and i = 1, ..., N

    for k = 1, ..., M do

       <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M51','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M51">View MathML</a>

    end for

    <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M52','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M52">View MathML</a>

    if <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M53','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M53">View MathML</a>

   (βt+1t+1) = argmax θ

          w.r.t β <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M22">View MathML</a>, θ ∈ ℝ with ∑kβk = 1

          <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M54','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M54">View MathML</a> for r = 1, ..., t

    else

       θt+1 = θt

    end if

    <a onClick="popup('http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M55','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/7/S1/S9/mathml/M55">View MathML</a> for all i = 1, ... N

    end for

Acknowledgements

The authors gratefully acknowledge partial support from the PASCAL Network of Excellence (EU #506778), DFG grants JA 379 /13-2 and MU 987/2-1. We thank Alexander Zien, K.-R. Müller, B. Schölkopf, D. Weigel and M.K. Warmuth for great discussions and C.-S. Ong for proof reading the manuscript. G.R. would like to acknowledge a visiting appointment with National ICT Australia during the preparation of this work.

N.B. The appendix contains details regarding the data generation. Additional information about this work can be found at http://www.fml.tuebingen.mpg.de/raetsch/projects/mkl_splice.

References

  1. Zien A, Rätsch G, Mika S, Schölkopf B, Lengauer T, Müller KR: Engineering Support Vector Machine Kernels That Recognize Translation Initiation Sites.

    Biolnformatics 2000, 16(9):799-807. Publisher Full Text OpenURL

  2. Jaakkola T, Diekhans M, Haussler D: discriminative framework for detecting remote protein homologies.

    J Comput Biol 2000, 7(1–2):95-114. PubMed Abstract | Publisher Full Text OpenURL

  3. Zhang X, Heller K, Hefter I, Leslie C, Chasin L: Sequence information for the splicing of human pre-mRNA identified by support vector machine classification.

    Genome Res 2003, 13(12):637-50. OpenURL

  4. Lanckriet G, Bie TD, Cristianini N, Jordan M, Noble W: A statistical framework for genomic data fusion.

    Bioinformatics 2004, 20:2626-2635. PubMed Abstract | Publisher Full Text OpenURL

  5. Delcher A, Harmon D, Kasif S, White O, Salzberg S: Improved microbial gene identification with GLIMMER.

    Nucleic Acids Research 1999, 27(23):4636-4641. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  6. Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie C: Profile-based string kernels for remote homology detection and motif extraction.

    Computational Systems Bioinformatics Conference 2004 2004, 146-154. OpenURL

  7. Bach FR, Lanckriet GRG, Jordan MI: Multiple kernel learning, conic duality, and the SMO algorithm.

    Twenty-first international conference on Machine learning 2004., 69

    ACM Press

    OpenURL

  8. Sonnenburg S, Rätsch G, Schäfer C: Learning Interpretable.

    RECOMB LNBI 3500, Springer-Verlag Berlin Heidelberg 2005, for Biological Sequence Classification.:389-407. OpenURL

  9. Chapelle O, Vapnik V, Bousquet O, Mukherjee S: Choosing Multiple Parameters for Support Vector Machines.

    Machine Learning 2002, 46(1–3):131-159. OpenURL

  10. Ong C, Smola A, Williamson R: Learning the Kernel with Hyperkernels.

    Journal of Machine Learning Research 2005, 6:1043-1071. OpenURL

  11. Müller KR, Mika S, Rätsch G, Tsuda K, Schölkopf B: An Introduction to Kernel-Based Learning Algorithms.

    IEEE Transactions on Neural Networks 2001, 12(2):181-201. PubMed Abstract | Publisher Full Text OpenURL

  12. Sonnenburg S, Rätsch G, Schölkopf B: Large Scale Genomic Sequence SVM Classifiers.

    Proceedings of the International Conference on Machine Learning, ICML 2005. OpenURL

  13. Rätsch G, Sonnenburg S:

    Accurate Splice Site Prediction for Caenorhabditis Elegans, MIT Press. MIT Press series on Computational Molecular Biology. 2003, 277-298. OpenURL

  14. Leslie C, Eskin E, Noble W: The Spectrum Kernel: A String Kernel for SVM protein Classification.

    Proceedings of the Pacific Symposium on Biocomputing, Kaua'i, Hawaii 2002, 564-575. OpenURL

  15. Vishwanathan S, Smola A: Fast Kernels for String and Tree Matching.

    Kernel Methods in Computational Biology, MIT Press series on Computational Molecular Biology, MIT Press 2003, 113-130. OpenURL

  16. Fredkin E: Trie Memory.

    Comm ACM 1960, 3(9):490-499. Publisher Full Text OpenURL

  17. Rätsch G, Sonnenburg S, Schölkopf B: RASE: Recognition of Alternatively Spliced Exons in C. elegans.

    Bioinformatics 2005, 21:i369-i377. PubMed Abstract | Publisher Full Text OpenURL

  18. Joachims T: Making Large-Scale SVM Learning Practical. In Advances in Kernel Methods – Support Vector Learning. Edited by Schölkopf B, Burges C, Smola A. Cambridge, MA: MIT Press; 1999:169-184. PubMed Abstract | Publisher Full Text OpenURL

  19. Engel Y, Mannor S, Meir R: Sparse Online Greedy Support Vector Regression.

    ECML 2002, 84-96. OpenURL

  20. Cristianini N, Shawe-Taylor J: An Introduction to Support Vector Machines. Cambridge, UK: Cambridge University Press; 2000. OpenURL

  21. Schölkopf B, Smola AJ: Learning with Kernels. Cambridge, MA: MIT Press; 2002. OpenURL

  22. Cortes C, Vapnik V: Support Vector Networks.

    Machine Learning 1995, 20:273-297. OpenURL

  23. Vapnik V: The nature of statistical learning theory. New York: Springer Verlag; 1995. OpenURL

  24. Rätsch G: Robust Boosting via Convex Optimization. PhD thesis. University of Potsdam, Computer Science Dept., August-Bebel-Str. 89, 14482 Potsdam, Germany; 2001. OpenURL

  25. Hettich R, Kortanek K: Semi-Infinite Programming: Theory, Methods and Applications.

    SIAM Review 1993, 3:380-429. OpenURL

  26. Bennett K, Demiriz A, Shawe-Taylor J: A Column Generation Algorithm for Boosting. In Proceedings, 17th ICML. Edited by Langley P. San Francisco: Morgan Kaufmann; 2000:65-72. OpenURL

  27. Rätsch G, Demiriz A, Bennett K: Sparse Regression Ensembles in Infinite and Finite Hypothesis Spaces.

    Machine Learning 2002, 48(1–3):193-221.

    Special Issue on New Methods for Model Selection and Model Combination. Also NeuroCOLT2 Technical Report NC-TR-2000-085.

    OpenURL

  28. Meir R, Rätsch G: An Introduction to Boosting and Leveraging. In Proc. of the first Machine Learning Summer School in Canberra, LNCS. Edited by Mendelson S, Smola A. Springer; 2003:in press. OpenURL

  29. Rätsch G, Warmuth MK: Efficient Margin Maximization with Boosting.

    Journal of Machine Learning Research 2005, 6(Dec):2131-2152. OpenURL

  30. Vapnik V: Estimation of Dependences Based on Empirical Data. Berlin: Springer-Verlag; 1982. OpenURL

  31. Platt J: Fast Training of Support Vector Machines using Sequential Minimal Optimization. In Advances in Kernel Methods – Support Vector Learning. Edited by Schölkopf B, Burges C, Smola A. Cambridge, MA: MIT Press; 1999:185-208. PubMed Abstract | Publisher Full Text OpenURL

  32. Mood A, Graybill F, Boes D: Introduction to the Theory of Statistics. third edition. McGraw-Hill; 1974. OpenURL

  33. Lehmann E:

    Testing Statistical Hypotheses. Springer, New York, second edition edition. 1997. OpenURL

  34. Harris TW, et al.: WormBase: a multi-species resource for nematode biology and genomics.

    Nucl Acids Res 2004., 32

    Database issue:D411-7

    OpenURL

  35. Boguski M, Tolstoshev TLC: dbEST-Database for "Expressed Sequence Tags".

    Nat Genet 1993, 4(4):332-3. PubMed Abstract | Publisher Full Text OpenURL

  36. Wheeler DL, et al.: Database Resources of the National Center for Biotechnology.

    Nucl Acids Res 2003, 31:38-33. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  37. Kent W: BLAT-the BLAST-like alignment tool.

    Genome Res 2002, 12(4):656-64. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  38. Bennett KP, Momma M, Embrechts MJ: MARK: a boosting algorithm for heterogeneous kernel models.

    KDD 2002, 24-31. OpenURL

  39. Sonnenburg S, Rätsch G, Schäfer S, Schölkopf B: Large Scale Multiple Kernel Learning.

    Journal of Machine Learning Research 2006.

    Accepted

    OpenURL