<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
	<ui>1471-2105-8-S10-S7</ui>
	<ji>1471-2105</ji>
	<fm>
		<dochead>Proceedings</dochead>
		<bibl>
			<title>
				<p>Accurate splice site prediction using support vector machines</p>
			</title>
			<aug>
				<au id="A1" ce="yes">
					<snm>Sonnenburg</snm>
					<fnm>S&#246;ren</fnm>
					<insr iid="I1"/>
					<email>Soeren.Sonnenburg@first.fraunhofer.de</email>
				</au>
				<au id="A2" ce="yes">
					<snm>Schweikert</snm>
					<fnm>Gabriele</fnm>
					<insr iid="I2"/>
					<insr iid="I3"/>
					<insr iid="I4"/>
					<email>Gabriele.Schweikert@tuebingen.mpg.de</email>
				</au>
				<au id="A3" ce="yes">
					<snm>Philips</snm>
					<fnm>Petra</fnm>
					<insr iid="I2"/>
					<email>Petra.Philips@tuebingen.mpg.de</email>
				</au>
				<au id="A4">
					<snm>Behr</snm>
					<fnm>Jonas</fnm>
					<insr iid="I2"/>
					<email>Jonas.Behr@tuebingen.mpg.de</email>
				</au>
				<au id="A5" ca="yes">
					<snm>R&#228;tsch</snm>
					<fnm>Gunnar</fnm>
					<insr iid="I2"/>
					<email>Gunnar.Raetsch@tuebingen.mpg.de</email>
				</au>
			</aug>
			<insg>
				<ins id="I1">
					<p>Fraunhofer Institute FIRST, Kekul&#233;str. 7, 12489 Berlin, Germany</p>
				</ins>
				<ins id="I2">
					<p>Friedrich Miescher Laboratory of the Max Planck Society, Spemannstr. 39, 72076 T&#252;bingen, Germany</p>
				</ins>
				<ins id="I3">
					<p>Max Planck Institute for Biological Cybernetics, Spemannstr. 38, 72076 T&#252;bingen, Germany</p>
				</ins>
				<ins id="I4">
					<p>Max Planck Institute for Developmental Biology, Spemannstr. 35, 72076 T&#252;bingen, Germany</p>
				</ins>
			</insg>
			<source>BMC Bioinformatics</source>
			<supplement>
				<title>
					<p>Neural Information Processing Systems (NIPS) workshop on New Problems and Methods in Computational Biology</p>
				</title>
				<editor>Gal Chechik, Christina Leslie, William Stafford Noble, Gunnar R&#228;tsch, Quiad Morris and Koji Tsuda</editor>
				<note>Proceedings</note>
			</supplement>
			<conference>
				<title>
					<p>NIPS workshop on New Problems and Methods in Computational Biology</p>
				</title>
				<location>Whistler, Canada</location>
				<date-range>8 December 2006</date-range>
				<url>http://www.mlcb.org</url>
			</conference>
			<issn>1471-2105</issn>
			<pubdate>2007</pubdate>
			<volume>8</volume>
			<issue>Suppl 10</issue>
			<fpage>S7</fpage>
			<url>http://www.biomedcentral.com/1471-2105/8/S10/S7</url>
			<xrefbib>
				<pubidlist>
					<pubid idtype="pmpid">18269701</pubid>
					<pubid idtype="doi">10.1186/1471-2105-8-S10-S7</pubid>
				</pubidlist>
			</xrefbib>
		</bibl>
		<history>
			<pub>
				<date>
					<day>21</day>
					<month>12</month>
					<year>2007</year>
				</date>
			</pub>
		</history>
		<cpyrt>
			<year>2007</year>
			<collab>Sonnenburg et al; licensee BioMed Central Ltd.</collab>
			<note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
		</cpyrt>
		<abs>
			<sec>
				<st>
					<p>Abstract</p>
				</st>
				<sec>
					<st>
						<p>Background</p>
					</st>
					<p>For splice site recognition, one has to solve two classification problems: discriminating true from decoy splice sites for both acceptor and donor sites. Gene finding systems typically rely on Markov Chains to solve these tasks.</p>
				</sec>
				<sec>
					<st>
						<p>Results</p>
					</st>
					<p>In this work we consider Support Vector Machines for splice site recognition. We employ the so-called <it>weighted degree </it>kernel which turns out well suited for this task, as we will illustrate in several experiments where we compare its prediction accuracy with that of recently proposed systems. We apply our method to the <it>genome-wide </it>recognition of splice sites in <it>Caenorhabditis elegans</it>, <it>Drosophila melanogaster</it>, <it>Arabidopsis thaliana</it>, <it>Danio rerio</it>, and <it>Homo sapiens</it>. Our performance estimates indicate that splice sites can be recognized very accurately in these genomes and that our method outperforms many other methods including <it>Markov Chains</it>, <it>GeneSplicer </it>and <it>SpliceMachine</it>. We provide genome-wide predictions of splice sites and a stand-alone prediction tool ready to be used for incorporation in a gene finder.</p>
				</sec>
				<sec>
					<st>
						<p>Availability</p>
					</st>
					<p>Data, splits, additional information on the model selection, the whole genome predictions, as well as the stand-alone prediction tool are available for download at <url>http://www.fml.mpg.de/raetsch/projects/splice</url>.</p>
				</sec>
			</sec>
		</abs>
	</fm>
	<bdy>
		<sec>
			<st>
				<p>Introduction</p>
			</st>
			<p>With the generation of whole genome sequences, important insight into gene functions and genetic variation has been gained over the last decades. As novel sequencing technologies are rapidly evolving, the way will be paved for cost efficient, high-throughput whole genome sequencing which is going to provide the community with massive amounts of sequences. It is self-evident that the handling of this wealth of data will require efficient and accurate computational methods for sequence analysis. Among the various tools in computational genetic research, gene prediction remains one of the most prominent tasks, as recent competitions have further emphasised (e.g. <abbrgrp>
					<abbr bid="B1">1</abbr>
					<abbr bid="B2">2</abbr>
				</abbrgrp>). Accurate gene prediction is of prime importance for the creation and improvement of annotations of recently sequenced genomes <abbrgrp>
					<abbr bid="B3">3</abbr>
					<abbr bid="B4">4</abbr>
				</abbrgrp>. In the light of new data related to natural variation (e.g. <abbrgrp>
					<abbr bid="B5">5</abbr>
					<abbr bid="B6">6</abbr>
					<abbr bid="B7">7</abbr>
				</abbrgrp>), the importance of accurate computational gene finding gains increasing importance since it helps to understand the effects of polymorphisms on the gene products.</p>
			<p>
				<it>Ab initio </it>gene prediction from sequence is a highly sophisticated procedure as it mimics &#8211; in its result &#8211; the labour of several complex cellular machineries at a time: identification of the beginning and the end of a gene, as is accomplished by RNA polymerases; splicing of the nascent RNA, in the cell performed by the spliceosome; and eventually the detection of an open reading frame, as does the ribosome. The success of a gene prediction method therefore relies on the accuracy of each of these components. In this paper we will focus on the improvement of signal sensors for the detection of splice sites, as this sub-problem is a core element of any gene finder. A comprehensive understanding of splice sites is not only a prerequisite for splice form prediction but can also be of great value in localizing genes <abbrgrp>
					<abbr bid="B8">8</abbr>
					<abbr bid="B9">9</abbr>
					<abbr bid="B10">10</abbr>
					<abbr bid="B11">11</abbr>
					<abbr bid="B12">12</abbr>
				</abbrgrp>.</p>
			<p>In eukaryotic genes, splice sites mark the boundaries between exons and introns. The latter are excised from premature mRNAs in a post-processing step after transcription. Both the donor sites at the exon-intron junctions, and the acceptor sites at the intron-exon boundaries, have quite strong consensus sequences which can, however, vary significantly from one organism to another. The vast majority of all splice sites are so called <it>canonical splice sites </it>which are characterised by the presence of the dimers GT and AG for donor and acceptor sites, respectively. The occurrence of the dimer is not sufficient for the splice site. Indeed, it occurs very frequently at non splice site positions. For example in human DNA, which is &#8776;6&#183;10<sup>9 </sup>nucleotides in size, GT can be found about 400 million times (summed over both strands). For some crude estimate of say 2&#183;10<sup>4 </sup>genes with 20 exons each, only 0.1% of the consensus sites are true splice sites. We therefore face two extremely unbalanced classification tasks, namely the discrimination between true donor sites and decoy positions with the consensus dimer GT or GC (the only non-canonical splice site that we will consider) and the discrimination between true acceptor sites and decoy positions with the consensus dimer AG.</p>
			<sec>
				<st>
					<p>Relation to previous work</p>
				</st>
				<p>Although present-day splice site detectors (e.g. based on Support Vector Machines, neural networks, hidden Markov models) are reported to perform at a fairly good level <abbrgrp>
						<abbr bid="B9">9</abbr>
						<abbr bid="B13">13</abbr>
						<abbr bid="B14">14</abbr>
						<abbr bid="B15">15</abbr>
					</abbrgrp>, several of the reported performance numbers should be interpreted with caution, for a number of reasons. First of all, these results are based on <it>small </it>and potentially biased data sets. Now that many genomes have been fully sequenced, these results will need to be re-evaluated. Second, issues in generating negative examples (decoys) were, if recognized, often not sufficiently documented. The choice of data sets, in particular the decoys, can make a tremendous difference in the measured performance. Third, often only the single site prediction of acceptor and donor sites is considered, whereas the higher goal is to use the splice site predictor within a gene finder. It is uncertain how good the predictors perform in this setting. Keeping these in mind, we provide unbiased <it>genome-wide </it>splice site prediction which enables further evaluation in gene finders.</p>
				<p>In this paper, we will apply Support Vector Machines (SVMs) to the recognition of splice sites. SVMs are known to be excellent algorithms for solving classification tasks <abbrgrp>
						<abbr bid="B16">16</abbr>
						<abbr bid="B17">17</abbr>
						<abbr bid="B18">18</abbr>
						<abbr bid="B19">19</abbr>
					</abbrgrp>, and have also been successfully applied to several bioinformatics problems <abbrgrp>
						<abbr bid="B3">3</abbr>
						<abbr bid="B20">20</abbr>
						<abbr bid="B21">21</abbr>
						<abbr bid="B22">22</abbr>
						<abbr bid="B23">23</abbr>
					</abbrgrp> including splice site detection, cf. e.g. <abbrgrp>
						<abbr bid="B24">24</abbr>
						<abbr bid="B25">25</abbr>
						<abbr bid="B26">26</abbr>
						<abbr bid="B27">27</abbr>
						<abbr bid="B28">28</abbr>
						<abbr bid="B29">29</abbr>
						<abbr bid="B30">30</abbr>
						<abbr bid="B31">31</abbr>
						<abbr bid="B32">32</abbr>
					</abbrgrp>. Our work builds upon our previous work: In <abbrgrp>
						<abbr bid="B24">24</abbr>
						<abbr bid="B25">25</abbr>
					</abbrgrp> we demonstrated that SVMs using kernels from probabilistic hidden Markov models (cf. <abbrgrp>
						<abbr bid="B20">20</abbr>
						<abbr bid="B23">23</abbr>
					</abbrgrp>) outperform hidden Markov models <it>alone</it>. As this approach did not scale to many training examples, we performed a comparison of different <it>faster </it>methods for splice site recognition <abbrgrp>
						<abbr bid="B28">28</abbr>
					</abbrgrp>, where we considered Markov models and SVMs with different kernels: the so-called <it>locality improved kernel</it>, originally proposed for recognition of translation initiation sites <abbrgrp>
						<abbr bid="B21">21</abbr>
					</abbrgrp>; the <it>SVM-pairwise kernel</it>, using alignment scores <abbrgrp>
						<abbr bid="B33">33</abbr>
					</abbrgrp>; the <it>TOP kernel</it>, making use of a probabilistic model (cf. <abbrgrp>
						<abbr bid="B20">20</abbr>
						<abbr bid="B23">23</abbr>
					</abbrgrp>); the standard <it>polynomial kernel </it>
					<abbrgrp>
						<abbr bid="B16">16</abbr>
					</abbrgrp>; and the so-called <it>weighted degree kernel </it>
					<abbrgrp>
						<abbr bid="B28">28</abbr>
						<abbr bid="B34">34</abbr>
					</abbrgrp>. A predictor based on the latter kernel has been successfully used in combination with other information for predicting the splice form of a gene, while outperforming other HMM based approaches <abbrgrp>
						<abbr bid="B3">3</abbr>
					</abbrgrp>. This indicates that the improved accuracy of splice site recognition indeed leads to a higher accuracy in <it>ab initio </it>transcript prediction.</p>
				<p>Other groups also reported successful SVM based splice site detectors. In <abbrgrp>
						<abbr bid="B27">27</abbr>
					</abbrgrp> it was proposed to use linear SVMs on binary features computed from di-nucleotides, an approach which also outperformed previous Markov models. Even more accurate, the authors of SpliceMachine <abbrgrp>
						<abbr bid="B29">29</abbr>
					</abbrgrp> not only used positional information (one- to trimers) around the splice site, but also explicitly modelled compositional context using tri- to hexamers. To the best of our knowledge, this approach is the current state-of-the art, outperforming previous SVM based approaches as well as GeneSplicer <abbrgrp>
						<abbr bid="B12">12</abbr>
					</abbrgrp> and GeneSplicerESE <abbrgrp>
						<abbr bid="B35">35</abbr>
					</abbrgrp>. In <abbrgrp>
						<abbr bid="B31">31</abbr>
					</abbrgrp> linear SVMs were used on positional features that were extracted from empirical estimates of unconditional positional probabilities. Note that this approach is similar to our TOP kernel method on zeroth-order Markov chains <abbrgrp>
						<abbr bid="B28">28</abbr>
					</abbrgrp>. Recently, <abbrgrp>
						<abbr bid="B32">32</abbr>
					</abbrgrp> reported improved accuracies for splice site prediction also by using SVMs. The method employed in <abbrgrp>
						<abbr bid="B32">32</abbr>
					</abbrgrp> is very similar to a kernel initially proposed in <abbrgrp>
						<abbr bid="B21">21</abbr>
					</abbrgrp> (<it>Salzberg kernel</it>). The idea of this kernel is to use empirical estimates of conditional positional probabilities of the nucleotides around splice sites (estimated by Markov models of first order) which are then used as input for classification by an SVM.</p>
				<p>Many other methods have been proposed for splice site recognition. For instance multilayer neural networks with Markovian probabilities as inputs <abbrgrp>
						<abbr bid="B15">15</abbr>
					</abbrgrp>. They train three Markov models on three segments of the input sequence, the upstream, signal and downstream segments. Although they outperform <abbrgrp>
						<abbr bid="B32">32</abbr>
					</abbrgrp> on small datasets, the authors themselves write that the training of the neural networks is especially slow when the number of true and decoy examples are imbalanced and that they have to downsample the number of negatives for training even on small and short sequence sets. Therefore, their method does not seem suitable for large-scale genome-wide computations. Finally, <abbrgrp>
						<abbr bid="B36">36</abbr>
					</abbrgrp> proposed a method based on Bayesian Networks which models statistical dependencies between nucleotide positions.</p>
				<p>In this work we will compare a few of our previously considered methods against these approaches and show that the engineering of the kernel, the careful choice of features and a sound model selection procedure are important for obtaining accurate predictions of splice sites.</p>
				<p>Our previous comparison in <abbrgrp>
						<abbr bid="B28">28</abbr>
					</abbrgrp> was performed on a relatively small data set derived from the <it>C. elegans </it>genome. Also, the data sets considered in <abbrgrp>
						<abbr bid="B32">32</abbr>
					</abbrgrp> are relatively small (around 300,000 examples, whereas more than 50,000,000 examples are nowadays readily available). In this study we therefore reevaluate the previous results on much larger data sets derived from the genomes of five model organisms, namely <it>Caenorhabditis elegans </it>("worm"), <it>Arabidopsis thaliana </it>("cress"), <it>Drosophila melanogaster </it>("fly"), <it>Danio rerio </it>("fish"), and <it>Homo sapiens </it>("human"). Building on our recent work on large scale kernel learning <abbrgrp>
						<abbr bid="B37">37</abbr>
						<abbr bid="B38">38</abbr>
						<abbr bid="B39">39</abbr>
						<abbr bid="B40">40</abbr>
					</abbrgrp>, we now are able to train and evaluate Support Vector Machines on such large data sets as is necessary for analyzing the whole human genome. In particular, we are able to show that increasing the number of training examples indeed helps to obtain a significantly improved performance, and thus will help to improve existing annotation (see, e.g. <abbrgrp>
						<abbr bid="B3">3</abbr>
					</abbrgrp>). We train and evaluate SVMs on newly generated data sets using nested cross-validation and provide genome-wide splice site predictions for any occurring GT, GC and AG dimers, which will be furnished with posterior probability estimates for being true splice sites. We will show that the methods in some cases exhibit dramatic performance differences for the different data sets.</p>
			</sec>
			<sec>
				<st>
					<p>Organization of the paper</p>
				</st>
				<p>The paper is organized as follows: In the next section we present the outcomes of (a) the comparison with the methods proposed in <abbrgrp>
						<abbr bid="B12">12</abbr>
						<abbr bid="B29">29</abbr>
						<abbr bid="B32">32</abbr>
						<abbr bid="B36">36</abbr>
					</abbrgrp>, (b) an assessment which window length should be used for classification and, finally, (c) a comparison of the large scale methods on the genome-wide data sets for the five considered genomes. After discussing our results, we also address the question about the interpretability of SVMs. Finally, in the Methods section, we describe the generation of our data sets, the details of cross-validation and model selection, different kernels, and visualizations method that we used in this study.</p>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Results and discussions</p>
			</st>
			<p>In this section we discuss experimental results we obtained with our methods for acceptor and donor splice site predictions for the five considered organisms.</p>
			<p>Throughout the paper we measure our prediction accuracy in terms of area under the Receiver Operator Characteristic Curve (auROC) <abbrgrp>
					<abbr bid="B41">41</abbr>
					<abbr bid="B42">42</abbr>
				</abbrgrp> and area under the Precision Recall Curve (auPRC) (e.g., <abbrgrp>
					<abbr bid="B43">43</abbr>
				</abbrgrp>). (We omit to show the classification accuracy, as often more than 99% of the examples are negatively labeled. Thus, the simplest classifier, predicting -1 for all examples, already achieves 99% rendering the accuracy measure meaningless.) Note that for unbalanced data sets the area under the auROC can also be rather meaningless, since this measure is independent of class ratios and large auROC values may not necessarily indicate a good detection performance. The auPRC is a better measure for performance, if the class distribution is very unbalanced. However, it does depend on the class priors on the test set and hence is affected by sub-sampling the decoys, as happened with the data sets used in previous studies (NN269 in <abbrgrp>
					<abbr bid="B32">32</abbr>
				</abbrgrp> contains about 4 times more decoy than true sites, DGSplicer in <abbrgrp>
					<abbr bid="B32">32</abbr>
					<abbr bid="B36">36</abbr>
				</abbrgrp> about 140 times more; in contrast, in the genome scenario the ratio is one to 300 &#8211; 1000). In order to compare the results among the different data sets with different class sizes, we therefore also provide the auROC score which is not affected by sub-sampling.</p>
			<sec>
				<st>
					<p>Pilot studies on small datasets</p>
				</st>
				<sec>
					<st>
						<p>Performance on the NN269 and DGSplicer data sets</p>
					</st>
					<p>For the comparison of our SVM classifiers to the approaches proposed in <abbrgrp>
							<abbr bid="B32">32</abbr>
							<abbr bid="B36">36</abbr>
						</abbrgrp>, we first measure the performance of our methods on the four tasks used in <abbrgrp>
							<abbr bid="B32">32</abbr>
						</abbrgrp> (see Methods for details). The approach in <abbrgrp>
							<abbr bid="B32">32</abbr>
						</abbrgrp> is outperformed by a neural network approach proposed in <abbrgrp>
							<abbr bid="B15">15</abbr>
						</abbrgrp>. However, we do not compare our methods to the latter method, since it already reaches its computational limits for the small datasets with only a few thousand short sequences (cf. <abbrgrp>
							<abbr bid="B15">15</abbr>
						</abbrgrp>, page 138) and hence is not suitable for large-scale genome-wide computations. On each task we trained SVMs with the <it>weighted degree kernel </it>(WD) <abbrgrp>
							<abbr bid="B28">28</abbr>
						</abbrgrp>, and the <it>weighted degree kernel with shifts </it>(WDS) <abbrgrp>
							<abbr bid="B34">34</abbr>
						</abbrgrp>. On the NN269 Acceptor and Donor sets we additionally trained an SVM using the <it>locality improved kernel </it>(LIK) <abbrgrp>
							<abbr bid="B21">21</abbr>
						</abbrgrp>; as it gives the weakest prediction performance and is computationally most expensive we exclude this model from the following investigations. As a benchmark method we also train higher order Markov Chains (MCs) (e.g. <abbrgrp>
							<abbr bid="B44">44</abbr>
						</abbrgrp>) of "linear" structure and predict with the posterior log-odds ratio (cf. Methods section). Note that Position Specific Scoring Matrices (PSSM) are recovered as the special case of zeroth-order MCs. A summary of our results showing the auROC and auPRC scores is displayed in Table <tblr tid="T1">1</tblr>.</p>
					<tbl id="T1">
						<title>
							<p>Table 1</p>
						</title>
						<caption>
							<p>Performance evaluation (auROC and auPRC scores) of six different methods on the NN269 and DGSplicer Acceptor and Donor test sets. MC denotes prediction with a Markov Chain, EBN the method proposed in [36], and MC-SVM the SVM based method described in [32] (similar to [21]).</p>
						</caption>
						<tblbdy cols="7">
							<r>
								<c>
									<p/>
								</c>
								<c ca="center">
									<p>MC</p>
								</c>
								<c ca="center">
									<p>EBN</p>
								</c>
								<c ca="center">
									<p>MC-SVM</p>
								</c>
								<c ca="center">
									<p>LIK</p>
								</c>
								<c ca="center">
									<p>WD</p>
								</c>
								<c ca="center">
									<p>WDS</p>
								</c>
							</r>
							<r>
								<c cspan="7">
									<hr/>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>
										<b>NN269</b>
									</p>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
							</r>
							<r>
								<c indent="1" ca="left">
									<p>
										<b>Acceptor</b>
									</p>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
							</r>
							<r>
								<c indent="1" ca="left">
									<p>
										<it>auROC</it>
									</p>
								</c>
								<c ca="center">
									<p>96.78</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>96.74<sup>&#8224;</sup>
									</p>
								</c>
								<c ca="center">
									<p>98.19</p>
								</c>
								<c ca="center">
									<p>98.16</p>
								</c>
								<c ca="center">
									<p>98.65</p>
								</c>
							</r>
							<r>
								<c indent="1" ca="left">
									<p>
										<it>auPRC</it>
									</p>
								</c>
								<c ca="center">
									<p>88.41</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>88.33<sup>&#8224;</sup>
									</p>
								</c>
								<c ca="center">
									<p>92.48</p>
								</c>
								<c ca="center">
									<p>92.53</p>
								</c>
								<c ca="center">
									<p>94.36</p>
								</c>
							</r>
							<r>
								<c indent="1" ca="left">
									<p>
										<b>Donor</b>
									</p>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
							</r>
							<r>
								<c indent="1" ca="left">
									<p>
										<it>auROC</it>
									</p>
								</c>
								<c ca="center">
									<p>98.18</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>97.64<sup>&#8224;</sup>
									</p>
								</c>
								<c ca="center">
									<p>98.04</p>
								</c>
								<c ca="center">
									<p>98.50</p>
								</c>
								<c ca="center">
									<p>98.13</p>
								</c>
							</r>
							<r>
								<c indent="1" ca="left">
									<p>
										<it>auPRC</it>
									</p>
								</c>
								<c ca="center">
									<p>92.42</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>89.57<sup>&#8224;</sup>
									</p>
								</c>
								<c ca="center">
									<p>92.65</p>
								</c>
								<c ca="center">
									<p>92.86</p>
								</c>
								<c ca="center">
									<p>92.47</p>
								</c>
							</r>
							<r>
								<c cspan="7">
									<hr/>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>
										<b>DGSplicer</b>
									</p>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
							</r>
							<r>
								<c indent="1" ca="left">
									<p>
										<b>Acceptor</b>
									</p>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
							</r>
							<r>
								<c indent="1" ca="left">
									<p>
										<it>auROC</it>
									</p>
								</c>
								<c ca="center">
									<p>97.23</p>
								</c>
								<c ca="center">
									<p>95.91*</p>
								</c>
								<c ca="center">
									<p>95.35*</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>97.50</p>
								</c>
								<c ca="center">
									<p>97.28</p>
								</c>
							</r>
							<r>
								<c indent="1" ca="left">
									<p>
										<it>auPRC</it>
									</p>
								</c>
								<c ca="center">
									<p>30.59</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>32.08</p>
								</c>
								<c ca="center">
									<p>28.58</p>
								</c>
							</r>
							<r>
								<c indent="1" ca="left">
									<p>
										<b>Donor</b>
									</p>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
							</r>
							<r>
								<c indent="1" ca="left">
									<p>
										<it>auROC</it>
									</p>
								</c>
								<c ca="center">
									<p>98.34</p>
								</c>
								<c ca="center">
									<p>96.88*</p>
								</c>
								<c ca="center">
									<p>95.08*</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>97.84</p>
								</c>
								<c ca="center">
									<p>97.47</p>
								</c>
							</r>
							<r>
								<c indent="1" ca="left">
									<p>
										<it>auPRC</it>
									</p>
								</c>
								<c ca="center">
									<p>41.72</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>39.72</p>
								</c>
								<c ca="center">
									<p>35.59</p>
								</c>
							</r>
						</tblbdy>
						<tblfn>
							<p>The remaining methods are based on SVMs using the locality improved kernel (LIK) [21], weighted degree kernel (WD) [28] and weighted degree kernel with shifts (WDS) [34]. The values marked with an asterisk were estimated from the figures provided in [32]. The values marked with &#8224; are from personal communication with the authors of [32].</p>
						</tblfn>
					</tbl>
					<p>We first note that the simple MCs perform already fairly well in comparison to the SVM methods. Surprisingly, we find that the MC-SVM proposed in <abbrgrp>
							<abbr bid="B32">32</abbr>
						</abbrgrp> performs worse than the MCs. (We have reevaluated the results in <abbrgrp>
							<abbr bid="B32">32</abbr>
						</abbrgrp> with the code provided by the authors and found that the stated false positive rate of their method is wrong by a factor of 10. We have contacted the authors for clarification and they published an erratum <abbrgrp>
							<abbr bid="B45">45</abbr>
						</abbrgrp>. The results for MC-SVMs given in Table <tblr tid="T1">1</tblr> are based on the corrected performance measurement.) As anticipated, for the two acceptor recognition tasks, EBN and MCs are outperformed by all kernel models which are performing all at a similar level. However, we were intrigued to observe that for the DGSplicer Donor recognition task, the MC based predictions outperform the kernel methods. For NN269 Donor recognition their performance is similar to the performance of the kernel methods.</p>
					<p>There are at least two possible explanations for the strong performance of the MCs. First, the DGSplicer data set has been derived from the genome annotation, which in turn might have been obtained using a MC based gene finder. Hence, the test set may contain false predictions easier reproduced by a MC. Second, the window size for the DGSplicer Donor recognition task is very short and has been tuned in <abbrgrp>
							<abbr bid="B36">36</abbr>
						</abbrgrp> to maximize the performance of their method (EBN) and might be suboptimal for SVMs. We investigated these hypotheses with two experiments:</p>
					<p>&#8226; In the first experiment, we shortened the length of the sequences in DGSplicer Acceptor from 36 to 18 (with consensus AG at 8,9). We retrained the MC and WD models doing a full model selection on the shortened training data. We observe that on the shortened data set the prediction performance drops drastically for both MC and WD (by 60% relative) and that, indeed, the MC outperforms the WD method (to 12.9% and 9% auPRC, respectively).</p>
					<p>&#8226; In a second experiment, we started with a subset of our new data sets generated from the genomes of worm and human which only uses EST or cDNA confirmed splice sites (see methods section). In the training data we used the same number of true and decoy donor sites as in the DGSplicer data set. For the test data we used the original class ratios (in order to allow a direct comparison to following experiments; cf. Table <tblr tid="T2">2</tblr>). Training and testing sequences were shortened from 218 nt in steps of 10 nt down to 18 nt (same as in the DGSplicer donor data set). We then trained and tested MCs and WD-SVMs for the sets of sequences of different length. Figure <figr fid="F1">1</figr> shows the resulting values for the auPRC on the test data for different sequence lengths. For the short sequences, the prediction accuracies of MCs and SVMs are close for both organisms. For human donor sequences of length 18 MCs indeed outperform SVMs. With increasing sequence length, however, the auPRC of SVMs rapidly improves while it degrades for MCs. Recall that the short sequence length in the DGSplicer data was tuned through model selection for EBN, and thus the performance of the EBN method will degrade for longer sequences <abbrgrp>
							<abbr bid="B36">36</abbr>
						</abbrgrp>, so that we can safely infer that our methods would also outperform EBN for longer training sequences.</p>
					<tbl id="T2">
						<title>
							<p>Table 2</p>
						</title>
						<caption>
							<p>Characteristics of the genome-wide data sets containing true and decoy acceptor and donor splice sites for our five model organisms.</p>
						</caption>
						<tblbdy cols="11">
							<r>
								<c>
									<p/>
								</c>
								<c cspan="2" ca="center">
									<p>
										<b>Worm</b>
									</p>
								</c>
								<c cspan="2" ca="center">
									<p>
										<b>Fly</b>
									</p>
								</c>
								<c cspan="2" ca="center">
									<p>
										<b>Cress</b>
									</p>
								</c>
								<c cspan="2" ca="center">
									<p>
										<b>Fish</b>
									</p>
								</c>
								<c cspan="2" ca="center">
									<p>
										<b>Human</b>
									</p>
								</c>
							</r>
							<r>
								<c>
									<p/>
								</c>
								<c ca="left">
									<p>Acceptor</p>
								</c>
								<c ca="left">
									<p>Donor</p>
								</c>
								<c ca="left">
									<p>Acceptor</p>
								</c>
								<c ca="left">
									<p>Donor</p>
								</c>
								<c ca="left">
									<p>Acceptor</p>
								</c>
								<c ca="left">
									<p>Donor</p>
								</c>
								<c ca="left">
									<p>Acceptor</p>
								</c>
								<c ca="left">
									<p>Donor</p>
								</c>
								<c ca="left">
									<p>Acceptor</p>
								</c>
								<c ca="left">
									<p>Donor</p>
								</c>
							</r>
							<r>
								<c cspan="11">
									<hr/>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Training total</p>
								</c>
								<c ca="left">
									<p>1,105,886</p>
								</c>
								<c ca="left">
									<p>1,744,733</p>
								</c>
								<c ca="left">
									<p>1,289,427</p>
								</c>
								<c ca="left">
									<p>2,484,854</p>
								</c>
								<c ca="left">
									<p>1,340,260</p>
								</c>
								<c ca="left">
									<p>2,033,863</p>
								</c>
								<c ca="left">
									<p>3,541,087</p>
								</c>
								<c ca="left">
									<p>6,017,854</p>
								</c>
								<c ca="left">
									<p>6,635,123</p>
								</c>
								<c ca="left">
									<p>9,262,241</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Fraction positives</p>
								</c>
								<c ca="left">
									<p>3.6%</p>
								</c>
								<c ca="left">
									<p>2.3%</p>
								</c>
								<c ca="left">
									<p>1.4%</p>
								</c>
								<c ca="left">
									<p>0.7%</p>
								</c>
								<c ca="left">
									<p>3.6%</p>
								</c>
								<c ca="left">
									<p>2.3%</p>
								</c>
								<c ca="left">
									<p>2.4%</p>
								</c>
								<c ca="left">
									<p>1.5%</p>
								</c>
								<c ca="left">
									<p>1.5%</p>
								</c>
								<c ca="left">
									<p>1.1%</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Evaluation total</p>
								</c>
								<c ca="left">
									<p>371,897</p>
								</c>
								<c ca="left">
									<p>588,088</p>
								</c>
								<c ca="left">
									<p>425,287</p>
								</c>
								<c ca="left">
									<p>820,172</p>
								</c>
								<c ca="left">
									<p>448,924</p>
								</c>
								<c ca="left">
									<p>680,998</p>
								</c>
								<c ca="left">
									<p>3,892,454</p>
								</c>
								<c ca="left">
									<p>10,820,985</p>
								</c>
								<c ca="left">
									<p>10,820,985</p>
								</c>
								<c ca="left">
									<p>15,201,348</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Fraction positives</p>
								</c>
								<c ca="left">
									<p>3.6%</p>
								</c>
								<c ca="left">
									<p>2.3%</p>
								</c>
								<c ca="left">
									<p>1.4%</p>
								</c>
								<c ca="left">
									<p>0.7%</p>
								</c>
								<c ca="left">
									<p>3.6%</p>
								</c>
								<c ca="left">
									<p>2.3%</p>
								</c>
								<c ca="left">
									<p>0.7%</p>
								</c>
								<c ca="left">
									<p>0.4%</p>
								</c>
								<c ca="left">
									<p>0.3%</p>
								</c>
								<c ca="left">
									<p>0.2%</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Testing total</p>
								</c>
								<c ca="left">
									<p>364,967</p>
								</c>
								<c ca="left">
									<p>578621</p>
								</c>
								<c ca="left">
									<p>441,686</p>
								</c>
								<c ca="left">
									<p>851,539</p>
								</c>
								<c ca="left">
									<p>445,585</p>
								</c>
								<c ca="left">
									<p>673,732</p>
								</c>
								<c ca="left">
									<p>3,998,521</p>
								</c>
								<c ca="left">
									<p>11,011,875</p>
								</c>
								<c ca="left">
									<p>11,011,875</p>
								</c>
								<c ca="left">
									<p>15,369,748</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Fraction positives</p>
								</c>
								<c ca="left">
									<p>3.6%</p>
								</c>
								<c ca="left">
									<p>2.3%</p>
								</c>
								<c ca="left">
									<p>1.4%</p>
								</c>
								<c ca="left">
									<p>0.7%</p>
								</c>
								<c ca="left">
									<p>3.5%</p>
								</c>
								<c ca="left">
									<p>2.3%</p>
								</c>
								<c ca="left">
									<p>0.7%</p>
								</c>
								<c ca="left">
									<p>0.4%</p>
								</c>
								<c ca="left">
									<p>0.3%</p>
								</c>
								<c ca="left">
									<p>0.2%</p>
								</c>
							</r>
						</tblbdy>
						<tblfn>
							<p>The sequence length in all sets is 141 nt, for acceptor splice sequences the consensus dimer AG is at position 61, for donor GT/GC at position 81. The negative examples in training sets of fish and human were sub-sampled by a factor of three and five, respectively.</p>
						</tblfn>
					</tbl>
					<fig id="F1">
						<title>
							<p>Figure 1</p>
						</title>
						<caption>
							<p>Comparison of classification performance of the weighted degree kernel based SVM classifier (WD) with the Markov chain based classifier (MC) on a subset of our <it>C. elegans Donor </it>and <it>Human Donor </it>data sets for sequences of varying length</p>
						</caption>
						<text>
							<p>Comparison of classification performance of the weighted degree kernel based SVM classifier (WD) with the Markov chain based classifier (MC) on a subset of our <it>C. elegans Donor </it>and <it>Human Donor </it>data sets for sequences of varying length. For each length, we performed a full model selection on the training data in order to choose the best model. The performance on the test sets, measured through area under the Precision Recall Curve (auPRC), is displayed in percent.</p>
						</text>
						<graphic file="1471-2105-8-S10-S7-1"/>
					</fig>
					<p>The results do not support our first hypothesis that the test data sets are enriched with MC predictions. However, the results confirm our second hypothesis that the poor performance of the kernel methods on the NN269 and DGSplicer donor tasks is due to the shortness of sequences. We also conclude that discriminative information between true and decoy donor sequences lies not only in the close vicinity of the splice site but also further away (see also the illustrations using <it>k</it>-mer scoring matrices below). Therefore, the careful choice of features is crucial for building accurate splice site detectors and if an appropriate window size is chosen, the WD kernel based SVM classifiers easily outperform previously proposed methods.</p>
				</sec>
				<sec>
					<st>
						<p>Comparison with SpliceMachine for cress and human</p>
					</st>
					<p>In this section we compare SpliceMachine <abbrgrp>
							<abbr bid="B29">29</abbr>
						</abbrgrp> with the WD kernel based SVMs. SpliceMachine <abbrgrp>
							<abbr bid="B46">46</abbr>
						</abbrgrp> is the current state-of-the art splice site detector. It is based on a linear SVM and outperforms the freely available GeneSplicer <abbrgrp>
							<abbr bid="B47">47</abbr>
							<abbr bid="B12">12</abbr>
						</abbrgrp> by a large margin <abbrgrp>
							<abbr bid="B29">29</abbr>
						</abbrgrp>. We therefore perform an extended comparison of our methods to SpliceMachine on subsets of the genome-wide datasets (cf. the results and methods sections). One fifth and one twenty-fifth of the data set was used each for training and for independent testing for cress and human, respectively. We downloaded the SpliceMachine feature extractor <abbrgrp>
							<abbr bid="B48">48</abbr>
						</abbrgrp> to generate train and test data sets. Similar to the WD kernel, SpliceMachine makes use of positional information around the splice site. As it explicitly extracts these features it is however limited to a low order context (small <it>d</it>). In addition, SpliceMachine explicitly models coding-frame specific compositional context using tri- to hexamers. Note that this compositional information is also available to a gene finding system for which we are targeting our splicing detector. Therefore, in order to avoid redundancy, compositional information should ideally not be used to detect the <it>splicing signal</it>. Nevertheless, for comparative evaluation of the potential of our method, we augment our WD kernel based methods with 6 spectrum kernels <abbrgrp>
							<abbr bid="B49">49</abbr>
						</abbrgrp> (order 3, 4, 5, each up- and downstream of splice site) and use the same large window sizes as were found out to be optimal in <abbrgrp>
							<abbr bid="B29">29</abbr>
						</abbrgrp>. For cress acceptor [-85, +86], donor [-87, +124], and for human acceptor [-105, +146], donor [-107, +104]. For the WD kernel based SVMs, we fixed the model parameters <it>C </it>= 1 and <it>d </it>= 22, and for WDS we additionaly fixed the shift parameter <it>&#963; </it>= 0.5. For the SpliceMachine we performed an extensive model selection and found <it>C </it>= 10<sup>-3 </sup>to be consistently optimal. We trained with <it>C </it>&#8712; {10<sup>0</sup>, 10<sup>-1</sup>, 10<sup>-2</sup>, 10<sup>-3</sup>, 5&#183;10<sup>-4</sup>, 10<sup>-4</sup>, 10<sup>-5</sup>, 10<sup>-6</sup>, 10<sup>-7</sup>, 10<sup>-8</sup>}. Using these parameter settings we trained SVMs a) on the SpliceMachine features (SM), b) using the WD kernel (WD) c) using the WD kernel augmented by the 6 spectrum kernels (WDSP) d) using the WDS kernel (WDS) and e) using the WDS and spectrum kernels (WDSSP). Table <tblr tid="T3">3</tblr> shows the area under the ROC and precision recall curve obtained in this comparison. Note that SpliceMachine always outperforms the WD kernel, but is in most cases inferior to the WDS kernel. Furthermore, complementing the WD kernels with spectrum kernels (methods WDSP and WDSSP) always improves precision beyond that of SpliceMachine. As this work is targeted at producing a splicing signal detector to be used in a gene finder, we will omit compositional information in the following genome-wide evaluations. To be fair, one can note that a WDS kernel using a very large shift is able to capture compositional information, and the same holds to some extend for the WD kernel when it has seen many training examples. It is therefore impossible to draw strong conclusions on whether window size and (ab)use of compositional features will prove beneficial when the splice site predictor is used as a module in a gene finder, which we hope is enabled by our work providing genome wide predictions.</p>
					<tbl id="T3">
						<title>
							<p>Table 3</p>
						</title>
						<caption>
							<p>Performance evaluation (auROC and auPRC scores) of four different methods on a subset of the genome-wide cress and human datasets.</p>
						</caption>
						<tblbdy cols="6">
							<r>
								<c>
									<p/>
								</c>
								<c ca="center">
									<p>SM</p>
								</c>
								<c ca="center">
									<p>WD</p>
								</c>
								<c ca="center">
									<p>WDSP</p>
								</c>
								<c ca="center">
									<p>WDS</p>
								</c>
								<c ca="center">
									<p>WDSSP</p>
								</c>
							</r>
							<r>
								<c cspan="6">
									<hr/>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>
										<b>Cress</b>
									</p>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
							</r>
							<r>
								<c indent="1" ca="left">
									<p>
										<b>Acceptor</b>
									</p>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
							</r>
							<r>
								<c indent="1" ca="left">
									<p>
										<it>auROC</it>
									</p>
								</c>
								<c ca="center">
									<p>99.41</p>
								</c>
								<c ca="center">
									<p>98.97</p>
								</c>
								<c ca="center">
									<p>99.36</p>
								</c>
								<c ca="center">
									<p>99.43</p>
								</c>
								<c ca="center">
									<p>99.43</p>
								</c>
							</r>
							<r>
								<c indent="1" ca="left">
									<p>
										<it>auPRC</it>
									</p>
								</c>
								<c ca="center">
									<p>91.76</p>
								</c>
								<c ca="center">
									<p>84.24</p>
								</c>
								<c ca="center">
									<p>90.64</p>
								</c>
								<c ca="center">
									<p>92.01</p>
								</c>
								<c ca="center">
									<p>92.09</p>
								</c>
							</r>
							<r>
								<c indent="1" ca="left">
									<p>
										<b>Donor</b>
									</p>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
							</r>
							<r>
								<c indent="1" ca="left">
									<p>
										<it>auROC</it>
									</p>
								</c>
								<c ca="center">
									<p>99.59</p>
								</c>
								<c ca="center">
									<p>99.38</p>
								</c>
								<c ca="center">
									<p>99.58</p>
								</c>
								<c ca="center">
									<p>99.61</p>
								</c>
								<c ca="center">
									<p>99.61</p>
								</c>
							</r>
							<r>
								<c indent="1" ca="left">
									<p>
										<it>auPRC</it>
									</p>
								</c>
								<c ca="center">
									<p>93.34</p>
								</c>
								<c ca="center">
									<p>88.62</p>
								</c>
								<c ca="center">
									<p>93.42</p>
								</c>
								<c ca="center">
									<p>93.68</p>
								</c>
								<c ca="center">
									<p>93.87</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>
										<b>Human</b>
									</p>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
							</r>
							<r>
								<c indent="1" ca="left">
									<p>
										<b>Acceptor</b>
									</p>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
							</r>
							<r>
								<c indent="1" ca="left">
									<p>
										<it>auROC</it>
									</p>
								</c>
								<c ca="center">
									<p>97.72</p>
								</c>
								<c ca="center">
									<p>97.34</p>
								</c>
								<c ca="center">
									<p>97.71</p>
								</c>
								<c ca="center">
									<p>97.73</p>
								</c>
								<c ca="center">
									<p>97.82</p>
								</c>
							</r>
							<r>
								<c indent="1" ca="left">
									<p>
										<it>auPRC</it>
									</p>
								</c>
								<c ca="center">
									<p>50.39</p>
								</c>
								<c ca="center">
									<p>42.77</p>
								</c>
								<c ca="center">
									<p>50.48</p>
								</c>
								<c ca="center">
									<p>51.78</p>
								</c>
								<c ca="center">
									<p>54.12</p>
								</c>
							</r>
							<r>
								<c indent="1" ca="left">
									<p>
										<b>Donor</b>
									</p>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
							</r>
							<r>
								<c indent="1" ca="left">
									<p>
										<it>auROC</it>
									</p>
								</c>
								<c ca="center">
									<p>98.44</p>
								</c>
								<c ca="center">
									<p>98.36</p>
								</c>
								<c ca="center">
									<p>98.36</p>
								</c>
								<c ca="center">
									<p>98.51</p>
								</c>
								<c ca="center">
									<p>98.37</p>
								</c>
							</r>
							<r>
								<c indent="1" ca="left">
									<p>
										<it>auPRC</it>
									</p>
								</c>
								<c ca="center">
									<p>53.29</p>
								</c>
								<c ca="center">
									<p>46.53</p>
								</c>
								<c ca="center">
									<p>54.06</p>
								</c>
								<c ca="center">
									<p>53.08</p>
								</c>
								<c ca="center">
									<p>54.69</p>
								</c>
							</r>
						</tblbdy>
						<tblfn>
							<p>The methods compared are SpliceMachine (SM), the weighted degree kernel (WD), the weighted degree kernel complemented with six spectrum kernels (WDSP), the weighted degree kernel with shifts (WDS), and the weighted degree kernel with shifts complemented by six spectrum kernels (WDSSP).</p>
						</tblfn>
					</tbl>
				</sec>
				<sec>
					<st>
						<p>Performance for varying data size</p>
					</st>
					<p>Figure <figr fid="F2">2</figr> shows the prediction performance in terms of the auROC and auPRC of SVMs using the MC and the WD kernel on the human acceptor and donor splice data that we generated for this work (see the methods section) for varying training set sizes. For training we use up to 80% of all examples and the remaining examples for testing. MCs and SVMs were trained on sets of size varying between 1000 and 8.5 million examples. Here we sub-sampled the negative examples by a factor of five. We observe that the performance steadily increases when using more data for training. For SVMs, over a wide range, the auPRC increases by about 5% (absolute) when the amount of data is multiplied by a factor of 2.7. In the last step, when increasing from 3.3 million to 8.5 million examples, the gain is slightly smaller (3.2 &#8211; 3.5%), indicating the start of a plateau. Similarly MCs improve with growing training set sizes. As MCs are computationally a lot less demanding, we performed a full model selection over the model order and pseudo counts for each training set size. For the WD-SVM the parameters were fixed to the ones found optimal in the results section. Nevertheless MCs did constantly perform inferior to WD-SVMs. We may conclude that one should train using all available data to obtain the best results. If this is infeasible, then we suggest to only sub-sample the negatives examples in the training set, until training becomes computationally tractable. The class distribution in the test set, however, should never be changed unless explicitly taken into account in evaluation.</p>
					<fig id="F2">
						<title>
							<p>Figure 2</p>
						</title>
						<caption>
							<p>Comparison of the classification performance of the weighted degree kernel based SVM classifier (SVM) with the Markov chain based classifier (MC) for different training set sizes</p>
						</caption>
						<text>
							<p>Comparison of the classification performance of the weighted degree kernel based SVM classifier (SVM) with the Markov chain based classifier (MC) for different training set sizes. The area under the Precision Recall Curve (auPRC; left) and the area under the Receiver Operator Curve (auROC; middle) are displayed in percent. On the right the CPU time in seconds needed to train the models is shown.</p>
						</text>
						<graphic file="1471-2105-8-S10-S7-2"/>
					</fig>
				</sec>
			</sec>
			<sec>
				<st>
					<p>Results on genome-wide data sets</p>
				</st>
				<p>Based on our preliminary studies, we now proceeded to design and train the genome-wide predictors. We first generated new <it>genome-wide </it>data sets for our five model organisms: worm, fly, cress, fish, and human. As our large-scale learning methods allow us to use millions of training examples, we included all available EST information from the commonly used databases. Since the reliability of the true and decoy splice sequences is crucial for a successful training and tuning, these data sets were produced with particular care; the details can be found in the Methods section. We arrived at training data sets of considerable size containing sequences of sufficient length (see Table <tblr tid="T2">2</tblr>). For fish and human the training datasets were sub-sampled to include only 1/3 and 1/5 of the negative examples, leading to a maximal training set size of 9 million sequences for human donor sites.</p>
				<p>For a subsequent use in a gene finder system we aimed at producing unbiased predictions for <it>all </it>candidate splice sites, i.e. for all occurrences of the GT/GC and AG consensus dimer. For a proper model selection and in order to obtain unbiased predictions on the <it>whole </it>genome we employed nested five-fold cross-validation. We additionally estimated posterior probabilities in order to obtain interpretable and comparable scores for the outputs of the different SVM classifiers (see Methods for details). The results summarized in Table <tblr tid="T4">4</tblr> are averaged values with standard deviation over the five different test partitions.</p>
				<tbl id="T4">
					<title>
						<p>Table 4</p>
					</title>
					<caption>
						<p>Performance evaluation on the genome-wide data sets for worm, fly cress, fish, and human.</p>
					</caption>
					<tblbdy cols="11">
						<r>
							<c>
								<p/>
							</c>
							<c cspan="2" ca="center">
								<p>
									<b>Worm</b>
								</p>
							</c>
							<c cspan="2" ca="center">
								<p>
									<b>Fly</b>
								</p>
							</c>
							<c cspan="2" ca="center">
								<p>
									<b>Cress</b>
								</p>
							</c>
							<c cspan="2" ca="center">
								<p>
									<b>Fish</b>
								</p>
							</c>
							<c cspan="2" ca="center">
								<p>
									<b>Human</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="right">
								<p>
									<b>Acc</b>
								</p>
							</c>
							<c ca="right">
								<p>
									<b>Don</b>
								</p>
							</c>
							<c ca="right">
								<p>
									<b>Acc</b>
								</p>
							</c>
							<c ca="right">
								<p>
									<b>Don</b>
								</p>
							</c>
							<c ca="right">
								<p>
									<b>Acc</b>
								</p>
							</c>
							<c ca="right">
								<p>
									<b>Don</b>
								</p>
							</c>
							<c ca="right">
								<p>
									<b>Acc</b>
								</p>
							</c>
							<c ca="right">
								<p>
									<b>Don</b>
								</p>
							</c>
							<c ca="right">
								<p>
									<b>Acc</b>
								</p>
							</c>
							<c ca="right">
								<p>
									<b>Don</b>
								</p>
							</c>
						</r>
						<r>
							<c cspan="11">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="right">
								<p>
									<b>MC</b>
								</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="right">
								<p>auROC(%)</p>
							</c>
							<c ca="right">
								<p>99.62 &#177; 0.03</p>
							</c>
							<c ca="right">
								<p>99.55 &#177; 0.02</p>
							</c>
							<c ca="right">
								<p>98.78 &#177; 0.10</p>
							</c>
							<c ca="right">
								<p>99.12 &#177; 0.05</p>
							</c>
							<c ca="right">
								<p>99.12 &#177; 0.03</p>
							</c>
							<c ca="right">
								<p>99.44 &#177; 0.02</p>
							</c>
							<c ca="right">
								<p>98.98 &#177; 0.03</p>
							</c>
							<c ca="right">
								<p>99.19 &#177; 0.05</p>
							</c>
							<c ca="right">
								<p>96.03 &#177; 0.09</p>
							</c>
							<c ca="right">
								<p>97.78 &#177; 0.05</p>
							</c>
						</r>
						<r>
							<c ca="right">
								<p>auPRC(%)</p>
							</c>
							<c ca="right">
								<p>92.09 &#177; 0.28</p>
							</c>
							<c ca="right">
								<p>89.98 &#177; 0.20</p>
							</c>
							<c ca="right">
								<p>80.27 &#177; 0.76</p>
							</c>
							<c ca="right">
								<p>78.47 &#177; 0.63</p>
							</c>
							<c ca="right">
								<p>87.43 &#177; 0.28</p>
							</c>
							<c ca="right">
								<p>88.23 &#177; 0.34</p>
							</c>
							<c ca="right">
								<p>63.59 &#177; 0.72</p>
							</c>
							<c ca="right">
								<p>62.91 &#177; 0.57</p>
							</c>
							<c ca="right">
								<p>16.20 &#177; 0.22</p>
							</c>
							<c ca="right">
								<p>24.98 &#177; 0.30</p>
							</c>
						</r>
						<r>
							<c cspan="11">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="right">
								<p>
									<b>WD</b>
								</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="right">
								<p>auROC(%)</p>
							</c>
							<c ca="right">
								<p>99.77 &#177; 0.02</p>
							</c>
							<c ca="right">
								<p>99.82 &#177; 0.01</p>
							</c>
							<c ca="right">
								<p>99.02 &#177; 0.09</p>
							</c>
							<c ca="right">
								<p>99.49 &#177; 0.05</p>
							</c>
							<c ca="right">
								<p>99.37 &#177; 0.02</p>
							</c>
							<c ca="right">
								<p>99.66 &#177; 0.02</p>
							</c>
							<c ca="right">
								<p>99.36 &#177; 0.04</p>
							</c>
							<c ca="right">
								<p>99.60 &#177; 0.04</p>
							</c>
							<c ca="right">
								<p>97.76 &#177; 0.06</p>
							</c>
							<c ca="right">
								<p>98.59 &#177; 0.05</p>
							</c>
						</r>
						<r>
							<c ca="right">
								<p>auPRC(%)</p>
							</c>
							<c ca="right">
								<p>95.20 &#177; 0.29</p>
							</c>
							<c ca="right">
								<p>95.34 &#177; 0.10</p>
							</c>
							<c ca="right">
								<p>84.80 &#177; 0.35</p>
							</c>
							<c ca="right">
								<p>86.42 &#177; 0.60</p>
							</c>
							<c ca="right">
								<p>91.06 &#177; 0.15</p>
							</c>
							<c ca="right">
								<p>92.21 &#177; 0.17</p>
							</c>
							<c ca="right">
								<p>85.33 &#177; 0.38</p>
							</c>
							<c ca="right">
								<p>85.80 &#177; 0.46</p>
							</c>
							<c ca="right">
								<p>52.07 &#177; 0.33</p>
							</c>
							<c ca="right">
								<p>54.62 &#177; 0.54</p>
							</c>
						</r>
						<r>
							<c cspan="11">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="right">
								<p>
									<b>WDS</b>
								</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="right">
								<p>auROC(%)</p>
							</c>
							<c ca="right">
								<p>99.80 &#177; 0.02</p>
							</c>
							<c ca="right">
								<p>99.82 &#177; 0.01</p>
							</c>
							<c ca="right">
								<p>99.12 &#177; 0.09</p>
							</c>
							<c ca="right">
								<p>99.51 &#177; 0.05</p>
							</c>
							<c ca="right">
								<p>99.43 &#177; 0.02</p>
							</c>
							<c ca="right">
								<p>99.68 &#177; 0.02</p>
							</c>
							<c ca="right">
								<p>99.38 &#177; 0.04</p>
							</c>
							<c ca="right">
								<p>99.61 &#177; 0.04</p>
							</c>
							<c ca="right">
								<p>97.86 &#177; 0.05</p>
							</c>
							<c ca="right">
								<p>98.63 &#177; 0.05</p>
							</c>
						</r>
						<r>
							<c ca="right">
								<p>auPRC(%)</p>
							</c>
							<c ca="right">
								<p>95.89 &#177; 0.26</p>
							</c>
							<c ca="right">
								<p>95.34 &#177; 0.10</p>
							</c>
							<c ca="right">
								<p>86.67 &#177; 0.35</p>
							</c>
							<c ca="right">
								<p>87.47 &#177; 0.54</p>
							</c>
							<c ca="right">
								<p>92.16 &#177; 0.17</p>
							</c>
							<c ca="right">
								<p>92.88 &#177; 0.15</p>
							</c>
							<c ca="right">
								<p>86.58 &#177; 0.33</p>
							</c>
							<c ca="right">
								<p>86.94 &#177; 0.44</p>
							</c>
							<c ca="right">
								<p>54.42 &#177; 0.38</p>
							</c>
							<c ca="right">
								<p>56.54 &#177; 0.57</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>Displayed are auROC and auPRC scores for acceptor and donor recognition tasks as archived by the MC method and two support vector machine approaches, one with the weighted degree kernel (WD) and one with the weighted degree kernewith shifts (WDS).</p>
					</tblfn>
				</tbl>
				<p>Confirming our evaluations in the pilot studies, kernel methods outperform the MC methods in all eight classification tasks. Figure <figr fid="F3">3</figr> displays the precision recall curves for all five organisms comparatively, Table <tblr tid="T4">4</tblr> the corresponding auPRC scores. For worm, fly and cress the improvement in the performance accuracy for the SVM in comparison to MC lies in a similar range of 4&#8211;10% (absolute), both for donor and for acceptor tasks. However, for fish and especially for human the performance gain is considerable higher. For human, MCs only achieve 16% and 25% auPRC scores, whereas WDS reaches 54% and 57% for acceptor and donor recognition, respectively. The severe decrease in performance from worm to human for all classification methods in the auPRC score can partially be explained by the different fractions of positive examples observed in the test set. However, a weaker decline can also be observed in the auROC scores (also Table <tblr tid="T4">4</tblr>) which are independent of the class skew (e.g. for acceptor sites from 99.6% on worm to 96.0% on human for MC, and from 99.8% to 97.9% for WDS). The classification task on the human genome seems to be a considerably more difficult problem than the same one on the worm genome. We may speculate that this can be partially explained by a higher incidence of alternative splicing in the human genome. These sites usually exhibit weaker consensus sequences and are therefore more difficult to detect. Additionally, they often lead to mislabeled examples in the training and testing sets. Finally, it might also be due to the used protocol for aligning the sequences which may generate more false splice sites in human than in other organisms. This hypothesis is supported by the fact that the performance significantly increases, if one only considers cDNA confirmed genes (data not shown).</p>
				<fig id="F3">
					<title>
						<p>Figure 3</p>
					</title>
					<caption>
						<p>Precision Recall Curve for the three methods MC, WD, WDS estimated on the genome-wide data sets for worm, fly, cress, fish, and human in a nested cross-validation scheme</p>
					</caption>
					<text>
						<p>Precision Recall Curve for the three methods MC, WD, WDS estimated on the genome-wide data sets for worm, fly, cress, fish, and human in a nested cross-validation scheme. In contrast to the ROC the random guess in this plot corresponds to a horizontal line, that depends on the fraction of positive examples in the test set (e.g. 2% and 3% in the case of the worm acceptor and donor data sets, respectively).</p>
					</text>
					<graphic file="1471-2105-8-S10-S7-3"/>
				</fig>
			</sec>
			<sec>
				<st>
					<p>Analysis of the learning result</p>
				</st>
				<p>One of the problems with kernel methods compared to probabilistic methods, such as Position Specific Scoring Matrices <abbrgrp>
						<abbr bid="B50">50</abbr>
					</abbrgrp> or Interpolated Markov Models <abbrgrp>
						<abbr bid="B11">11</abbr>
					</abbrgrp>, is that the resulting decision function is hard to interpret and, hence, difficult to use in order to extract relevant biological knowledge from it (see also <abbrgrp>
						<abbr bid="B51">51</abbr>
						<abbr bid="B52">52</abbr>
						<abbr bid="B53">53</abbr>
					</abbrgrp>). Here, we propose to use <it>k</it>-mer scoring matrices <abbrgrp>
						<abbr bid="B3">3</abbr>
						<abbr bid="B54">54</abbr>
					</abbrgrp> to visualize the contribution of all (<it>k</it>-mer, sequence position) pairs to the final decision function of the SVM with WD-Kernel (cf. Methods section). We obtain a graphical representation from which it is possible to judge where in the sequence which substring lengths are of importance.</p>
				<p>We plotted the <it>k</it>-mer scoring matrices corresponding to our trained models for the organisms comparatively in Figure <figr fid="F4">4</figr>, which shows the relative importance of substrings of a certain length for each position in the classified sequences. We can make a few interesting observations: For worm, fly, and potentially also cress there is a rather strong signal about 40&#8211;60 nt downstream of the donor and 40&#8211;60 nt upstream of the acceptor splice sites. These two signals are related to each other, since introns in these organisms are often only 50 nt long. Additionally, we find the region 20&#8211;30 nt upstream of the acceptor splice site of importance, which is very likely related to the branch point. In human it is typically located 20&#8211;50 nt upstream and exhibits the consensus CU(A/G)A(C/U), which matches the lengths of important <it>k</it>-mers in that region for human <abbrgrp>
						<abbr bid="B55">55</abbr>
					</abbrgrp>. In worms, the branch point consensus seems shorter (3&#8211;4 nt) &#8211; confirming previous reports that the branch point is much weaker in worms. In fly and cress the branch point seems rather long (5&#8211;6 nt) and important for recognition of the splice site. Finally, note that the exon sequence carries a lot of discriminative information. The <it>k</it>-mers of most importance are of length three, relating to the coding potential of exons. Additionally, the periodicity observed for instance in cress is due to the reading frame. On the supplementary website we also provide a list of most discriminative <it>k</it>-mers for the two splice site recognition tasks.</p>
				<fig id="F4">
					<title>
						<p>Figure 4</p>
					</title>
					<caption>
						<p>
							<it>k</it>-mer scoring matrices comparatively for worm, fly, cress, fish, and human</p>
					</caption>
					<text>
						<p>
							<it>k</it>-mer scoring matrices comparatively for worm, fly, cress, fish, and human. They depict the maximal position-wise contribution of all <it>k</it>-mers up to order 8 to the decision of the trained kernel classifiers, transformed into percentile values (cf. the section on interpreting the SVM classifier). Red values are highest contributions, blue lowest. Position 1 denotes the splice site and the start of the consensus dimer.</p>
					</text>
					<graphic file="1471-2105-8-S10-S7-4"/>
				</fig>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Conclusion</p>
			</st>
			<p>In this work we have evaluated several approaches for the recognition of splice sites in worm, fly, cress, fish, and human. In a first step we compared MCs, a Bayesian method (EBN) and SVM based methods using several kernels on existing data sets generated from the human genome. We considered the kernel used in <abbrgrp>
					<abbr bid="B32">32</abbr>
				</abbrgrp> based on MCs, the locality improved kernel <abbrgrp>
					<abbr bid="B21">21</abbr>
				</abbrgrp> and two variants of the weighted degree kernel <abbrgrp>
					<abbr bid="B28">28</abbr>
					<abbr bid="B34">34</abbr>
				</abbrgrp>. We found that these existing data sets have limitations in that the sequences used for training and evaluation turn out to be too short for optimal discrimination performance. For SVMs we showed that they are able to exploit &#8211; albeit presumably weak &#8211; features as far as 80 nt away from the splice sites. In a comparison to SpliceMachine we could show that our approach perform favorably when complemented with compositional information. Using the protocol proposed in <abbrgrp>
					<abbr bid="B3">3</abbr>
				</abbrgrp>, we therefore generated new data sets for the five organisms. These data sets contain sufficiently long sequences and for human as many as 9 million training examples. Based on our previous work on large scale kernel learning <abbrgrp>
					<abbr bid="B40">40</abbr>
				</abbrgrp>, we were able to train SVM classifiers also on these rather big data sets. Moreover, we illustrated that the large amount of training data is indeed beneficial for significantly improving the SVM prediction performance, while MCs do not significantly improve when using much more training examples. We therefore encourage using as many examples for training as feasible to obtain the best generalization results.</p>
			<p>For worm, fly and cress we were able to improve the performance by 4%&#8211;10% (absolute) compared to MCs. The biggest difference between the methods is observed for the most difficult task: acceptor and donor recognition on human DNA. The MCs reach only 16% and 25% auPRC, while SVMs achieve 54% and 57%, respectively. The drastic differences between organisms in the prediction performance scores can be understood as a consequence of the smaller fraction of positive examples and a higher incidence of alternative splicing in the human genome compared to the other genomes. For further comparative studies we provide and discuss <it>k</it>-mer scoring matrices elucidating which features are important for discrimination.</p>
			<p>In order to facilitate the use of our classifiers for other studies, we provide whole genome predictions for the five organisms. Additionally, we offer an open-source stand-alone prediction tool allowing, for instance, the integration in other gene finder systems. The predictions, data sets and the stand-alone prediction tool are available for download on the supplementary website <url>http://www.fml.mpg.de/raetsch/projects/splice</url>.</p>
		</sec>
		<sec>
			<st>
				<p>Methods</p>
			</st>
			<sec>
				<st>
					<p>Data sets</p>
				</st>
				<sec>
					<st>
						<p>NN269 and DGSplicer data sets</p>
					</st>
					<p>For the pilot study we use the NN269 and the DGSplicer data sets originating from <abbrgrp>
							<abbr bid="B9">9</abbr>
						</abbrgrp> and <abbrgrp>
							<abbr bid="B32">32</abbr>
						</abbrgrp>, respectively. The data originates from <abbrgrp>
							<abbr bid="B56">56</abbr>
						</abbrgrp> and the training and test splits can be downloaded from <abbrgrp>
							<abbr bid="B46">46</abbr>
						</abbrgrp>. The data sets only include sequences with the canonical splice site dimers AG and GT. We use the same split for training and test sets as used in <abbrgrp>
							<abbr bid="B32">32</abbr>
						</abbrgrp>. A description of the properties of the data set is given in Table <tblr tid="T5">5</tblr>.</p>
					<tbl id="T5">
						<title>
							<p>Table 5</p>
						</title>
						<caption>
							<p>Characteristics of the NN269 and DGSplicer data sets containing true and decoy acceptor and donor splice sites derived from the human genome.</p>
						</caption>
						<tblbdy cols="5">
							<r>
								<c>
									<p/>
								</c>
								<c cspan="2" ca="center">
									<p>
										<b>NN269</b>
									</p>
								</c>
								<c cspan="2" ca="center">
									<p>
										<b>DGSplicer</b>
									</p>
								</c>
							</r>
							<r>
								<c>
									<p/>
								</c>
								<c ca="left">
									<p>
										<b>Acceptor</b>
									</p>
								</c>
								<c ca="left">
									<p>
										<b>Donor</b>
									</p>
								</c>
								<c ca="left">
									<p>
										<b>Acceptor</b>
									</p>
								</c>
								<c ca="left">
									<p>
										<b>Donor</b>
									</p>
								</c>
							</r>
							<r>
								<c cspan="5">
									<hr/>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>
										<b>Sequence length</b>
									</p>
								</c>
								<c ca="left">
									<p>90</p>
								</c>
								<c ca="left">
									<p>15</p>
								</c>
								<c ca="left">
									<p>36</p>
								</c>
								<c ca="left">
									<p>18</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>
										<b>Consensus positions</b>
									</p>
								</c>
								<c ca="left">
									<p>AG at 69</p>
								</c>
								<c ca="left">
									<p>GT at 8</p>
								</c>
								<c ca="left">
									<p>AG at 26</p>
								</c>
								<c ca="left">
									<p>GT at 10</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>
										<b>Train total</b>
									</p>
								</c>
								<c ca="left">
									<p>5788</p>
								</c>
								<c ca="left">
									<p>5256</p>
								</c>
								<c ca="left">
									<p>322156</p>
								</c>
								<c ca="left">
									<p>228268</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>
										<b>Fraction positives</b>
									</p>
								</c>
								<c ca="left">
									<p>19.3%</p>
								</c>
								<c ca="left">
									<p>21.2%</p>
								</c>
								<c ca="left">
									<p>0.6%</p>
								</c>
								<c ca="left">
									<p>0.8%</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>
										<b>Test total</b>
									</p>
								</c>
								<c ca="left">
									<p>1087</p>
								</c>
								<c ca="left">
									<p>990</p>
								</c>
								<c ca="left">
									<p>80539</p>
								</c>
								<c ca="left">
									<p>57067</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>
										<b>Fraction positives</b>
									</p>
								</c>
								<c ca="left">
									<p>19.4%</p>
								</c>
								<c ca="left">
									<p>21.0%</p>
								</c>
								<c ca="left">
									<p>0.6%</p>
								</c>
								<c ca="left">
									<p>0.8%</p>
								</c>
							</r>
						</tblbdy>
					</tbl>
				</sec>
				<sec>
					<st>
						<p>Worm, fly, cress, fish, and human</p>
					</st>
					<p>We collected all known ESTs from dbEST <abbrgrp>
							<abbr bid="B57">57</abbr>
						</abbrgrp> (as of February 28, 2007; 346,064 sequences for worm, 514,613 sequences for fly, 1,276,130 sequences for cress, 1,168,572 sequences for fish, and 7,915,689 sequences for human). We additionally used EST and cDNA sequences available from wormbase <abbrgrp>
							<abbr bid="B58">58</abbr>
						</abbrgrp> for worm, (file confirmed_genes.WS170) <abbrgrp>
							<abbr bid="B59">59</abbr>
						</abbrgrp> for fly, (files na_EST.dros and na_dbEST.same.dmel) <abbrgrp>
							<abbr bid="B60">60</abbr>
						</abbrgrp> for cress, (files cDNA_flanking_050524.txt and cDNA_full_reading_050524.txt) <abbrgrp>
							<abbr bid="B61">61</abbr>
						</abbrgrp> for fish, (file Danio_rerio.ZFISH6.43.cdna.known.?? and <abbrgrp>
							<abbr bid="B62">62</abbr>
						</abbrgrp> for fish and human (file dr_mgc_mrna.fasta for fish and hs_mgc_mrna.fasta for human). Using <it>blat </it>
						<abbrgrp>
							<abbr bid="B63">63</abbr>
						</abbrgrp> we aligned ESTs and cDNA sequences against the genomic DNA (releases WS170, dm5, ath1, zv6, and hg18, respectively). If the sequence could not be unambiguously matched, we only considered the best hit. The alignment was used to confirm exons and introns. We refined the alignment by correcting typical sequencing errors, for instance by removing minor insertions and deletions. If an intron did not exhibit the consensus GT/AG or GC/AG at the 5' and 3' ends, we tried to achieve this by shifting the boundaries up to two base pairs (bp). If this still did not lead to the consensus, then we split the sequence into two parts and considered each subsequence separately. Then, we merged alignments if they did not disagree and if they shared at least one complete exon or intron.</p>
					<p>In a next step, we clustered the alignments: In the beginning, each of the above EST and cDNA alignments were in a separate cluster. We iteratively joined clusters, if any two sequences from distinct clusters match to the same genomic location (this includes many forms of alternative splicing).</p>
					<p>From the clustered alignments we obtained a compact splicing graph representation <abbrgrp>
							<abbr bid="B64">64</abbr>
						</abbrgrp>, which can be easily used to generate a list of positions of true acceptor and donor splice sites. Within the boundaries of the alignments (we cut out 10 nt at both ends of the alignments to exclude potentially undetected splice sites), we identified all positions exhibiting the AG, GT or GC dimer and which were not in the list of confirmed splice sites. The lists of true and decoy splice site positions were used to extract the disjoint training, validation and test sets consisting of sequences in a window around these positions. Additionally, we divided the whole genome into regions, which are disjoint contiguous sequences containing at least two complete genes; if an adjacent gene is less than 250 base pairs away, we merge the adjacent genes into the region. Genes in the same region are also assigned to the same cross-validation split. The splitting was implemented by defining a linkage graph over the regions and by using single linkage clustering. The splits were defined by randomly assigning clusters of regions to the split.</p>
				</sec>
			</sec>
			<sec>
				<st>
					<p>Model selection and evaluation</p>
				</st>
				<p>To be able to apply SVMs, we have to find the optimal soft margin parameter <it>C </it>
					<abbrgrp>
						<abbr bid="B18">18</abbr>
					</abbrgrp> and the kernel parameters. These are: For the LI-kernel, the degree <it>d </it>and window size <it>l</it>; for the WD kernel, the degree <it>d</it>; and for the WDS kernel, the degree <it>d </it>and the shift parameter <it>&#963; </it>(see the section on SVMs and kernels for details). For MCs we have to determine the order <it>d </it>of the Markov chain and the pseudocounts for the models of positive and the negative examples (see the posterior log-odds section). In order to tune these parameters we perform the cross-validation procedures described below.</p>
				<sec>
					<st>
						<p>NN269 and DGSplicer</p>
					</st>
					<p>The training and model selection of our methods for each of the four tasks was done separately by partial 10-fold cross-validation on the training data. For this, the training sets for each task are divided into 10 equally sized data splits, each containing the same number of splice sequences and the same proportion of true versus decoy sequences. For each parameter combination, we use only 3 out of the 10 folds, that is we train 3 times by using 9 out of the 10 training data splits and evaluate on the remaining training data split. Since the data is highly unbalanced, we choose the model with the highest average auPRC score on the three evaluation sets. This best model is then trained on the complete training data set. The final evaluation is done on the corresponding independent test sets (same as in <abbrgrp>
							<abbr bid="B32">32</abbr>
						</abbrgrp>). The supplementary website includes tables with all parameter combinations used in model selection for each task and the chosen parameters.</p>
				</sec>
				<sec>
					<st>
						<p>Worm, fly, cress, fish, and human</p>
					</st>
					<p>The training and model selection of our methods for the five organisms on the acceptor and donor recognition tasks was done separately by 5-fold cross-validation. The optimal parameter was chosen by selecting the parameter combination that maximized the auPRC score. This model selection method was nested within 5-fold cross-validation for final evaluation of the performance. The reported auROC and auPRC are averaged scores over the five cross-validation splits. The supplementary website includes tables with all considered parameter combinations and the chosen parameters for each task. All splits were based on the basis of the clusters derived from EST and cDNA alignments, such that different splits come from random draws of the genome.</p>
				</sec>
				<sec>
					<st>
						<p>Performance measures</p>
					</st>
					<p>The sensitivity is defined as the fraction of correctly classified positive examples among the total number of positive examples, i.e. it equals the true positive rate <it>TPR </it>= <it>TP</it>/(<it>TP </it>+ <it>FN</it>). Analogously, the fraction <it>FPR </it>= <it>FP</it>/(<it>TN </it>+ <it>FP</it>) of negative examples wrongly classified as positive is called the false positive rate. Plotting <it>TPR </it>against <it>FPR </it>results in the Receiver Operator Characteristic Curve (ROC) <abbrgrp>
							<abbr bid="B41">41</abbr>
							<abbr bid="B42">42</abbr>
						</abbrgrp>. Plotting the positive predictive value <it>PPV </it>= <it>TP</it>/(<it>FP </it>+ <it>TP</it>), i.e. the fraction of correct positive predictions among all positively predicted examples, against the <it>TPR</it>, one obtains the Precision Recall Curve (PRC) (see e.g., <abbrgrp>
							<abbr bid="B43">43</abbr>
						</abbrgrp>). The area under the ROC and PRC are denoted by auROC and auPRC respectively.</p>
				</sec>
				<sec>
					<st>
						<p>Estimation of posterior probabilities</p>
					</st>
					<p>In order to provide an interpretable and comparable confidence score of the SVM predictions, we estimated the conditional likelihood <it>P</it>(<it>y </it>= 1|<it>f</it>(<b>
							<it>x</it>
						</b>)) of the true label <it>y </it>being positive for a given SVM output value <it>f</it>(<b>
							<it>x</it>
						</b>). To do this, we applied a piecewise linear function which was determined on the validation set (the same used for the classifier model selection). We used the <it>N </it>= 50 quantiles taken on the SVM output values as supporting points <it>&#966;</it>
						<sub>
							<it>i</it>
						</sub>, <it>i </it>= 1,...,<it>N</it>. For convenience, denote <it>&#966;</it>
						<sub>0 </sub>= -&#8734;. For each point <it>&#966;</it>
						<sub>
							<it>i </it>
						</sub>the corresponding <inline-formula>
							<m:math name="1471-2105-8-S10-S7-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:msub>
											<m:mover accent="true">
												<m:mi>&#960;</m:mi>
												<m:mo>^</m:mo>
											</m:mover>
											<m:mi>i</m:mi>
										</m:msub>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFapaCgaqcamaaBaaaleaacqWGPbqAaeqaaaaa@3007@</m:annotation>
								</m:semantics>
							</m:math>
						</inline-formula>-value, which represents the empirical probability of being a true positive, was computed as <inline-formula>
							<m:math name="1471-2105-8-S10-S7-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:msub>
											<m:mover accent="true">
												<m:mi>&#960;</m:mi>
												<m:mo>^</m:mo>
											</m:mover>
											<m:mi>i</m:mi>
										</m:msub>
										<m:mo>=</m:mo>
										<m:mfrac>
											<m:mrow>
												<m:msubsup>
													<m:mi>n</m:mi>
													<m:mi>i</m:mi>
													<m:mrow>
														<m:mi>T</m:mi>
														<m:mi>P</m:mi>
													</m:mrow>
												</m:msubsup>
											</m:mrow>
											<m:mrow>
												<m:msub>
													<m:mi>n</m:mi>
													<m:mi>i</m:mi>
												</m:msub>
											</m:mrow>
										</m:mfrac>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFapaCgaqcamaaBaaaleaacqWGPbqAaeqaaOGaeyypa0ZaaSaaaeaacqWGUbGBdaqhaaWcbaGaemyAaKgabaGaemivaqLaemiuaafaaaGcbaGaemOBa42aaSbaaSqaaiabdMgaPbqabaaaaaaa@3964@</m:annotation>
								</m:semantics>
							</m:math>
						</inline-formula>, where <it>n</it>
						<sub>
							<it>i </it>
						</sub>(<it>i </it>= 1,...,<it>N</it>) is the number of examples with output values <it>&#966;</it>
						<sub>
							<it>i</it>-1 </sub>&#8804; <it>f</it>(<b>
							<it>x</it>
						</b>) &lt;<it>&#966;</it>
						<sub>
							<it>i </it>
						</sub>and <inline-formula>
							<m:math name="1471-2105-8-S10-S7-i3" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:msubsup>
											<m:mi>n</m:mi>
											<m:mi>i</m:mi>
											<m:mrow>
												<m:mi>T</m:mi>
												<m:mi>P</m:mi>
											</m:mrow>
										</m:msubsup>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGUbGBdaqhaaWcbaGaemyAaKgabaGaemivaqLaemiuaafaaaaa@31F3@</m:annotation>
								</m:semantics>
							</m:math>
						</inline-formula> is the number of true splice sites in the same output range. Additionally, we determined the empirical cumulative probability as follows <inline-formula>
							<m:math name="1471-2105-8-S10-S7-i4" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:msubsup>
											<m:mover accent="true">
												<m:mi>&#960;</m:mi>
												<m:mo>^</m:mo>
											</m:mover>
											<m:mi>i</m:mi>
											<m:mi>c</m:mi>
										</m:msubsup>
										<m:mo>=</m:mo>
										<m:mrow>
											<m:mo>(</m:mo>
											<m:mrow>
												<m:mstyle displaystyle="true">
													<m:msubsup>
														<m:mo>&#8721;</m:mo>
														<m:mrow>
															<m:mi>j</m:mi>
															<m:mo>=</m:mo>
															<m:mi>i</m:mi>
														</m:mrow>
														<m:mi>N</m:mi>
													</m:msubsup>
													<m:mrow>
														<m:msubsup>
															<m:mi>n</m:mi>
															<m:mi>j</m:mi>
															<m:mrow>
																<m:mi>T</m:mi>
																<m:mi>P</m:mi>
															</m:mrow>
														</m:msubsup>
													</m:mrow>
												</m:mstyle>
											</m:mrow>
											<m:mo>)</m:mo>
										</m:mrow>
										<m:mo>/</m:mo>
										<m:mrow>
											<m:mo>(</m:mo>
											<m:mrow>
												<m:mstyle displaystyle="true">
													<m:msubsup>
														<m:mo>&#8721;</m:mo>
														<m:mrow>
															<m:mi>j</m:mi>
															<m:mo>=</m:mo>
															<m:mi>i</m:mi>
														</m:mrow>
														<m:mi>N</m:mi>
													</m:msubsup>
													<m:mrow>
														<m:msub>
															<m:mi>n</m:mi>
															<m:mi>j</m:mi>
														</m:msub>
													</m:mrow>
												</m:mstyle>
											</m:mrow>
											<m:mo>)</m:mo>
										</m:mrow>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFapaCgaqcamaaDaaaleaacqWGPbqAaeaacqWGJbWyaaGccqGH9aqpdaqadaqaamaaqadabaGaemOBa42aa0baaSqaaiabdQgaQbqaaiabdsfaujabdcfaqbaaaeaacqWGQbGAcqGH9aqpcqWGPbqAaeaacqWGobGta0GaeyyeIuoaaOGaayjkaiaawMcaaiabc+caVmaabmaabaWaaabmaeaacqWGUbGBdaWgaaWcbaGaemOAaOgabeaaaeaacqWGQbGAcqGH9aqpcqWGPbqAaeaacqWGobGta0GaeyyeIuoaaOGaayjkaiaawMcaaaaa@4C5E@</m:annotation>
								</m:semantics>
							</m:math>
						</inline-formula>. In order to obtain a smooth and strictly monotonically increasing probability estimate, we solve the following quadratic optimization problem:</p>
					<p>
						<display-formula>
							<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-8-S10-S7-i5">
								<m:semantics>
									<m:mrow>
										<m:mtable columnalign="left">
											<m:mtr columnalign="left">
												<m:mtd columnalign="left">
													<m:mrow>
														<m:munder>
															<m:mrow>
																<m:mi>min</m:mi><m:mo>&#8289;</m:mo>
															</m:mrow>
															<m:mrow>
																<m:mi>&#960;</m:mi><m:mo>,</m:mo><m:msup>
																	<m:mi>&#960;</m:mi>
																	<m:mi>c</m:mi>
																</m:msup>
																<m:mo>&#8712;</m:mo><m:msubsup>
																	<m:mi>&#8477;</m:mi>
																	<m:mo>+</m:mo>
																	<m:mi>N</m:mi>
																</m:msubsup>
															</m:mrow>
														</m:munder>
													</m:mrow>
												</m:mtd>
												<m:mtd columnalign="left">
													<m:mrow>
														<m:mstyle displaystyle="true">
															<m:munderover>
																<m:mo>&#8721;</m:mo>
																<m:mrow>
																	<m:mi>i</m:mi><m:mo>=</m:mo><m:mn>1</m:mn>
																</m:mrow>
																<m:mi>N</m:mi>
															</m:munderover>
															<m:mrow>
																<m:mo stretchy="false">(</m:mo><m:msub>
																	<m:mi>s</m:mi>
																	<m:mi>i</m:mi>
																</m:msub>
																<m:mo stretchy="false">(</m:mo><m:mi>&#960;</m:mi><m:mo stretchy="false">)</m:mo><m:mo>+</m:mo><m:msub>
																	<m:mi>t</m:mi>
																	<m:mi>i</m:mi>
																</m:msub>
																<m:mo stretchy="false">(</m:mo><m:msup>
																	<m:mi>&#960;</m:mi>
																	<m:mi>c</m:mi>
																</m:msup>
																<m:mo stretchy="false">)</m:mo><m:mo stretchy="false">)</m:mo>
															</m:mrow>
														</m:mstyle>
													</m:mrow>
												</m:mtd>
											</m:mtr>
											<m:mtr columnalign="left">
												<m:mtd columnalign="left">
													<m:mrow>
														<m:mtext>s</m:mtext><m:mo>.</m:mo><m:mtext>t</m:mtext><m:mo>.</m:mo>
													</m:mrow>
												</m:mtd>
												<m:mtd columnalign="left">
													<m:mrow>
														<m:msub>
															<m:mi>&#960;</m:mi>
															<m:mi>i</m:mi>
														</m:msub>
														<m:mo>&#8804;</m:mo><m:msubsup>
															<m:mi>&#960;</m:mi>
															<m:mi>i</m:mi>
															<m:mi>c</m:mi>
														</m:msubsup>
														<m:mo>,</m:mo><m:mtext>&#160;for&#160;all&#160;</m:mtext><m:mi>i</m:mi><m:mo>=</m:mo><m:mn>1</m:mn><m:mo>,</m:mo><m:mn>...</m:mn><m:mo>,</m:mo><m:mi>N</m:mi>
													</m:mrow>
												</m:mtd>
											</m:mtr>
											<m:mtr columnalign="left">
												<m:mtd columnalign="left">
													<m:mrow/>
												</m:mtd>
												<m:mtd columnalign="left">
													<m:mrow>
														<m:msub>
															<m:mi>&#960;</m:mi>
															<m:mi>i</m:mi>
														</m:msub>
														<m:mo>&#8804;</m:mo><m:msub>
															<m:mi>&#960;</m:mi>
															<m:mrow>
																<m:mi>i</m:mi><m:mo>+</m:mo><m:mn>1</m:mn>
															</m:mrow>
														</m:msub>
														<m:mo>&#8722;</m:mo><m:mi>&#949;</m:mi><m:mo>,</m:mo><m:mtext>&#160;for&#160;all&#160;</m:mtext><m:mi>i</m:mi><m:mo>=</m:mo><m:mn>1</m:mn><m:mo>,</m:mo><m:mn>...</m:mn><m:mo>,</m:mo><m:mi>N</m:mi><m:mo>&#8722;</m:mo><m:mn>1</m:mn>
													</m:mrow>
												</m:mtd>
											</m:mtr>
											<m:mtr columnalign="left">
												<m:mtd columnalign="left">
													<m:mrow/>
												</m:mtd>
												<m:mtd columnalign="left">
													<m:mrow>
														<m:msubsup>
															<m:mi>&#960;</m:mi>
															<m:mi>i</m:mi>
															<m:mi>c</m:mi>
														</m:msubsup>
														<m:mo>&#8804;</m:mo><m:msubsup>
															<m:mi>&#960;</m:mi>
															<m:mrow>
																<m:mi>i</m:mi><m:mo>+</m:mo><m:mn>1</m:mn>
															</m:mrow>
															<m:mi>c</m:mi>
														</m:msubsup>
														<m:mo>&#8722;</m:mo><m:mi>&#949;</m:mi><m:mtext>,&#160;for&#160;all&#160;</m:mtext><m:mi>i</m:mi><m:mo>=</m:mo><m:mn>1</m:mn><m:mo>,</m:mo><m:mn>...</m:mn><m:mo>,</m:mo><m:mi>N</m:mi><m:mo>&#8722;</m:mo><m:mn>1</m:mn><m:mo>,</m:mo>
													</m:mrow>
												</m:mtd>
											</m:mtr>
										</m:mtable>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqaaeabcaaaaeaadaWfqaqaaiGbc2gaTjabcMgaPjabc6gaUbWcbaacciGae8hWdaNaeiilaWIae8hWda3aaWbaaWqabeaacqWGJbWyaaWccqGHiiIZtuuDJXwAK1uy0HMmaeHbfv3ySLgzG0uy0HgiuD3BaGabaiab+1risnaaDaaameaacqGHRaWkaeaacqWGobGtaaaaleqaaaGcbaWaaabCaeaacqGGOaakcqWGZbWCdaWgaaWcbaGaemyAaKgabeaakiabcIcaOGGadiab9b8aWjabcMcaPiabgUcaRiabdsha0naaBaaaleaacqWGPbqAaeqaaOGaeiikaGIae0hWda3aaWbaaSqabeaacqWGJbWyaaGccqGGPaqkcqGGPaqkaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6eaobqdcqGHris5aaGcbaGaee4CamNaeiOla4IaeeiDaqNaeiOla4cabaGae8hWda3aaSbaaSqaaiabdMgaPbqabaGccqGHKjYOcqWFapaCdaqhaaWcbaGaemyAaKgabaGaem4yamgaaOGaeiilaWIaeeiiaaIaeeOzayMaee4Ba8MaeeOCaiNaeeiiaaIaeeyyaeMaeeiBaWMaeeiBaWMaeeiiaaIaemyAaKMaeyypa0JaeGymaeJaeiilaWIaeiOla4IaeiOla4IaeiOla4IaeiilaWIaemOta4eabaaabaGae8hWda3aaSbaaSqaaiabdMgaPbqabaGccqGHKjYOcqWFapaCdaWgaaWcbaGaemyAaKMaey4kaSIaeGymaedabeaakiabgkHiTiab=v7aLjabcYcaSiabbccaGiabbAgaMjabb+gaVjabbkhaYjabbccaGiabbggaHjabbYgaSjabbYgaSjabbccaGiabdMgaPjabg2da9iabigdaXiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiabd6eaojabgkHiTiabigdaXaqaaaqaaiab=b8aWnaaDaaaleaacqWGPbqAaeaacqWGJbWyaaGccqGHKjYOcqWFapaCdaqhaaWcbaGaemyAaKMaey4kaSIaeGymaedabaGaem4yamgaaOGaeyOeI0Iae8xTduMaeeilaWIaeeiiaaIaeeOzayMaee4Ba8MaeeOCaiNaeeiiaaIaeeyyaeMaeeiBaWMaeeiBaWMaeeiiaaIaemyAaKMaeyypa0JaeGymaeJaeiilaWIaeiOla4IaeiOla4IaeiOla4IaeiilaWIaemOta4KaeyOeI0IaeGymaeJaeiilaWcaaaaa@C8C6@</m:annotation>
								</m:semantics>
							</m:math>
						</display-formula>
					</p>
					<p>where <it>&#949; </it>= 10<sup>-4 </sup>is a small constant ensuring that the functions are <it>strictly </it>monotonically increasing and <inline-formula>
							<m:math name="1471-2105-8-S10-S7-i6" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:msub>
											<m:mi>s</m:mi>
											<m:mi>i</m:mi>
										</m:msub>
										<m:mo stretchy="false">(</m:mo>
										<m:mi>&#960;</m:mi>
										<m:mo stretchy="false">)</m:mo>
										<m:mo>=</m:mo>
										<m:mfrac>
											<m:mrow>
												<m:msub>
													<m:mi>n</m:mi>
													<m:mi>i</m:mi>
												</m:msub>
											</m:mrow>
											<m:mrow>
												<m:mstyle displaystyle="true">
													<m:msubsup>
														<m:mo>&#8721;</m:mo>
														<m:mrow>
															<m:mi>j</m:mi>
															<m:mo>=</m:mo>
															<m:mn>1</m:mn>
														</m:mrow>
														<m:mi>N</m:mi>
													</m:msubsup>
													<m:mrow>
														<m:msub>
															<m:mi>n</m:mi>
															<m:mi>j</m:mi>
														</m:msub>
													</m:mrow>
												</m:mstyle>
											</m:mrow>
										</m:mfrac>
										<m:msup>
											<m:mrow>
												<m:mo stretchy="false">(</m:mo>
												<m:msub>
													<m:mi>&#960;</m:mi>
													<m:mi>i</m:mi>
												</m:msub>
												<m:mo>&#8722;</m:mo>
												<m:msub>
													<m:mover accent="true">
														<m:mi>&#960;</m:mi>
														<m:mo>^</m:mo>
													</m:mover>
													<m:mi>i</m:mi>
												</m:msub>
												<m:mo stretchy="false">)</m:mo>
											</m:mrow>
											<m:mn>2</m:mn>
										</m:msup>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGZbWCdaWgaaWcbaGaemyAaKgabeaakiabcIcaOGGadiab=b8aWjabcMcaPiabg2da9maalaaabaGaemOBa42aaSbaaSqaaiabdMgaPbqabaaakeaadaaeWaqaaiabd6gaUnaaBaaaleaacqWGQbGAaeqaaaqaaiabdQgaQjabg2da9iabigdaXaqaaiabd6eaobqdcqGHris5aaaakiabcIcaOGGaciab+b8aWnaaBaaaleaacqWGPbqAaeqaaOGaeyOeI0Iaf4hWdaNbaKaadaWgaaWcbaGaemyAaKgabeaakiabcMcaPmaaCaaaleqabaGaeGOmaidaaaaa@4B00@</m:annotation>
								</m:semantics>
							</m:math>
						</inline-formula> and <inline-formula>
							<m:math name="1471-2105-8-S10-S7-i7" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:msub>
											<m:mi>t</m:mi>
											<m:mi>i</m:mi>
										</m:msub>
										<m:mo stretchy="false">(</m:mo>
										<m:msup>
											<m:mi>&#960;</m:mi>
											<m:mi>c</m:mi>
										</m:msup>
										<m:mo stretchy="false">)</m:mo>
										<m:mo>=</m:mo>
										<m:mfrac>
											<m:mrow>
												<m:mstyle displaystyle="true">
													<m:msubsup>
														<m:mo>&#8721;</m:mo>
														<m:mrow>
															<m:mi>j</m:mi>
															<m:mo>=</m:mo>
															<m:mi>i</m:mi>
														</m:mrow>
														<m:mi>N</m:mi>
													</m:msubsup>
													<m:mrow>
														<m:msub>
															<m:mi>n</m:mi>
															<m:mi>j</m:mi>
														</m:msub>
													</m:mrow>
												</m:mstyle>
											</m:mrow>
											<m:mrow>
												<m:mstyle displaystyle="true">
													<m:msubsup>
														<m:mo>&#8721;</m:mo>
														<m:mrow>
															<m:mi>j</m:mi>
															<m:mo>=</m:mo>
															<m:mn>1</m:mn>
														</m:mrow>
														<m:mi>N</m:mi>
													</m:msubsup>
													<m:mrow>
														<m:msub>
															<m:mi>n</m:mi>
															<m:mi>j</m:mi>
														</m:msub>
													</m:mrow>
												</m:mstyle>
											</m:mrow>
										</m:mfrac>
										<m:msup>
											<m:mrow>
												<m:mo stretchy="false">(</m:mo>
												<m:msubsup>
													<m:mi>&#960;</m:mi>
													<m:mi>i</m:mi>
													<m:mi>c</m:mi>
												</m:msubsup>
												<m:mo>&#8722;</m:mo>
												<m:msubsup>
													<m:mover accent="true">
														<m:mi>&#960;</m:mi>
														<m:mo>^</m:mo>
													</m:mover>
													<m:mi>i</m:mi>
													<m:mi>c</m:mi>
												</m:msubsup>
												<m:mo stretchy="false">)</m:mo>
											</m:mrow>
											<m:mn>2</m:mn>
										</m:msup>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWG0baDdaWgaaWcbaGaemyAaKgabeaakiabcIcaOGGadiab=b8aWnaaCaaaleqabaGaem4yamgaaOGaeiykaKIaeyypa0ZaaSaaaeaadaaeWaqaaiabd6gaUnaaBaaaleaacqWGQbGAaeqaaaqaaiabdQgaQjabg2da9iabdMgaPbqaaiabd6eaobqdcqGHris5aaGcbaWaaabmaeaacqWGUbGBdaWgaaWcbaGaemOAaOgabeaaaeaacqWGQbGAcqGH9aqpcqaIXaqmaeaacqWGobGta0GaeyyeIuoaaaGccqGGOaakiiGacqGFapaCdaqhaaWcbaGaemyAaKgabaGaem4yamgaaOGaeyOeI0Iaf4hWdaNbaKaadaqhaaWcbaGaemyAaKgabaGaem4yamgaaOGaeiykaKYaaWbaaSqabeaacqaIYaGmaaaaaa@5604@</m:annotation>
								</m:semantics>
							</m:math>
						</inline-formula> ensuring that big differences between the final and empirical estimates in ranges with many outputs are penalized stronger. Using the newly computed values <it>&#960;</it>
						<sub>1</sub>,...,<it>&#960;</it>
						<sub>
							<it>N</it>
						</sub>, we can compute for any output value <it>f</it>(<b>
							<it>x</it>
						</b>) the corresponding posterior probability estimate <it>P</it>(<it>y </it>= 1|<it>f</it>(<b>
							<it>x</it>
						</b>)) by linear interpolation</p>
					<p>
						<display-formula>
							<m:math name="1471-2105-8-S10-S7-i8" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:mi>P</m:mi>
										<m:mo stretchy="false">(</m:mo>
										<m:mi>y</m:mi>
										<m:mo>=</m:mo>
										<m:mn>1</m:mn>
										<m:mo>|</m:mo>
										<m:mi>f</m:mi>
										<m:mo stretchy="false">(</m:mo>
										<m:mi>x</m:mi>
										<m:mo stretchy="false">)</m:mo>
										<m:mo stretchy="false">)</m:mo>
										<m:mo>=</m:mo>
										<m:mrow>
											<m:mo>{</m:mo>
											<m:mrow>
												<m:mtable columnalign="left">
													<m:mtr columnalign="left">
														<m:mtd columnalign="left">
															<m:mrow>
																<m:msub>
																	<m:mi>&#960;</m:mi>
																	<m:mn>1</m:mn>
																</m:msub>
															</m:mrow>
														</m:mtd>
														<m:mtd columnalign="left">
															<m:mrow>
																<m:mtext>for&#160;</m:mtext>
																<m:mi>f</m:mi>
																<m:mo stretchy="false">(</m:mo>
																<m:mi>x</m:mi>
																<m:mo stretchy="false">)</m:mo>
																<m:mo>&lt;</m:mo>
																<m:msub>
																	<m:mi>&#966;</m:mi>
																	<m:mn>1</m:mn>
																</m:msub>
															</m:mrow>
														</m:mtd>
													</m:mtr>
													<m:mtr columnalign="left">
														<m:mtd columnalign="left">
															<m:mrow>
																<m:mi>r</m:mi>
																<m:mo stretchy="false">(</m:mo>
																<m:msub>
																	<m:mi>&#966;</m:mi>
																	<m:mi>i</m:mi>
																</m:msub>
																<m:mo>,</m:mo>
																<m:msub>
																	<m:mi>&#966;</m:mi>
																	<m:mrow>
																		<m:mi>i</m:mi>
																		<m:mo>+</m:mo>
																		<m:mn>1</m:mn>
																	</m:mrow>
																</m:msub>
																<m:mo stretchy="false">)</m:mo>
															</m:mrow>
														</m:mtd>
														<m:mtd columnalign="left">
															<m:mrow>
																<m:mtext>for&#160;</m:mtext>
																<m:msub>
																	<m:mi>&#966;</m:mi>
																	<m:mi>i</m:mi>
																</m:msub>
																<m:mo>&#8804;</m:mo>
																<m:mi>f</m:mi>
																<m:mo stretchy="false">(</m:mo>
																<m:mi>x</m:mi>
																<m:mo stretchy="false">)</m:mo>
																<m:mo>&lt;</m:mo>
																<m:msub>
																	<m:mi>&#966;</m:mi>
																	<m:mrow>
																		<m:mi>i</m:mi>
																		<m:mo>+</m:mo>
																		<m:mn>1</m:mn>
																	</m:mrow>
																</m:msub>
																<m:mo>,</m:mo>
															</m:mrow>
														</m:mtd>
													</m:mtr>
													<m:mtr columnalign="left">
														<m:mtd columnalign="left">
															<m:mrow>
																<m:msub>
																	<m:mi>&#960;</m:mi>
																	<m:mi>N</m:mi>
																</m:msub>
															</m:mrow>
														</m:mtd>
														<m:mtd columnalign="left">
															<m:mrow>
																<m:mtext>for&#160;</m:mtext>
																<m:mi>f</m:mi>
																<m:mo stretchy="false">(</m:mo>
																<m:mi>x</m:mi>
																<m:mo stretchy="false">)</m:mo>
																<m:mo>&#8805;</m:mo>
																<m:msub>
																	<m:mi>&#966;</m:mi>
																	<m:mi>N</m:mi>
																</m:msub>
															</m:mrow>
														</m:mtd>
													</m:mtr>
												</m:mtable>
											</m:mrow>
										</m:mrow>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaucqGGOaakcqWG5bqEcqGH9aqpcqaIXaqmcqGG8baFcqWGMbGzcqGGOaakieWacqWF4baEcqGGPaqkcqGGPaqkcqGH9aqpdaGabeqaauaabaqadiaaaeaaiiGacqGFapaCdaWgaaWcbaGaeGymaedabeaaaOqaaiabbAgaMjabb+gaVjabbkhaYjabbccaGiabdAgaMjabcIcaOiab=Hha4jabcMcaPiabgYda8iab+z8aMnaaBaaaleaacqaIXaqmaeqaaaGcbaGaemOCaiNaeiikaGIae4NXdy2aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqGFgpGzdaWgaaWcbaGaemyAaKMaey4kaSIaeGymaedabeaakiabcMcaPaqaaiabbAgaMjabb+gaVjabbkhaYjabbccaGiab+z8aMnaaBaaaleaacqWGPbqAaeqaaOGaeyizImQaemOzayMaeiikaGIae8hEaGNaeiykaKIaeyipaWJae4NXdy2aaSbaaSqaaiabdMgaPjabgUcaRiabigdaXaqabaGccqGGSaalaeaacqGFapaCdaWgaaWcbaGaemOta4eabeaaaOqaaiabbAgaMjabb+gaVjabbkhaYjabbccaGiabdAgaMjabcIcaOiab=Hha4jabcMcaPiabgwMiZkab+z8aMnaaBaaaleaacqWGobGtaeqaaaaaaOGaay5Eaaaaaa@7E5D@</m:annotation>
								</m:semantics>
							</m:math>
						</display-formula>
					</p>
					<p>where <inline-formula>
							<m:math name="1471-2105-8-S10-S7-i9" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:mi>r</m:mi>
										<m:mo stretchy="false">(</m:mo>
										<m:msub>
											<m:mi>&#966;</m:mi>
											<m:mi>i</m:mi>
										</m:msub>
										<m:mo>,</m:mo>
										<m:msub>
											<m:mi>&#966;</m:mi>
											<m:mrow>
												<m:mi>i</m:mi>
												<m:mo>+</m:mo>
												<m:mn>1</m:mn>
											</m:mrow>
										</m:msub>
										<m:mo stretchy="false">)</m:mo>
										<m:mo>=</m:mo>
										<m:mfrac>
											<m:mrow>
												<m:msub>
													<m:mi>&#960;</m:mi>
													<m:mrow>
														<m:mi>i</m:mi>
														<m:mo>+</m:mo>
														<m:mn>1</m:mn>
													</m:mrow>
												</m:msub>
												<m:mo stretchy="false">(</m:mo>
												<m:mi>f</m:mi>
												<m:mo stretchy="false">(</m:mo>
												<m:mi>x</m:mi>
												<m:mo stretchy="false">)</m:mo>
												<m:mo>&#8722;</m:mo>
												<m:msub>
													<m:mi>&#966;</m:mi>
													<m:mi>i</m:mi>
												</m:msub>
												<m:mo stretchy="false">)</m:mo>
												<m:mo>+</m:mo>
												<m:msub>
													<m:mi>&#960;</m:mi>
													<m:mi>i</m:mi>
												</m:msub>
												<m:mo stretchy="false">(</m:mo>
												<m:msub>
													<m:mi>&#966;</m:mi>
													<m:mrow>
														<m:mi>i</m:mi>
														<m:mo>+</m:mo>
														<m:mn>1</m:mn>
													</m:mrow>
												</m:msub>
												<m:mo>&#8722;</m:mo>
												<m:mi>f</m:mi>
												<m:mo stretchy="false">(</m:mo>
												<m:mi>x</m:mi>
												<m:mo stretchy="false">)</m:mo>
												<m:mo stretchy="false">)</m:mo>
											</m:mrow>
											<m:mrow>
												<m:msub>
													<m:mi>&#966;</m:mi>
													<m:mrow>
														<m:mi>i</m:mi>
														<m:mo>+</m:mo>
														<m:mn>1</m:mn>
													</m:mrow>
												</m:msub>
												<m:mo>&#8722;</m:mo>
												<m:msub>
													<m:mi>&#966;</m:mi>
													<m:mi>i</m:mi>
												</m:msub>
											</m:mrow>
										</m:mfrac>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGYbGCcqGGOaakiiGacqWFgpGzdaWgaaWcbaGaemyAaKgabeaakiabcYcaSiab=z8aMnaaBaaaleaacqWGPbqAcqGHRaWkcqaIXaqmaeqaaOGaeiykaKIaeyypa0ZaaSaaaeaacqWFapaCdaWgaaWcbaGaemyAaKMaey4kaSIaeGymaedabeaakiabcIcaOiabdAgaMjabcIcaOGqadiab+Hha4jabcMcaPiabgkHiTiab=z8aMnaaBaaaleaacqWGPbqAaeqaaOGaeiykaKIaey4kaSIae8hWda3aaSbaaSqaaiabdMgaPbqabaGccqGGOaakcqWFgpGzdaWgaaWcbaGaemyAaKMaey4kaSIaeGymaedabeaakiabgkHiTiabdAgaMjabcIcaOiab+Hha4jabcMcaPiabcMcaPaqaaiab=z8aMnaaBaaaleaacqWGPbqAcqGHRaWkcqaIXaqmaeqaaOGaeyOeI0Iae8NXdy2aaSbaaSqaaiabdMgaPbqabaaaaaaa@634A@</m:annotation>
								</m:semantics>
							</m:math>
						</inline-formula>. The cumulative posterior probability <it>P</it>
						<sup>
							<it>c</it>
						</sup>(<it>y </it>= 1|<it>f</it>(<b>
							<it>x</it>
						</b>)) is computed analogously. The above estimation procedure was performed separately for every classifier.</p>
				</sec>
			</sec>
			<sec>
				<st>
					<p>Identifying splice sites</p>
				</st>
				<p>Machine learning binary classification methods aim at estimating a classification function <it>f </it>: <inline-formula>
						<m:math name="1471-2105-8-S10-S7-i10" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mi mathvariant="script">X</m:mi>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFxepwaaa@384E@</m:annotation>
							</m:semantics>
						</m:math>
					</inline-formula> &#8594; {&#177;1} using labeled training data from <inline-formula>
						<m:math name="1471-2105-8-S10-S7-i10" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mi mathvariant="script">X</m:mi>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFxepwaaa@384E@</m:annotation>
							</m:semantics>
						</m:math>
					</inline-formula> &#215; {&#177;1} such that <it>g </it>will correctly classify unseen examples. In our case, the input space <inline-formula>
						<m:math name="1471-2105-8-S10-S7-i10" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mi mathvariant="script">X</m:mi>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFxepwaaa@384E@</m:annotation>
							</m:semantics>
						</m:math>
					</inline-formula> will contain simple representations of sequences of length <it>N</it>, {<it>A</it>, <it>C</it>, <it>G</it>, <it>T</it>}<sup>
						<it>N</it>
					</sup>, while &#177;1 corresponds to true splice and decoy sites, respectively. We will use the posterior log-odds of a simple probabilistic model and SVMs using different kernels as classifiers as discussed below.</p>
				<sec>
					<st>
						<p>Posterior log-odds</p>
					</st>
					<p>The posterior log-odds of a probabilistic model with parameters <b>
							<it>&#952; </it>
						</b>are defined by</p>
					<p>
						<display-formula>
							<m:math name="1471-2105-8-S10-S7-i11" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:mtable columnalign="left">
											<m:mtr columnalign="left">
												<m:mtd columnalign="left">
													<m:mrow>
														<m:mi>g</m:mi>
														<m:mo stretchy="false">(</m:mo>
														<m:mi>x</m:mi>
														<m:mo stretchy="false">)</m:mo>
													</m:mrow>
												</m:mtd>
												<m:mtd columnalign="left">
													<m:mrow>
														<m:mo>:</m:mo>
														<m:mo>=</m:mo>
													</m:mrow>
												</m:mtd>
												<m:mtd columnalign="left">
													<m:mrow>
														<m:mi>log</m:mi>
														<m:mo>&#8289;</m:mo>
														<m:mo stretchy="false">(</m:mo>
														<m:mi>P</m:mi>
														<m:mo stretchy="false">(</m:mo>
														<m:mi>y</m:mi>
														<m:mo>=</m:mo>
														<m:mo>+</m:mo>
														<m:mn>1</m:mn>
														<m:mo>|</m:mo>
														<m:mi>x</m:mi>
														<m:mo>,</m:mo>
														<m:mi>&#952;</m:mi>
														<m:mo stretchy="false">)</m:mo>
														<m:mo stretchy="false">)</m:mo>
														<m:mo>&#8722;</m:mo>
														<m:mi>log</m:mi>
														<m:mo>&#8289;</m:mo>
														<m:mo stretchy="false">(</m:mo>
														<m:mi>P</m:mi>
														<m:mo stretchy="false">(</m:mo>
														<m:mi>y</m:mi>
														<m:mo>=</m:mo>
														<m:mo>&#8722;</m:mo>
														<m:mn>1</m:mn>
														<m:mo>|</m:mo>
														<m:mi>x</m:mi>
														<m:mo>,</m:mo>
														<m:mi>&#952;</m:mi>
														<m:mo stretchy="false">)</m:mo>
														<m:mo stretchy="false">)</m:mo>
													</m:mrow>
												</m:mtd>
											</m:mtr>
											<m:mtr columnalign="left">
												<m:mtd columnalign="left">
													<m:mrow/>
												</m:mtd>
												<m:mtd columnalign="left">
													<m:mo>=</m:mo>
												</m:mtd>
												<m:mtd columnalign="left">
													<m:mrow>
														<m:mi>log</m:mi>
														<m:mo>&#8289;</m:mo>
														<m:mo stretchy="false">(</m:mo>
														<m:mi>P</m:mi>
														<m:mo stretchy="false">(</m:mo>
														<m:mi>x</m:mi>
														<m:mo>|</m:mo>
														<m:msup>
															<m:mi>&#952;</m:mi>
															<m:mo>+</m:mo>
														</m:msup>
														<m:mo stretchy="false">)</m:mo>
														<m:mo stretchy="false">)</m:mo>
														<m:mo>&#8722;</m:mo>
														<m:mi>log</m:mi>
														<m:mo>&#8289;</m:mo>
														<m:mo stretchy="false">(</m:mo>
														<m:mi>P</m:mi>
														<m:mo stretchy="false">(</m:mo>
														<m:mi>x</m:mi>
														<m:mo>|</m:mo>
														<m:msup>
															<m:mi>&#952;</m:mi>
															<m:mo>&#8722;</m:mo>
														</m:msup>
														<m:mo stretchy="false">)</m:mo>
														<m:mo stretchy="false">)</m:mo>
														<m:mo>+</m:mo>
														<m:mi>b</m:mi>
														<m:mo>,</m:mo>
													</m:mrow>
												</m:mtd>
											</m:mtr>
										</m:mtable>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqaaeGadaaabaGaem4zaCMaeiikaGccbmGae8hEaGNaeiykaKcabaGaeiOoaOJaeyypa0dabaGagiiBaWMaei4Ba8Maei4zaCMaeiikaGIaemiuaaLaeiikaGIaemyEaKNaeyypa0Jaey4kaSIaeGymaeJaeiiFaWNae8hEaGNaeiilaWcccmGae4hUdeNaeiykaKIaeiykaKIaeyOeI0IagiiBaWMaei4Ba8Maei4zaCMaeiikaGIaemiuaaLaeiikaGIaemyEaKNaeyypa0JaeyOeI0IaeGymaeJaeiiFaWNae8hEaGNaeiilaWIae4hUdeNaeiykaKIaeiykaKcabaaabaGaeyypa0dabaGagiiBaWMaei4Ba8Maei4zaCMaeiikaGIaemiuaaLaeiikaGIae8hEaGNaeiiFaWNae4hUde3aaWbaaSqabeaacqGHRaWkaaGccqGGPaqkcqGGPaqkcqGHsislcyGGSbaBcqGGVbWBcqGGNbWzcqGGOaakcqWGqbaucqGGOaakcqWF4baEcqGG8baFcqGF4oqCdaahaaWcbeqaaiabgkHiTaaakiabcMcaPiabcMcaPiabgUcaRiabdkgaIjabcYcaSaaaaaa@7B34@</m:annotation>
								</m:semantics>
							</m:math>
						</display-formula>
					</p>
					<p>where <it>b </it>is a bias term. We use <it>f</it>(<b>
							<it>x</it>
						</b>) = sign(<it>g</it>(<b>
							<it>x</it>
						</b>)) for classification and Markov chains of order <it>d</it>
					</p>
					<p>
						<display-formula>
							<m:math name="1471-2105-8-S10-S7-i12" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:mtable columnalign="left">
											<m:mtr columnalign="left">
												<m:mtd columnalign="left">
													<m:mrow>
														<m:mi>P</m:mi>
														<m:mo stretchy="false">(</m:mo>
														<m:mi>x</m:mi>
														<m:mo>|</m:mo>
														<m:msup>
															<m:mi>&#952;</m:mi>
															<m:mo>&#177;</m:mo>
														</m:msup>
														<m:mo stretchy="false">)</m:mo>
													</m:mrow>
												</m:mtd>
												<m:mtd columnalign="left">
													<m:mo>=</m:mo>
												</m:mtd>
												<m:mtd columnalign="left">
													<m:mrow>
														<m:mi>P</m:mi>
														<m:mo stretchy="false">(</m:mo>
														<m:msub>
															<m:mi>x</m:mi>
															<m:mn>1</m:mn>
														</m:msub>
														<m:mo>,</m:mo>
														<m:mn>...</m:mn>
														<m:mo>,</m:mo>
														<m:msub>
															<m:mi>x</m:mi>
															<m:mi>N</m:mi>
														</m:msub>
														<m:mo>|</m:mo>
														<m:msup>
															<m:mi>&#952;</m:mi>
															<m:mo>&#177;</m:mo>
														</m:msup>
														<m:mo stretchy="false">)</m:mo>
													</m:mrow>
												</m:mtd>
											</m:mtr>
											<m:mtr columnalign="left">
												<m:mtd columnalign="left">
													<m:mrow/>
												</m:mtd>
												<m:mtd columnalign="left">
													<m:mo>=</m:mo>
												</m:mtd>
												<m:mtd columnalign="left">
													<m:mrow>
														<m:mi>P</m:mi>
														<m:mo stretchy="false">(</m:mo>
														<m:msub>
															<m:mi>x</m:mi>
															<m:mn>1</m:mn>
														</m:msub>
														<m:mo>,</m:mo>
														<m:mn>...</m:mn>
														<m:mo>,</m:mo>
														<m:msub>
															<m:mi>x</m:mi>
															<m:mi>d</m:mi>
														</m:msub>
														<m:mo>|</m:mo>
														<m:msup>
															<m:mi>&#952;</m:mi>
															<m:mo>&#177;</m:mo>
														</m:msup>
														<m:mo stretchy="false">)</m:mo>
														<m:mstyle displaystyle="true">
															<m:munderover>
																<m:mo>&#8719;</m:mo>
																<m:mrow>
																	<m:mi>i</m:mi>
																	<m:mo>=</m:mo>
																	<m:mi>d</m:mi>
																	<m:mo>+</m:mo>
																	<m:mn>1</m:mn>
																</m:mrow>
																<m:mi>N</m:mi>
															</m:munderover>
															<m:mrow>
																<m:mi>P</m:mi>
																<m:mo stretchy="false">(</m:mo>
																<m:msub>
																	<m:mi>x</m:mi>
																	<m:mi>i</m:mi>
																</m:msub>
																<m:mo>|</m:mo>
																<m:msub>
																	<m:mi>x</m:mi>
																	<m:mrow>
																		<m:mi>i</m:mi>
																		<m:mo>&#8722;</m:mo>
																		<m:mn>1</m:mn>
																	</m:mrow>
																</m:msub>
																<m:mo>,</m:mo>
																<m:mn>...</m:mn>
																<m:mo>,</m:mo>
																<m:msub>
																	<m:mi>x</m:mi>
																	<m:mrow>
																		<m:mi>i</m:mi>
																		<m:mo>&#8722;</m:mo>
																		<m:mi>d</m:mi>
																	</m:mrow>
																</m:msub>
																<m:mo>,</m:mo>
																<m:msup>
																	<m:mi>&#952;</m:mi>
																	<m:mo>&#177;</m:mo>
																</m:msup>
															</m:mrow>
														</m:mstyle>
														<m:mo stretchy="false">)</m:mo>
													</m:mrow>
												</m:mtd>
											</m:mtr>
										</m:mtable>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqaaeGadaaabaGaemiuaaLaeiikaGccbmGae8hEaGNaeiiFaWhccmGae4hUde3aaWbaaSqabeaacqGHXcqSaaGccqGGPaqkaeaacqGH9aqpaeaacqWGqbaucqGGOaakcqWG4baEdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiabdIha4naaBaaaleaacqWGobGtaeqaaOGaeiiFaWNae4hUde3aaWbaaSqabeaacqGHXcqSaaGccqGGPaqkaeaaaeaacqGH9aqpaeaacqWGqbaucqGGOaakcqWG4baEdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiabdIha4naaBaaaleaacqWGKbazaeqaaOGaeiiFaWNae4hUde3aaWbaaSqabeaacqGHXcqSaaGccqGGPaqkdaqeWbqaaiabdcfaqjabcIcaOiabdIha4naaBaaaleaacqWGPbqAaeqaaOGaeiiFaWNaemiEaG3aaSbaaSqaaiabdMgaPjabgkHiTiabigdaXaqabaGccqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqGGSaalcqWG4baEdaWgaaWcbaGaemyAaKMaeyOeI0IaemizaqgabeaakiabcYcaSiab+H7aXnaaCaaaleqabaGaeyySaelaaaqaaiabdMgaPjabg2da9iabdsgaKjabgUcaRiabigdaXaqaaiabd6eaobqdcqGHpis1aOGaeiykaKcaaaaa@8056@</m:annotation>
								</m:semantics>
							</m:math>
						</display-formula>
					</p>
					<p>as for instance described in <abbrgrp>
							<abbr bid="B44">44</abbr>
						</abbrgrp>. Each factor in this product has to be estimated in model training, i.e. one counts how often each symbol appears at each position in the training data conditioned on every possible <it>x</it>
						<sub>
							<it>i</it>-1</sub>,...,<it>x</it>
						<sub>
							<it>i</it>-<it>d</it>
						</sub>. Then for given model parameters <b>
							<it>&#952; </it>
						</b>we have</p>
					<p>
						<display-formula>
							<m:math name="1471-2105-8-S10-S7-i13" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:mi>P</m:mi>
										<m:mo stretchy="false">(</m:mo>
										<m:mi>x</m:mi>
										<m:mo>|</m:mo>
										<m:msup>
											<m:mi>&#952;</m:mi>
											<m:mo>&#177;</m:mo>
										</m:msup>
										<m:mo stretchy="false">)</m:mo>
										<m:mo>=</m:mo>
										<m:msubsup>
											<m:mi>&#952;</m:mi>
											<m:mn>0</m:mn>
											<m:mo>&#177;</m:mo>
										</m:msubsup>
										<m:mo stretchy="false">(</m:mo>
										<m:msub>
											<m:mi>x</m:mi>
											<m:mn>1</m:mn>
										</m:msub>
										<m:mo>,</m:mo>
										<m:mn>...</m:mn>
										<m:mo>,</m:mo>
										<m:msub>
											<m:mi>x</m:mi>
											<m:mi>d</m:mi>
										</m:msub>
										<m:mo stretchy="false">)</m:mo>
										<m:mstyle displaystyle="true">
											<m:munderover>
												<m:mo>&#8719;</m:mo>
												<m:mrow>
													<m:mi>i</m:mi>
													<m:mo>=</m:mo>
													<m:mi>d</m:mi>
													<m:mo>+</m:mo>
													<m:mn>1</m:mn>
												</m:mrow>
												<m:mi>N</m:mi>
											</m:munderover>
											<m:mrow>
												<m:msubsup>
													<m:mi>&#952;</m:mi>
													<m:mi>i</m:mi>
													<m:mo>&#177;</m:mo>
												</m:msubsup>
												<m:mo stretchy="false">(</m:mo>
												<m:msub>
													<m:mi>x</m:mi>
													<m:mi>i</m:mi>
												</m:msub>
												<m:mo>,</m:mo>
												<m:mn>...</m:mn>
												<m:mo>,</m:mo>
												<m:msub>
													<m:mi>x</m:mi>
													<m:mrow>
														<m:mi>i</m:mi>
														<m:mo>&#8722;</m:mo>
														<m:mi>d</m:mi>
													</m:mrow>
												</m:msub>
												<m:mo stretchy="false">)</m:mo>
											</m:mrow>
										</m:mstyle>
										<m:mo>,</m:mo>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaucqGGOaakieWacqWF4baEcqGG8baFiiWacqGF4oqCdaahaaWcbeqaaiabgglaXcaakiabcMcaPiabg2da9GGaciab9H7aXnaaDaaaleaacqaIWaamaeaacqGHXcqSaaGccqGGOaakcqWG4baEdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiabdIha4naaBaaaleaacqWGKbazaeqaaOGaeiykaKYaaebCaeaacqqF4oqCdaqhaaWcbaGaemyAaKgabaGaeyySaelaaOGaeiikaGIaemiEaG3aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqGGSaalcqWG4baEdaWgaaWcbaGaemyAaKMaeyOeI0IaemizaqgabeaakiabcMcaPaWcbaGaemyAaKMaeyypa0JaemizaqMaey4kaSIaeGymaedabaGaemOta4eaniabg+GivdGccqGGSaalaaa@655E@</m:annotation>
								</m:semantics>
							</m:math>
						</display-formula>
					</p>
					<p>where <inline-formula>
							<m:math name="1471-2105-8-S10-S7-i14" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:msubsup>
											<m:mi>&#952;</m:mi>
											<m:mn>0</m:mn>
											<m:mo>&#177;</m:mo>
										</m:msubsup>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWF4oqCdaqhaaWcbaGaeGimaadabaGaeyySaelaaaaa@3172@</m:annotation>
								</m:semantics>
							</m:math>
						</inline-formula> is an estimate for <it>P</it>(<it>x</it>
						<sub>1</sub>,...,<it>x</it>
						<sub>
							<it>d</it>
						</sub>) and <it>&#952;</it>
						<sub>
							<it>i</it>
						</sub>(<it>x</it>
						<sub>
							<it>i</it>
						</sub>,...,<it>x</it>
						<sub>
							<it>i</it>-<it>d</it>
						</sub>) an estimate for <it>P</it>(<it>x</it>
						<sub>
							<it>i</it>
						</sub>
						<it>|x</it>
						<sub>
							<it>i</it>-1</sub>,...,<it>x</it>
						<sub>
							<it>i</it>-<it>d</it>
						</sub>). As the alphabet has four letters, each model has (<it>N </it>- <it>d </it>+ 1)&#183;4<sup>
							<it>d</it>+1 </sup>parameters and the maximum likelihood estimate is given by:</p>
					<p>
						<display-formula>
							<m:math name="1471-2105-8-S10-S7-i15" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:mtable columnalign="left">
											<m:mtr columnalign="left">
												<m:mtd columnalign="left">
													<m:mrow>
														<m:msub>
															<m:mi>&#952;</m:mi>
															<m:mn>0</m:mn>
														</m:msub>
														<m:mo stretchy="false">(</m:mo>
														<m:msub>
															<m:mi>s</m:mi>
															<m:mn>1</m:mn>
														</m:msub>
														<m:mo>,</m:mo>
														<m:mn>...</m:mn>
														<m:mo>,</m:mo>
														<m:msub>
															<m:mi>s</m:mi>
															<m:mi>d</m:mi>
														</m:msub>
														<m:mo stretchy="false">)</m:mo>
														<m:mo>=</m:mo>
													</m:mrow>
												</m:mtd>
											</m:mtr>
											<m:mtr columnalign="left">
												<m:mtd columnalign="left">
													<m:mrow>
														<m:mtext/>
														<m:mfrac>
															<m:mn>1</m:mn>
															<m:mrow>
																<m:mi>m</m:mi>
																<m:mo>+</m:mo>
																<m:mi>&#960;</m:mi>
															</m:mrow>
														</m:mfrac>
														<m:mrow>
															<m:mo>(</m:mo>
															<m:mrow>
																<m:mstyle displaystyle="true">
																	<m:munderover>
																		<m:mo>&#8721;</m:mo>
																		<m:mrow>
																			<m:mi>k</m:mi>
																			<m:mo>=</m:mo>
																			<m:mn>1</m:mn>
																		</m:mrow>
																		<m:mi>m</m:mi>
																	</m:munderover>
																	<m:mrow>
																		<m:mi>I</m:mi>
																		<m:mo stretchy="false">(</m:mo>
																		<m:msub>
																			<m:mi>s</m:mi>
																			<m:mn>1</m:mn>
																		</m:msub>
																		<m:mo>=</m:mo>
																		<m:msubsup>
																			<m:mi>x</m:mi>
																			<m:mn>1</m:mn>
																			<m:mi>k</m:mi>
																		</m:msubsup>
																		<m:mo>&#8743;</m:mo>
																		<m:mo>&#8943;</m:mo>
																		<m:mo>&#8743;</m:mo>
																		<m:msub>
																			<m:mi>s</m:mi>
																			<m:mi>d</m:mi>
																		</m:msub>
																		<m:mo>=</m:mo>
																		<m:msubsup>
																			<m:mi>x</m:mi>
																			<m:mi>d</m:mi>
																			<m:mi>k</m:mi>
																		</m:msubsup>
																		<m:mo stretchy="false">)</m:mo>
																		<m:mo>+</m:mo>
																		<m:mi>&#960;</m:mi>
																	</m:mrow>
																</m:mstyle>
															</m:mrow>
															<m:mo>)</m:mo>
														</m:mrow>
													</m:mrow>
												</m:mtd>
											</m:mtr>
											<m:mtr columnalign="left">
												<m:mtd columnalign="left">
													<m:mrow>
														<m:msub>
															<m:mi>&#952;</m:mi>
															<m:mi>i</m:mi>
														</m:msub>
														<m:mo stretchy="false">(</m:mo>
														<m:msub>
															<m:mi>s</m:mi>
															<m:mi>i</m:mi>
														</m:msub>
														<m:mo>,</m:mo>
														<m:mn>...</m:mn>
														<m:mo>,</m:mo>
														<m:msub>
															<m:mi>s</m:mi>
															<m:mrow>
																<m:mi>i</m:mi>
																<m:mo>&#8722;</m:mo>
																<m:mi>d</m:mi>
															</m:mrow>
														</m:msub>
														<m:mo stretchy="false">)</m:mo>
														<m:mo>=</m:mo>
													</m:mrow>
												</m:mtd>
											</m:mtr>
											<m:mtr columnalign="left">
												<m:mtd columnalign="left">
													<m:mrow>
														<m:mtext/>
														<m:mfrac>
															<m:mrow>
																<m:mstyle displaystyle="true">
																	<m:msubsup>
																		<m:mo>&#8721;</m:mo>
																		<m:mrow>
																			<m:mi>k</m:mi>
																			<m:mo>=</m:mo>
																			<m:mn>1</m:mn>
																		</m:mrow>
																		<m:mi>m</m:mi>
																	</m:msubsup>
																	<m:mrow>
																		<m:mi>I</m:mi>
																		<m:mo stretchy="false">(</m:mo>
																		<m:msub>
																			<m:mi>s</m:mi>
																			<m:mi>i</m:mi>
																		</m:msub>
																		<m:mo>=</m:mo>
																		<m:msubsup>
																			<m:mi>x</m:mi>
																			<m:mi>i</m:mi>
																			<m:mi>k</m:mi>
																		</m:msubsup>
																		<m:mo>&#8743;</m:mo>
																		<m:mo>&#8943;</m:mo>
																		<m:mo>&#8743;</m:mo>
																		<m:msub>
																			<m:mi>s</m:mi>
																			<m:mrow>
																				<m:mi>i</m:mi>
																				<m:mo>&#8722;</m:mo>
																				<m:mi>d</m:mi>
																			</m:mrow>
																		</m:msub>
																		<m:mo>=</m:mo>
																		<m:msubsup>
																			<m:mi>x</m:mi>
																			<m:mrow>
																				<m:mi>i</m:mi>
																				<m:mo>&#8722;</m:mo>
																				<m:mi>d</m:mi>
																			</m:mrow>
																			<m:mi>k</m:mi>
																		</m:msubsup>
																		<m:mo stretchy="false">)</m:mo>
																		<m:mo>+</m:mo>
																		<m:mi>&#960;</m:mi>
																	</m:mrow>
																</m:mstyle>
															</m:mrow>
															<m:mrow>
																<m:mstyle displaystyle="true">
																	<m:msubsup>
																		<m:mo>&#8721;</m:mo>
																		<m:mrow>
																			<m:mi>k</m:mi>
																			<m:mo>=</m:mo>
																			<m:mn>1</m:mn>
																		</m:mrow>
																		<m:mi>m</m:mi>
																	</m:msubsup>
																	<m:mrow>
																		<m:mi>I</m:mi>
																		<m:mo stretchy="false">(</m:mo>
																		<m:msub>
																			<m:mi>s</m:mi>
																			<m:mi>i</m:mi>
																		</m:msub>
																		<m:mo>=</m:mo>
																		<m:msubsup>
																			<m:mi>x</m:mi>
																			<m:mrow>
																				<m:mi>i</m:mi>
																				<m:mo>&#8722;</m:mo>
																				<m:mn>1</m:mn>
																			</m:mrow>
																			<m:mi>k</m:mi>
																		</m:msubsup>
																		<m:mo>&#8743;</m:mo>
																		<m:mo>&#8943;</m:mo>
																		<m:mo>&#8743;</m:mo>
																		<m:msub>
																			<m:mi>s</m:mi>
																			<m:mrow>
																				<m:mi>i</m:mi>
																				<m:mo>&#8722;</m:mo>
																				<m:mi>d</m:mi>
																			</m:mrow>
																		</m:msub>
																		<m:mo>=</m:mo>
																		<m:msubsup>
																			<m:mi>x</m:mi>
																			<m:mrow>
																				<m:mi>i</m:mi>
																				<m:mo>&#8722;</m:mo>
																				<m:mi>d</m:mi>
																			</m:mrow>
																			<m:mi>k</m:mi>
																		</m:msubsup>
																		<m:mo stretchy="false">)</m:mo>
																		<m:mo>+</m:mo>
																		<m:mn>4</m:mn>
																		<m:mi>&#960;</m:mi>
																	</m:mrow>
																</m:mstyle>
															</m:mrow>
														</m:mfrac>
													</m:mrow>
												</m:mtd>
											</m:mtr>
										</m:mtable>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqaaeabbaaaaeaaiiGacqWF4oqCdaWgaaWcbaGaeGimaadabeaakiabcIcaOiabdohaZnaaBaaaleaacqaIXaqmaeqaaOGaeiilaWIaeiOla4IaeiOla4IaeiOla4IaeiilaWIaem4Cam3aaSbaaSqaaiabdsgaKbqabaGccqGGPaqkcqGH9aqpaeaacaWLjaWaaSaaaeaacqaIXaqmaeaacqWGTbqBcqGHRaWkcqWFapaCaaWaaeWaaeaadaaeWbqaaGqabiab+LeajjabcIcaOiabdohaZnaaBaaaleaacqaIXaqmaeqaaOGaeyypa0JaemiEaG3aa0baaSqaaiabigdaXaqaaiabdUgaRbaakiabgEIizlabl+UimjabgEIizlabdohaZnaaBaaaleaacqWGKbazaeqaaOGaeyypa0JaemiEaG3aa0baaSqaaiabdsgaKbqaaiabdUgaRbaakiabcMcaPiabgUcaRiab=b8aWbWcbaGaem4AaSMaeyypa0JaeGymaedabaGaemyBa0ganiabggHiLdaakiaawIcacaGLPaaaaeaacqWF4oqCdaWgaaWcbaGaemyAaKgabeaakiabcIcaOiabdohaZnaaBaaaleaacqWGPbqAaeqaaOGaeiilaWIaeiOla4IaeiOla4IaeiOla4IaeiilaWIaem4Cam3aaSbaaSqaaiabdMgaPjabgkHiTiabdsgaKbqabaGccqGGPaqkcqGH9aqpaeaacaWLjaWaaSaaaeaadaaeWaqaaiab+LeajjabcIcaOiabdohaZnaaBaaaleaacqWGPbqAaeqaaOGaeyypa0JaemiEaG3aa0baaSqaaiabdMgaPbqaaiabdUgaRbaakiabgEIizlabl+UimjabgEIizlabdohaZnaaBaaaleaacqWGPbqAcqGHsislcqWGKbazaeqaaOGaeyypa0JaemiEaG3aa0baaSqaaiabdMgaPjabgkHiTiabdsgaKbqaaiabdUgaRbaakiabcMcaPiabgUcaRiab=b8aWbWcbaGaem4AaSMaeyypa0JaeGymaedabaGaemyBa0ganiabggHiLdaakeaadaaeWaqaaiab+LeajjabcIcaOiabdohaZnaaBaaaleaacqWGPbqAaeqaaOGaeyypa0JaemiEaG3aa0baaSqaaiabdMgaPjabgkHiTiabigdaXaqaaiabdUgaRbaakiabgEIizlabl+UimjabgEIizlabdohaZnaaBaaaleaacqWGPbqAcqGHsislcqWGKbazaeqaaOGaeyypa0JaemiEaG3aa0baaSqaaiabdMgaPjabgkHiTiabdsgaKbqaaiabdUgaRbaakiabcMcaPiabgUcaRiabisda0iab=b8aWbWcbaGaem4AaSMaeyypa0JaeGymaedabaGaemyBa0ganiabggHiLdaaaaaaaaa@C8DB@</m:annotation>
								</m:semantics>
							</m:math>
						</display-formula>
					</p>
					<p>where <b>I</b>(&#183;) is the indicator function, <it>k </it>enumerates over the number of observed sequences <it>m</it>, and <it>&#960; </it>is the commonly used pseudocount (a model parameter, cf. <abbrgrp>
							<abbr bid="B44">44</abbr>
						</abbrgrp>) which is also tuned within the model selection procedure (cf. the model selection and evaluation section).</p>
				</sec>
				<sec>
					<st>
						<p>SVM and kernels for splice site detection</p>
					</st>
					<p>As the second method we use SVMs. The generated classification function can be written as</p>
					<p>
						<display-formula>
							<m:math name="1471-2105-8-S10-S7-i16" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:mi>f</m:mi>
										<m:mo stretchy="false">(</m:mo>
										<m:mi>x</m:mi>
										<m:mo stretchy="false">)</m:mo>
										<m:mo>=</m:mo>
										<m:mtext>sign</m:mtext>
										<m:mrow>
											<m:mo>(</m:mo>
											<m:mrow>
												<m:mstyle displaystyle="true">
													<m:munderover>
														<m:mo>&#8721;</m:mo>
														<m:mrow>
															<m:mi>i</m:mi>
															<m:mo>=</m:mo>
															<m:mn>1</m:mn>
														</m:mrow>
														<m:mi>m</m:mi>
													</m:munderover>
													<m:mrow>
														<m:msub>
															<m:mi>y</m:mi>
															<m:mi>i</m:mi>
														</m:msub>
														<m:msub>
															<m:mi>&#945;</m:mi>
															<m:mi>i</m:mi>
														</m:msub>
														<m:mi mathvariant="script">K</m:mi>
														<m:mo stretchy="false">(</m:mo>
														<m:msub>
															<m:mi>x</m:mi>
															<m:mi>i</m:mi>
														</m:msub>
														<m:mo>,</m:mo>
														<m:mi>x</m:mi>
														<m:mo stretchy="false">)</m:mo>
														<m:mo>+</m:mo>
														<m:mi>b</m:mi>
													</m:mrow>
												</m:mstyle>
											</m:mrow>
											<m:mo>)</m:mo>
										</m:mrow>
										<m:mo>,</m:mo>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGMbGzcqGGOaakieWacqWF4baEcqGGPaqkcqGH9aqpcqqGZbWCcqqGPbqAcqqGNbWzcqqGUbGBdaqadaqaamaaqahabaGaemyEaK3aaSbaaSqaaiabdMgaPbqabaacciGccqGFXoqydaWgaaWcbaGaemyAaKgabeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaakiab9Pq8ljabcIcaOiab=Hha4naaBaaaleaacqWGPbqAaeqaaOGaeiilaWIae8hEaGNaeiykaKIaey4kaSIaemOyaigaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGTbqBa0GaeyyeIuoaaOGaayjkaiaawMcaaiabcYcaSaaa@5C25@</m:annotation>
								</m:semantics>
							</m:math>
						</display-formula>
					</p>
					<p>where <it>y</it>
						<sub>
							<it>i </it>
						</sub>&#8712; {-1, +1} (<it>i </it>= 1,...,<it>m</it>) is the label of example <b>
							<it>x</it>
						</b>
						<sub>
							<it>i</it>
						</sub>. The <it>&#945;</it>
						<sub>
							<it>i</it>
						</sub>'s are Lagrange multipliers and <it>b </it>is the usual bias which are the results of SVM training <abbrgrp>
							<abbr bid="B16">16</abbr>
						</abbrgrp>. The kernel <inline-formula>
							<m:math name="1471-2105-8-S10-S7-i17" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mi mathvariant="script">K</m:mi>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFke=saaa@3834@</m:annotation>
								</m:semantics>
							</m:math>
						</inline-formula> is the <it>key ingredient </it>for learning with SVMs.</p>
					<p>In the following paragraphs we describe the kernels which are used in this study. They are all functions defined on sequences. In the following <b>
							<it>x </it>
						</b>= <it>x</it>
						<sub>1</sub>
						<it>x</it>
						<sub>2</sub>...<it>x</it>
						<sub>
							<it>N </it>
						</sub>denotes a sequence of length <it>N</it>.</p>
					<p>
						<b>The locality improved (LI) kernel </b>has been proven useful in the context of translation initiation site (TIS) recognition <abbrgrp>
							<abbr bid="B21">21</abbr>
						</abbrgrp>. Similar to the <it>polynomial kernel </it>of degree <it>d </it>for discrete input data, this kernel considers correlations of matches up to order <it>d</it>. In contrast to polynomial kernels however, the LI kernel only considers local subsequence correlations within a small window of length 2<it>l </it>+ 1 around a sequence position:</p>
					<p>
						<display-formula>
							<m:math name="1471-2105-8-S10-S7-i18" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:msub>
											<m:mrow>
												<m:mtext>win</m:mtext>
											</m:mrow>
											<m:mi>p</m:mi>
										</m:msub>
										<m:mo stretchy="false">(</m:mo>
										<m:mi>x</m:mi>
										<m:mo>,</m:mo>
										<m:msup>
											<m:mi>x</m:mi>
											<m:mo>&#8242;</m:mo>
										</m:msup>
										<m:mo stretchy="false">)</m:mo>
										<m:mo>=</m:mo>
										<m:msup>
											<m:mrow>
												<m:mrow>
													<m:mo>(</m:mo>
													<m:mrow>
														<m:mfrac>
															<m:mn>1</m:mn>
															<m:mrow>
																<m:mn>2</m:mn>
																<m:mi>l</m:mi>
																<m:mo>+</m:mo>
																<m:mn>1</m:mn>
															</m:mrow>
														</m:mfrac>
														<m:mstyle displaystyle="true">
															<m:munderover>
																<m:mo>&#8721;</m:mo>
																<m:mrow>
																	<m:mi>j</m:mi>
																	<m:mo>=</m:mo>
																	<m:mo>&#8722;</m:mo>
																	<m:mi>l</m:mi>
																</m:mrow>
																<m:mrow>
																	<m:mo>+</m:mo>
																	<m:mi>l</m:mi>
																</m:mrow>
															</m:munderover>
															<m:mrow>
																<m:mi>I</m:mi>
																<m:mo stretchy="false">(</m:mo>
																<m:msub>
																	<m:mi>x</m:mi>
																	<m:mrow>
																		<m:mi>p</m:mi>
																		<m:mo>+</m:mo>
																		<m:mi>j</m:mi>
																	</m:mrow>
																</m:msub>
																<m:mo>=</m:mo>
																<m:msub>
																	<m:msup>
																		<m:mi>x</m:mi>
																		<m:mo>&#8242;</m:mo>
																	</m:msup>
																	<m:mrow>
																		<m:mi>p</m:mi>
																		<m:mo>+</m:mo>
																		<m:mi>j</m:mi>
																	</m:mrow>
																</m:msub>
																<m:mo stretchy="false">)</m:mo>
															</m:mrow>
														</m:mstyle>
													</m:mrow>
													<m:mo>)</m:mo>
												</m:mrow>
											</m:mrow>
											<m:mi>d</m:mi>
										</m:msup>
										<m:mo>,</m:mo>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqqG3bWDcqqGPbqAcqqGUbGBdaWgaaWcbaGaemiCaahabeaakiabcIcaOGqadiab=Hha4jabcYcaSiqb=Hha4zaafaGaeiykaKIaeyypa0ZaaeWaaeaadaWcaaqaaiabigdaXaqaaiabikdaYiabdYgaSjabgUcaRiabigdaXaaadaaeWbqaaGqabiab+LeajjabcIcaOiabdIha4naaBaaaleaacqWGWbaCcqGHRaWkcqWGQbGAaeqaaOGaeyypa0JafmiEaGNbauaadaWgaaWcbaGaemiCaaNaey4kaSIaemOAaOgabeaakiabcMcaPaWcbaGaemOAaOMaeyypa0JaeyOeI0IaemiBaWgabaGaey4kaSIaemiBaWganiabggHiLdaakiaawIcacaGLPaaadaahaaWcbeqaaiabdsgaKbaakiabcYcaSaaa@59FE@</m:annotation>
								</m:semantics>
							</m:math>
						</display-formula>
					</p>
					<p>where <it>p </it>= <it>l </it>+ 1,...,<it>N </it>- <it>l</it>. These window scores are then summed up over the length of the sequence using a weighting <it>w</it>
						<sub>
							<it>p </it>
						</sub>which linearly decreases to both ends of the sequence, i.e. <inline-formula>
							<m:math name="1471-2105-8-S10-S7-i19" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:msub>
											<m:mi>w</m:mi>
											<m:mi>p</m:mi>
										</m:msub>
										<m:mo>=</m:mo>
										<m:mrow>
											<m:mo>{</m:mo>
											<m:mrow>
												<m:mtable>
													<m:mtr>
														<m:mtd>
															<m:mrow>
																<m:mi>p</m:mi>
																<m:mo>&#8722;</m:mo>
																<m:mi>l</m:mi>
															</m:mrow>
														</m:mtd>
														<m:mtd>
															<m:mrow>
																<m:mi>p</m:mi>
																<m:mo>&#8804;</m:mo>
																<m:mi>N</m:mi>
																<m:mo>/</m:mo>
																<m:mn>2</m:mn>
															</m:mrow>
														</m:mtd>
													</m:mtr>
													<m:mtr>
														<m:mtd>
															<m:mrow>
																<m:mi>N</m:mi>
																<m:mo>&#8722;</m:mo>
																<m:mi>p</m:mi>
																<m:mo>&#8722;</m:mo>
																<m:mi>l</m:mi>
																<m:mo>+</m:mo>
																<m:mn>1</m:mn>
															</m:mrow>
														</m:mtd>
														<m:mtd>
															<m:mrow>
																<m:mi>p</m:mi>
																<m:mo>></m:mo>
																<m:mi>N</m:mi>
																<m:mo>/</m:mo>
																<m:mn>2</m:mn>
															</m:mrow>
														</m:mtd>
													</m:mtr>
												</m:mtable>
											</m:mrow>
										</m:mrow>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWG3bWDdaWgaaWcbaGaemiCaahabeaakiabg2da9maaceqabaqbaeqabiGaaaqaaiabdchaWjabgkHiTiabdYgaSbqaaiabdchaWjabgsMiJkabd6eaojabc+caViabikdaYaqaaiabd6eaojabgkHiTiabdchaWjabgkHiTiabdYgaSjabgUcaRiabigdaXaqaaiabdchaWjabg6da+iabd6eaojabc+caViabikdaYaaaaiaawUhaaaaa@48CE@</m:annotation>
								</m:semantics>
							</m:math>
						</inline-formula>. Then we have the following kernel:</p>
					<p>
						<display-formula>
							<m:math name="1471-2105-8-S10-S7-i20" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:mi mathvariant="script">K</m:mi>
										<m:mo stretchy="false">(</m:mo>
										<m:mi>x</m:mi>
										<m:mo>,</m:mo>
										<m:msup>
											<m:mi>x</m:mi>
											<m:mo>&#8242;</m:mo>
										</m:msup>
										<m:mo stretchy="false">)</m:mo>
										<m:mo>=</m:mo>
										<m:mstyle displaystyle="true">
											<m:munderover>
												<m:mo>&#8721;</m:mo>
												<m:mrow>
													<m:mi>p</m:mi>
													<m:mo>=</m:mo>
													<m:mi>l</m:mi>
													<m:mo>+</m:mo>
													<m:mn>1</m:mn>
												</m:mrow>
												<m:mrow>
													<m:mi>N</m:mi>
													<m:mo>&#8722;</m:mo>
													<m:mi>l</m:mi>
												</m:mrow>
											</m:munderover>
											<m:mrow>
												<m:msub>
													<m:mi>w</m:mi>
													<m:mi>p</m:mi>
												</m:msub>
												<m:msub>
													<m:mrow>
														<m:mtext>win</m:mtext>
													</m:mrow>
													<m:mi>p</m:mi>
												</m:msub>
												<m:mo stretchy="false">(</m:mo>
												<m:mi>x</m:mi>
												<m:mo>,</m:mo>
												<m:msup>
													<m:mi>x</m:mi>
													<m:mo>&#8242;</m:mo>
												</m:msup>
												<m:mo stretchy="false">)</m:mo>
											</m:mrow>
										</m:mstyle>
										<m:mo>.</m:mo>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFke=scqGGOaakieWacqGF4baEcqGGSaalcuGF4baEgaqbaiabcMcaPiabg2da9maaqahabaGaem4DaC3aaSbaaSqaaiabdchaWbqabaGccqqG3bWDcqqGPbqAcqqGUbGBdaWgaaWcbaGaemiCaahabeaakiabcIcaOiab+Hha4jabcYcaSiqb+Hha4zaafaGaeiykaKcaleaacqWGWbaCcqGH9aqpcqWGSbaBcqGHRaWkcqaIXaqmaeaacqWGobGtcqGHsislcqWGSbaBa0GaeyyeIuoakiabc6caUaaa@597D@</m:annotation>
								</m:semantics>
							</m:math>
						</display-formula>
					</p>
					<p>The weighting allows one to emphasize regions of the sequence which are believed to be of higher importance; in our case this is the center, which is the location of the splice site. (Note that the definition of the LI kernel in <abbrgrp>
							<abbr bid="B21">21</abbr>
						</abbrgrp> is slightly different from ours. Previously the weighting was inside the window and was not very effective. Moreover, the version presented here of the kernel can be computed 2<it>l </it>+ 1 times faster than the original one.)</p>
					<p>
						<b>The weighted degree (WD) kernel </b>
						<abbrgrp>
							<abbr bid="B28">28</abbr>
						</abbrgrp> uses a similar approach by counting matching subsequences <b>
							<it>u</it>
						</b>
						<sub>
							<it>&#948;</it>,<it>l</it>
						</sub>(<b>
							<it>x</it>
						</b>) and <b>
							<it>u</it>
						</b>
						<sub>
							<it>&#948;</it>,<it>l</it>
						</sub>(<b>
							<it>x'</it>
						</b>) between two sequences <b>
							<it>x </it>
						</b>and <b>
							<it>x'</it>
						</b>, with <b>
							<it>u</it>
						</b>
						<sub>
							<it>&#948;</it>,<it>l</it>
						</sub>(<b>
							<it>x</it>
						</b>) = <it>x</it>
						<sub>
							<it>l</it>
						</sub>
						<it>x</it>
						<sub>
							<it>l</it>+1</sub>...<it>x</it>
						<sub>
							<it>l</it>+<it>&#948;</it>-1 </sub>for all <it>l </it>and 1 &#8804; <it>&#948; </it>&#8804; <it>d</it>. Here, <it>&#948; </it>denotes the order (length of the subsequence) to be compared. The WD kernel is defined as</p>
					<p>
						<display-formula>
							<m:math name="1471-2105-8-S10-S7-i21" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:mi mathvariant="script">K</m:mi>
										<m:mo stretchy="false">(</m:mo>
										<m:mi>x</m:mi>
										<m:mo>,</m:mo>
										<m:msup>
											<m:mi>x</m:mi>
											<m:mo>&#8242;</m:mo>
										</m:msup>
										<m:mo stretchy="false">)</m:mo>
										<m:mo>=</m:mo>
										<m:mstyle displaystyle="true">
											<m:munderover>
												<m:mo>&#8721;</m:mo>
												<m:mrow>
													<m:mi>&#948;</m:mi>
													<m:mo>=</m:mo>
													<m:mn>1</m:mn>
												</m:mrow>
												<m:mi>d</m:mi>
											</m:munderover>
											<m:mrow>
												<m:msub>
													<m:mi>w</m:mi>
													<m:mi>&#948;</m:mi>
												</m:msub>
											</m:mrow>
										</m:mstyle>
										<m:mstyle displaystyle="true">
											<m:munderover>
												<m:mo>&#8721;</m:mo>
												<m:mrow>
													<m:mi>l</m:mi>
													<m:mo>=</m:mo>
													<m:mn>1</m:mn>
												</m:mrow>
												<m:mrow>
													<m:mi>N</m:mi>
													<m:mo>&#8722;</m:mo>
													<m:mi>&#948;</m:mi>
													<m:mo>+</m:mo>
													<m:mn>1</m:mn>
												</m:mrow>
											</m:munderover>
											<m:mrow>
												<m:mi>I</m:mi>
												<m:mo stretchy="false">(</m:mo>
												<m:msub>
													<m:mi>u</m:mi>
													<m:mrow>
														<m:mi>&#948;</m:mi>
														<m:mo>,</m:mo>
														<m:mi>l</m:mi>
													</m:mrow>
												</m:msub>
												<m:mo stretchy="false">(</m:mo>
												<m:mi>x</m:mi>
												<m:mo stretchy="false">)</m:mo>
												<m:mo>=</m:mo>
												<m:msub>
													<m:mi>u</m:mi>
													<m:mrow>
														<m:mi>&#948;</m:mi>
														<m:mo>,</m:mo>
														<m:mi>l</m:mi>
													</m:mrow>
												</m:msub>
												<m:mo stretchy="false">(</m:mo>
												<m:msup>
													<m:mi>x</m:mi>
													<m:mo>&#8242;</m:mo>
												</m:msup>
												<m:mo stretchy="false">)</m:mo>
												<m:mo stretchy="false">)</m:mo>
											</m:mrow>
										</m:mstyle>
										<m:mo>,</m:mo>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFke=scqGGOaakieWacqGF4baEcqGGSaalcuGF4baEgaqbaiabcMcaPiabg2da9maaqahabaGaem4DaC3aaSbaaSqaaGGaciab9r7aKbqabaaabaGae0hTdqMaeyypa0JaeGymaedabaGaemizaqganiabggHiLdGcdaaeWbqaaGqabiab8LeajjabcIcaOiab+vha1naaBaaaleaacqqF0oazcqGGSaalcqWGSbaBaeqaaOGaeiikaGIae4hEaGNaeiykaKIaeyypa0Jae4xDau3aaSbaaSqaaiab9r7aKjabcYcaSiabdYgaSbqabaGccqGGOaakcuGF4baEgaqbaiabcMcaPiabcMcaPaWcbaGaemiBaWMaeyypa0JaeGymaedabaGaemOta4KaeyOeI0Iae0hTdqMaey4kaSIaeGymaedaniabggHiLdGccqGGSaalaaa@6A78@</m:annotation>
								</m:semantics>
							</m:math>
						</display-formula>
					</p>
					<p>where we choose the weighting to be <it>w</it>
						<sub>
							<it>&#948; </it>
						</sub>= <it>d </it>- <it>&#948; </it>+ 1. This kernel emphasizes position dependent information and the weighting decreases the influence for higher order matches, which would anyway have a higher contribution due to all their matching subsequences. It can be computed very efficiently without even extracting and enumerating all subsequences of the sequences <abbrgrp>
							<abbr bid="B40">40</abbr>
						</abbrgrp>. Note that this kernel is similar to the spectrum kernel as proposed by <abbrgrp>
							<abbr bid="B49">49</abbr>
						</abbrgrp>, with the main difference that the weighted degree kernel uses position specific information.</p>
					<p>
						<b>The weighted degree kernel with shifts (WDS) </b>
						<abbrgrp>
							<abbr bid="B34">34</abbr>
						</abbrgrp> is defined as</p>
					<p>
						<display-formula>
							<m:math name="1471-2105-8-S10-S7-i22" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:mtable>
											<m:mtr>
												<m:mtd>
													<m:mrow>
														<m:mi mathvariant="script">K</m:mi>
														<m:mo stretchy="false">(</m:mo>
														<m:msub>
															<m:mi>x</m:mi>
															<m:mi>i</m:mi>
														</m:msub>
														<m:mo>,</m:mo>
														<m:msub>
															<m:mi>x</m:mi>
															<m:mi>j</m:mi>
														</m:msub>
														<m:mo stretchy="false">)</m:mo>
														<m:mo>=</m:mo>
														<m:mstyle displaystyle="true">
															<m:munderover>
																<m:mo>&#8721;</m:mo>
																<m:mrow>
																	<m:mi>&#948;</m:mi>
																	<m:mo>=</m:mo>
																	<m:mn>1</m:mn>
																</m:mrow>
																<m:mi>d</m:mi>
															</m:munderover>
															<m:mrow>
																<m:msub>
																	<m:mi>w</m:mi>
																	<m:mi>&#948;</m:mi>
																</m:msub>
															</m:mrow>
														</m:mstyle>
														<m:mstyle displaystyle="true">
															<m:munderover>
																<m:mo>&#8721;</m:mo>
																<m:mrow>
																	<m:mi>l</m:mi>
																	<m:mo>=</m:mo>
																	<m:mn>1</m:mn>
																</m:mrow>
																<m:mrow>
																	<m:mi>N</m:mi>
																	<m:mo>&#8722;</m:mo>
																	<m:mi>&#948;</m:mi>
																	<m:mo>+</m:mo>
																	<m:mn>1</m:mn>
																</m:mrow>
															</m:munderover>
															<m:mrow>
																<m:mstyle displaystyle="true">
																	<m:munderover>
																		<m:mo>&#8721;</m:mo>
																		<m:mrow>
																			<m:mtable>
																				<m:mtr>
																					<m:mtd>
																						<m:mrow>
																							<m:mi>s</m:mi>
																							<m:mo>=</m:mo>
																							<m:mn>0</m:mn>
																						</m:mrow>
																					</m:mtd>
																				</m:mtr>
																				<m:mtr>
																					<m:mtd>
																						<m:mrow>
																							<m:mi>s</m:mi>
																							<m:mo>+</m:mo>
																							<m:mi>l</m:mi>
																							<m:mo>&#8804;</m:mo>
																							<m:mi>N</m:mi>
																						</m:mrow>
																					</m:mtd>
																				</m:mtr>
																			</m:mtable>
																		</m:mrow>
																		<m:mrow>
																			<m:mi>S</m:mi>
																			<m:mo stretchy="false">(</m:mo>
																			<m:mi>l</m:mi>
																			<m:mo stretchy="false">)</m:mo>
																		</m:mrow>
																	</m:munderover>
																	<m:mrow>
																		<m:msub>
																			<m:mi>&#948;</m:mi>
																			<m:mi>s</m:mi>
																		</m:msub>
																		<m:msub>
																			<m:mi>&#956;</m:mi>
																			<m:mrow>
																				<m:mi>&#948;</m:mi>
																				<m:mo>,</m:mo>
																				<m:mi>l</m:mi>
																				<m:mo>,</m:mo>
																				<m:mi>s</m:mi>
																				<m:mo>,</m:mo>
																				<m:msub>
																					<m:mi>x</m:mi>
																					<m:mi>i</m:mi>
																				</m:msub>
																				<m:mo>,</m:mo>
																				<m:msub>
																					<m:mi>x</m:mi>
																					<m:mi>j</m:mi>
																				</m:msub>
																			</m:mrow>
																		</m:msub>
																	</m:mrow>
																</m:mstyle>
																<m:mo>,</m:mo>
															</m:mrow>
														</m:mstyle>
													</m:mrow>
												</m:mtd>
											</m:mtr>
											<m:mtr>
												<m:mtd>
													<m:mrow>
														<m:msub>
															<m:mi>&#956;</m:mi>
															<m:mrow>
																<m:mi>&#948;</m:mi>
																<m:mo>,</m:mo>
																<m:mi>l</m:mi>
																<m:mo>,</m:mo>
																<m:mi>s</m:mi>
																<m:mo>,</m:mo>
																<m:msub>
																	<m:mi>x</m:mi>
																	<m:mi>i</m:mi>
																</m:msub>
																<m:mo>,</m:mo>
																<m:msub>
																	<m:mi>x</m:mi>
																	<m:mi>j</m:mi>
																</m:msub>
															</m:mrow>
														</m:msub>
														<m:mo>=</m:mo>
														<m:mi>I</m:mi>
														<m:mo stretchy="false">(</m:mo>
														<m:msub>
															<m:mi>u</m:mi>
															<m:mrow>
																<m:mi>&#948;</m:mi>
																<m:mo>,</m:mo>
																<m:mi>l</m:mi>
																<m:mo>+</m:mo>
																<m:mi>s</m:mi>
															</m:mrow>
														</m:msub>
														<m:mo stretchy="false">(</m:mo>
														<m:msub>
															<m:mi>x</m:mi>
															<m:mi>i</m:mi>
														</m:msub>
														<m:mo stretchy="false">)</m:mo>
														<m:mo>=</m:mo>
														<m:msub>
															<m:mi>u</m:mi>
															<m:mrow>
																<m:mi>&#948;</m:mi>
																<m:mo>,</m:mo>
																<m:mi>l</m:mi>
															</m:mrow>
														</m:msub>
														<m:mo stretchy="false">(</m:mo>
														<m:msub>
															<m:mi>x</m:mi>
															<m:mi>j</m:mi>
														</m:msub>
														<m:mo stretchy="false">)</m:mo>
														<m:mo stretchy="false">)</m:mo>
														<m:mo>+</m:mo>
														<m:mi>I</m:mi>
														<m:mo stretchy="false">(</m:mo>
														<m:msub>
															<m:mi>u</m:mi>
															<m:mrow>
																<m:mi>&#948;</m:mi>
																<m:mo>,</m:mo>
																<m:mi>l</m:mi>
															</m:mrow>
														</m:msub>
														<m:mo stretchy="false">(</m:mo>
														<m:msub>
															<m:mi>x</m:mi>
															<m:mi>i</m:mi>
														</m:msub>
														<m:mo stretchy="false">)</m:mo>
														<m:mo>=</m:mo>
														<m:msub>
															<m:mi>u</m:mi>
															<m:mrow>
																<m:mi>&#948;</m:mi>
																<m:mo>,</m:mo>
																<m:mi>l</m:mi>
																<m:mo>+</m:mo>
																<m:mi>s</m:mi>
															</m:mrow>
														</m:msub>
														<m:mo stretchy="false">(</m:mo>
														<m:msub>
															<m:mi>x</m:mi>
															<m:mi>j</m:mi>
														</m:msub>
														<m:mo stretchy="false">)</m:mo>
														<m:mo stretchy="false">)</m:mo>
														<m:mo>,</m:mo>
													</m:mrow>
												</m:mtd>
											</m:mtr>
										</m:mtable>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqadeGabaaabaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NcXVKaeiikaGccbmGae4hEaG3aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqGF4baEdaWgaaWcbaGaemOAaOgabeaakiabcMcaPiabg2da9maaqahabaGaem4DaC3aaSbaaSqaaGGaciab9r7aKbqabaaabaGae0hTdqMaeyypa0JaeGymaedabaGaemizaqganiabggHiLdGcdaaeWbqaamaaqahabaGae0hTdq2aaSbaaSqaaiabdohaZbqabaGccqqF8oqBdaWgaaWcbaGae0hTdqMaeiilaWIaemiBaWMaeiilaWIaem4CamNaeiilaWIae4hEaG3aaSbaaWqaaiabdMgaPbqabaWccqGGSaalcqGF4baEdaWgaaadbaGaemOAaOgabeaaaSqabaaabaqbaeqabiqaaaqaaiabdohaZjabg2da9iabicdaWaqaaiabdohaZjabgUcaRiabdYgaSjabgsMiJkabd6eaobaaaeaacqWGtbWucqGGOaakcqWGSbaBcqGGPaqka0GaeyyeIuoakiabcYcaSaWcbaGaemiBaWMaeyypa0JaeGymaedabaGaemOta4KaeyOeI0Iae0hTdqMaey4kaSIaeGymaedaniabggHiLdaakeaacqqF8oqBdaWgaaWcbaGae0hTdqMaeiilaWIaemiBaWMaeiilaWIaem4CamNaeiilaWIae4hEaG3aaSbaaWqaaiabdMgaPbqabaWccqGGSaalcqGF4baEdaWgaaadbaGaemOAaOgabeaaaSqabaGccqGH9aqpieqacqaFjbqscqGGOaakcqGF1bqDdaWgaaWcbaGae0hTdqMaeiilaWIaemiBaWMaey4kaSIaem4CamhabeaakiabcIcaOiab+Hha4naaBaaaleaacqWGPbqAaeqaaOGaeiykaKIaeyypa0Jae4xDau3aaSbaaSqaaiab9r7aKjabcYcaSiabdYgaSbqabaGccqGGOaakcqGF4baEdaWgaaWcbaGaemOAaOgabeaakiabcMcaPiabcMcaPiabgUcaRiab8LeajjabcIcaOiab+vha1naaBaaaleaacqaH0oazcqGGSaalcqWGSbaBaeqaaOGaeiikaGIae4hEaG3aaSbaaSqaaiabdMgaPbqabaGccqGGPaqkcqGH9aqpcqGF1bqDdaWgaaWcbaGae0hTdqMaeiilaWIaemiBaWMaey4kaSIaem4CamhabeaakiabcIcaOiab+Hha4naaBaaaleaacqWGQbGAaeqaaOGaeiykaKIaeiykaKIaeiilaWcaaaaa@C3E8@</m:annotation>
								</m:semantics>
							</m:math>
						</display-formula>
					</p>
					<p>where <b>
							<it>w</it>
						</b>
						<sub>
							<it>&#948; </it>
						</sub>is as before, <it>&#948;</it>
						<sub>
							<it>s </it>
						</sub>= 1/(2(<it>s </it>+ 1)) is the weight assigned to shifts (in either direction) of extent <it>s</it>, and <it>S</it>(<it>l</it>) determines the shift range at position <it>l</it>. Here, we choose <it>S</it>(<it>l</it>) = <it>&#963;</it>|<it>l </it>- <it>l</it>
						<sub>
							<it>c</it>
						</sub>|, where <it>l</it>
						<sub>
							<it>c </it>
						</sub>is the position of the splice site. An efficient implementation for this kernel allowing large scale computations is described in <abbrgrp>
							<abbr bid="B40">40</abbr>
						</abbrgrp>.</p>
					<p>For both the WD and WDS kernel we use the following normalization</p>
					<p>
						<display-formula>
							<m:math name="1471-2105-8-S10-S7-i23" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:mover accent="true">
											<m:mi mathvariant="script">K</m:mi>
											<m:mo>&#732;</m:mo>
										</m:mover>
										<m:mo stretchy="false">(</m:mo>
										<m:mi>x</m:mi>
										<m:mo>,</m:mo>
										<m:msup>
											<m:mi>x</m:mi>
											<m:mo>&#8242;</m:mo>
										</m:msup>
										<m:mo stretchy="false">)</m:mo>
										<m:mo>=</m:mo>
										<m:mfrac>
											<m:mrow>
												<m:mi mathvariant="script">K</m:mi>
												<m:mo stretchy="false">(</m:mo>
												<m:mi>x</m:mi>
												<m:mo>,</m:mo>
												<m:msup>
													<m:mi>x</m:mi>
													<m:mo>&#8242;</m:mo>
												</m:msup>
												<m:mo stretchy="false">)</m:mo>
											</m:mrow>
											<m:mrow>
												<m:msqrt>
													<m:mrow>
														<m:mi mathvariant="script">K</m:mi>
														<m:mo stretchy="false">(</m:mo>
														<m:mi>x</m:mi>
														<m:mo>,</m:mo>
														<m:mi>x</m:mi>
														<m:mo stretchy="false">)</m:mo>
														<m:mi mathvariant="script">K</m:mi>
														<m:mo stretchy="false">(</m:mo>
														<m:msup>
															<m:mi>x</m:mi>
															<m:mo>&#8242;</m:mo>
														</m:msup>
														<m:mo>,</m:mo>
														<m:msup>
															<m:mi>x</m:mi>
															<m:mo>&#8242;</m:mo>
														</m:msup>
														<m:mo stretchy="false">)</m:mo>
													</m:mrow>
												</m:msqrt>
											</m:mrow>
										</m:mfrac>
										<m:mo>.</m:mo>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacuWFke=sgaacaiabcIcaOGqadiab+Hha4jabcYcaSiqb+Hha4zaafaGaeiykaKIaeyypa0ZaaSaaaeaacqWFke=scqGGOaakcqGF4baEcqGGSaalcuGF4baEgaqbaiabcMcaPaqaamaakaaabaGae8NcXVKaeiikaGIae4hEaGNaeiilaWIae4hEaGNaeiykaKIae8NcXVKaeiikaGIaf4hEaGNbauaacqGGSaalcuGF4baEgaqbaiabcMcaPaWcbeaaaaGccqGGUaGlaaa@55E7@</m:annotation>
								</m:semantics>
							</m:math>
						</display-formula>
					</p>
					<p>Training and evaluation of the SVMs and the MCs were performed using our shogun machine learning toolbox (cf. <url>http://www.shogun-toolbox.org</url>) <abbrgrp>
							<abbr bid="B38">38</abbr>
						</abbrgrp> in which efficient implementations of the aforementioned kernels can be found.</p>
				</sec>
				<sec>
					<st>
						<p>Interpreting the SVM classifier</p>
					</st>
					<p>Kernel methods are aimed directly at the classification task which is to <it>discriminate </it>between the true and decoy classes by learning a decision function separating the classes in an associated feature space. In contrast, <it>generative methods </it>like position weight matrices or Markov models are statistical models which represent the data under specific assumptions on the statistical structure and hence it is relatively straightforward to interpret their results. Although kernel methods outperform in many cases generative models, especially when the true statistical structure is more intricate than the assumed one, one of the main criticisms of kernel methods is the difficulty to directly interpret their decision function in a way that allows to gain biologically relevant insight. However, by taking advantage of our specific kernels and of their sparse representation, we are able to efficiently use the decision function of our SVMs in order to understand which <it>k</it>-mers at which positions are contributing the most in discriminating between true and decoy splice sites.</p>
					<p>To see how this is possible, recall that, for SVMs, the resulting classifier can be written as a dot product between an <b>
							<it>&#945;</it>
						</b>-weighted linear combination of support vectors mapped into the feature space (which is often only implicitly defined via the kernel function) <abbrgrp>
							<abbr bid="B18">18</abbr>
						</abbrgrp>:</p>
					<p>
						<display-formula>
							<m:math name="1471-2105-8-S10-S7-i24" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:mi>f</m:mi>
										<m:mo stretchy="false">(</m:mo>
										<m:mi>x</m:mi>
										<m:mo stretchy="false">)</m:mo>
										<m:mo>=</m:mo>
										<m:munder>
											<m:munder>
												<m:mrow>
													<m:mstyle displaystyle="true">
														<m:munderover>
															<m:mo>&#8721;</m:mo>
															<m:mrow>
																<m:mi>i</m:mi>
																<m:mo>=</m:mo>
																<m:mn>1</m:mn>
															</m:mrow>
															<m:mi>m</m:mi>
														</m:munderover>
														<m:mrow>
															<m:msub>
																<m:mi>&#945;</m:mi>
																<m:mi>i</m:mi>
															</m:msub>
															<m:msub>
																<m:mi>y</m:mi>
																<m:mi>i</m:mi>
															</m:msub>
															<m:mi>&#934;</m:mi>
															<m:mo stretchy="false">(</m:mo>
															<m:msub>
																<m:mi>x</m:mi>
																<m:mi>i</m:mi>
															</m:msub>
															<m:mo stretchy="false">)</m:mo>
														</m:mrow>
													</m:mstyle>
												</m:mrow>
												<m:mo stretchy="true">&#65080;</m:mo>
											</m:munder>
											<m:mi>w</m:mi>
										</m:munder>
										<m:mo>&#8901;</m:mo>
										<m:mi>&#934;</m:mi>
										<m:mo stretchy="false">(</m:mo>
										<m:mi>x</m:mi>
										<m:mo stretchy="false">)</m:mo>
										<m:mo>=</m:mo>
										<m:mstyle displaystyle="true">
											<m:munderover>
												<m:mo>&#8721;</m:mo>
												<m:mrow>
													<m:mi>i</m:mi>
													<m:mo>=</m:mo>
													<m:mn>1</m:mn>
												</m:mrow>
												<m:mi>m</m:mi>
											</m:munderover>
											<m:mrow>
												<m:msub>
													<m:mi>&#945;</m:mi>
													<m:mi>i</m:mi>
												</m:msub>
												<m:msub>
													<m:mi>y</m:mi>
													<m:mi>i</m:mi>
												</m:msub>
												<m:mi mathvariant="script">K</m:mi>
												<m:mo stretchy="false">(</m:mo>
												<m:msub>
													<m:mi>x</m:mi>
													<m:mi>i</m:mi>
												</m:msub>
												<m:mo>,</m:mo>
												<m:mi>x</m:mi>
												<m:mo stretchy="false">)</m:mo>
											</m:mrow>
										</m:mstyle>
										<m:mo>.</m:mo>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGMbGzcqGGOaakieWacqWF4baEcqGGPaqkcqGH9aqpdaagaaqaamaaqahabaacciGae4xSde2aaSbaaSqaaiabdMgaPbqabaGccqWG5bqEdaWgaaWcbaGaemyAaKgabeaakiabfA6agjabcIcaOiab=Hha4naaBaaaleaacqWGPbqAaeqaaOGaeiykaKcaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGTbqBa0GaeyyeIuoaaSqaaiab=Dha3bGccaGL44pacqGHflY1cqqHMoGrcqGGOaakcqWF4baEcqGGPaqkcqGH9aqpdaaeWbqaaiab+f7aHnaaBaaaleaacqWGPbqAaeqaaOGaemyEaK3aaSbaaSqaaiabdMgaPbqabaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGccqqFke=scqGGOaakcqWF4baEdaWgaaWcbaGaemyAaKgabeaakiabcYcaSiab=Hha4jabcMcaPaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemyBa0ganiabggHiLdGccqGGUaGlaaa@71B5@</m:annotation>
								</m:semantics>
							</m:math>
						</display-formula>
					</p>
					<p>In the case of sparse feature spaces, as with string kernels, one can represent <b>
							<it>w </it>
						</b>in a sparse form and then efficiently compute dot products between <b>
							<it>w </it>
						</b>and &#934;(<b>
							<it>x</it>
						</b>) in order to speed up SVM training or testing <abbrgrp>
							<abbr bid="B40">40</abbr>
						</abbrgrp>. This sparse representation comes with the additional benefit of providing us with means to interpret the SVM classifier. For <it>k</it>-mer based string kernels like the spectrum kernel, each dimension <it>w</it>
						<sub>
							<b>u </b>
						</sub>in <b>
							<it>w </it>
						</b>represents a weight assigned to that <it>k</it>-mer <b>u</b>. From the learned weighting one can thus easily identify the <it>k</it>-mers with highest absolute weight or above a given threshold <it>&#964;</it>: {<b>
							<it>u </it>
						</b>| |<it>w</it>
						<sub>
							<it>u</it>
						</sub>| ><it>&#964;</it>}. Note that the total number of <it>k</it>-mers appearing in the support vectors is bounded by <it>dN</it>
						<sub>
							<it>s</it>
						</sub>
						<it>L </it>where <it>L </it>is the maximum length of the sequences <inline-formula>
							<m:math name="1471-2105-8-S10-S7-i25" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:mi>L</m:mi>
										<m:mo>=</m:mo>
										<m:msub>
											<m:mrow>
												<m:mi>max</m:mi>
												<m:mo>&#8289;</m:mo>
											</m:mrow>
											<m:mrow>
												<m:mi>i</m:mi>
												<m:mo>=</m:mo>
												<m:mn>1</m:mn>
												<m:mo>,</m:mo>
												<m:mn>...</m:mn>
												<m:mo>,</m:mo>
												<m:mi>m</m:mi>
											</m:mrow>
										</m:msub>
										<m:msub>
											<m:mi>l</m:mi>
											<m:mrow>
												<m:msub>
													<m:mi>x</m:mi>
													<m:mi>i</m:mi>
												</m:msub>
											</m:mrow>
										</m:msub>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGmbatcqGH9aqpcyGGTbqBcqGGHbqycqGG4baEdaWgaaWcbaGaemyAaKMaeyypa0JaeGymaeJaeiilaWIaeiOla4IaeiOla4IaeiOla4IaeiilaWIaemyBa0gabeaakiabdYgaSnaaBaaaleaaieWacqWF4baEdaWgaaadbaGaemyAaKgabeaaaSqabaaaaa@40F0@</m:annotation>
								</m:semantics>
							</m:math>
						</inline-formula>. This approach also works for the WD kernel (with and without shifts). Here a weight is assigned to each <it>k</it>-mer with 1 &#8804; <it>k </it>&#8804; <it>d </it>at each position in the sequence. This allows us to generate the <it>k</it>-<it>mer importance matrices</it>, displayed in Figure <figr fid="F4">4</figr>, associated with our splice classifiers <abbrgrp>
							<abbr bid="B54">54</abbr>
						</abbrgrp>. They display the weight which the SVM assigns to each <it>k</it>-mer at each position in the input sequence, i.e. given a SVM classifier trained with a WD kernel of degree <it>d </it>we extract the <it>k</it>-mers weightings for 1 &#8804; <it>k </it>&#8804; <inline-formula>
							<m:math name="1471-2105-8-S10-S7-i26" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mover accent="true">
										<m:mi>d</m:mi>
										<m:mo>&#732;</m:mo>
									</m:mover>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGKbazgaacaaaa@2E0C@</m:annotation>
								</m:semantics>
							</m:math>
						</inline-formula> starting at position <it>p </it>= 1,...,<it>N</it>, where we used <it>d </it>as selected in model selection and <inline-formula>
							<m:math name="1471-2105-8-S10-S7-i26" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mover accent="true">
										<m:mi>d</m:mi>
										<m:mo>&#732;</m:mo>
									</m:mover>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGKbazgaacaaaa@2E0C@</m:annotation>
								</m:semantics>
							</m:math>
						</inline-formula> = 1,...,8. This leads to a weighting for <inline-formula>
							<m:math name="1471-2105-8-S10-S7-i26" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mover accent="true">
										<m:mi>d</m:mi>
										<m:mo>&#732;</m:mo>
									</m:mover>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGKbazgaacaaaa@2E0C@</m:annotation>
								</m:semantics>
							</m:math>
						</inline-formula>-mers <b>u </b>for each position in the sequence: <it>W</it>
						<sub>
							<b>u</b>,<it>p</it>
						</sub>, which may be summarized by <inline-formula>
							<m:math name="1471-2105-8-S10-S7-i27" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mrow>
										<m:msub>
											<m:mi>S</m:mi>
											<m:mrow>
												<m:mover accent="true">
													<m:mi>d</m:mi>
													<m:mo>&#732;</m:mo>
												</m:mover>
												<m:mo>,</m:mo>
												<m:mi>p</m:mi>
											</m:mrow>
										</m:msub>
									</m:mrow>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGtbWudaWgaaWcbaGafmizaqMbaGaacqGGSaalcqWGWbaCaeqaaaaa@31B0@</m:annotation>
								</m:semantics>
							</m:math>
						</inline-formula> = max<sub>
							<b>u</b>
						</sub>(<it>W</it>
						<sub>
							<b>u</b>,<it>p</it>
						</sub>). We compute this quantity for <inline-formula>
							<m:math name="1471-2105-8-S10-S7-i26" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mover accent="true">
										<m:mi>d</m:mi>
										<m:mo>&#732;</m:mo>
									</m:mover>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGKbazgaacaaaa@2E0C@</m:annotation>
								</m:semantics>
							</m:math>
						</inline-formula> = 1,...,8 leading to the two 8 &#215; 141 matrices, which are transformed into percentile values and then displayed color-coded in Figure <figr fid="F4">4</figr>. Note that the above computation can be done efficiently using string index data structures implemented in <it>SHOGUN </it>and described in detail in <abbrgrp>
							<abbr bid="B40">40</abbr>
						</abbrgrp>.</p>
				</sec>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Competing interests</p>
			</st>
			<p>The authors declare that they have no competing interests.</p>
		</sec>
		<sec>
			<st>
				<p>Authors' contributions</p>
			</st>
			<p>SS provided code for large scale kernel learning and helped carrying out the experiments. PP performed most experiments in the pilot study and drafted the manuscript. GS and JB performed the experiments on the genome-wide data sets and helped generating the data. GR conceived the experiments, generated the data sets and helped performing experiments. All authors contributed to the writing and critically revising the manuscript.</p>
		</sec>
	</bdy>
	<bm>
		<ack>
			<sec>
				<st>
					<p>Acknowledgements</p>
				</st>
				<p>We gratefully acknowledge helpful discussions with Anja Neuber, Alexander Zien, Georg Zeller, Andrei Lupas, Detlef Weigel, Alan Zahler, Koji Tsuda, Christina Leslie and Eleazar Eskin. Additionally, we thank Alexander Zien for helping with the implementation of the <it>k</it>-mer importance matrices and Cheng Soon Ong for the implementation of generating the splice graphs from aligned sequences. Finally, we would like to thank Michiel Van Bel from Ghent University for help to get SpliceMachine to work.</p>
				<p>This article has been published as part of <it>BMC Bioinformatics </it>Volume 8 Supplement 10, 2007: Neural Information Processing Systems (NIPS) workshop on New Problems and Methods in Computational Biology. The full contents of the supplement are available online at <url>http://www.biomedcentral.com/1471-2105/8?issue=S10</url>.</p>
			</sec>
		</ack>
		<refgrp>
			<bibl id="B1">
				<title>
					<p>Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment</p>
				</title>
				<aug>
					<au>
						<snm>Bajic</snm>
						<fnm>V</fnm>
					</au>
					<au>
						<snm>Brent</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Brown</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Frankish</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Harrow</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Ohler</snm>
						<fnm>U</fnm>
					</au>
					<au>
						<snm>Solovyev</snm>
						<fnm>V</fnm>
					</au>
					<au>
						<snm>Tan</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Genome Biology</source>
				<pubdate>2006</pubdate>
				<volume>7</volume>
				<issue>Suppl 1</issue>
				<fpage>S3</fpage>
				<lpage/>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1810552</pubid>
						<pubid idtype="pmpid" link="fulltext">16925837</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B2">
				<title>
					<p>nGASP Gene prediction challenge</p>
				</title>
				<aug>
					<au>
						<snm>Stein</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Blasiar</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Coghlan</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Fiedler</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>McKay</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Flicek</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<pubdate>2007</pubdate>
				<url>http://www.wormbase.org/wiki/index.php/NGASP</url>
			</bibl>
			<bibl id="B3">
				<title>
					<p>Improving the C. elegans genome annotation using machine learning</p>
				</title>
				<aug>
					<au>
						<snm>R&#228;tsch</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Sonnenburg</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Srinivasan</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Witte</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>M&#252;ller</snm>
						<fnm>KR</fnm>
					</au>
					<au>
						<snm>Sommer</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Sch&#246;lkopf</snm>
						<fnm>B</fnm>
					</au>
				</aug>
				<source>PLoS Computational Biology</source>
				<pubdate>2007</pubdate>
				<volume>3</volume>
				<issue>2</issue>
				<fpage>e20</fpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1808025</pubid>
						<pubid idtype="pmpid" link="fulltext">17319737</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B4">
				<title>
					<p>Global Discriminative Learning for Higher-Accuracy Computational Gene Prediction</p>
				</title>
				<aug>
					<au>
						<snm>Bernal</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Crammer</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Hatzigeorgiou</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Pereira</snm>
						<fnm>F</fnm>
					</au>
				</aug>
				<source>PLoS Computational Biology</source>
				<pubdate>2007</pubdate>
				<volume>3</volume>
				<issue>3</issue>
				<fpage>e54</fpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1828702</pubid>
						<pubid idtype="pmpid" link="fulltext">17367206</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B5">
				<title>
					<p>Whole-Genome Patterns of Common DNA Variation in Three Human Populations</p>
				</title>
				<aug>
					<au>
						<snm>Hinds</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Stuve</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Nilsen</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Halperin</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Eskin</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Ballinger</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Frazer</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Cox</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Science</source>
				<pubdate>2005</pubdate>
				<volume>307</volume>
				<issue>5712</issue>
				<fpage>1072</fpage>
				<lpage>1079</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15718463</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B6">
				<title>
					<p>A haplotype map of the human genome</p>
				</title>
				<aug>
					<au>
						<cnm>International HapMap Consortium</cnm>
					</au>
				</aug>
				<source>Nature</source>
				<pubdate>2005</pubdate>
				<volume>437</volume>
				<fpage>1299</fpage>
				<lpage>1320</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1880871</pubid>
						<pubid idtype="pmpid" link="fulltext">16255080</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B7">
				<title>
					<p>Common sequence polymorphisms shaping genetic diversity in Arabidopsis thaliana</p>
				</title>
				<aug>
					<au>
						<snm>Clark</snm>
						<fnm>RM</fnm>
					</au>
					<au>
						<snm>Schweikert</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Toomajian</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Ossowski</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Zeller</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Shinn</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Warthmann</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Hu</snm>
						<fnm>TT</fnm>
					</au>
					<au>
						<snm>Fu</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Hinds</snm>
						<fnm>DA</fnm>
					</au>
					<au>
						<snm>Chen</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Frazer</snm>
						<fnm>KA</fnm>
					</au>
					<au>
						<snm>Huson</snm>
						<fnm>DH</fnm>
					</au>
					<au>
						<snm>Sch&#246;lkopf</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Nordborg</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>R&#228;tsch</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Ecker</snm>
						<fnm>JR</fnm>
					</au>
					<au>
						<snm>Weigel</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Science</source>
				<pubdate>2007</pubdate>
				<volume>317</volume>
				<issue>5836</issue>
				<fpage>338</fpage>
				<lpage>342</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">17641193</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B8">
				<title>
					<p>Prediction of complete gene structures in human genomic DNA</p>
				</title>
				<aug>
					<au>
						<snm>Burge</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Karlin</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Journal of Molecular Biology</source>
				<pubdate>1997</pubdate>
				<volume>268</volume>
				<fpage>78</fpage>
				<lpage>94</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">9149143</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B9">
				<title>
					<p>Improved splice site detection in Genie</p>
				</title>
				<aug>
					<au>
						<snm>Reese</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Eeckman</snm>
						<fnm>FH</fnm>
					</au>
					<au>
						<snm>Kulp</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Haussler</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Journal of Computational Biology</source>
				<pubdate>1997</pubdate>
				<volume>4</volume>
				<fpage>311</fpage>
				<lpage>323</lpage>
				<xrefbib>
					<pubid idtype="pmpid">9278062</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B10">
				<title>
					<p>A decision tree system for finding genes in DNA</p>
				</title>
				<aug>
					<au>
						<snm>Salzberg</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Delcher</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Fasman</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Henderson</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Journal of Computational Biology</source>
				<pubdate>1998</pubdate>
				<volume>5</volume>
				<issue>4</issue>
				<fpage>667</fpage>
				<lpage>680</lpage>
				<xrefbib>
					<pubid idtype="pmpid">10072083</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B11">
				<title>
					<p>Improved microbial gene identification with GLIMMER</p>
				</title>
				<aug>
					<au>
						<snm>Delcher</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Harmon</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Kasif</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>White</snm>
						<fnm>O</fnm>
					</au>
					<au>
						<snm>Salzberg</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Research</source>
				<pubdate>1999</pubdate>
				<volume>27</volume>
				<issue>23</issue>
				<fpage>4636</fpage>
				<lpage>4641</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">148753</pubid>
						<pubid idtype="pmpid" link="fulltext">10556321</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B12">
				<title>
					<p>GeneSplicer: a new computational method for splice site prediction</p>
				</title>
				<aug>
					<au>
						<snm>Pertea</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Lin</snm>
						<fnm>X</fnm>
					</au>
					<au>
						<snm>Salzberg</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Research</source>
				<pubdate>2001</pubdate>
				<volume>29</volume>
				<issue>5</issue>
				<fpage>1185</fpage>
				<lpage>1190</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">29713</pubid>
						<pubid idtype="pmpid" link="fulltext">11222768</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B13">
				<title>
					<p>Recognition of splice junctions on DNA sequences by BRAIN learning algorithm</p>
				</title>
				<aug>
					<au>
						<snm>Rampone</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>1998</pubdate>
				<volume>14</volume>
				<issue>8</issue>
				<fpage>676</fpage>
				<lpage>684</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">9789093</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B14">
				<title>
					<p>Modeling splice sites with Bayes networks</p>
				</title>
				<aug>
					<au>
						<snm>Cai</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Delcher</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Kao</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Kasif</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2000</pubdate>
				<volume>16</volume>
				<issue>2</issue>
				<fpage>152</fpage>
				<lpage>158</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">10842737</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B15">
				<title>
					<p>Markov Encoding for Detecting Signals in Genomic Sequences</p>
				</title>
				<aug>
					<au>
						<snm>Rajapakse</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Ho</snm>
						<fnm>L</fnm>
					</au>
				</aug>
				<source>IEEE ACM Transactions on Computational Biology and Bioinformatics</source>
				<pubdate>2005</pubdate>
				<volume>2</volume>
				<issue>2</issue>
				<fpage>131</fpage>
				<lpage>142</lpage>
			</bibl>
			<bibl id="B16">
				<aug>
					<au>
						<snm>Vapnik</snm>
						<fnm>VN</fnm>
					</au>
				</aug>
				<source>The Nature of Statistical Learning Theory</source>
				<publisher>New York, Springer Verlag</publisher>
				<pubdate>1995</pubdate>
			</bibl>
			<bibl id="B17">
				<title>
					<p>An Introduction to Kernel-Based Learning Algorithms</p>
				</title>
				<aug>
					<au>
						<snm>M&#252;ller</snm>
						<fnm>KR</fnm>
					</au>
					<au>
						<snm>Mika</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>R&#228;tsch</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Tsuda</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Sch&#246;lkopf</snm>
						<fnm>B</fnm>
					</au>
				</aug>
				<source>IEEE Transactions on Neural Networks</source>
				<pubdate>2001</pubdate>
				<volume>12</volume>
				<issue>2</issue>
				<fpage>181</fpage>
				<lpage>201</lpage>
			</bibl>
			<bibl id="B18">
				<aug>
					<au>
						<snm>Sch&#246;lkopf</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Smola</snm>
						<fnm>AJ</fnm>
					</au>
				</aug>
				<source>Learning with Kernels</source>
				<publisher>Cambridge, MA, MIT Press</publisher>
				<pubdate>2002</pubdate>
			</bibl>
			<bibl id="B19">
				<title>
					<p>What is a Support Vector Machine?</p>
				</title>
				<aug>
					<au>
						<snm>Noble</snm>
						<fnm>WS</fnm>
					</au>
				</aug>
				<source>Nature Biotechnology</source>
				<pubdate>2006</pubdate>
				<volume>24</volume>
				<issue>12</issue>
				<fpage>1565</fpage>
				<lpage>7</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">17160063</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B20">
				<title>
					<p>Exploiting Generative Models in Discriminative Classifiers</p>
				</title>
				<aug>
					<au>
						<snm>Jaakkola</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Haussler</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Advances in Neural Information Processing Systems</source>
				<publisher>Cambridge, MA, MIT Press</publisher>
				<editor>Kearns M, Solla S, Cohn D</editor>
				<pubdate>1999</pubdate>
				<volume>11</volume>
				<fpage>487</fpage>
				<lpage>493</lpage>
			</bibl>
			<bibl id="B21">
				<title>
					<p>Engineering Support Vector Machine Kernels That Recognize Translation Initiation Sites</p>
				</title>
				<aug>
					<au>
						<snm>Zien</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>R&#228;tsch</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Mika</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Sch&#246;lkopf</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Lengauer</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>M&#252;ller</snm>
						<fnm>KR</fnm>
					</au>
				</aug>
				<source>BioInformatics</source>
				<pubdate>2000</pubdate>
				<volume>16</volume>
				<issue>9</issue>
				<fpage>799</fpage>
				<lpage>807</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">11108702</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B22">
				<title>
					<p>Knowledge-based analysis of microarray gene expression data using support vector machines</p>
				</title>
				<aug>
					<au>
						<snm>Brown</snm>
						<fnm>MPS</fnm>
					</au>
					<au>
						<snm>Grundy</snm>
						<fnm>WN</fnm>
					</au>
					<au>
						<snm>Lin</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Cristianini</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Sugnet</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Furey</snm>
						<fnm>TS</fnm>
					</au>
					<au>
						<snm>Ares</snm>
						<fnm>JM</fnm>
					</au>
					<au>
						<snm>Haussler</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>PNAS</source>
				<pubdate>2000</pubdate>
				<volume>97</volume>
				<fpage>262</fpage>
				<lpage>267</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">26651</pubid>
						<pubid idtype="pmpid" link="fulltext">10618406</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B23">
				<title>
					<p>A New Discriminative Kernel from Probabilistic Models</p>
				</title>
				<aug>
					<au>
						<snm>Tsuda</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Kawanabe</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>R&#228;tsch</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Sonnenburg</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>M&#252;ller</snm>
						<fnm>K</fnm>
					</au>
				</aug>
				<source>Advances in Neural information processings systems</source>
				<editor>Dietterich T, Becker S, Ghahramani Z</editor>
				<pubdate>2002</pubdate>
				<volume>14</volume>
				<fpage>977</fpage>
			</bibl>
			<bibl id="B24">
				<title>
					<p>New Methods for Splice-Site Recognition</p>
				</title>
				<aug>
					<au>
						<snm>Sonnenburg</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>R&#228;tsch</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Jagota</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>M&#252;ller</snm>
						<fnm>KR</fnm>
					</au>
				</aug>
				<source>Proc ICANN'02</source>
				<pubdate>2002</pubdate>
			</bibl>
			<bibl id="B25">
				<title>
					<p>New Methods for Splice Site Recognition</p>
				</title>
				<aug>
					<au>
						<snm>Sonnenburg</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Master's thesis</source>
				<publisher>Humboldt University</publisher>
				<pubdate>2002</pubdate>
				<note>[Supervised by K.-R. M&#252;ller H.-D. Burkhard and G. R&#228;tsch]</note>
			</bibl>
			<bibl id="B26">
				<title>
					<p>Human Splice Site Identifications with Multiclass Support Vector Machines and Bagging</p>
				</title>
				<aug>
					<au>
						<snm>Lorena</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>de Carvalho</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>Artificial Neural Neural Networks and Neural Information Processing &#8211; ICANN/ICONIP 2003</source>
				<pubdate>2003</pubdate>
				<volume>2714</volume>
			</bibl>
			<bibl id="B27">
				<title>
					<p>Detection of the Splicing Sites with Kernel Method Approaches Dealing with Nucleotide Doublets</p>
				</title>
				<aug>
					<au>
						<snm>Yamamura</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Gotoh</snm>
						<fnm>O</fnm>
					</au>
				</aug>
				<source>Genome Informatics</source>
				<pubdate>2003</pubdate>
				<volume>14</volume>
				<fpage>426</fpage>
				<lpage>427</lpage>
			</bibl>
			<bibl id="B28">
				<title>
					<p>Accurate Splice Site Detection for <it>Caenorhabditis elegans</it>
					</p>
				</title>
				<aug>
					<au>
						<snm>R&#228;tsch</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Sonnenburg</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Kernel Methods in Computational Biology</source>
				<publisher>MIT Press</publisher>
				<editor>B Sch&#246;lkopf KT, Vert JP</editor>
				<pubdate>2004</pubdate>
			</bibl>
			<bibl id="B29">
				<title>
					<p>SpliceMachine: predicting splice sites from high-dimensional local context representations</p>
				</title>
				<aug>
					<au>
						<snm>Degroeve</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Saeys</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Baets</snm>
						<fnm>BD</fnm>
					</au>
					<au>
						<snm>Rouz&#233;</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>de Peer</snm>
						<fnm>YV</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2005</pubdate>
				<volume>21</volume>
				<issue>8</issue>
				<fpage>1332</fpage>
				<lpage>8</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15564294</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B30">
				<title>
					<p>An approach of encoding for predictionof splice sites using SVM</p>
				</title>
				<aug>
					<au>
						<snm>Huang</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Li</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Chen</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Wu</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Biochimie</source>
				<pubdate>2006</pubdate>
				<volume>88</volume>
				<fpage>923</fpage>
				<lpage>929</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">16626852</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B31">
				<title>
					<p>Splice site prediction using support vector machines with a Bayes kernel</p>
				</title>
				<aug>
					<au>
						<snm>Zhang</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Chu</snm>
						<fnm>CH</fnm>
					</au>
					<au>
						<snm>Chen</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Zha</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Ji</snm>
						<fnm>X</fnm>
					</au>
				</aug>
				<source>Expert Systems with Applications</source>
				<pubdate>2006</pubdate>
				<volume>30</volume>
				<fpage>73</fpage>
				<lpage>81</lpage>
			</bibl>
			<bibl id="B32">
				<title>
					<p>Splice site identification using probabilistic parameters and SVM classification</p>
				</title>
				<aug>
					<au>
						<snm>Baten</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Chang</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Halgamuge</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Li</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>BMC Bioinformatics</source>
				<pubdate>2006</pubdate>
				<volume>7</volume>
				<issue>Suppl 5</issue>
				<fpage>S15</fpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1764471</pubid>
						<pubid idtype="pmpid" link="fulltext">17254299</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B33">
				<title>
					<p>Combining pairwise sequence similarity and support vector machines for remote protein homology detection</p>
				</title>
				<aug>
					<au>
						<snm>Liao</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Noble</snm>
						<fnm>WS</fnm>
					</au>
				</aug>
				<source>Proceedings of the Sixth Annual International Conference on Computational Molecular Biology (RECOMB)</source>
				<publisher>New York: ACM Press</publisher>
				<editor>Myers G, Hannenhalli S, Sankoff D, Istrail S, Pevzner P, Waterman M</editor>
				<pubdate>2002</pubdate>
				<fpage>225</fpage>
				<lpage>232</lpage>
			</bibl>
			<bibl id="B34">
				<title>
					<p>RASE: Recognition of Alternatively Spliced Exons in <it>C. elegans</it>
					</p>
				</title>
				<aug>
					<au>
						<snm>R&#228;tsch</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Sonnenburg</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Sch&#246;lkopf</snm>
						<fnm>B</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2005</pubdate>
				<volume>21</volume>
				<issue>Suppl 1</issue>
				<fpage>i369</fpage>
				<lpage>i377</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15961480</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B35">
				<title>
					<p>A computational survey of candidate exonic splicing enhancer motifs in the model plant Arabidopsis thaliana</p>
				</title>
				<aug>
					<au>
						<snm>Pertea</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Mount</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Salzberg</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>BMC Bioinformatics</source>
				<pubdate>2007</pubdate>
				<volume>8</volume>
				<fpage>159</fpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1892810</pubid>
						<pubid idtype="pmpid" link="fulltext">17517127</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B36">
				<title>
					<p>Prediction of splice sites with dependency graphs and their expanded bayesian networks</p>
				</title>
				<aug>
					<au>
						<snm>Chen</snm>
						<fnm>TM</fnm>
					</au>
					<au>
						<snm>Lu</snm>
						<fnm>CC</fnm>
					</au>
					<au>
						<snm>Li</snm>
						<fnm>WH</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2005</pubdate>
				<volume>21</volume>
				<issue>4</issue>
				<fpage>471</fpage>
				<lpage>482</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15374869</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B37">
				<title>
					<p>ARTS: Accurate Recognition of Transcription Starts in Human</p>
				</title>
				<aug>
					<au>
						<snm>Sonnenburg</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Zien</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>R&#228;tsch</snm>
						<fnm>G</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2006</pubdate>
				<volume>22</volume>
				<issue>14</issue>
				<fpage>e472</fpage>
				<lpage>480</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">16873509</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B38">
				<title>
					<p>Large Scale Multiple Kernel Learning</p>
				</title>
				<aug>
					<au>
						<snm>Sonnenburg</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>R&#228;tsch</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Sch&#228;fer</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Sch&#246;lkopf</snm>
						<fnm>B</fnm>
					</au>
				</aug>
				<source>Journal of Machine Learning Research</source>
				<pubdate>2006</pubdate>
				<volume>7</volume>
				<fpage>1531</fpage>
				<lpage>1565</lpage>
				<note>[Special Topic on Machine Learning and Optimization]</note>
			</bibl>
			<bibl id="B39">
				<title>
					<p>Learning Interpretable SVMs for Biological Sequence Classification</p>
				</title>
				<aug>
					<au>
						<snm>R&#228;tsch</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Sonnenburg</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Sch&#228;fer</snm>
						<fnm>C</fnm>
					</au>
				</aug>
				<source>BMC Bioinformatics</source>
				<pubdate>2006</pubdate>
				<volume>7</volume>
				<issue>Suppl 1</issue>
				<fpage>S9</fpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1810320</pubid>
						<pubid idtype="pmpid" link="fulltext">16723012</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B40">
				<title>
					<p>Large Scale Learning with String Kernels</p>
				</title>
				<aug>
					<au>
						<snm>Sonnenburg</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>R&#228;tsch</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Rieck</snm>
						<fnm>K</fnm>
					</au>
				</aug>
				<source>Large Scale Kernel Machines</source>
				<publisher>MIT Press</publisher>
				<editor>Bottou L, Chapelle O, DeCoste D, Weston J</editor>
				<pubdate>2007</pubdate>
			</bibl>
			<bibl id="B41">
				<title>
					<p>Basic principles of ROC analysis</p>
				</title>
				<aug>
					<au>
						<snm>Metz</snm>
						<fnm>CE</fnm>
					</au>
				</aug>
				<source>Seminars in Nuclear Medicine</source>
				<pubdate>1978</pubdate>
				<volume>VIII</volume>
				<issue>4</issue>
			</bibl>
			<bibl id="B42">
				<title>
					<p>ROC graphs: Notes and practical considerations for data mining researchers</p>
				</title>
				<aug>
					<au>
						<snm>Fawcett</snm>
						<fnm>T</fnm>
					</au>
				</aug>
				<source>Technical report hpl-2003-4</source>
				<publisher>HP Laboratories, Palo Alto, CA, USA</publisher>
				<pubdate>2003</pubdate>
			</bibl>
			<bibl id="B43">
				<title>
					<p>The relationship between Precision-Recall and ROC curves</p>
				</title>
				<aug>
					<au>
						<snm>Davis</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Goadrich</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>ICML</source>
				<pubdate>2006</pubdate>
				<fpage>233</fpage>
				<lpage>240</lpage>
			</bibl>
			<bibl id="B44">
				<aug>
					<au>
						<snm>Durbin</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Eddy</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Krogh</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Mitchison</snm>
						<fnm>G</fnm>
					</au>
				</aug>
				<source>Biological Sequence Analysis &#8211; Probabilistic Models of Proteins and Nucleic Acids</source>
				<publisher>Cambridge, UK, Cambridge University Press</publisher>
				<pubdate>1998</pubdate>
			</bibl>
			<bibl id="B45">
				<title>
					<p>Correction notes to BMC Bioinformatics 2006, 7(Suppl 5):S15</p>
				</title>
				<pubdate>2006</pubdate>
				<url>http://www.mame.mu.oz.au/bioinformatics/splicesite</url>
			</bibl>
			<bibl id="B46">
				<title>
					<p>SpliceMachine</p>
				</title>
				<url>http://bioinformatics.psb.ugent.be/webtools/splicemachine/</url>
			</bibl>
			<bibl id="B47">
				<title>
					<p>GeneSplicer</p>
				</title>
				<url>http://www.cbcb.umd.edu/software/GeneSplicer/</url>
			</bibl>
			<bibl id="B48">
				<title>
					<p>SpliceMachine feature extractor</p>
				</title>
				<url>http://bioinformatics.psb.ugent.be/supplementary_data/svgro/splicemachine/downloads/splice_machine_sept_2004.zip</url>
			</bibl>
			<bibl id="B49">
				<title>
					<p>The spectrum kernel: A string kernel for SVM protein classification</p>
				</title>
				<aug>
					<au>
						<snm>Leslie</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Eskin</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Noble</snm>
						<fnm>WS</fnm>
					</au>
				</aug>
				<source>PSB</source>
				<publisher>River Edge, NJ, World Scientific</publisher>
				<editor>Altman R, Dunker A, Hunter L, Lauerdale K, Klein T</editor>
				<pubdate>2002</pubdate>
				<fpage>564</fpage>
				<lpage>575</lpage>
			</bibl>
			<bibl id="B50">
				<title>
					<p>Profile analysis: Detection of distantly related proteins</p>
				</title>
				<aug>
					<au>
						<snm>Gribskov</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>McLachlan</snm>
						<fnm>AD</fnm>
					</au>
					<au>
						<snm>Eisenberg</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci U S A</source>
				<pubdate>1987</pubdate>
				<volume>84</volume>
				<fpage>4355</fpage>
				<lpage>4358</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">305087</pubid>
						<pubid idtype="pmpid" link="fulltext">3474607</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B51">
				<title>
					<p>Profile-based string kernels for remote homology detection and motif extraction</p>
				</title>
				<aug>
					<au>
						<snm>Kuang</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Ie</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Wang</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Wang</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Siddiqi</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Freund</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Leslie</snm>
						<fnm>C</fnm>
					</au>
				</aug>
				<source>Computational Systems Bioinformatics Conference 2004</source>
				<pubdate>2004</pubdate>
				<fpage>146</fpage>
				<lpage>154</lpage>
			</bibl>
			<bibl id="B52">
				<title>
					<p>Sequence information for the splicing of human pre-mRNA identified by support vector machine classification</p>
				</title>
				<aug>
					<au>
						<snm>Zhang</snm>
						<fnm>XHF</fnm>
					</au>
					<au>
						<snm>Heller</snm>
						<fnm>KA</fnm>
					</au>
					<au>
						<snm>Hefter</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Leslie</snm>
						<fnm>CS</fnm>
					</au>
					<au>
						<snm>Chasin</snm>
						<fnm>LA</fnm>
					</au>
				</aug>
				<source>Genome Res</source>
				<pubdate>2003</pubdate>
				<volume>13</volume>
				<issue>12</issue>
				<fpage>2637</fpage>
				<lpage>2650</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">403805</pubid>
						<pubid idtype="pmpid" link="fulltext">14656968</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B53">
				<title>
					<p>Dichotomous splicing signals in exon flanks</p>
				</title>
				<aug>
					<au>
						<snm>Zhang</snm>
						<fnm>X</fnm>
					</au>
					<au>
						<snm>Leslie</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Chasin</snm>
						<fnm>L</fnm>
					</au>
				</aug>
				<source>Genome Research</source>
				<pubdate>2005</pubdate>
				<volume>15</volume>
				<issue>6</issue>
				<fpage>768</fpage>
				<lpage>79</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1142467</pubid>
						<pubid idtype="pmpid" link="fulltext">15930489</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B54">
				<title>
					<p>POIMS: Positional Oligomer Importance Matrices</p>
				</title>
				<aug>
					<au>
						<snm>Sonnenburg</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Zien</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Philips</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>R&#228;tsch</snm>
						<fnm>G</fnm>
					</au>
				</aug>
				<pubdate>2007</pubdate>
				<note>[In preparation]</note>
			</bibl>
			<bibl id="B55">
				<aug>
					<au>
						<snm>Lewin</snm>
						<fnm>B</fnm>
					</au>
				</aug>
				<source>Genes VII</source>
				<publisher>Oxford University Press, New York</publisher>
				<pubdate>2000</pubdate>
			</bibl>
			<bibl id="B56">
				<title>
					<p>Fruit fly genome sequence</p>
				</title>
				<url>http://www.fruitfly.org/sequence/human-datasets.html</url>
			</bibl>
			<bibl id="B57">
				<title>
					<p>dbEST-Database for "Expressed Sequence Tags"</p>
				</title>
				<aug>
					<au>
						<snm>Boguski</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Tolstoshev</snm>
						<fnm>TLC</fnm>
					</au>
				</aug>
				<source>Nature Genetics</source>
				<pubdate>1993</pubdate>
				<volume>4</volume>
				<issue>4</issue>
				<fpage>332</fpage>
				<lpage>3</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">8401577</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B58">
				<title>
					<p>Wormbase</p>
				</title>
				<url>http://www.wormbase.org</url>
			</bibl>
			<bibl id="B59">
				<title>
					<p>Fruit fly expression sequence tags</p>
				</title>
				<url>http://www.fruitfly.org/EST/index.shtml</url>
			</bibl>
			<bibl id="B60">
				<title>
					<p>Riken cress sequence</p>
				</title>
				<url>http://rarge.psc.riken.jp/archives/rafl/sequence/</url>
			</bibl>
			<bibl id="B61">
				<title>
					<p>Ensemble</p>
				</title>
				<url>http://www.ensembl.org</url>
			</bibl>
			<bibl id="B62">
				<title>
					<p>Mammalian Gene Collection</p>
				</title>
				<url>http://mgc.nci.nih.gov</url>
			</bibl>
			<bibl id="B63">
				<title>
					<p>BLAT-the BLAST-like alignment tool</p>
				</title>
				<aug>
					<au>
						<snm>Kent</snm>
						<fnm>W</fnm>
					</au>
				</aug>
				<source>Genome Research</source>
				<pubdate>2002</pubdate>
				<volume>12</volume>
				<issue>4</issue>
				<fpage>656</fpage>
				<lpage>64</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">187518</pubid>
						<pubid idtype="pmpid" link="fulltext">11932250</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B64">
				<title>
					<p>Prediction of Alternative Splicing in Eukaryotes</p>
				</title>
				<aug>
					<au>
						<snm>Ong</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>R&#228;tsch</snm>
						<fnm>G</fnm>
					</au>
				</aug>
				<note>[In preparation]</note>
			</bibl>
		</refgrp>
	</bm>
</art>

