<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
	<ui>1471-2105-8-S4-S6</ui>
	<ji>1471-2105</ji>
	<fm>
		<dochead>Proceedings</dochead>
		<bibl>
			<title>
				<p>Gene function prediction based on genomic context clustering and discriminative learning: an application to bacteriophages</p>
			</title>
			<aug>
				<au id="A1">
					<snm>Li</snm>
					<fnm>Jason</fnm>
					<insr iid="I1"/>
					<email>lij@mame.mu.oz.au</email>
				</au>
				<au id="A2">
					<snm>Halgamuge</snm>
					<mi>K</mi>
					<fnm>Saman</fnm>
					<insr iid="I1"/>
					<email>saman@unimelb.edu.au</email>
				</au>
				<au id="A3">
					<snm>Kells</snm>
					<mi>I</mi>
					<fnm>Christopher</fnm>
					<insr iid="I1"/>
					<email>c.kells@ugrad.unimelb.edu.au</email>
				</au>
				<au id="A4" ca="yes">
					<snm>Tang</snm>
					<fnm>Sen-Lin</fnm>
					<insr iid="I2"/>
					<email>sltang@gate.sinica.edu.tw</email>
				</au>
			</aug>
			<insg>
				<ins id="I1">
					<p>Dynamic Systems &amp; Control Group, DoMME, University of Melbourne, Melbourne, Australia</p>
				</ins>
				<ins id="I2">
					<p>Research Center for Biodiversity, Academia Sinica, Taipei, Taiwan</p>
				</ins>
			</insg>
			<source>BMC Bioinformatics</source>
			<supplement>
				<title>
					<p>The Second Automated Function Prediction Meeting</p>
				</title>
				<editor>Ana PC Rodrigues, Barry J Grant, Adam Godzik and Iddo Friedberg</editor>
				<note>Proceedings</note>
				<url>http://www.biomedcentral.com/content/pdf/1471-2105-8-S4-info.pdf</url>
			</supplement>
			<conference>
				<title>
					<p>The Second Automated Function Prediction Meeting</p>
				</title>
				<location>La Jolla, CA, USA</location>
				<date-range>30 August &#8211; 1 September 2006</date-range>
				<url>http://BioFunctionPrediction.org/AFP/afp06</url>
			</conference>
			<issn>1471-2105</issn>
			<pubdate>2007</pubdate>
			<volume>8</volume>
			<issue>Suppl 4</issue>
			<fpage>S6</fpage>
			<url>http://www.biomedcentral.com/1471-2105/8/S4/S6</url>
			<xrefbib>
				<pubidlist><pubid idtype="pmpid">17570149</pubid><pubid idtype="doi">10.1186/1471-2105-8-S4-S6</pubid>
				</pubidlist></xrefbib>
		</bibl>
		<history>
			<pub>
				<date>
					<day>22</day>
					<month>5</month>
					<year>2007</year>
				</date>
			</pub>
		</history>
		<cpyrt>
			<year>2007</year>
			<collab>Li et al; licensee BioMed Central Ltd.</collab>
			<note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
		</cpyrt>
		<abs>
			<sec>
				<st>
					<p>Abstract</p>
				</st>
				<sec>
					<st>
						<p>Background</p>
					</st>
					<p>Existing methods for whole-genome comparisons require prior knowledge of related species and provide little automation in the function prediction process. Bacteriophage genomes are an example that cannot be easily analyzed by these methods. This work addresses these shortcomings and aims to provide an automated prediction system of gene function.</p>
				</sec>
				<sec>
					<st>
						<p>Results</p>
					</st>
					<p>We have developed a novel system called SynFPS to perform gene function prediction over completed genomes. The prediction system is initialized by clustering a large collection of weakly related genomes into groups based on their resemblance in gene distribution. From each individual group, data are then extracted and used to train a Support Vector Machine that makes gene function predictions. Experiments were conducted with 9 different gene functions over 296 bacteriophage genomes. Cross validation results gave an average prediction accuracy of ~80%, which is comparable to other genomic-context based prediction methods. Functional predictions are also made on 3 uncharacterized genes and 12 genes that cannot be identified by sequence alignment. The software is publicly available at <url>http://www.synteny.net/</url>.</p>
				</sec>
				<sec>
					<st>
						<p>Conclusion</p>
					</st>
					<p>The proposed system employs genomic context to predict gene function and detect gene correspondence in whole-genome comparisons. Although our experimental focus is on bacteriophages, the method may be extended to other microbial genomes as they share a number of similar characteristics with phage genomes such as gene order conservation.</p>
				</sec>
			</sec>
		</abs>
	</fm>
	<bdy>
		<sec>
			<st>
				<p>Background</p>
			</st>
			<p>The increasing number of completely sequenced genomes has enabled gene function predictions by means of whole genome comparison. Existing methods such as SynBrowse <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>, Vista <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>, LAGAN <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>, PipMaker <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> and Ensembl SyntenyView <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> provide visualization of conserved regions between two or more genome sequences for comparative analysis. Such visualization facilitates the prediction of gene function based on comparison of genomic context information such as co-occurrence of genes <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp> and conservation of gene order <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>.</p>
			<p>However, these methods have two major limitations. First, they rely on sequence alignment to identify corresponding genes or regions between genomes <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr></abbrgrp>. Consequently, they cannot automatically detect homologous or functionally similar genes that share no sequence similarity, resulting in a need for manual prediction for those genes. Second, these methods require the genomes being compared to be closely related. This hinders the possibility of automatically analyzing a large collection of weakly related genomes and makes it impossible to inspect a genome to which related species have not been identified.</p>
			<p>Bacteriophage genomes are one example that suffers from the above limitations. Firstly, sequence alignment based methods are not fully reliable in detecting functionally similar genes within phages. This is because homologous phage genes have often diverged beyond the recognition of sequence similarity <abbrgrp><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp>. A key argument to explain such divergence was that the genes have a very distant common ancestry <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. Secondly, requiring to compare only a few related phages and to ignore the remainder can hinder the genomic analysis of the target phage. The reason is that the global phage relationships are not clearly defined phylogenetically due to an extensive amount of horizontal gene transfers (HGT) <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B16">16</abbr></abbrgrp>, implying that relatedness between phages often cannot be established. Consequently, it is desirable to have an objective measure to automatically identify closely related genomes based on the genetic data, as opposed to depending on the user to define a set of "related species".</p>
			<p>This work addresses the shortcomings of the existing methods and aims to provide a highly automated gene function prediction system based on whole-genome comparison. The system, named SynFPS, contains two automated learning units with distinct roles: a clustering technique that utilizes gene-to-gene distances to identify closely related genomes and a Support Vector Machine (SVM) for discriminative classification on gene functions. The algorithm of SynFPS and the results of function prediction on phage genes will be presented in the remainder of this paper.</p>
		</sec>
		<sec>
			<st>
				<p>Results and discussion</p>
			</st>
			<sec>
				<st>
					<p>Evaluation of prediction results by leave-one-out cross validation</p>
				</st>
				<p>We have attempted to perform predictions over nine common phage genes using SynFPS. These are major head, major tail, tape measure, prohead protease, integrase, terminase, portal, holin and lysin genes. They were selected on the basis of regular existence &#8211; they encode necessary functions not provided by their hosts, including structural and assembly genes, as well as lysis genes <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. These genes were searched against the annotation database using regular expression patterns defined in Table <tblr tid="T1">1</tblr>. Manual modifications of the search results have been conducted to remove ambiguous entries.</p>
				<tbl id="T1">
					<title>
						<p>Table 1</p>
					</title>
					<caption>
						<p>Regular expression patterns used for the nine selected genes.</p>
					</caption>
					<tblbdy cols="2">
						<r>
							<c ca="left">
								<p>Gene</p>
							</c>
							<c ca="left">
								<p>Search pattern</p>
							</c>
						</r>
						<r>
							<c cspan="2">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Major head</p>
							</c>
							<c ca="left">
								<p>(?&lt;!minor)\b(head|capsid)\b</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Major tail</p>
							</c>
							<c ca="left">
								<p>(?&lt;!minor)\btail\b</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Terminase (large subunit)</p>
							</c>
							<c ca="left">
								<p>terminase|\bterL\b</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Holin</p>
							</c>
							<c ca="left">
								<p>\bholin\b</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Lysin</p>
							</c>
							<c ca="left">
								<p>\blysin\b</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Tape measure</p>
							</c>
							<c ca="left">
								<p>\btape\b|minor tail</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Integrase</p>
							</c>
							<c ca="left">
								<p>integrase</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Portal protein</p>
							</c>
							<c ca="left">
								<p>\bportal\b</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Prohead protease</p>
							</c>
							<c ca="left">
								<p>prohead AND protease<sup>&#8224;</sup></p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>&#8224; Not a direct regular expression; "Prohead" and "protease" were searched separately and the results were combined using the AND operation provided by SynFPS.</p>
						<p/>
						<p>These patterns were matched against the CDS annotations of the phages retrieved from GenBank. Note that the search results were then refined via manual inspection. \w &#8211; alphanumeric character; \b &#8211; word boundary; | &#8211; 'or'; * &#8211; zero or more of the preceding character.</p>
					</tblfn>
				</tbl>
				<p>Table <tblr tid="T2">2</tblr> indicates the amount of genes that can be detected if sequence alignment (BLAST) alone was used. The K-Means clustering result based on these genes can be found in Supplementary Material (see Additional file <supplr sid="S1">1</supplr>).</p>
				<tbl id="T2">
					<title>
						<p>Table 2</p>
					</title>
					<caption>
						<p>Percentage of genes detected using sequence alignment.</p>
					</caption>
					<tblbdy cols="20">
						<r>
							<c ca="left">
								<p>Reference Genome</p>
							</c>
							<c>
								<p/>
							</c>
							<c cspan="2" ca="center">
								<p>Terminase</p>
							</c>
							<c cspan="2" ca="center">
								<p>Portal</p>
							</c>
							<c cspan="2" ca="center">
								<p>Head</p>
							</c>
							<c cspan="2" ca="center">
								<p>Tail</p>
							</c>
							<c cspan="2" ca="center">
								<p>Tape measure</p>
							</c>
							<c cspan="2" ca="center">
								<p>Prohead protease</p>
							</c>
							<c cspan="2" ca="center">
								<p>Lysin</p>
							</c>
							<c cspan="2" ca="center">
								<p>Holin</p>
							</c>
							<c cspan="2" ca="center">
								<p>Integrase</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c cspan="18">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>E-value cutoff</p>
							</c>
							<c ca="center">
								<p>0.01</p>
							</c>
							<c ca="center">
								<p>0.1</p>
							</c>
							<c ca="center">
								<p>0.01</p>
							</c>
							<c ca="center">
								<p>0.1</p>
							</c>
							<c ca="center">
								<p>0.01</p>
							</c>
							<c ca="center">
								<p>0.1</p>
							</c>
							<c ca="center">
								<p>0.01</p>
							</c>
							<c ca="center">
								<p>0.1</p>
							</c>
							<c ca="center">
								<p>0.01</p>
							</c>
							<c ca="center">
								<p>0.1</p>
							</c>
							<c ca="center">
								<p>0.01</p>
							</c>
							<c ca="center">
								<p>0.1</p>
							</c>
							<c ca="center">
								<p>0.01</p>
							</c>
							<c ca="center">
								<p>0.1</p>
							</c>
							<c ca="center">
								<p>0.01</p>
							</c>
							<c ca="center">
								<p>0.1</p>
							</c>
							<c ca="center">
								<p>0.01</p>
							</c>
							<c ca="center">
								<p>0.1</p>
							</c>
						</r>
						<r>
							<c cspan="20">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Bacteriophage bIL285</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>
									<b>31</b>
								</p>
							</c>
							<c ca="center">
								<p>37</p>
							</c>
							<c ca="center">
								<p>33</p>
							</c>
							<c ca="center">
								<p>50</p>
							</c>
							<c ca="center">
								<p>-</p>
							</c>
							<c ca="center">
								<p>-</p>
							</c>
							<c ca="center">
								<p>4</p>
							</c>
							<c ca="center">
								<p>19</p>
							</c>
							<c ca="center">
								<p>-</p>
							</c>
							<c ca="center">
								<p>-</p>
							</c>
							<c ca="center">
								<p>46</p>
							</c>
							<c ca="center">
								<p>49</p>
							</c>
							<c ca="center">
								<p>16</p>
							</c>
							<c ca="center">
								<p>18</p>
							</c>
							<c ca="center">
								<p>11</p>
							</c>
							<c ca="center">
								<p>23</p>
							</c>
							<c ca="center">
								<p>57</p>
							</c>
							<c ca="center">
								<p>64</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p><it>Lactococcus </it>phage TP901-1</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>8</p>
							</c>
							<c ca="center">
								<p>22</p>
							</c>
							<c ca="center">
								<p>13</p>
							</c>
							<c ca="center">
								<p>19</p>
							</c>
							<c ca="center">
								<p>12</p>
							</c>
							<c ca="center">
								<p>
									<b>27</b>
								</p>
							</c>
							<c ca="center">
								<p>7</p>
							</c>
							<c ca="center">
								<p>22</p>
							</c>
							<c ca="center">
								<p>
									<b>96</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>98</b>
								</p>
							</c>
							<c ca="center">
								<p>-</p>
							</c>
							<c ca="center">
								<p>-</p>
							</c>
							<c ca="center">
								<p>30</p>
							</c>
							<c ca="center">
								<p>48</p>
							</c>
							<c ca="center">
								<p>13</p>
							</c>
							<c ca="center">
								<p>13</p>
							</c>
							<c ca="center">
								<p>3</p>
							</c>
							<c ca="center">
								<p>15</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p><it>Enterobacteria </it>phage HK97</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>29</p>
							</c>
							<c ca="center">
								<p>40</p>
							</c>
							<c ca="center">
								<p>35</p>
							</c>
							<c ca="center">
								<p>43</p>
							</c>
							<c ca="center">
								<p>4</p>
							</c>
							<c ca="center">
								<p>26</p>
							</c>
							<c ca="center">
								<p>
									<b>12</b>
								</p>
							</c>
							<c ca="center">
								<p>19</p>
							</c>
							<c ca="center">
								<p>83</p>
							</c>
							<c ca="center">
								<p>96</p>
							</c>
							<c ca="center">
								<p>
									<b>54</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>64</b>
								</p>
							</c>
							<c ca="center">
								<p>0</p>
							</c>
							<c ca="center">
								<p>4</p>
							</c>
							<c ca="center">
								<p>5</p>
							</c>
							<c ca="center">
								<p>24</p>
							</c>
							<c ca="center">
								<p>54</p>
							</c>
							<c ca="center">
								<p>
									<b>73</b>
								</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Bacteriophage phi LC3</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>19</p>
							</c>
							<c ca="center">
								<p>
									<b>42</b>
								</p>
							</c>
							<c ca="center">
								<p>9</p>
							</c>
							<c ca="center">
								<p>25</p>
							</c>
							<c ca="center">
								<p>
									<b>14</b>
								</p>
							</c>
							<c ca="center">
								<p>24</p>
							</c>
							<c ca="center">
								<p>4</p>
							</c>
							<c ca="center">
								<p>
									<b>25</b>
								</p>
							</c>
							<c ca="center">
								<p>63</p>
							</c>
							<c ca="center">
								<p>83</p>
							</c>
							<c ca="center">
								<p>-</p>
							</c>
							<c ca="center">
								<p>-</p>
							</c>
							<c ca="center">
								<p>
									<b>36</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>48</b>
								</p>
							</c>
							<c ca="center">
								<p>13</p>
							</c>
							<c ca="center">
								<p>14</p>
							</c>
							<c ca="center">
								<p>58</p>
							</c>
							<c ca="center">
								<p>65</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p><it>Staphylococcus aureus </it>phage phi 13</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>12</p>
							</c>
							<c ca="center">
								<p>25</p>
							</c>
							<c ca="center">
								<p>
									<b>40</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>56</b>
								</p>
							</c>
							<c ca="center">
								<p>9</p>
							</c>
							<c ca="center">
								<p>19</p>
							</c>
							<c ca="center">
								<p>-</p>
							</c>
							<c ca="center">
								<p>-</p>
							</c>
							<c ca="center">
								<p>77</p>
							</c>
							<c ca="center">
								<p>94</p>
							</c>
							<c ca="center">
								<p>36</p>
							</c>
							<c ca="center">
								<p>38</p>
							</c>
							<c ca="center">
								<p>-</p>
							</c>
							<c ca="center">
								<p>-</p>
							</c>
							<c ca="center">
								<p>
									<b>13</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>26</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>62</b>
								</p>
							</c>
							<c ca="center">
								<p>71</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>The percentages are calculated by dividing the number of significantly similar sequences by the total number of sequences found by using regular expression. Sequence similarity is determined by BLAST (bl2seq) [33] using BLOSUM45 with indicated E-value cutoffs. Each sequence is "blasted" against its corresponding gene in the reference genome. The best cases are highlighted in bold.</p>
					</tblfn>
				</tbl>
				<suppl id="S1">
					<title>
						<p>Additional file 1</p>
					</title>
					<text>
						<p>Supplementary material &#8211; the list of all phages and clustering result</p>
					</text>
					<file name="1471-2105-8-S4-S6-S1.doc">
						<p>Click here for file</p>
					</file>
				</suppl>
				<p>We perform leave-one-out (LOO) cross validation to evaluate the prediction performances for these genes. For each gene function, we run the cross validation in each cluster individually over a discrete range of values of the kernel parameter &#8211; <it>&#963; </it>for Gaussian RBF kernel <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. The <it>&#963; </it>value that gives the best accuracy is chosen and is used for all future predictions for that function. The prediction accuracies shown in Table <tblr tid="T3">3</tblr> are the averages of cross validation results across all the clusters.</p>
				<tbl id="T3">
					<title>
						<p>Table 3</p>
					</title>
					<caption>
						<p>Prediction settings and results for the nine gene functions.</p>
					</caption>
					<tblbdy cols="10">
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>Terminase</p>
							</c>
							<c ca="center">
								<p>Portal</p>
							</c>
							<c ca="center">
								<p>Head</p>
							</c>
							<c ca="center">
								<p>Tail</p>
							</c>
							<c ca="center">
								<p>Tape measure</p>
							</c>
							<c ca="center">
								<p>Prohead protease</p>
							</c>
							<c ca="center">
								<p>Lysin</p>
							</c>
							<c ca="center">
								<p>Holin</p>
							</c>
							<c ca="center">
								<p>Integrase</p>
							</c>
						</r>
						<r>
							<c cspan="10">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="center">
								<p># positive samples</p>
							</c>
							<c ca="center">
								<p>93</p>
							</c>
							<c ca="center">
								<p>83</p>
							</c>
							<c ca="center">
								<p>26</p>
							</c>
							<c ca="center">
								<p>26</p>
							</c>
							<c ca="center">
								<p>21</p>
							</c>
							<c ca="center">
								<p>11</p>
							</c>
							<c ca="center">
								<p>25</p>
							</c>
							<c ca="center">
								<p>69</p>
							</c>
							<c ca="center">
								<p>67</p>
							</c>
						</r>
						<r>
							<c ca="center">
								<p># negative samples</p>
							</c>
							<c ca="center">
								<p>308</p>
							</c>
							<c ca="center">
								<p>195</p>
							</c>
							<c ca="center">
								<p>107</p>
							</c>
							<c ca="center">
								<p>133</p>
							</c>
							<c ca="center">
								<p>82</p>
							</c>
							<c ca="center">
								<p>28</p>
							</c>
							<c ca="center">
								<p>45</p>
							</c>
							<c ca="center">
								<p>213</p>
							</c>
							<c ca="center">
								<p>102</p>
							</c>
						</r>
						<r>
							<c ca="center">
								<p># clusters</p>
							</c>
							<c ca="center">
								<p>17</p>
							</c>
							<c ca="center">
								<p>15</p>
							</c>
							<c ca="center">
								<p>7</p>
							</c>
							<c ca="center">
								<p>6</p>
							</c>
							<c ca="center">
								<p>7</p>
							</c>
							<c ca="center">
								<p>4</p>
							</c>
							<c ca="center">
								<p>6</p>
							</c>
							<c ca="center">
								<p>16</p>
							</c>
							<c ca="center">
								<p>12</p>
							</c>
						</r>
						<r>
							<c ca="center">
								<p>
									<b>Prediction Accuracy at <it>t </it>= 0.1(%)</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>86.9</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>85.89</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>67.87</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>83.33</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>75.68</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>66.67</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>100</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>79.5</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>82.18</b>
								</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>The total number (#) of positive training samples, negative training samples and the number of clusters involved with each gene class are shown. Accuracy values are computed using leave-one-out cross validations. K-Means adaptive threshold <it>t </it>= 0.1. GRBF kernel's <it>&#963; </it>= 2 for Head and Tail; <it>&#963; </it>= 11.3 for all other cases.</p>
					</tblfn>
				</tbl>
				<p><it>K</it>-fold cross validation may also be used to evaluate the prediction performances and it is expected that accuracies are lower with a smaller <it>K </it>value. For instance, the prediction accuracy for Terminase is 79.8% for <it>K </it>= 4 and 62.3% for <it>K </it>= 2. However, LOO is more suited to our overall purpose &#8211; one primary objective of the cross validation is to find out the near optimal <it>&#963; </it>value for the gene class to perform future predictions. Since most clusters contain only a very small portion of genomes that require genuine prediction, they are best simulated by LOO, where only one genome is taken out for prediction testing at a time.</p>
				<p>The prediction accuracies are averaged at ~80%. The 100% prediction accuracy of lysin can be explained by the strong context relationship between lysin and holin. Since the presence of a lysin is always accompanied by the presence of a holin immediately beside it <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>, SynFPS can easily identify the lysin gene if it already knows the position of the holin. However, the converse is not true: the identification of holin genes may not depend upon the presence of lysin. Consequently, the prediction accuracy for holin is not as high.</p>
				<p>These prediction accuracies reflect the sensitivity of the system (true positives/(true positives + false negatives)). The specificity of the system (true negatives/(true negatives + false positives)) on the other hand is always larger the sensitivity because of two system features. Firstly, we allow only a single positive prediction for each genome (see Methods). Thus, the number of false negatives is always the same as the number of false positives, implying that the specificities always scale together with the sensitivities. Secondly, the number of negative training data (hence true negatives) is always larger than the number of positive training data (hence true positives), and consequently Specificity &gt; Sensitivity. One reason for using LOO cross validation accuracies to evaluate the system is the lack of benchmark for our problem. However, it may be noteworthy that other genomic-context based methods for the prediction of functional elements have similar reported accuracies ranging from 72% to 80% <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>.</p>
			</sec>
			<sec>
				<st>
					<p>Trade-off between prediction coverage and prediction accuracy</p>
				</st>
				<p>We have examined the effect of the K-Means adaptive threshold <it>t </it>on the prediction accuracies. The value of <it>t </it>&#8712; (0,1] implicitly specifies the maximum tolerable distance between any two genomes within a cluster. As a result, as <it>t </it>&#8594; 0, there are as many clusters as the number of genomes, and as <it>t </it>&#8594; 1, there is only one cluster. Both of these cases do not provide useful information for prediction. Since there is no analytical method to find out a good value for <it>t</it>, we have run SynFPS over a range of values from <it>t </it>= 0.05 to <it>t </it>= 0.3. Values outside this range generate either too many or too few clusters (average number of genomes per cluster &lt; 2 or number of clusters &lt; 3 respectively). Using different <it>t </it>values lead to a different amount of genomes that are covered by the automated prediction (a.k.a. prediction coverage). Genomes within the "coverage" are those for which SynFPS has made a classification decision; the remaining genomes are discarded or ignored by SynFPS. Here are examples of genomes not in coverage:</p>
				<p>&#8226; genomes not containing the gene being predicted (discarded during cross validation only)</p>
				<p>&#8226; genomes that is in a cluster on their own</p>
				<p>&#8226; within a cluster, if there are fewer than two genomes that contain the gene being predicted, then all the genomes are discarded</p>
				<p>&#8226; genomes with genomic context different to the consensus of the group may be discarded</p>
				<p>Figure <figr fid="F3">3</figr> shows the plot of prediction accuracies versus prediction coverage. The highest coverage values for all gene functions are about 20&#8211;25%, achieved by using a <it>t </it>value ~0.1. The results indicate that we can obtain a higher accuracy by lowering the coverage. However, the ultimate purpose of the system is to make genuine predictions over the genomes that lack identification of the genes being predicted. Lowering the coverage can lead to ignorance of many of these genomes. Consequently, one must find a balance between the accuracy and the coverage according to the intended task.</p>
			</sec>
			<sec>
				<st>
					<p>Functions predicted to 3 uncharacterised genes and 12 sequence dissimilar genes</p>
				</st>
				<p>Using the maximum coverage and the <it>&#963; </it>values optimized by LOO cross validation, we have generated predictions over genomes within which certain gene functions were not already detected. The outcome of SynFPS is to identify which genes within those genomes correspond to the functions of our interest. The prediction outcomes are listed in Table <tblr tid="T4">4</tblr>.</p>
				<tbl id="T4">
					<title>
						<p>Table 4</p>
					</title>
					<caption>
						<p>Gene function prediction results for bacteriophage genomes.</p>
					</caption>
					<tblbdy cols="5">
						<r>
							<c ca="left">
								<p>
									<b>Gene (Phage abbrev.<sup>&#8224;</sup>: CDS location)</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>Existing function annotation</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>Predicted function</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>Supporting phages (phage abbrev.<sup>&#8224;</sup>)</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>SS</b>
								</p>
							</c>
						</r>
						<r>
							<c cspan="5">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>69: 4704..5324</p>
							</c>
							<c ca="left">
								<p>
									<it>Uncharacterised</it>
								</p>
							</c>
							<c ca="left">
								<p>Prohead protease</p>
							</c>
							<c ca="left">
								<p>PVL</p>
							</c>
							<c ca="left">
								<p>N</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>phi-105: 7918..8520</p>
							</c>
							<c ca="left">
								<p>
									<it>Uncharacterised</it>
								</p>
							</c>
							<c ca="left">
								<p>Major tail protein</p>
							</c>
							<c ca="left">
								<p>Cherry, Gamma, 3A, 47</p>
							</c>
							<c ca="left">
								<p>Y</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Tuc2009: 23727..24224</p>
							</c>
							<c ca="left">
								<p>
									<it>Uncharacterised</it>
								</p>
							</c>
							<c ca="left">
								<p>Major tail protein</p>
							</c>
							<c ca="left">
								<p>bIL285, bIL286, bIL309, ul36, phiSLT</p>
							</c>
							<c ca="left">
								<p>Y</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>A118: 4590..5159</p>
							</c>
							<c ca="left">
								<p>determines size and shape of viral capsid, putative scaffolding protein</p>
							</c>
							<c ca="left">
								<p>Prohead protease</p>
							</c>
							<c ca="left">
								<p>PVL</p>
							</c>
							<c ca="left">
								<p>N</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>71: 4149..4748</p>
							</c>
							<c ca="left">
								<p>Phage minor structural protein, GP20</p>
							</c>
							<c ca="left">
								<p>Prohead protease</p>
							</c>
							<c ca="left">
								<p>PVL</p>
							</c>
							<c ca="left">
								<p>N</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>phi ETA: 21172..21768</p>
							</c>
							<c ca="left">
								<p>minor capsid protein</p>
							</c>
							<c ca="left">
								<p>Prohead protease</p>
							</c>
							<c ca="left">
								<p>phi 13</p>
							</c>
							<c ca="left">
								<p>N</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>phi 11: 21115..21750</p>
							</c>
							<c ca="left">
								<p>phi Mu50B-like protein</p>
							</c>
							<c ca="left">
								<p>Prohead protease</p>
							</c>
							<c ca="left">
								<p>phi 13</p>
							</c>
							<c ca="left">
								<p>N</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>P22: 38551..38991</p>
							</c>
							<c ca="left">
								<p>lysozyme, endolysin_autolysin</p>
							</c>
							<c ca="left">
								<p>Lysin</p>
							</c>
							<c ca="left">
								<p>V</p>
							</c>
							<c ca="left">
								<p>N</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Sf6: 3975..4859</p>
							</c>
							<c ca="left">
								<p>putative scaffolding protein</p>
							</c>
							<c ca="left">
								<p>Prohead protease</p>
							</c>
							<c ca="left">
								<p>ST64B, V</p>
							</c>
							<c ca="left">
								<p>N</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>HK620: 23655..24539</p>
							</c>
							<c ca="left">
								<p>scaffold protein</p>
							</c>
							<c ca="left">
								<p>Prohead protease</p>
							</c>
							<c ca="left">
								<p>P27</p>
							</c>
							<c ca="left">
								<p>N</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>sk1: 8582..11581</p>
							</c>
							<c ca="left">
								<p>Mu-like prophage protein, phage-related protein [function unknown]</p>
							</c>
							<c ca="left">
								<p>Tape measure</p>
							</c>
							<c ca="left">
								<p>bIL170</p>
							</c>
							<c ca="left">
								<p>Y</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>77: 19572..21026</p>
							</c>
							<c ca="left">
								<p>CHAP domain, Ami_3, SH3 domain</p>
							</c>
							<c ca="left">
								<p>Lysin</p>
							</c>
							<c ca="left">
								<p>phi-105</p>
							</c>
							<c ca="left">
								<p>N</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>77: 3291..4028</p>
							</c>
							<c ca="left">
								<p>Clp protease</p>
							</c>
							<c ca="left">
								<p>Prohead protease</p>
							</c>
							<c ca="left">
								<p>Cherry, Gamma, phi-105</p>
							</c>
							<c ca="left">
								<p>N</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>phiSLT: 20002..20775</p>
							</c>
							<c ca="left">
								<p>protease, clp protease</p>
							</c>
							<c ca="left">
								<p>Prohead protease</p>
							</c>
							<c ca="left">
								<p>bIL285, bIL309, phiPV83</p>
							</c>
							<c ca="left">
								<p>N</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>phiSLT: 38923..40377</p>
							</c>
							<c ca="left">
								<p>amidase, CHAP, Ami_3, SH3b</p>
							</c>
							<c ca="left">
								<p>Lysin</p>
							</c>
							<c ca="left">
								<p>bIL285, bIL286, bIL309, ul36, 315.5, 315.6</p>
							</c>
							<c ca="left">
								<p>Y</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>bIL286: 21258..21965</p>
							</c>
							<c ca="left">
								<p>protease, clp protease</p>
							</c>
							<c ca="left">
								<p>Prohead protease</p>
							</c>
							<c ca="left">
								<p>bIL285, bIL309, phiPV83</p>
							</c>
							<c ca="left">
								<p>N</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p><b>&#8224; </b>Full names of the phages are as follows with abbreviations in bold: <it>Staphylococcus aureus </it>bacteriophage <b>PVL</b>, Bacteriophage <b>69</b>, Bacteriophage <b>A118</b>, Bacteriophage <b>71</b>, Bacteriophage <b>phi ETA</b>, <it>Staphylococcus aureus </it>phage <b>phi 11</b>, <it>Staphylococcus aureus </it>phage <b>phi 13</b>, Enterobacteria phage <b>P22</b>, Enterobacteria phage <b>Sf6</b>, <it>Salmonella typhimurium </it>bacteriophage <b>ST64B</b>, <it>Shigella flexneri </it>bacteriophage <b>V</b>, Bacteriophage <b>HK620</b>, Bacteriophage <b>P27</b>, Bacteriophage <b>sk1</b>, Bacteriophage <b>bIL170</b>, <it>Bacillus anthracis </it>phage <b>Cherry</b>, <it>Bacillus anthracis </it>phage <b>Gamma</b>, Bacteriophage <b>3A</b>, Bacteriophage <b>47</b>, Bacteriophage <b>phi-105</b>, Bacteriophage <b>77</b>, Bacteriophage <b>bIL285</b>, Bacteriophage <b>bIL286</b>, Bacteriophage <b>bIL309</b>, Bacteriophage <b>Tuc2009</b>, <it>Lactococcus </it>phage <b>ul36</b>, <it>Staphylococcus aureus </it>prophage <b>phiPV83</b>, <it>Staphylococcus aureus </it>temperate phage <b>phiSLT</b>, <it>Streptococcus pyogenes </it>phage <b>315.5</b>, <it>Streptococcus pyogenes </it>phage <b>315.6</b>.</p>
						<p/>
						<p>This is a subset of the predictions generated by SynFPS. SS refers to Sequence Similarity: N indicates there is no significance in sequence similarity between the target gene (first column) and any of the corresponding genes in the supporting phages (second last column) within the same cluster; Y indicates at least one of the corresponding genes show significant similarity. BLAST-P with Blosum45 has been used to test for similarity significance.</p>
					</tblfn>
				</tbl>
				<p>Three genes that we have predicted functions for have no existing functional annotation in the database (marked <it>uncharacterised </it>in Table <tblr tid="T4">4</tblr>). Seven genes in Table <tblr tid="T4">4</tblr> exhibit sequence similarity to their reference genes, suggesting that their predicted functions are supported by both sequence similarity and the genomic context information embedded in our system, such as gene order conservation and positional coupling. For other genes that show no sequence similarity (a total of 12 of them in Table <tblr tid="T4">4</tblr>), the predicted functions are only evident by the genomic context. It is noteworthy that sequence alignment based methods would have failed in finding correspondences to these genes. Other prediction results have complemented existing annotations in the database in cases where they do exist, and therefore support the validity of our approach.</p>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Conclusion</p>
			</st>
			<p>We presented a novel genomic-context based method capable of predicting gene functions from a large collection of genomes. An adaptive K-Means clustering is used to distinguish groups of related genomes based on the conservation of gene order and the conservation of gene-to-gene distances. The clustering results serve as a platform for the SVM to extract training data to perform classification based predictions. Nine common gene functions of bacteriophages were tested and the LOO cross-validated prediction results are averaged at 80%. Functional predictions are also made on 3 uncharacterized genes and 12 genes that cannot be identified by sequence alignment.</p>
			<p>Although our experimental focus is on bacteriophages, the method may be extended to other microbial genomes. For example, bacterial genomes have been observed with conserved gene order <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr></abbrgrp> and conserved gene-to-gene distances (positional coupling) <abbrgrp><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr></abbrgrp>. These properties satisfy the underlying assumptions of our approach and suggest potential application of the method.</p>
		</sec>
		<sec>
			<st>
				<p>Methods</p>
			</st>
			<sec>
				<st>
					<p>Strategy overview &#8211; SynFPS</p>
				</st>
				<p>We present a novel method called Synteny-based Function Prediction System (SynFPS) capable of predicting gene functions among completed genomes based on the conservation of gene order (synteny) and the conservation of gene-to-gene distance. An overview of SynFPS is shown in Figure <figr fid="F1">1</figr>. The genome annotation database as shown in the figure defines the scope of analysis for the system. In our work, it consists of 296 phage genomes retrieved from GenBank (see Additional file <supplr sid="S1">1</supplr>).</p>
				<fig id="F1">
					<title>
						<p>Figure 1</p>
					</title>
					<caption>
						<p>Structure of the Synteny-based Function Prediction System (SynFPS)</p>
					</caption>
					<text>
						<p><b>Structure of the Synteny-based Function Prediction System (SynFPS)</b>. The dotted line represents the system boundary, outside of which lies the system inputs and outputs. A set of gene functions (A) specified in the form of regular expressions are matched against the genome database (B) via the text processing unit (D), which result may then be refined (C). A clustering system (E) based on the synteny scores of the matching genes brings together genomes that show conservation of gene order and position (G). Such information is used to generate a set of positive and negative data (genes) to train the classification system (F) that produces function prediction results (H).</p>
					</text>
					<graphic file="1471-2105-8-S4-S6-1"/>
				</fig>
				<p>SynFPS runs on Windows and is publicly available. It was developed in C# and requires the free Microsoft .NET Framework 2.0 to run. Bioperl 1.4 <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> is needed for data retrieval from public databases. Workstations with a single CPU of ~3.0 GHz and 1 GB of RAM are sufficient for reasonable performance over a collection of ~300 phages.</p>
			</sec>
			<sec>
				<st>
					<p>Identification of functionally similar genes using regular expression</p>
				</st>
				<p>The system begins by identifying in the database a collection of genes that correspond to a set of user-specified gene functions. Instead of using sequence similarity as in many other methods <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B12">12</abbr></abbrgrp>, SynFPS identifies functionally similar genes using regular expressions <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. For example, to search for genes that encode the major head proteins of phages, one possible regular expression pattern is "(?&lt;!minor)\b(head|capsid) protein". With this pattern, we are including genes that have been annotated with "head protein" or "capsid protein" except those with the prefix term "minor". The use of regular expression is aimed at tackling annotation discrepancies among coding sequences in databases that do not have vocabulary control. The regular expression syntax used in SynFPS follows the syntax defined for the .NET Framework <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>.</p>
				<p>Once a regular expression pattern is given, the system searches against the annotation data of all the genomes that have been supplied to the program. By default, it will identify coding sequence (CDS) regions in each of the genome and then try to match the patterns against their annotated features such as "product", "function" and "note". The set of annotated features that the search will perform over is customisable by the users. The search results can be visually displayed, where the genomes and matching genes are illustrated. The display is interactive in which annotations can be viewed and search results can be modified via manual addition and removal of genes.</p>
				<p>Although genome annotation processes are often assisted by sequence alignment, many annotations are prepared manually by biologists who conducted research on the genomes. Therefore, the set of sequences found by annotation search could embrace functionally similar genes that show no sequence similarity. In the results section, we provide an assessment on sequence alignment in relation to regular expression search.</p>
			</sec>
			<sec>
				<st>
					<p>K-Means clustering to identify similar genomic context</p>
				</st>
				<p>The annotation search process leads to a mapping of genes across the genomes. This mapping provides the necessary information for a context based clustering. Let <it>G </it>= {<it>g</it><sub>1</sub>, <it>g</it><sub>2</sub>,..., <it>g</it><sub><it>n</it></sub>} be the set of all gene functions where <it>g </it>is a symbol representing a function and <it>n </it>is the total number of functional classes identified. Let <it>m </it>be the number of genomes in the database. We define <it>X</it><sub><it>k </it></sub>&#8838; <it>G</it>, <it>k </it>= 1,2,..., <it>m </it>to be the set of genes detected in genome <it>k </it>and <it>C</it><sub><it>kl </it></sub>= <it>C</it>(<it>X</it><sub><it>k</it></sub>, <it>X</it><sub><it>l</it></sub>) = <it>X</it><sub><it>k </it></sub>&#8745; <it>X</it><sub><it>l </it></sub>to be the common set of genes between genomes <it>k </it>and <it>l</it>. The genomic-context distance between two genomes <it>k </it>and <it>l </it>is defined as:</p>
				<p>
					<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-8-S4-S6-i1">
						<m:semantics>
							<m:mrow>
								<m:msub>
									<m:mi>D</m:mi>
									<m:mrow>
										<m:mi>k</m:mi>
										<m:mi>l</m:mi>
									</m:mrow>
								</m:msub>
								<m:mo>=</m:mo>
								<m:mfrac>
									<m:mrow>
										<m:mstyle displaystyle="true">
											<m:munder>
												<m:mo>&#8721;</m:mo>
												<m:mrow>
													<m:msub>
														<m:mi>g</m:mi>
														<m:mi>i</m:mi>
													</m:msub>
													<m:mo>,</m:mo>
													<m:msub>
														<m:mi>g</m:mi>
														<m:mi>j</m:mi>
													</m:msub>
													<m:mo>&#8712;</m:mo>
													<m:msub>
														<m:mi>C</m:mi>
														<m:mrow>
															<m:mi>k</m:mi>
															<m:mi>l</m:mi>
														</m:mrow>
													</m:msub>
													<m:mo>;</m:mo>
													<m:mi>i</m:mi>
													<m:mo>&lt;</m:mo>
													<m:mi>j</m:mi>
												</m:mrow>
											</m:munder>
											<m:mrow>
												<m:mrow>
													<m:mo>[</m:mo>
													<m:mrow>
														<m:msub>
															<m:mi>d</m:mi>
															<m:mi>k</m:mi>
														</m:msub>
														<m:mo stretchy="false">(</m:mo>
														<m:msub>
															<m:mi>g</m:mi>
															<m:mi>i</m:mi>
														</m:msub>
														<m:mo>,</m:mo>
														<m:msub>
															<m:mi>g</m:mi>
															<m:mi>j</m:mi>
														</m:msub>
														<m:mo stretchy="false">)</m:mo>
														<m:mo>&#8722;</m:mo>
														<m:msub>
															<m:mi>d</m:mi>
															<m:mi>l</m:mi>
														</m:msub>
														<m:mrow>
															<m:mo>(</m:mo>
															<m:mrow>
																<m:msub>
																	<m:mi>g</m:mi>
																	<m:mi>i</m:mi>
																</m:msub>
																<m:mo>,</m:mo>
																<m:msub>
																	<m:mi>g</m:mi>
																	<m:mi>j</m:mi>
																</m:msub>
															</m:mrow>
															<m:mo>)</m:mo>
														</m:mrow>
													</m:mrow>
													<m:mo>]</m:mo>
												</m:mrow>
											</m:mrow>
										</m:mstyle>
									</m:mrow>
									<m:mrow>
										<m:mrow>
											<m:mo>|</m:mo>
											<m:mrow>
												<m:msub>
													<m:mi>C</m:mi>
													<m:mrow>
														<m:mi>k</m:mi>
														<m:mi>l</m:mi>
													</m:mrow>
												</m:msub>
											</m:mrow>
											<m:mo>|</m:mo>
										</m:mrow>
									</m:mrow>
								</m:mfrac>
								<m:mo>+</m:mo>
								<m:mi>p</m:mi>
								<m:mrow>
									<m:mo>(</m:mo>
									<m:mrow>
										<m:mrow>
											<m:mo>|</m:mo>
											<m:mrow>
												<m:msub>
													<m:mi>X</m:mi>
													<m:mi>k</m:mi>
												</m:msub>
												<m:mo>&#8746;</m:mo>
												<m:msub>
													<m:mi>X</m:mi>
													<m:mi>l</m:mi>
												</m:msub>
											</m:mrow>
											<m:mo>|</m:mo>
										</m:mrow>
										<m:mo>&#8722;</m:mo>
										<m:mrow>
											<m:mo>|</m:mo>
											<m:mrow>
												<m:msub>
													<m:mi>C</m:mi>
													<m:mrow>
														<m:mi>k</m:mi>
														<m:mi>l</m:mi>
													</m:mrow>
												</m:msub>
											</m:mrow>
											<m:mo>|</m:mo>
										</m:mrow>
									</m:mrow>
									<m:mo>)</m:mo>
								</m:mrow>
								<m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
								<m:mrow>
									<m:mo>(</m:mo>
									<m:mn>1</m:mn>
									<m:mo>)</m:mo>
								</m:mrow>
							</m:mrow>
							<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegyvzYrwyUfgarqqtubsr4rNCHbGeaGqiA8vkIkVAFgIELiFeLkFeLk=iY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfeaY=biLkVcLq=JHqVepeea0=as0db9vqpepesP0xe9Fve9Fve9GapdbaqaaeGacaGaaiaabeqaamqadiabaaGcbaGaemiraq0aaSbaaSqaaiabdUgaRjabdYgaSbqabaGccqGH9aqpdaWcaaqaamaaqafabaWaamWaaeaacqWGKbazdaWgaaWcbaGaem4AaSgabeaakiabcIcaOiabdEgaNnaaBaaaleaacqWGPbqAaeqaaOGaeiilaWIaem4zaC2aaSbaaSqaaiabdQgaQbqabaGccqGGPaqkcqGHsislcqWGKbazdaWgaaWcbaGaemiBaWgabeaakmaabmaabaGaem4zaC2aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqWGNbWzdaWgaaWcbaGaemOAaOgabeaaaOGaayjkaiaawMcaaaGaay5waiaaw2faaaWcbaGaem4zaC2aaSbaaWqaaiabdMgaPbqabaWccqGGSaalcqWGNbWzdaWgaaadbaGaemOAaOgabeaaliabgIGiolabdoeadnaaBaaameaacqWGRbWAcqWGSbaBaeqaaSGaei4oaSJaemyAaKMaeyipaWJaemOAaOgabeqdcqGHris5aaGcbaWaaqWaaeaacqWGdbWqdaWgaaWcbaGaem4AaSMaemiBaWgabeaaaOGaay5bSlaawIa7aaaacqGHRaWkcqWGWbaCdaqadaqaamaaemaabaGaemiwaG1aaSbaaSqaaiabdUgaRbqabaGccqWIQisvcqWGybawdaWgaaWcbaGaemiBaWgabeaaaOGaay5bSlaawIa7aiabgkHiTmaaemaabaGaem4qam0aaSbaaSqaaiabdUgaRjabdYgaSbqabaaakiaawEa7caGLiWoaaiaawIcacaGLPaaacaWLjaGaaCzcamaabmaabaGaeGymaedacaGLOaGaayzkaaaaaa@8F3A@</m:annotation>
						</m:semantics>
					</m:math>
				</p>
				<p>where <it>d</it><sub><it>k</it></sub>(<it>g</it><sub><it>i</it></sub>, <it>g</it><sub><it>j</it></sub>) = location of <it>g</it><sub><it>j </it></sub>- location of <it>g</it><sub><it>i </it></sub>in genome <it>k</it>, |<it>s</it>| denotes the size of a set <it>s </it>and <it>p </it>is a parameter to penalize the genomes not sharing the same set of genes. The summation term dictates the conservation of gene order as well as the conservation of gene-to-gene distances between the two genomes. The second term dictates gene co-occurrence.</p>
				<p>We represent each genome <it>k </it>by a vector of distance values: <it>F</it><sub><it>k </it></sub>= [<it>D</it><sub><it>k</it>1</sub>, <it>D</it><sub><it>k</it>2</sub>,...,<it>D</it><sub><it>km</it></sub>] and then we perform K-Means clustering over the set <it>S </it>= {<it>F</it><sub><it>k </it></sub>| <it>k </it>= 1,..., <it>m</it>}. We implemented an adaptive technique such that the number of clusters grows incrementally until the size of the largest cluster is smaller than a specified threshold. The threshold <it>t </it>&#8712; (0,1] describes the fractional size of the Euclidean space spanned by <it>S</it>. Each resulting cluster contains genomes with high resemblance in gene distribution. Alternative adaptive clustering methods include dynamic self-organizing maps <abbrgrp><abbr bid="B26">26</abbr><abbr bid="B27">27</abbr></abbrgrp>.</p>
			</sec>
			<sec>
				<st>
					<p>Support Vector Machines for function prediction</p>
				</st>
				<p>The clusters of genomes are analysed separately and individually in the last stage of the system. For each cluster, we use the information of the previously identified genes to predict the functions of other genes that exhibit similar context. This is achieved by extracting a set of genes from the cluster and converting them into positive and negative training data for a discriminative classification. Positive data are formed by the group of genes previously identified by the system during the match of regular expression plus any manually added genes, with each gene function representing one class. Negative data comprise the genes that are neighbours to the positive genes. The size of neighbourhood is determined by the statistics of the gene locations in that particular cluster. We use 99% confidence interval on the gene locations of each class to determine the range in which neighbour genes are to be included. This interval also determines the set of candidate genes on which function predictions are performed (see Figure <figr fid="F2">2</figr>). The discriminative classification is carried out by a Support Vector Machine (SVM) <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>, which has been reported with superior results in a variety of biological applications <abbrgrp><abbr bid="B29">29</abbr><abbr bid="B30">30</abbr><abbr bid="B31">31</abbr></abbrgrp>. For each gene function, the SVM produces a binary result on each candidate gene indicating whether or not the gene belongs to that function class. Since the number of gene functions is specified by the user and is not likely to cover every possible function, only a subset of the candidate genes &#8211; those with positive results &#8211; will eventually be assigned with predicted functions.</p>
				<fig id="F2">
					<title>
						<p>Figure 2</p>
					</title>
					<caption>
						<p>An illustration of a cluster containing four genomes</p>
					</caption>
					<text>
						<p><b>An illustration of a cluster containing four genomes</b>. Performing function prediction over gene class "A" consists of two steps: i) perform Leave-One-Out cross validation over the first three genomes and hence adapt to the optimal kernel parameters, ii) find A in the bottom genome within the confidence interval. Since the distances between A and B genes are the most conserved, class B will act as the reference genes for computing relative positions for class A genes for use as one of the training features.</p>
					</text>
					<graphic file="1471-2105-8-S4-S6-2"/>
				</fig>
				<fig id="F3">
					<title>
						<p>Figure 3</p>
					</title>
					<caption>
						<p>A plot of cross-validated prediction accuracy versus prediction coverage of the genomes in the database (296)</p>
					</caption>
					<text>
						<p><b>A plot of cross-validated prediction accuracy versus prediction coverage of the genomes in the database (296)</b>. Prediction coverage indicates the percentage amount of genomes that have been included to perform the leave-one-out cross validations using SynFPS. The maximum coverage of each gene function is limited by the number of its existences detected in the database. The coverage is varied using different adaptive threshold for the K-Means clustering.</p>
					</text>
					<graphic file="1471-2105-8-S4-S6-3"/>
				</fig>
				<p>To enhance prediction accuracy, we force a unique positive prediction in every genome within a cluster. This is based on an assumption that all pairs of genomes within a cluster would have a one-to-one mapping of genes (gene correspondence). The decision values generated by SVM depict the relative positiveness of each candidate gene. Consequently, the gene with the strongest decision value will be chosen as the positive prediction.</p>
				<p>In order to apply SVM, each gene is converted into a numeric vector capturing the following features: composition, normalized van der Waals volume, hydrophobicity, polarity <abbrgrp><abbr bid="B30">30</abbr><abbr bid="B32">32</abbr></abbrgrp>, pairwise similarity scores against other genes in the database <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>, relative position and gene size. To compute the "relative position", the system first finds the gene class which has the most conserved distance to the gene under current prediction. For example, as demonstrated in Figure <figr fid="F2">2</figr>, if we are making predictions over class A, then class B will be chosen as the reference for computing the relative positions because the distances between class B genes and class A genes are the most conserved. The relative position of a gene in class A is then computed as the distance between itself and the class B gene in the corresponding genome.</p>
				<p>The pairwise similarity scores have been observed to improve classification accuracies. These scores represent the distance between a gene and every other gene in the database <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. However, it should be emphasized that while these sequence similarity scores enhance the strength of the feature vectors, the system does not rely upon similarity significances to detect gene correspondence.</p>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Availability and requirements</p>
			</st>
			<p><b>Project name</b>: SynFPS</p>
			<p><b>Project website</b>: <url>http://www.synteny.net/</url></p>
			<p><b>Operating system</b>: Microsoft Windows family</p>
			<p><b>Other requirements</b>: Microsoft .NET Framework 2.0 (free), Bioperl 1.4 (optional)</p>
			<p><b>Any restrictions to use by non-academics</b>: None</p>
		</sec>
		<sec>
			<st>
				<p>Abbreviations</p>
			</st>
			<p><b>CDS </b>Coding Sequence; <b>HGT </b>Horizontal gene transfers; <b>LOO </b>Leave-one-out; <b>SVM </b>Support Vector Machines; <b>SynFPS </b>Synteny-based Function Prediction System</p>
		</sec>
		<sec>
			<st>
				<p>Competing interests</p>
			</st>
			<p>The authors declare that they have no competing interests.</p>
		</sec>
		<sec>
			<st>
				<p>Authors' contributions</p>
			</st>
			<p>JL conceived of the study, designed the software and drafted the manuscript. SKH supervised the work and participated in results evaluation. ST conceived of the clustering design and gave expertise in bacteriophage analysis. CIK participated in the SVM predictions. All authors have participated in preparing the manuscript, have read and approved the final manuscript.</p>
		</sec>
	</bdy>
	<bm>
		<ack>
			<sec>
				<st>
					<p>Acknowledgements</p>
				</st>
				<p>We thank Bill Chang and Arthur Hsu for their advice on this work and Zhi Feng Zhu for his assistance in software implementation.</p>
				<p>This article has been published as part of <it>BMC Bioinformatics </it>Volume 8, Supplement 4, 2007: The Second Automated Function Prediction Meeting. The full contents of the supplement are available online at <url>http://www.biomedcentral.com/1471-2105/8?issue=S4</url>.</p>
			</sec>
		</ack>
		<refgrp>
			<bibl id="B1">
				<title>
					<p>SynBrowse: a synteny browser for comparative sequence analysis</p>
				</title>
				<aug>
					<au>
						<snm>Pan</snm>
						<fnm>X</fnm>
					</au>
					<au>
						<snm>Stein</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Brendel</snm>
						<fnm>V</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2005</pubdate>
				<volume>21</volume>
				<issue>17</issue>
				<fpage>3461</fpage>
				<lpage>3468</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/bti555</pubid>
						<pubid idtype="pmpid" link="fulltext">15994196</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B2">
				<title>
					<p>VISTA: computational tools for comparative genomics</p>
				</title>
				<aug>
					<au>
						<snm>Frazer</snm>
						<fnm>KA</fnm>
					</au>
					<au>
						<snm>Pachter</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Poliakov</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Rubin</snm>
						<fnm>EM</fnm>
					</au>
					<au>
						<snm>Dubchak</snm>
						<fnm>I</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2004</pubdate>
				<issue>32 Web Server</issue>
				<fpage>W273</fpage>
				<lpage>279</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">441596</pubid>
						<pubid idtype="pmpid" link="fulltext">15215394</pubid>
						<pubid idtype="doi">10.1093/nar/gkh458</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B3">
				<title>
					<p>LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA</p>
				</title>
				<aug>
					<au>
						<snm>Brudno</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Do</snm>
						<fnm>CB</fnm>
					</au>
					<au>
						<snm>Cooper</snm>
						<fnm>GM</fnm>
					</au>
					<au>
						<snm>Kim</snm>
						<fnm>MF</fnm>
					</au>
					<au>
						<snm>Davydov</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Program</snm>
						<fnm>NCS</fnm>
					</au>
					<au>
						<snm>Green</snm>
						<fnm>ED</fnm>
					</au>
					<au>
						<snm>Sidow</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Batzoglou</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Genome Res</source>
				<pubdate>2003</pubdate>
				<volume>13</volume>
				<issue>4</issue>
				<fpage>721</fpage>
				<lpage>731</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">430158</pubid>
						<pubid idtype="pmpid" link="fulltext">12654723</pubid>
						<pubid idtype="doi">10.1101/gr.926603</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B4">
				<title>
					<p>PipMaker &#8211; a web server for aligning two genomic DNA sequences</p>
				</title>
				<aug>
					<au>
						<snm>Schwartz</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Zhang</snm>
						<fnm>Z</fnm>
					</au>
					<au>
						<snm>Frazer</snm>
						<fnm>KA</fnm>
					</au>
					<au>
						<snm>Smit</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Riemer</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Bouck</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Gibbs</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Hardison</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Miller</snm>
						<fnm>W</fnm>
					</au>
				</aug>
				<source>Genome Res</source>
				<pubdate>2000</pubdate>
				<volume>10</volume>
				<issue>4</issue>
				<fpage>577</fpage>
				<lpage>586</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">310868</pubid>
						<pubid idtype="pmpid" link="fulltext">10779500</pubid>
						<pubid idtype="doi">10.1101/gr.10.4.577</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B5">
				<title>
					<p>Ensembl 2002: accommodating comparative genomics</p>
				</title>
				<aug>
					<au>
						<snm>Clamp</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Andrews</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Barker</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Bevan</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Cameron</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Chen</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Clark</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Cox</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Cuff</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Curwen</snm>
						<fnm>V</fnm>
					</au>
					<etal/>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2003</pubdate>
				<volume>31</volume>
				<issue>1</issue>
				<fpage>38</fpage>
				<lpage>42</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">165530</pubid>
						<pubid idtype="pmpid" link="fulltext">12519943</pubid>
						<pubid idtype="doi">10.1093/nar/gkg083</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B6">
				<title>
					<p>Function prediction and protein networks</p>
				</title>
				<aug>
					<au>
						<snm>Huynen</snm>
						<fnm>MA</fnm>
					</au>
					<au>
						<snm>Snel</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>von Mering</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Bork</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>Curr Opin Cell Biol</source>
				<pubdate>2003</pubdate>
				<volume>15</volume>
				<issue>2</issue>
				<fpage>191</fpage>
				<lpage>198</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1016/S0955-0674(03)00009-7</pubid>
						<pubid idtype="pmpid" link="fulltext">12648675</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B7">
				<title>
					<p>STRING: known and predicted protein-protein associations, integrated and transferred across organisms</p>
				</title>
				<aug>
					<au>
						<snm>von Mering</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Jensen</snm>
						<fnm>LJ</fnm>
					</au>
					<au>
						<snm>Snel</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Hooper</snm>
						<fnm>SD</fnm>
					</au>
					<au>
						<snm>Krupp</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Foglierini</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Jouffre</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Huynen</snm>
						<fnm>MA</fnm>
					</au>
					<au>
						<snm>Bork</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2005</pubdate>
				<volume>33</volume>
				<issue>Database issue</issue>
				<fpage>D433</fpage>
				<lpage>D437</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">539959</pubid>
						<pubid idtype="pmpid" link="fulltext">15608232</pubid>
						<pubid idtype="doi">10.1093/nar/gki005</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B8">
				<title>
					<p>Evolution of gene order conservation in prokaryotes</p>
				</title>
				<aug>
					<au>
						<snm>Tamames</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Genome Biol</source>
				<pubdate>2001</pubdate>
				<volume>2</volume>
				<issue>6</issue>
				<fpage>RESEARCH0020</fpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">33396</pubid>
						<pubid idtype="pmpid" link="fulltext">11423009</pubid>
						<pubid idtype="doi">10.1186/gb-2001-2-6-research0020</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B9">
				<title>
					<p>Identifying functional links between genes using conserved chromosomal proximity</p>
				</title>
				<aug>
					<au>
						<snm>Yanai</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Mellor</snm>
						<fnm>JC</fnm>
					</au>
					<au>
						<snm>DeLisi</snm>
						<fnm>C</fnm>
					</au>
				</aug>
				<source>Trends in Genetics</source>
				<pubdate>2002</pubdate>
				<volume>18</volume>
				<issue>4</issue>
				<fpage>176</fpage>
				<lpage>179</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1016/S0168-9525(01)02621-X</pubid>
						<pubid idtype="pmpid" link="fulltext">11932011</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B10">
				<title>
					<p>AVID: A Global Alignment Program</p>
				</title>
				<aug>
					<au>
						<snm>Bray</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Dubchak</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Pachter</snm>
						<fnm>L</fnm>
					</au>
				</aug>
				<source>Genome Res</source>
				<pubdate>2003</pubdate>
				<volume>13</volume>
				<fpage>97</fpage>
				<lpage>102</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">430967</pubid>
						<pubid idtype="pmpid" link="fulltext">12529311</pubid>
						<pubid idtype="doi">10.1101/gr.789803</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B11">
				<title>
					<p>Glocal alignment: finding rearrangements during alignment</p>
				</title>
				<aug>
					<au>
						<snm>Brudno</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Malde</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Poliakov</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Do</snm>
						<fnm>CB</fnm>
					</au>
					<au>
						<snm>Couronne</snm>
						<fnm>O</fnm>
					</au>
					<au>
						<snm>Dubchak</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Batzoglou</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2003</pubdate>
				<volume>19</volume>
				<issue>suppl_1</issue>
				<fpage>i54</fpage>
				<lpage>62</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/btg1005</pubid>
						<pubid idtype="pmpid" link="fulltext">12855437</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B12">
				<title>
					<p>SHOT: a web server for the construction of genome phylogenies</p>
				</title>
				<aug>
					<au>
						<snm>Korbel</snm>
						<fnm>JO</fnm>
					</au>
					<au>
						<snm>Snel</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Huynen</snm>
						<fnm>MA</fnm>
					</au>
					<au>
						<snm>Bork</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>Trends Genet</source>
				<pubdate>2002</pubdate>
				<volume>18</volume>
				<issue>3</issue>
				<fpage>158</fpage>
				<lpage>162</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1016/S0168-9525(01)02597-5</pubid>
						<pubid idtype="pmpid" link="fulltext">11858840</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B13">
				<title>
					<p>Phage Genomics: Small Is Beautiful</p>
				</title>
				<aug>
					<au>
						<snm>Brussow</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Hendrix</snm>
						<fnm>RW</fnm>
					</au>
				</aug>
				<source>Cell</source>
				<pubdate>2002</pubdate>
				<volume>108</volume>
				<fpage>13</fpage>
				<lpage>16</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1016/S0092-8674(01)00637-7</pubid>
						<pubid idtype="pmpid" link="fulltext">11792317</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B14">
				<title>
					<p>Bacteriophage genomics</p>
				</title>
				<aug>
					<au>
						<snm>Hendrix</snm>
						<fnm>RW</fnm>
					</au>
				</aug>
				<source>Curr Opin Microbiol</source>
				<pubdate>2003</pubdate>
				<volume>6</volume>
				<issue>5</issue>
				<fpage>506</fpage>
				<lpage>511</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1016/j.mib.2003.09.004</pubid>
						<pubid idtype="pmpid" link="fulltext">14572544</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B15">
				<title>
					<p>Coat protein fold and maturation transition of bacteriophage P22 seen at subnanometer resolutions</p>
				</title>
				<aug>
					<au>
						<snm>Jiang</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Li</snm>
						<fnm>Z</fnm>
					</au>
					<au>
						<snm>Zhang</snm>
						<fnm>Z</fnm>
					</au>
					<au>
						<snm>Baker</snm>
						<fnm>ML</fnm>
					</au>
					<au>
						<snm>Prevelige</snm>
						<fnm>PE</fnm>
						<suf>Jr</suf>
					</au>
					<au>
						<snm>Chiu</snm>
						<fnm>W</fnm>
					</au>
				</aug>
				<source>Nat Struct Biol</source>
				<pubdate>2003</pubdate>
				<volume>10</volume>
				<issue>2</issue>
				<fpage>131</fpage>
				<lpage>135</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1038/nsb891</pubid>
						<pubid idtype="pmpid" link="fulltext">12536205</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B16">
				<title>
					<p>Exploring the mycobacteriophage metaproteome: phage genomics as an educational platform</p>
				</title>
				<aug>
					<au>
						<snm>Hatfull</snm>
						<fnm>GF</fnm>
					</au>
					<au>
						<snm>Pedulla</snm>
						<fnm>ML</fnm>
					</au>
					<au>
						<snm>Jacobs-Sera</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Cichon</snm>
						<fnm>PM</fnm>
					</au>
					<au>
						<snm>Foley</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Ford</snm>
						<fnm>ME</fnm>
					</au>
					<au>
						<snm>Gonda</snm>
						<fnm>RM</fnm>
					</au>
					<au>
						<snm>Houtz</snm>
						<fnm>JM</fnm>
					</au>
					<au>
						<snm>Hryckowian</snm>
						<fnm>AJ</fnm>
					</au>
					<au>
						<snm>Kelchner</snm>
						<fnm>VA</fnm>
					</au>
					<etal/>
				</aug>
				<source>PLoS Genet</source>
				<pubdate>2006</pubdate>
				<volume>2</volume>
				<issue>6</issue>
				<fpage>e92</fpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1475703</pubid>
						<pubid idtype="pmpid" link="fulltext">16789831</pubid>
						<pubid idtype="doi">10.1371/journal.pgen.0020092</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B17">
				<title>
					<p>An introduction to support vector machines: And other kernel-based learning methods</p>
				</title>
				<aug>
					<au>
						<snm>Cristianini</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Shawe-Taylor</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<publisher>Cambridge, England: Cambridge Press</publisher>
				<pubdate>2000</pubdate>
			</bibl>
			<bibl id="B18">
				<title>
					<p>Holins: the protein clocks of bacteriophage infections</p>
				</title>
				<aug>
					<au>
						<snm>Wang</snm>
						<fnm>IN</fnm>
					</au>
					<au>
						<snm>Smith</snm>
						<fnm>DL</fnm>
					</au>
					<au>
						<snm>Young</snm>
						<fnm>R</fnm>
					</au>
				</aug>
				<source>Annu Rev Microbiol</source>
				<pubdate>2000</pubdate>
				<volume>54</volume>
				<fpage>799</fpage>
				<lpage>825</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1146/annurev.micro.54.1.799</pubid>
						<pubid idtype="pmpid" link="fulltext">11018145</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B19">
				<title>
					<p>Bringing gene order into bacterial shape</p>
				</title>
				<aug>
					<au>
						<snm>Tamames</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Gonzalez-Moreno</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Mingorance</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Valencia</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Vicente</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Trends in Genetics</source>
				<pubdate>2001</pubdate>
				<volume>17</volume>
				<issue>3</issue>
				<fpage>124</fpage>
				<lpage>126</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1016/S0168-9525(00)02212-5</pubid>
						<pubid idtype="pmpid" link="fulltext">11226588</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B20">
				<title>
					<p>Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context</p>
				</title>
				<aug>
					<au>
						<snm>Wolf</snm>
						<fnm>YI</fnm>
					</au>
					<au>
						<snm>Rogozin</snm>
						<fnm>IB</fnm>
					</au>
					<au>
						<snm>Kondrashov</snm>
						<fnm>AS</fnm>
					</au>
					<au>
						<snm>Koonin</snm>
						<fnm>EV</fnm>
					</au>
				</aug>
				<source>Genome Res</source>
				<pubdate>2001</pubdate>
				<volume>11</volume>
				<issue>3</issue>
				<fpage>356</fpage>
				<lpage>372</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1101/gr.GR-1619R</pubid>
						<pubid idtype="pmpid" link="fulltext">11230160</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B21">
				<title>
					<p>Automatic detection of conserved gene clusters in multiple genomes by graph comparison and P-quasi grouping</p>
				</title>
				<aug>
					<au>
						<snm>Fujibuchi</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Ogata</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Matsuda</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Kanehisa</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2000</pubdate>
				<volume>28</volume>
				<issue>20</issue>
				<fpage>4029</fpage>
				<lpage>4036</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">110780</pubid>
						<pubid idtype="pmpid" link="fulltext">11024184</pubid>
						<pubid idtype="doi">10.1093/nar/28.20.4029</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B22">
				<title>
					<p>The KEGG databases at GenomeNet</p>
				</title>
				<aug>
					<au>
						<snm>Kanehisa</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Goto</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Kawashima</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Nakaya</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>Nucl Acids Res</source>
				<pubdate>2002</pubdate>
				<volume>30</volume>
				<issue>1</issue>
				<fpage>42</fpage>
				<lpage>46</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">99091</pubid>
						<pubid idtype="pmpid" link="fulltext">11752249</pubid>
						<pubid idtype="doi">10.1093/nar/30.1.42</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B23">
				<title>
					<p>The Bioperl toolkit: Perl modules for the life sciences</p>
				</title>
				<aug>
					<au>
						<snm>Stajich</snm>
						<fnm>JE</fnm>
					</au>
					<au>
						<snm>Block</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Boulez</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Brenner</snm>
						<fnm>SE</fnm>
					</au>
					<au>
						<snm>Chervitz</snm>
						<fnm>SA</fnm>
					</au>
					<au>
						<snm>Dagdigian</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Fuellen</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Gilbert</snm>
						<fnm>JG</fnm>
					</au>
					<au>
						<snm>Korf</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Lapp</snm>
						<fnm>H</fnm>
					</au>
					<etal/>
				</aug>
				<source>Genome Res</source>
				<pubdate>2002</pubdate>
				<volume>12</volume>
				<issue>10</issue>
				<fpage>1611</fpage>
				<lpage>1618</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">187536</pubid>
						<pubid idtype="pmpid" link="fulltext">12368254</pubid>
						<pubid idtype="doi">10.1101/gr.361602</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B24">
				<title>
					<p>Chapter 1: Regular languages</p>
				</title>
				<aug>
					<au>
						<snm>Sipser</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Introduction to the theory of computation</source>
				<publisher>Boston: Thomson Course Technology</publisher>
				<edition>2</edition>
				<pubdate>2006</pubdate>
				<fpage>31</fpage>
				<lpage>90</lpage>
			</bibl>
			<bibl id="B25">
				<aug>
					<au>
						<cnm>Microsoft</cnm>
					</au>
				</aug>
				<source>Regular Expression Language Elements</source>
				<publisher>MSDN Library: .NET Framework General Reference, Microsoft Corporation</publisher>
				<pubdate>2006</pubdate>
			</bibl>
			<bibl id="B26">
				<title>
					<p>Enhancement of topology preservation and hierarchical dynamic self-organising maps for data visualisation</p>
				</title>
				<aug>
					<au>
						<snm>Hsu</snm>
						<fnm>AL</fnm>
					</au>
					<au>
						<snm>Halgamuge</snm>
						<fnm>SK</fnm>
					</au>
				</aug>
				<source>International Journal of Approximate Reasoning</source>
				<pubdate>2003</pubdate>
				<volume>32</volume>
				<issue>2&#8211;3</issue>
				<fpage>259</fpage>
				<lpage>279</lpage>
				<xrefbib>
					<pubid idtype="doi">10.1016/S0888-613X(02)00086-5</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B27">
				<title>
					<p>An unsupervised hierarchical dynamic self-organizing approach to cancer class discovery and marker gene identification in microarray data</p>
				</title>
				<aug>
					<au>
						<snm>Hsu</snm>
						<fnm>AL</fnm>
					</au>
					<au>
						<snm>Tang</snm>
						<fnm>SL</fnm>
					</au>
					<au>
						<snm>Halgamuge</snm>
						<fnm>SK</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2003</pubdate>
				<volume>19</volume>
				<issue>16</issue>
				<fpage>2131</fpage>
				<lpage>2140</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/btg296</pubid>
						<pubid idtype="pmpid" link="fulltext">14594719</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B28">
				<title>
					<p>Improvements to Platt's SMO Algorithm for SVM Classifier Design</p>
				</title>
				<aug>
					<au>
						<snm>Keerthi</snm>
						<fnm>SS</fnm>
					</au>
					<au>
						<snm>Shevade</snm>
						<fnm>SK</fnm>
					</au>
					<au>
						<snm>Bhattacharyya</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Murthy</snm>
						<fnm>KRK</fnm>
					</au>
				</aug>
				<source>Neural Comp</source>
				<pubdate>2001</pubdate>
				<volume>13</volume>
				<issue>3</issue>
				<fpage>637</fpage>
				<lpage>649</lpage>
				<xrefbib>
					<pubid idtype="doi">10.1162/089976601300014493</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B29">
				<title>
					<p>Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships</p>
				</title>
				<aug>
					<au>
						<snm>Liao</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Noble</snm>
						<fnm>WS</fnm>
					</au>
				</aug>
				<source>J Comput Biol</source>
				<pubdate>2003</pubdate>
				<volume>10</volume>
				<issue>6</issue>
				<fpage>857</fpage>
				<lpage>868</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1089/106652703322756113</pubid>
						<pubid idtype="pmpid" link="fulltext">14980014</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B30">
				<title>
					<p>Enzyme family classification by support vector machines</p>
				</title>
				<aug>
					<au>
						<snm>Cai</snm>
						<fnm>CZ</fnm>
					</au>
					<au>
						<snm>Han</snm>
						<fnm>LY</fnm>
					</au>
					<au>
						<snm>Ji</snm>
						<fnm>ZL</fnm>
					</au>
					<au>
						<snm>Chen</snm>
						<fnm>YZ</fnm>
					</au>
				</aug>
				<source>Proteins</source>
				<pubdate>2004</pubdate>
				<volume>55</volume>
				<issue>1</issue>
				<fpage>66</fpage>
				<lpage>76</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1002/prot.20045</pubid>
						<pubid idtype="pmpid" link="fulltext">14997540</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B31">
				<title>
					<p>Splice site identification using probabilistic parameters and SVM classification</p>
				</title>
				<aug>
					<au>
						<snm>Baten</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Chang</snm>
						<fnm>BCH</fnm>
					</au>
					<au>
						<snm>Halgamuge</snm>
						<fnm>SK</fnm>
					</au>
					<au>
						<snm>Li</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>BMC Bioinformatics</source>
				<pubdate>2006</pubdate>
				<volume>7</volume>
				<issue>Suppl 5</issue>
				<fpage>S15</fpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1764471</pubid>
						<pubid idtype="pmpid" link="fulltext">17254299</pubid>
						<pubid idtype="doi">10.1186/1471-2105-7-S5-S15</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B32">
				<title>
					<p>Prediction of protein folding class using global description of amino acid sequence</p>
				</title>
				<aug>
					<au>
						<snm>Dubchak</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Muchnik</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Holbrook</snm>
						<fnm>SR</fnm>
					</au>
					<au>
						<snm>Kim</snm>
						<fnm>SH</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci USA</source>
				<pubdate>1995</pubdate>
				<volume>92</volume>
				<issue>19</issue>
				<fpage>8700</fpage>
				<lpage>8704</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">41034</pubid>
						<pubid idtype="pmpid" link="fulltext">7568000</pubid>
						<pubid idtype="doi">10.1073/pnas.92.19.8700</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B33">
				<title>
					<p>BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences</p>
				</title>
				<aug>
					<au>
						<snm>Tatusova</snm>
						<fnm>TA</fnm>
					</au>
					<au>
						<snm>Madden</snm>
						<fnm>TL</fnm>
					</au>
				</aug>
				<source>FEMS Microbiol Lett</source>
				<pubdate>1999</pubdate>
				<volume>174</volume>
				<issue>2</issue>
				<fpage>247</fpage>
				<lpage>250</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1111/j.1574-6968.1999.tb13575.x</pubid>
						<pubid idtype="pmpid" link="fulltext">10339815</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
		</refgrp>
	</bm>
</art>
