<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
	<ui>gb-2006-7-4-r29</ui>
	<ji>GBJ</ji>
	<fm>
		<dochead>Software</dochead>
		<bibl>
			<title>
				<p>Reference based annotation with GeneMapper</p>
			</title>
			<aug>
				<au id="A1" ca="yes">
					<snm>Chatterji</snm>
					<fnm>Sourav</fnm>
					<insr iid="I1"/>
					<email>souravc@eecs.berkeley.edu</email>
				</au>
				<au id="A2">
					<snm>Pachter</snm>
					<fnm>Lior</fnm>
					<insr iid="I2"/>
					<email>lpachter@math.berkeley.edu</email>
				</au>
			</aug>
			<insg>
				<ins id="I1">
					<p>Department of Computer Science, University of California at Berkeley, Berkeley, CA, 94720, USA</p>
				</ins>
				<ins id="I2">
					<p>Department of Mathematics, University of California at Berkeley, Berkeley, CA 94720, USA</p>
				</ins>
			</insg>
			<source>Genome Biology</source>
			<issn>1465-6906</issn>
			<pubdate>2006</pubdate>
			<volume>7</volume>
			<issue>4</issue>
			<fpage>R29</fpage>
			<url>http://genomebiology.com/2006/7/4/R29</url>
			<xrefbib>
				<pubidlist><pubid idtype="pmpid">16600017</pubid><pubid idtype="doi">10.1186/gb-2006-7-4-r29</pubid>
				</pubidlist></xrefbib>
		</bibl>
		<history>
			<rec>
				<date>
					<day>24</day>
					<month>11</month>
					<year>2005</year>
				</date>
			</rec>
			<revrec>
				<date>
					<day>3</day>
					<month>2</month>
					<year>2006</year>
				</date>
			</revrec>
			<acc>
				<date>
					<day>3</day>
					<month>3</month>
					<year>2006</year>
				</date>
			</acc>
			<pub>
				<date>
					<day>5</day>
					<month>4</month>
					<year>2006</year>
				</date>
			</pub>
		</history>
		<cpyrt>
			<year>2006</year>
			<collab>Chatterji and Pachter; licensee BioMed Central Ltd.</collab>
			<note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
		</cpyrt>
		<shorttitle>
			<p>Reference-based annotation</p>
		</shorttitle>
		<shortabs>
			<p>GeneMapper, a new program for transferring annotations from a well-annotated reference genome to other genomes, is described.</p>
		</shortabs>
		<abs>
			<sec>
				<st>
					<p>Abstract</p>
				</st>
				<p>We introduce GeneMapper, a program for transferring annotations from a well annotated genome to other genomes. Drawing on high quality curated annotations, GeneMapper enables rapid and accurate annotation of newly sequenced genomes and is suitable for both finished and draft genomes. GeneMapper uses a profile based approach for mapping genes into multiple species, improving upon the standard pairwise approach. GeneMapper is freely available for academic use.</p>
			</sec>
		</abs>
	</fm>
	<meta>
		<classifications>
			<classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
			<classification type="BMC" subtype="man_spc_id" id="30010010">Genome studies</classification>
		</classifications>
	</meta>
	<bdy>
		<sec>
			<st>
				<p>Rationale</p>
			</st>
			<p>With large scale sequencing of vertebrate, fly, and worm genomes now underway, it is imperative to develop methods that produce high quality annotations of these newly sequenced genomes. Lack of genome wide, full length cDNA sequences for these species will make it virtually impossible to annotate these genomes completely using cDNA based methods such as Aceview <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. An alternative approach is to transfer reference annotation from a well annotated genome (such as human and <it>Drosophila melanogaster</it>) to other (possibly draft) genomes. We call this 'reference based annotation'. In fact, annotation systems such as ENSEMBL <abbrgrp><abbr bid="B2">2</abbr></abbrgrp> already incorporate reference based annotation as part of their gene prediction pipelines.</p>
			<p>The rationale behind the reference based approach is that a lot of resources have been invested in annotating genomes of model organisms, and it is unreasonable to expect similar efforts to be expended for the myriad of genomes that are now being sequenced. The status of current annotation projects for various insect and chordate genomes is shown in Table <tblr tid="T1">1</tblr>. In the case of vertebrate genomes, the human genome provides an excellent source of reference annotations suitable for transfer. In addition to having extensive numbers of cDNA sequences and a fairly complete RefSeq gene annotation, the human genome annotation also consists of a manual annotation component. By contrast, the other vertebrate genomes have insufficient cDNA sequence. In fact, many genome projects lack sufficient resources to run some of the existing <it>ab initio </it>gene prediction programs. The reference based annotation tool we have developed, called GeneMapper, can be used in such cases to transfer human annotations. GeneMapper provides a comprehensive annotation that, as we show, is surprisingly accurate. A similar argument can be made for other clades. For example, <it>D. melanogaster </it>is an extensively studied model organism, and there is a well curated FlyBase database <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> of supporting annotations. GeneMapper has been used to provide high quality annotations of the newly sequenced fruitfly genomes by transferring the FlyBase annotations.</p>
			<tbl id="T1" hint_layout="double">
				<title>
					<p>Table 1</p>
				</title>
				<caption>
					<p>Annotation status of vertebrate and fly genomes</p>
				</caption>
				<tblbdy cols="6">
					<r>
						<c ca="left">
							<p>Organism</p>
						</c>
						<c ca="center">
							<p>EST sequences</p>
						</c>
						<c ca="center">
							<p>Genbank mRNA</p>
						</c>
						<c ca="center">
							<p>RefSeq genes</p>
						</c>
						<c ca="center">
							<p>Manual annotations</p>
						</c>
						<c ca="center">
							<p><it>Ab initio </it>tracks</p>
						</c>
					</r>
					<r>
						<c cspan="6">
							<hr/>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>Homo sapiens</it>
							</p>
						</c>
						<c ca="center">
							<p>6,134,812</p>
						</c>
						<c ca="center">
							<p>207,905</p>
						</c>
						<c ca="center">
							<p>24,293</p>
						</c>
						<c ca="center">
							<p>22,421</p>
						</c>
						<c ca="center">
							<p>5</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>Pan troglodytes</it>
							</p>
						</c>
						<c ca="center">
							<p>4,983</p>
						</c>
						<c ca="center">
							<p>947</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>3</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>Macaca mulatta</it>
							</p>
						</c>
						<c ca="center">
							<p>52,754</p>
						</c>
						<c ca="center">
							<p>1,766</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>Canis familiaris</it>
							</p>
						</c>
						<c ca="center">
							<p>349,306</p>
						</c>
						<c ca="center">
							<p>1,666</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>45</p>
						</c>
						<c ca="center">
							<p>2</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>Bos taurus</it>
							</p>
						</c>
						<c ca="center">
							<p>702,434</p>
						</c>
						<c ca="center">
							<p>8,046</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>2</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>Mus musculus</it>
							</p>
						</c>
						<c ca="center">
							<p>4,686,082</p>
						</c>
						<c ca="center">
							<p>241,865</p>
						</c>
						<c ca="center">
							<p>18,757</p>
						</c>
						<c ca="center">
							<p>5,501</p>
						</c>
						<c ca="center">
							<p>3</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>Rattus norvegicus</it>
							</p>
						</c>
						<c ca="center">
							<p>701,072</p>
						</c>
						<c ca="center">
							<p>23,017</p>
						</c>
						<c ca="center">
							<p>9,012</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>5</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>Oryctolagus cuniculus</it>
							</p>
						</c>
						<c ca="center">
							<p>28,046</p>
						</c>
						<c ca="center">
							<p>2,669</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>Dasypus novemcinctus</it>
							</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>Loxodonta africana</it>
							</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>4</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>Monodelphis domestica</it>
							</p>
						</c>
						<c ca="center">
							<p>50</p>
						</c>
						<c ca="center">
							<p>363</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>1</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>Gallus gallus</it>
							</p>
						</c>
						<c ca="center">
							<p>578,445</p>
						</c>
						<c ca="center">
							<p>29,743</p>
						</c>
						<c ca="center">
							<p>3,848</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>4</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>Xenopus tropicalis</it>
							</p>
						</c>
						<c ca="center">
							<p>1,038,272</p>
						</c>
						<c ca="center">
							<p>10,712</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>1</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>Dana rerio</it>
							</p>
						</c>
						<c ca="center">
							<p>673,076</p>
						</c>
						<c ca="center">
							<p>25,094</p>
						</c>
						<c ca="center">
							<p>10,689</p>
						</c>
						<c ca="center">
							<p>3,546</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>Tetraodon nigroviridis</it>
							</p>
						</c>
						<c ca="center">
							<p>99</p>
						</c>
						<c ca="center">
							<p>107,945</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>2</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>Takifugu rubripes</it>
							</p>
						</c>
						<c ca="center">
							<p>25,850</p>
						</c>
						<c ca="center">
							<p>978</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>1</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>Drosophila melanogaster</it>
							</p>
						</c>
						<c ca="center">
							<p>383,407</p>
						</c>
						<c ca="center">
							<p>19,931</p>
						</c>
						<c ca="center">
							<p>19,697</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>4</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>D. simulans</it>
							</p>
						</c>
						<c ca="center">
							<p>5,013</p>
						</c>
						<c ca="center">
							<p>80</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>2</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>D. yakuba</it>
							</p>
						</c>
						<c ca="center">
							<p>11,015</p>
						</c>
						<c ca="center">
							<p>808</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>2</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>D. erecta</it>
							</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>6</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>1</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>D. ananassae</it>
							</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>11</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>1</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>D. pseudoobscura</it>
							</p>
						</c>
						<c ca="center">
							<p>35,042</p>
						</c>
						<c ca="center">
							<p>40</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>4</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>D. virilis</it>
							</p>
						</c>
						<c ca="center">
							<p>663</p>
						</c>
						<c ca="center">
							<p>41</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>1</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>D. mojavensis</it>
							</p>
						</c>
						<c ca="center">
							<p>361</p>
						</c>
						<c ca="center">
							<p>2</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>1</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<it>D. grimshawi</it>
							</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>None</p>
						</c>
						<c ca="center">
							<p>1</p>
						</c>
					</r>
				</tblbdy>
				<tblfn>
					<p>The Table summarizes the annotation status of vertebrate and fly genomes as of October 2005. The numbers of expressed sequence tag (EST) sequences were obtained from the NCBI dbEST database [38]. The number of manually annotated genes was obtained from the VEGA annotation project site [39]. Other numbers were obtained from the UCSC genome browser database [30].</p>
				</tblfn>
			</tbl>
			<p>Existing computational gene finding methods can be broadly classified into two main categories: <it>ab initio </it>methods and evidence based methods. <it>Ab initio </it>gene finding methods such as GENSCAN <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> and GENIE <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> predict the gene structure from first principles without using external evidence. Comparative <it>ab initio </it>gene finding methods such as SLAM <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, Twinscan <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>, and SGP-2 <abbrgrp><abbr bid="B8">8</abbr></abbrgrp> use conservation of gene structure among related species, for example human and mouse, to derive more accurate predictions. They exploit the fact that coding exons are functional and therefore are more likely to be conserved than noncoding sequence. More recently, methods such as Shadower <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr></abbrgrp>, GIBBS <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr></abbrgrp>, EXONIPHY <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>, and NSCAN <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> use conservation information among multiple species to make gene predictions.</p>
			<p>Evidence based gene finding methods are considerably more accurate than <it>ab initio </it>methods because they rely on information that is not intrinsic to the genome to improve prediction. Such information, called external evidence, can be in the form of cDNA or protein sequences from other species. Use of such information frequently requires alignment programs. In the case of cDNA, in order to make use of the evidence, programs such as Aceview <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>, ecGene <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>, GMAP <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>, and BLAT <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> align cDNA with genomic sequence. These methods need to account for the fact that expressed sequence tags can have a relatively high error rate (up to 3%). However, they have not been developed to project cDNA evidence onto distantly related species. For example, they are not designed to align human cDNA with the mouse genome.</p>
			<p>Another class of evidence based methods makes use of alignments of protein sequences with genomic sequences, and form an important component of pipelines such as ENSEMBL. Such programs include DPS <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>, Procrustes <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>, GeneWise <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>, and GenomeScan <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. To some extent, these programs are designed to work with proteins from related species. Although they work quite well with highly conserved proteins, they are not as accurate for diverged protein sequences. Hybrid methods such as JIGSAW <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> and ExonHunter <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> combine both cDNA and protein evidence probabilistically while making gene predictions.</p>
			<p>GeneMapper has been influenced by and is in the same category of gene finding methods as Projector <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. Projector uses gene annotations from a reference species as evidence to predict the gene structure in a target sequence. In analogy to cDNA based methods, Projector aligns mRNA from a reference gene to a target sequence, but it exploits additional information about splice sites. This is accomplished by using a pair hidden Markov model to transfer annotations from the reference species to the target sequence.</p>
			<p>GeneMapper uses a bottom up approach to predict gene structure. First, each reference exon is aligned to a target genome and these alignments are then joined to build a gene structure. Because exons are much shorter than introns, this approach makes use of dynamic programming with a fairly sophisticated codon evolution model to provide detailed alignment of exons. GeneMapper also uses a novel mapping process that exploits the phylogeny of the reference and target species to obtain more precise annotations. If a gene is to be mapped from a reference species to multiple target species, then GeneMapper makes use of characteristic properties extracted from all of the available orthologous genes in the family. In other words, the program works with profiles of orthologous genes, which are not unlike protein profiles. The gene profile is built up progressively as the gene is mapped into successive target species. Therefore, the profile becomes more complete as the gene is mapped into additional target species. The profile is especially useful in mapping genes to evolutionarily distant species that may have diverged considerably from the reference species. The rationale behind the profile based approach is that information from all orthologous sequences results in a more comprehensive representation of the gene than is possible with a single sequence.</p>
			<p>GeneMapper was tested on a set of orthologous human and mouse genes. Results were compared with GeneWise and Projector annotations. We show that GeneMapper outperforms both GeneWise and Projector, and also establish that the addition of multiple sequences from chimpanzee, rat, and chicken further improves performance through the use of gene profiles.</p>
		</sec>
		<sec>
			<st>
				<p>Results</p>
			</st>
			<p>GeneMapper was implemented in the computer programming language C and tested on a standard Linux machine. The running time of GeneMapper on a single gene is given by the following equation:</p>
			<p>
				<graphic file="gb-2006-7-4-r29-i1.gif"/>
			</p>
			<p>where N<sub>e </sub>is the number of exons in the gene and l<sub>i </sub>is the length of the ith exon. A loose upper bound on this running time is O(L<sup>2</sup>), where L is the length of coding sequence in the gene. However, the running time is expected to be appreciably smaller than quadratic for multiple exon genes. GeneMapper can be downloaded from the GeneMapper website <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>.</p>
			<p>Two tests were conducted to evaluate the performance of GeneMapper. In the first test, GeneMapper was compared with GeneWise and Projector, two commonly used reference based programs. For the second test, a data set of orthologous genes from the human, chimpanzee, mouse, rat, and chicken genomes was created. This data set was then used to test the hypothesis that adding more species improves the performance of GeneMapper. The tests are described in detail in the following two sections. Finally, GeneMapper was used to annotate ENCODE <abbrgrp><abbr bid="B26">26</abbr></abbrgrp> regions by transferring human GENCODE <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> annotations to other species. We believe that this data set will be an important resource for studying the evolution of genes in vertebrate genomes.</p>
			<sec>
				<st>
					<p>Performance</p>
				</st>
				<p>GeneMapper was compared with Projector and GeneWise on the Projector data set <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. This data set consists of 491 orthologous genes that are reciprocal best matches between mRNA supported human and mouse ENSEMBL genes. The set can be divided into two subsets. The first subset contains 465 genes for which the number of exons is the same in the human and mouse orthologs. The second subset has 26 genes in which the human and mouse orthologs have different number of exons, in some cases resulting from exon fusion and splitting events. Some of the genes in this subset were not true orthologs and the data set was refined manually to remove any such errors. The refined data are in Additional data file 1.</p>
				<p>To compare the performance of the programs, the human annotations were used to predict the gene structure in the orthologous mouse sequences. GeneWise and Projector predictions were taken from the Projector paper <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. The eval package <abbrgrp><abbr bid="B28">28</abbr></abbrgrp> was then used to calculate the nucleotide, exon, and gene level sensitivities and specificities of the programs. For more details about these metrics, the reader is referred to the report by Burset and Guigo <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. The performances of the three programs are compared in Table <tblr tid="T2">2</tblr>. The exon level sensitivity and specificity of GeneMapper is 97.15% and 98.19%, respectively, and the error rate is less than half that in the other programs. The gene level sensitivity and specificity is improved by more than 20% compared to GeneWise and Projector. We believe that the primary reason for GeneMapper's accuracy is the use of a proper exon model for the alignment and mapping of exons. The results clearly indicate that GeneMapper represents a significant improvement over existing programs and will be a useful tool for accurately transferring annotations from reference genomes to the newly sequenced genomes.</p>
				<tbl id="T2" hint_layout="double">
					<title>
						<p>Table 2</p>
					</title>
					<caption>
						<p>Performance of reference based programs</p>
					</caption>
					<tblbdy cols="7">
						<r>
							<c ca="left">
								<p>Program</p>
							</c>
							<c cspan="2" ca="center">
								<p>Nucleotide</p>
							</c>
							<c cspan="2" ca="center">
								<p>Exon</p>
							</c>
							<c cspan="2" ca="center">
								<p>Gene<sup>a</sup></p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c cspan="6">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>Sensitivity</p>
							</c>
							<c ca="center">
								<p>Specificity</p>
							</c>
							<c ca="center">
								<p>Sensitivity</p>
							</c>
							<c ca="center">
								<p>Specificity</p>
							</c>
							<c ca="center">
								<p>Sensitivity</p>
							</c>
							<c ca="center">
								<p>Specificity</p>
							</c>
						</r>
						<r>
							<c cspan="7">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>GeneWise</p>
							</c>
							<c ca="center">
								<p>99.86</p>
							</c>
							<c ca="center">
								<p>99.91</p>
							</c>
							<c ca="center">
								<p>92.8</p>
							</c>
							<c ca="center">
								<p>93.4</p>
							</c>
							<c ca="center">
								<p>61.3</p>
							</c>
							<c ca="center">
								<p>60.8</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Projector</p>
							</c>
							<c ca="center">
								<p>99.78</p>
							</c>
							<c ca="center">
								<p>99.70</p>
							</c>
							<c ca="center">
								<p>94.2</p>
							</c>
							<c ca="center">
								<p>90.5</p>
							</c>
							<c ca="center">
								<p>59.9</p>
							</c>
							<c ca="center">
								<p>59.5</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>GeneMapper</p>
							</c>
							<c ca="center">
								<p>99.88</p>
							</c>
							<c ca="center">
								<p>99.94</p>
							</c>
							<c ca="center">
								<p>97.2</p>
							</c>
							<c ca="center">
								<p>97.8</p>
							</c>
							<c ca="center">
								<p>81.7</p>
							</c>
							<c ca="center">
								<p>81.7</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>The Table summarizes the performance of GeneWise, Projector and GeneMapper on the Projector data set consisting of 491 orthologous human and mouse genes. The human annotation was used to predict the gene structure in the mouse sequence. Performance is reported in terms of nucleotide, exon, and gene level sensitivities and specificities. <sup>a</sup>GeneMapper predicts exactly one gene per reference annotation, and the number of predicted genes is equal to the number of genes in true or gold standard annotation. Consequently, gene sensitivity is equal to gene specificity for GeneMapper.</p>
					</tblfn>
				</tbl>
			</sec>
			<sec>
				<st>
					<p>Using additional species to improve performance</p>
				</st>
				<p>The second test used a data set of orthologous human, chimpanzee, mouse, rat, and chicken genes to measure the improvement in accuracy of GeneMapper with the addition of multiple species. RefSeq annotations of human, mouse, and chicken genomes were downloaded from the University of California Santa Cruz (UCSC) genome browser database <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. The gene set was refined to remove annotations with common errors such as the absence of start or stop codons. BLAT <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> was then used to find mutually best hits among the proteomes. The pair-wise hits were further joined together to obtain orthologous triplets of human, mouse, and chicken genes. The human and mouse orthologs were then mapped into the chimpanzee and rat genomes, respectively, resulting in a set of orthologs from all five species. The data set obtained by this process consisted of 895 potential orthologous segments from the five vertebrate genomes, and is provided in Additional data file 2. We should note here that this standard method of obtaining orthologs by reciprocal best hits cannot distinguish between paralogs. However, the accuracy of reference based programs such as GeneMapper is not affected as long as the potential orthologs are sufficiently conserved.</p>
				<p>To assess the performance of pair-wise GeneMapper, human annotations were used to predict the gene structure in the orthologous chicken sequences. For the multiple species version of GeneMapper, additional orthologous sequences from chimpanzee, mouse, and rat were utilized. The profiles were initialized with the human genes, and were then used to predict gene structures incrementally in the chimpanzee, mouse, and rat genomes. As gene structures were predicted in each new species, they were added to the profiles. Finally, the profiles were used to predict the gene structures in the chicken sequence. The performance of the pair-wise and multiple species versions of GeneMapper on the chicken genome is summarized in Table <tblr tid="T3">3</tblr>. The Table demonstrates that multiple species GeneMapper represents an improvement over pair-wise GeneMapper. We point out below that most of the errors in the predictions are caused by factors that cannot be corrected computationally. Consequently, it is quite significant that multiple species GeneMapper is able to correct 18 wrong exon predictions of pair-wise GeneMapper with just three additional species. We therefore believe that, with the addition of more species, multiple species GeneMapper will come close to the limit of computational reference based methods.</p>
				<tbl id="T3" hint_layout="double">
					<title>
						<p>Table 3</p>
					</title>
					<caption>
						<p>Comparison of pairwise and multiple species GeneMapper</p>
					</caption>
					<tblbdy cols="7">
						<r>
							<c ca="left">
								<p>Program</p>
							</c>
							<c cspan="2" ca="center">
								<p>Nucleotide</p>
							</c>
							<c cspan="2" ca="center">
								<p>Exon</p>
							</c>
							<c cspan="2" ca="center">
								<p>Gene</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c cspan="6">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>Sensitivity</p>
							</c>
							<c ca="center">
								<p>Specificity</p>
							</c>
							<c ca="center">
								<p>Sensitivity</p>
							</c>
							<c ca="center">
								<p>Sensitivity</p>
							</c>
							<c ca="center">
								<p>Specificity</p>
							</c>
							<c ca="center">
								<p>Sensitivity</p>
							</c>
						</r>
						<r>
							<c cspan="7">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Pair-wise GeneMapper</p>
							</c>
							<c ca="center">
								<p>99.95</p>
							</c>
							<c ca="center">
								<p>99.93</p>
							</c>
							<c ca="center">
								<p>91.3</p>
							</c>
							<c ca="center">
								<p>95.1</p>
							</c>
							<c ca="center">
								<p>52.2</p>
							</c>
							<c ca="center">
								<p>52.2</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Multiple species GeneMapper</p>
							</c>
							<c ca="center">
								<p>99.95</p>
							</c>
							<c ca="center">
								<p>99.93</p>
							</c>
							<c ca="center">
								<p>91.5</p>
							</c>
							<c ca="center">
								<p>95.2</p>
							</c>
							<c ca="center">
								<p>52.6</p>
							</c>
							<c ca="center">
								<p>52.6</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>The Table summarizes the effect of additional species on the performance of GeneMapper. To test pair-wise GeneMapper, only the human annotation was used to predict the gene structure in the chicken sequence. To test the profile based approach, additional orthologous sequences from the chimpanzee, mouse, and rat genomes were used to create a profile for each gene. The profiles were then employed to predict genes in the chicken sequences. The Table compares the accuracy in predicting the gene structure in the chicken sequences.</p>
					</tblfn>
				</tbl>
			</sec>
			<sec>
				<st>
					<p>ENCODE annotations</p>
				</st>
				<p>The goal of the ENCODE project <abbrgrp><abbr bid="B26">26</abbr></abbrgrp> is to study functional elements by rigorously analyzing a portion (about 1%) of the human genome. Forty-four regions across the human genome were chosen for investigation and orthologous regions in other vertebrate genomes were sequenced for comparative analysis. GeneMapper was used to annotate the ENCODE regions by transferring human GENCODE <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> annotations to other species. We provide these annotations as a resource for studying the evolution of genes (Additional data file 3).</p>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Discussion</p>
			</st>
			<p>We have shown that GeneMapper can transfer reference annotations with remarkably high accuracy and that it is a substantial improvement over existing programs. This suggests that reference based gene finding is a feasible approach for accurately annotating the large number of genomes that are now being sequenced.</p>
			<p>It is important to note that the concept of transferring annotations is not a new one, and methods such as DPS, Procrustes, GeneWise, Genomescan, and Projector have been designed to perform exactly the same task. GeneWise and Procrustes align proteins with genomic sequences from target species. The principal disadvantage of the protein alignment approach is that it does not utilize information about exon/intron boundaries and therefore does not perform very well on less conserved genes. On the other hand, methods such as Projector and GeneMapper utilize the exon/intron structure of the gene and thus are more accurate in identifying splice sites. However, it should be noted that GeneMapper and Projector are not suitable for mapping genes from very distant species, in which the exon/intron structure of the gene might not remain conserved. For example, if one wants to find the homolog of a novel fruitfly gene in the human genome, it is probably best to use methods such as Procrustes and GeneWise.</p>
			<p>Both GeneMapper and Projector use the exon/intron structure of the gene to predict the ortholog of a reference gene in a related species, but they have different approaches to the prediction problem. Projector uses the Viterbi algorithm for a pair hidden Markov model to predict the gene structure. Because the running time of the Viterbi algorithms for pair hidden Markov models is quadratic, Projector uses a heuristic to decrease the search space. In contrast, GeneMapper uses a bottom up algorithm that first maps each exon and then joins the exon predictions together to obtain the gene structure. Because exons are much shorter than introns, a more sophisticated model can be used for exon alignment. The optimal alignment is still obtained using dynamic programming, albeit a more complex one. We believe that the use of our exon alignment model makes GeneMapper more accurate than Projector. Furthermore, unlike Projector, GeneMapper models sequencing errors and frameshifts, and we believe that this makes GeneMapper more suitable for draft genomes.</p>
			<p>When a gene must be mapped into multiple species, GeneMapper uses profiles to derive a more complete characterization of the gene and thus make more precise predictions. This is because a profile of orthologous genes can help us to obtain much more information about the gene family than a single reference gene. We showed that the use of additional species and the application of the profile based approach outperforms the pair-wise approach. The use of profiles is particularly appropriate for annotating the newly sequenced vertebrate, insect, and worm genomes because the profile can exploit information from all related genomes while making gene predictions.</p>
			<sec>
				<st>
					<p>Potential sources of error</p>
				</st>
				<p>Even though GeneMapper is remarkably accurate and has an error rate of less than 3% in transferring exons from human genes to orthologous mouse sequences, we investigated the sources of these errors to gain more insight into the GeneMapper algorithm. Most errors can be classified into the categories explained below.</p>
				<p>Exons that have diverged considerably between the reference and the target genes are unable to pass the statistical significance tests of ExonAligner. This is because a choice was made to report only highly reliable predictions at the cost of missing a few true exons.</p>
				<p>As described in the Methods section (below), GeneMapper's procedure for detecting exon splitting is comparatively crude and depends on accurate alignment of the reference exon with the orthologous target sequence (which contains an inserted intron). The presence of the inserted intron makes it difficult to align these regions accurately, especially if it is a long intron. Such wrongly aligned exons are partially predicted and this problem can probably be solved by employing a more sophisticated alignment model that allows inserted introns.</p>
				<p>The GeneMapper algorithm is unable to account for certain assembly and sequencing errors. For example, we found many cases of duplicated chicken exons, most probably due to errors in the assembly. In such cases there is no way to distinguish between the duplicate exons, and the prediction is made randomly among the duplicates. GeneMapper also constrains the predicted exons to have splice sites at their ends. Therefore, we are unable to deal with sequencing errors at splice sites.</p>
				<p>Differential splicing in the reference and target species can also cause errors in GeneMapper predictions. For example, if an exon is transcribed in the reference species but its ortholog is not transcribed in the target species, then GeneMapper predicts a wrong exon in the target species. However, it is not clear whether this is a wrong prediction, considering that this exon might be part of an alternate transcript in the target species. In fact, whether alternative spliced forms are conserved among related species such as human and mouse is an open question, and we believe that GeneMapper predictions could be an appropriate starting point for any experiment that seeks to address this issue.</p>
				<p>An analysis of these errors will facilitate future improvements in GeneMapper. For example, we intend to work on statistical significance tests that are able to do a better job in discriminating between true and false exon predictions. Future enhancements of GeneMapper will also include improved handling of exon splitting. GeneMapper only transfers the coding sequence of a reference gene to a target sequence. We intend to modify GeneMapper to map 5' and 3' untranslated regions. This will also help in mapping short initial/terminal coding exons, which are more divergent compared with internal exons.</p>
				<p>Although, as we point out, there is still room for improvement, we believe that multiple species GeneMapper comes close to the limit of gene prediction accuracy that is possible with computational reference based gene finding.</p>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Methods</p>
			</st>
			<sec>
				<st>
					<p>ExonAligner</p>
				</st>
				<p>GeneMapper is a bottom up algorithm that first predicts the ortholog of each reference exon in the target sequence and then combines the exon predictions to determine the gene structure. Therefore, the most critical step in the algorithm is to predict the ortholog of each reference exon by aligning it with the target sequence. A module called ExonAligner was developed to carry out this step in GeneMapper. ExonAligner takes as input two sequences, the annotated exon from the reference species and a target sequence containing its ortholog. A fairly intricate dynamic programming model is then used to align the reference exon with the target sequence.</p>
				<p>ExonAligner uses a version of the Smith Waterman algorithm to find the best alignment of the reference exon with a subsequence of the target sequence. In this version of the standard dynamic programming algorithm, as shown in Figure <figr fid="F1">1a</figr>, overhanging ends are penalized in the reference exon but not in the target sequence. In addition, the matched subsequence is constrained to have splice sites at its boundaries. The splice sites are scored using StrataSplice <abbrgrp><abbr bid="B31">31</abbr></abbrgrp> to improve splice site detection.</p>
				<fig id="F1">
					<title>
						<p>Figure 1</p>
					</title>
					<caption>
						<p>The ExonAligner algorithm</p>
					</caption>
					<text>
						<p>The ExonAligner algorithm. <b>(a) </b>Representation of constrained dynamic programming used by ExonAligner. It aligns the reference exon with a subsequence of the target sequence. This subsequence is additionally constrained to have splice sites at its ends, which are represented by green blobs in the cartoon. <b>(b) </b>The dynamic programming matrix used by ExonAligner. Only the edges into top right node are shown. The solid edges represent matches/mismatches and gaps in codon space. The dotted edges represent translation frame disrupting events such as frameshifts.</p>
					</text>
					<graphic file="gb-2006-7-4-r29-1"/>
				</fig>
				<p>ExonAligner uses a special dynamic programming matrix to model the evolution of codons and to allow for sequencing errors and frameshifts. The dynamic programming matrix is shown in Figure <figr fid="F1">1b</figr>. There are two types of edges in the matrix, with solid edges representing transitions in codon space and dotted edges representing events that cause disruptions in the translation frame. The solid edges model insertions, deletions and pairing of codons, and cover three nucleotides in the X and/or Y coordinates. On the other hand, the dotted edges cover one nucleotide in the X or Y direction. They model events such as sequencing errors and frameshifts, which cause disruptions in the translation frame. Because these events are very rare, a large penalty is charged for traversing these edges.</p>
				<p>ExonAligner models the evolution of codons by using 64 &#215; 64 COD matrices. COD matrices are very similar to PAM and BLOSUM matrices <abbrgrp><abbr bid="B32">32</abbr><abbr bid="B33">33</abbr></abbrgrp>, which define distances between amino acids. The COD matrices are learned from whole genome alignments. In the case of vertebrates, the COD matrices are extrapolated from human and chimpanzee whole genome alignments. The whole genome alignment of the human and chimpanzee genomes was obtained from the UCSC genome browser database <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. The alignments of human genes with the chimpanzee genome were extracted from these data. The gene alignments were then used to learn parameters for evolution of codons between human and chimpanzee genomes. The human/chimpanzee parameters were extrapolated to obtain parameters for other species.</p>
				<p>The ExonAligner algorithm predicts the reference exon's putative ortholog in the target species. The putative ortholog is used as a prediction by GeneMapper only if its alignment with the reference exon passes a test of statistical significance. The testing of statistical significance of alignments is a well studied problem. The reader is referred to the book by Durbin and coworkers <abbrgrp><abbr bid="B34">34</abbr></abbrgrp> for an overview. ExonAligner uses the Bayesian likelihood ratio test as its core test. In this test, the calculated score is the ratio of the likelihood of the alignment in the match model to its likelihood in the random model. Because the score is dependent upon length, short exons may fail to pass the ratio test. Therefore, ExonAligner also allows highly conserved short exons to pass the test of statistical significance.</p>
			</sec>
			<sec>
				<st>
					<p>The pair-wise GeneMapper algorithm</p>
				</st>
				<p>In this section we describe the pair-wise version of GeneMapper, which maps gene annotations from a reference species to a single target species. The GeneMapper pipeline consists of three stages, shown in Figure <figr fid="F2">2</figr>. In the first stage only the most conserved exons are mapped to the target sequence. At the end of this stage, an approximate outline of the gene in target sequence is obtained, as shown in Figure <figr fid="F2">2a</figr>. In the second stage this outline is used to predict the orthologs of exons that are unmapped in the first stage. The exons mapped in the first stage narrow down the possible locations of neighboring unmapped exons and thus help in mapping them with more confidence. For example, in Figure <figr fid="F2">2b</figr> the search for the third exon in the target sequence can be narrowed down between the second and fourth exons (which were mapped in the first stage of the algorithm). In the first two stages, it is assumed that there are equal numbers of exons in orthologous genes of the reference and target species. However, studies <abbrgrp><abbr bid="B35">35</abbr></abbrgrp> have shown that this is not entirely true. In case of human and mouse, for instance, about 15% of orthologous genes do not have the same number of exons. Therefore, GeneMapper searches for exon splitting and exon fusion events in the third stage. We now describe in detail each stage of the pipeline.</p>
				<fig id="F2">
					<title>
						<p>Figure 2</p>
					</title>
					<caption>
						<p>The three stages of the GeneMapper pipeline</p>
					</caption>
					<text>
						<p>The three stages of the GeneMapper pipeline. <b>(a) </b>The first stage, in which only the most conserved exons are mapped. <b>(b) </b>The second stage, in which the algorithm uses exons mapped in the first stage as signposts to map already mapped exons. In this example, the possible locations of the second and third exons are narrowed down because they must be between the first and fourth exons. <b>(c) </b>The last stage, in which the algorithm searches for cases of exon splitting and exon fusion.</p>
					</text>
					<graphic file="gb-2006-7-4-r29-2"/>
				</fig>
				<p>In the first stage of the GeneMapper algorithm, only the highly conserved exons are mapped. GeneMapper initially searches for the approximate locations of the ortholog of each exon in the target sequence by using translated BLAST. If any significant hits are found for an exon, then the best hit is extended to derive an approximate location of the exon's ortholog in the target sequence. The ExonAligner algorithm is then used to predict the exact ortholog of the exon. The alignment of the predicted ortholog with the reference exon is checked for statistical significance using a combination of tests (described above). These tests are made quite stringent so that only the most conserved exons may pass them. This choice is made by design because we are able to obtain an outline of the gene structure in the target sequence that can be utilized to map less conserved exons more confidently in the next stage of the algorithm.</p>
				<p>In the second stage of GeneMapper, linearity of transcription is used to map exons that are missed in the first stage of the algorithm (specifically, already mapped exons are used to find out the approximate locations of unmapped exons). The details of the use of extrapolation to pinpoint the location of unmapped exons is shown in Figure <figr fid="F3">3</figr>. Once the possible location of an unmapped exon has been narrowed down, translated BLAST and ExonAligner are used to map the exon in the target sequence by a procedure that is similar to the first stage of the algorithm. However, the statistical significance tests are made less stringent in the second stage. This is because the position of the exon was narrowed down using already predicted exons, and this makes us more confident about the accuracy of the prediction.</p>
				<fig id="F3">
					<title>
						<p>Figure 3</p>
					</title>
					<caption>
						<p>Extrapolation in GeneMapper</p>
					</caption>
					<text>
						<p>Extrapolation in GeneMapper. Use of extrapolation to pinpoint the location of unmapped exons in the second stage of GeneMapper pipeline. The blue sequence shows the possible location of the unmapped exon in the target sequence, and we assume that the gene is in the same strand in both species. <b>(a) </b>If an unmapped exon has mapped exons both to its upstream as well as downstream, then the unmapped exon should be mapped between the orthologs of its nearest mapped upstream and downstream exons. <b>(b) </b>If only the exons upstream of an unmapped exon are mapped, then the unmapped exon should be mapped downstream of the ortholog of its closest mapped exon. <b>(c) </b>If only the exons downstream of an unmapped exon are mapped, then the unmapped exon should be mapped upstream of the ortholog of its closest mapped exon.</p>
					</text>
					<graphic file="gb-2006-7-4-r29-3"/>
				</fig>
				<p>In the third and final stage of GeneMapper, the algorithm searches for exon fusion and exon splitting events. For detecting exon fusion, we exploit the fact that introns must be of a minimum length to maintain the intron splicing reaction. Thus, if two adjacent exon predictions in the target sequence are closer than the minimum intron length, then they must have fused during evolution. This rule is very effective in detecting most cases of exon fusion in the Projector data set. On the other hand, the rule for detecting exon splitting is comparatively crude and is dependent on having an accurate alignment of the reference exon with the predicted ortholog. The alignment is searched for gaps of length greater than the minimum intron length and having splice sites at their ends. Such gaps are best explained by exon splitting events. The rules for detecting exon splitting are preliminary and improvements are planned in future versions of GeneMapper.</p>
			</sec>
			<sec>
				<st>
					<p>Multiple species GeneMapper</p>
				</st>
				<p>Several studies <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B14">14</abbr><abbr bid="B36">36</abbr><abbr bid="B37">37</abbr></abbrgrp> have shown that increasing the number of species helps in improving the performance of comparative <it>ab initio </it>gene finding programs. It therefore appears intuitive that increasing the number of species (and thus increasing the amount of available data) should enhance the accuracy of evidence based gene finding methods. The multiple species version of the GeneMapper algorithm makes use of two key ideas to improve upon the pair-wise algorithm. First, a profile of the gene is built and updated each time we map the gene into a new target species. The gene profiles are very similar to protein profiles, which are used extensively in protein informatics. The profiles help us to map genes more accurately into species that are evolutionarily distant from the reference species. Second, there is a specific order in which a gene is mapped from the reference species into the multiple target species, and this order is designed to take full advantage of the profile.</p>
				<p>Gene profiles are alignments of one or more orthologous genes that are used to search for new orthologs. As shown in Figure <figr fid="F4">4</figr>, gene profiles work in codon space and each column in the profile contains orthologous codons. As with standard profiles, a gene profile can include gaps of length 3 that cover a codon. For example, the fifth column in the figure has codon gaps in the mouse and rat sequences. In addition, a gene profile can contain noncodon gaps that cover one nucleotide. These gaps account for rare translation disrupting events such as frameshifts and sequencing errors and are not shown in the Figure.</p>
				<fig id="F4">
					<title>
						<p>Figure 4</p>
					</title>
					<caption>
						<p>A gene profile</p>
					</caption>
					<text>
						<p>A gene profile. A portion of the gene profile of the <it>Neurod4 </it>gene orthologs in human, chimpanzee, mouse, and rat. Each column in the profile contains orthologous codons and is used to obtain the residue scoring matrix for dynamic programming. Columns with conserved codons are shown in bold, whereas columns with synonymous substitutions are italicized.</p>
					</text>
					<graphic file="gb-2006-7-4-r29-4"/>
				</fig>
				<p>ExonAligner is modified to align gene profiles with sequences. As with pair-wise ExonAligner, COD matrices are used to model the evolution of codons. To evaluate the residue scoring matrix for the profile, ExonAligner calculates the COD matrices defining the distances between the codons in the target species and each species in the profile. The COD matrices are then used to derive the pair-wise residue scoring matrix for each species. The residue scoring matrix for the whole profile is the sum of the pair-wise scores. We illustrate the procedure by calculating the residue scoring matrix for species s at the third column in Figure <figr fid="F4">4</figr>. We first calculate the pair-wise COD matrices between species s and human, chimpanzee, mouse and rat, and call them COD<sub>sh</sub>, COD<sub>sc</sub>, COD<sub>sm </sub>and COD<sub>sr</sub>, respectively. The score for codon c is sum of the pair-wise scores:</p>
				<p>COD<sub>sh</sub>(c, GGA) + COD<sub>sc</sub>(c, GGA) + COD<sub>sm</sub>(c, GGT) + COD<sub>sr</sub>(c, GGA)</p>
				<p>ExonAligner uses two evolutionary models to take into account the variations in mutability of codons. The first model represents codons that are under negative selection and have low mutation rate. The second model represents codons that are not under any selection pressure and therefore have a high rate of mutability. A simple heuristic is employed to determine the model for a particular site. The first model is used if all of the mutations in the site are synonymous; otherwise, the second model is used. In addition, the program uses position sensitive gap scores, whereby sites represented by the second model have a lower gap penalty.</p>
				<p>The mapping of the gene into each target species takes place in three stages, in exactly the same manner as for pair-wise GeneMapper (see above). The sequence in which the target species are mapped is ordered by the evolutionary distance from the reference species; specifically, the gene is first mapped to the target species closest to the reference species, then to the next closest species, and so on. This particular order is used because it is comparatively easier to map genes to a species that is evolutionarily close to the reference species than to a species that is more distant. Each time an orthologous gene is predicted in a target species, it is added to the profile. The updated profile is a more complete representation of the statistical properties of the gene family and therefore helps us to derive a more accurate prediction of the ortholog in the next species.</p>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Additional data files</p>
			</st>
			<p>The following additional data are included with the online version of this article: a gunzipped tar file containing the data set of orthologous genes in human and mouse that was used to compare GeneMapper with Projector and GeneWise (Additional data file <supplr sid="S1">1</supplr>); a gunzipped tar file containing the data set of orthologous genes in five vertebrates (human, chimpanzee, mouse, rat and chicken) that was used to compare pair-wise and multiple species GeneMapper (Additional data file <supplr sid="S2">2</supplr>); and a gunzipped tar file containing GeneMapper annotations of the ENCODE regions (Additional data file <supplr sid="S3">3</supplr>).</p>
			<suppl id="S1">
				<title>
					<p>Additional data file 1</p>
				</title>
				<caption>
					<p>A gunzipped tar file containing the data set of orthologous genes in human and mouse that was used to compare GeneMapper with Projector and GeneWise</p>
				</caption>
				<text>
					<p>A gunzipped tar file containing the data set of orthologous genes in human and mouse that was used to compare GeneMapper with Projector and GeneWise</p>
				</text>
				<file name="gb-2006-7-4-r29-S1.tgz">
					<p>Click here for file</p>
				</file>
			</suppl>
			<suppl id="S2">
				<title>
					<p>Additional data file 2</p>
				</title>
				<caption>
					<p>A gunzipped tar file containing the data set of orthologous genes in five vertebrates (human, chimpanzee, mouse, rat and chicken) that was used to compare pair-wise and multiple species GeneMapper</p>
				</caption>
				<text>
					<p>A gunzipped tar file containing the data set of orthologous genes in five vertebrates (human, chimpanzee, mouse, rat and chicken) that was used to compare pair-wise and multiple species GeneMapper</p>
				</text>
				<file name="gb-2006-7-4-r29-S2.tgz">
					<p>Click here for file</p>
				</file>
			</suppl>
			<suppl id="S3">
				<title>
					<p>Additional data file 3</p>
				</title>
				<caption>
					<p>A gunzipped tar file containing GeneMapper annotations of the ENCODE regions</p>
				</caption>
				<text>
					<p>A gunzipped tar file containing GeneMapper annotations of the ENCODE regions</p>
				</text>
				<file name="gb-2006-7-4-r29-S3.tgz">
					<p>Click here for file</p>
				</file>
			</suppl>
		</sec>
	</bdy>
	<bm>
		<ack>
			<sec>
				<st>
					<p>Acknowledgements</p>
				</st>
				<p>We thank Colin Dewey and Narayanan Manikandan for their helpful suggestions and comments. The work was partially funded by NIH grants R01:HG02632-1 and U01:HG003150-01.</p>
			</sec>
		</ack>
		<refgrp>
			<bibl id="B1">
				<title>
					<p>The Aceview genes</p>
				</title>
				<url>http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/</url>
			</bibl>
			<bibl id="B2">
				<title>
					<p>An overview of Ensembl.</p>
				</title>
				<aug>
					<au>
						<snm>Birney</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Andrews</snm>
						<fnm>TD</fnm>
					</au>
					<au>
						<snm>Bevan</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Caccamo</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Chen</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Clarke</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Coates</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Cuff</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Curwen</snm>
						<fnm>V</fnm>
					</au>
					<au>
						<snm>Cutts</snm>
						<fnm>T</fnm>
					</au>
					<etal/>
				</aug>
				<source>Genome Res</source>
				<pubdate>2004</pubdate>
				<volume>14</volume>
				<fpage>925</fpage>
				<lpage>928</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">479121</pubid>
						<pubid idtype="pmpid" link="fulltext">15078858</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B3">
				<title>
					<p>FlyBase: genes and gene models.</p>
				</title>
				<aug>
					<au>
						<snm>Drysdale</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Crosby</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Gelbart</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Campbell</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Emmert</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Matthews</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Russo</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Schroeder</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Smutniak</snm>
						<fnm>F</fnm>
					</au>
					<au>
						<snm>Zhang</snm>
						<fnm>P</fnm>
					</au>
					<etal/>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2005</pubdate>
				<volume>33</volume>
				<issue>Database</issue>
				<fpage>D390</fpage>
				<lpage>D395</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">540000</pubid>
						<pubid idtype="pmpid" link="fulltext">15608223</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B4">
				<title>
					<p>Prediction of complete gene structures in human genomic DNA.</p>
				</title>
				<aug>
					<au>
						<snm>Burge</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Karlin</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>J Mol Biol</source>
				<pubdate>1997</pubdate>
				<volume>268</volume>
				<fpage>78</fpage>
				<lpage>94</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">9149143</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B5">
				<title>
					<p>A generalized hidden Markov model for the recognition of human genes in DNA.</p>
				</title>
				<aug>
					<au>
						<snm>Kulp</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Haussler</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Reese</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Eeckman</snm>
						<fnm>F</fnm>
					</au>
				</aug>
				<source>Proc Int Conf Intell Syst Mol Biol</source>
				<pubdate>1996</pubdate>
				<volume>4</volume>
				<fpage>134</fpage>
				<lpage>142</lpage>
				<xrefbib>
					<pubid idtype="pmpid">8877513</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B6">
				<title>
					<p>SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model.</p>
				</title>
				<aug>
					<au>
						<snm>Alexandersson</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Cawley</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Pachter</snm>
						<fnm>L</fnm>
					</au>
				</aug>
				<source>Genome Res</source>
				<pubdate>2003</pubdate>
				<volume>13</volume>
				<fpage>496</fpage>
				<lpage>502</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">430255</pubid>
						<pubid idtype="pmpid" link="fulltext">12618381</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B7">
				<title>
					<p>Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map.</p>
				</title>
				<aug>
					<au>
						<snm>Flicek</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Keibler</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Hu</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Korf</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Brent</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Genome Res</source>
				<pubdate>2003</pubdate>
				<volume>13</volume>
				<fpage>46</fpage>
				<lpage>54</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">430948</pubid>
						<pubid idtype="pmpid" link="fulltext">12529305</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B8">
				<title>
					<p>Comparative gene prediction in human and mouse.</p>
				</title>
				<aug>
					<au>
						<snm>Parra</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Agarwal</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Abril</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Wiehe</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Fickett</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Guig&#243;</snm>
						<fnm>R</fnm>
					</au>
				</aug>
				<source>Genome Res</source>
				<pubdate>2003</pubdate>
				<volume>13</volume>
				<fpage>108</fpage>
				<lpage>117</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">430976</pubid>
						<pubid idtype="pmpid" link="fulltext">12529313</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B9">
				<title>
					<p>Phylogenetic shadowing of primate sequences to find functional regions of the human genome.</p>
				</title>
				<aug>
					<au>
						<snm>Boffelli</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>McAuliffe</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Ovcharenko</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Lewis</snm>
						<fnm>KD</fnm>
					</au>
					<au>
						<snm>Ovcharenko</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Pachter</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Rubin</snm>
						<fnm>E</fnm>
					</au>
				</aug>
				<source>Science</source>
				<pubdate>2003</pubdate>
				<volume>299</volume>
				<fpage>1391</fpage>
				<lpage>1394</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">12610304</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B10">
				<title>
					<p>Multiple-sequence functional annotation and the generalized hidden Markov phylogeny.</p>
				</title>
				<aug>
					<au>
						<snm>McAuliffe</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Pachter</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Jordan</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2004</pubdate>
				<volume>20</volume>
				<fpage>1850</fpage>
				<lpage>1860</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">14988105</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B11">
				<title>
					<p>Multiple organism gene finding by collapsed gibbs sampling.</p>
				</title>
				<aug>
					<au>
						<snm>Chatterji</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Pachter</snm>
						<fnm>L</fnm>
					</au>
				</aug>
				<source>RECOMB '04: Proceedings of the Eighth Annual International Conference on Computational Molecular Biology</source>
				<publisher>San Deigo, CA, USA. New York, NY: ACM Press</publisher>
				<pubdate>2004</pubdate>
				<volume>8</volume>
				<fpage>187</fpage>
				<lpage>193</lpage>
				<note>March 27-31 2004</note>
			</bibl>
			<bibl id="B12">
				<title>
					<p>Large multiple organism gene finding by collapsed Gibbs sampling.</p>
				</title>
				<aug>
					<au>
						<snm>Chatterji</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Pachter</snm>
						<fnm>L</fnm>
					</au>
				</aug>
				<source>J Comput Biol</source>
				<pubdate>2005</pubdate>
				<volume>12</volume>
				<fpage>599</fpage>
				<lpage>608</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">16108706</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B13">
				<title>
					<p>Computational identification of evolutionarily conserved exons.</p>
				</title>
				<aug>
					<au>
						<snm>Siepel</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Haussler</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>RECOMB '04: Proceedings of the Eighth Annual International Conference on Computational Molecular Biology</source>
				<publisher>San Deigo, CA, USA. New York, NY: ACM Press</publisher>
				<pubdate>2004</pubdate>
				<volume>8</volume>
				<fpage>177</fpage>
				<lpage>186</lpage>
				<note>March 27-31 2004</note>
			</bibl>
			<bibl id="B14">
				<title>
					<p>Using multiple alignments to improve gene prediction.</p>
				</title>
				<aug>
					<au>
						<snm>Gross</snm>
						<fnm>SS</fnm>
					</au>
					<au>
						<snm>Brent</snm>
						<fnm>MR</fnm>
					</au>
				</aug>
				<source>RECOMB '05: Proceedings of the Ninth Annual International Conference on Computational Molecular Biology</source>
				<publisher>Cambridge, MA, USA</publisher>
				<pubdate>2005</pubdate>
				<fpage>374</fpage>
				<lpage>388</lpage>
				<note>May 14-16 2005</note>
			</bibl>
			<bibl id="B15">
				<title>
					<p>ECgene: genome-based EST clustering and gene modeling for alternative splicing.</p>
				</title>
				<aug>
					<au>
						<snm>Kim</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Shin</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Lee</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Genome Res</source>
				<pubdate>2005</pubdate>
				<volume>15</volume>
				<fpage>566</fpage>
				<lpage>576</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1074371</pubid>
						<pubid idtype="pmpid" link="fulltext">15805497</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B16">
				<title>
					<p>GMAP: a genomic mapping and alignment program for mRNA and EST sequences.</p>
				</title>
				<aug>
					<au>
						<snm>Wu</snm>
						<fnm>TD</fnm>
					</au>
					<au>
						<snm>Watanabe</snm>
						<fnm>CK</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2005</pubdate>
				<volume>21</volume>
				<fpage>1859</fpage>
				<lpage>1875</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15728110</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B17">
				<title>
					<p>BLAT-the BLAST-like alignment tool.</p>
				</title>
				<aug>
					<au>
						<snm>Kent</snm>
						<fnm>W</fnm>
					</au>
				</aug>
				<source>Genome Res</source>
				<pubdate>2002</pubdate>
				<volume>12</volume>
				<fpage>656</fpage>
				<lpage>664</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">187518</pubid>
						<pubid idtype="pmpid" link="fulltext">11932250</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B18">
				<title>
					<p>Fast comparison of a DNA sequence with a protein sequence database.</p>
				</title>
				<aug>
					<au>
						<snm>Huang</snm>
						<fnm>X</fnm>
					</au>
				</aug>
				<source>Microb Comp Genomics</source>
				<pubdate>1996</pubdate>
				<volume>1</volume>
				<fpage>281</fpage>
				<lpage>291</lpage>
				<xrefbib>
					<pubid idtype="pmpid">9689213</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B19">
				<title>
					<p>Gene recognition via spliced sequence alignment.</p>
				</title>
				<aug>
					<au>
						<snm>Gelfand</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Mironov</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Pevzner</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci USA</source>
				<pubdate>1996</pubdate>
				<volume>93</volume>
				<fpage>9061</fpage>
				<lpage>9066</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">38595</pubid>
						<pubid idtype="pmpid" link="fulltext">8799154</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B20">
				<title>
					<p>GeneWise and Genomewise.</p>
				</title>
				<aug>
					<au>
						<snm>Birney</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Clamp</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Durbin</snm>
						<fnm>R</fnm>
					</au>
				</aug>
				<source>Genome Res</source>
				<pubdate>2004</pubdate>
				<volume>14</volume>
				<fpage>988</fpage>
				<lpage>995</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">479130</pubid>
						<pubid idtype="pmpid" link="fulltext">15123596</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B21">
				<title>
					<p>Computational inference of homologous gene structures in the human genome.</p>
				</title>
				<aug>
					<au>
						<snm>Yeh</snm>
						<fnm>RF</fnm>
					</au>
					<au>
						<snm>Lim</snm>
						<fnm>LP</fnm>
					</au>
					<au>
						<snm>Burge</snm>
						<fnm>CB</fnm>
					</au>
				</aug>
				<source>Genome Res</source>
				<pubdate>2001</pubdate>
				<volume>11</volume>
				<fpage>803</fpage>
				<lpage>816</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">311055</pubid>
						<pubid idtype="pmpid" link="fulltext">11337476</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B22">
				<title>
					<p>JIGSAW: integration of multiple sources of evidence for gene prediction.</p>
				</title>
				<aug>
					<au>
						<snm>Allen</snm>
						<fnm>JE</fnm>
					</au>
					<au>
						<snm>Salzberg</snm>
						<fnm>SL</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2005</pubdate>
				<volume>21</volume>
				<fpage>3596</fpage>
				<lpage>3603</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">16076884</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B23">
				<title>
					<p>ExonHunter: a comprehensive approach to gene finding.</p>
				</title>
				<aug>
					<au>
						<snm>Brejova</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Brown</snm>
						<fnm>DG</fnm>
					</au>
					<au>
						<snm>Li</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Vinar</snm>
						<fnm>T</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2005</pubdate>
				<issue>21 Suppl 1</issue>
				<fpage>i57</fpage>
				<lpage>i65</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15961499</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B24">
				<title>
					<p>Gene structure conservation aids similarity based gene prediction.</p>
				</title>
				<aug>
					<au>
						<snm>Meyer</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Durbin</snm>
						<fnm>R</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2004</pubdate>
				<volume>32</volume>
				<fpage>776</fpage>
				<lpage>783</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">373336</pubid>
						<pubid idtype="pmpid" link="fulltext">14764925</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B25">
				<title>
					<p>GeneMapper Supplementary Webpage</p>
				</title>
				<url>http://bio.math.berkeley.edu/genemapper/suppl.html</url>
			</bibl>
			<bibl id="B26">
				<title>
					<p>The ENCODE (ENCyclopedia Of DNA Elements) Project.</p>
				</title>
				<aug>
					<au>
						<snm>Feingold</snm>
						<fnm>EA</fnm>
					</au>
					<au>
						<snm>Good</snm>
						<fnm>PJ</fnm>
					</au>
					<au>
						<snm>Guyer</snm>
						<fnm>MS</fnm>
					</au>
					<au>
						<snm>Kamholz</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Liefer</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Wetterstrand</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Collins</snm>
						<fnm>FS</fnm>
					</au>
				</aug>
				<source>Science</source>
				<pubdate>2004</pubdate>
				<volume>306</volume>
				<fpage>636</fpage>
				<lpage>640</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15499007</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B27">
				<title>
					<p>The GENCODE Project: encyclopedia of genes and genes variants</p>
				</title>
				<url>http://genome.imim.es/gencode/</url>
			</bibl>
			<bibl id="B28">
				<title>
					<p>Eval: a software package for analysis of genome annotations.</p>
				</title>
				<aug>
					<au>
						<snm>Keibler</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Brent</snm>
						<fnm>MR</fnm>
					</au>
				</aug>
				<source>BMC Bioinformatics</source>
				<pubdate>2003</pubdate>
				<volume>4</volume>
				<fpage>50</fpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">270064</pubid>
						<pubid idtype="pmpid" link="fulltext">14565849</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B29">
				<title>
					<p>Evaluation of gene structure prediction programs.</p>
				</title>
				<aug>
					<au>
						<snm>Burset</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Guigo</snm>
						<fnm>R</fnm>
					</au>
				</aug>
				<source>Genomics</source>
				<pubdate>1996</pubdate>
				<volume>34</volume>
				<fpage>353</fpage>
				<lpage>367</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">8786136</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B30">
				<title>
					<p>The UCSC Genome Browser Database.</p>
				</title>
				<aug>
					<au>
						<snm>Karolchik</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Baertsch</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Diekhans</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Furey</snm>
						<fnm>TS</fnm>
					</au>
					<au>
						<snm>Hinrichs</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Lu</snm>
						<fnm>YT</fnm>
					</au>
					<au>
						<snm>Roskin</snm>
						<fnm>KM</fnm>
					</au>
					<au>
						<snm>Schwartz</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Sugnet</snm>
						<fnm>CW</fnm>
					</au>
					<au>
						<snm>Thomas</snm>
						<fnm>DJ</fnm>
					</au>
					<etal/>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2003</pubdate>
				<volume>31</volume>
				<fpage>51</fpage>
				<lpage>54</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">165576</pubid>
						<pubid idtype="pmpid" link="fulltext">12519945</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B31">
				<title>
					<p>StrataSplice-A human splice site predictor</p>
				</title>
				<url>http://www.sanger.ac.uk/Software/analysis/stratasplice</url>
			</bibl>
			<bibl id="B32">
				<title>
					<p>A model of evolutionary change in protein.</p>
				</title>
				<aug>
					<au>
						<snm>Dayhoff</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Schwartz</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Orcutt</snm>
						<fnm>B</fnm>
					</au>
				</aug>
				<source>Atlas of Protein Sequences and Structure</source>
				<publisher>Washington DC: National Biomedical Research Foundation</publisher>
				<pubdate>1978</pubdate>
				<volume>5</volume>
				<fpage>345</fpage>
				<lpage>352</lpage>
			</bibl>
			<bibl id="B33">
				<title>
					<p>Amino acid substitution matrices from protein blocks.</p>
				</title>
				<aug>
					<au>
						<snm>Henikoff</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Henikoff</snm>
						<fnm>JG</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci USA</source>
				<pubdate>1992</pubdate>
				<volume>89</volume>
				<fpage>10915</fpage>
				<lpage>10919</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">50453</pubid>
						<pubid idtype="pmpid" link="fulltext">1438297</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B34">
				<aug>
					<au>
						<snm>Durbin</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Eddy</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Krogh</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Mitchison</snm>
						<fnm>G</fnm>
					</au>
				</aug>
				<source>Biological Sequence Analysis: Probablistic Models of Proteins and Nucleic Acids</source>
				<publisher>Cambridge: Cambridge University Press</publisher>
				<pubdate>1998</pubdate>
			</bibl>
			<bibl id="B35">
				<title>
					<p>Initial sequencing and comparative analysis of the mouse genome.</p>
				</title>
				<aug>
					<au>
						<snm>Waterston</snm>
						<fnm>RH</fnm>
					</au>
					<au>
						<snm>Lindblad-Toh</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Birney</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Rogers</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Abril</snm>
						<fnm>JF</fnm>
					</au>
					<au>
						<snm>Agarwal</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Agarwala</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Ainscough</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Alexandersson</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>An</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>Nature</source>
				<pubdate>2002</pubdate>
				<volume>420</volume>
				<fpage>520</fpage>
				<lpage>562</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">12466850</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B36">
				<title>
					<p>Active conservation of noncoding sequences revealed by three-way species comparisons.</p>
				</title>
				<aug>
					<au>
						<snm>Dubchak</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Brudno</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Loots</snm>
						<fnm>GG</fnm>
					</au>
					<au>
						<snm>Pachter</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Mayor</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Rubin</snm>
						<fnm>EM</fnm>
					</au>
					<au>
						<snm>Frazer</snm>
						<fnm>KA</fnm>
					</au>
				</aug>
				<source>Genome Res</source>
				<pubdate>2000</pubdate>
				<volume>10</volume>
				<fpage>1304</fpage>
				<lpage>1306</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">310906</pubid>
						<pubid idtype="pmpid" link="fulltext">10984448</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B37">
				<title>
					<p>Accurate identification of novel human genes through simultaneous gene prediction in human, mouse, and rat.</p>
				</title>
				<aug>
					<au>
						<snm>Dewey</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Wu</snm>
						<fnm>JQ</fnm>
					</au>
					<au>
						<snm>Cawley</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Alexandersson</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Gibbs</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Pachter</snm>
						<fnm>L</fnm>
					</au>
				</aug>
				<source>Genome Res</source>
				<pubdate>2004</pubdate>
				<volume>14</volume>
				<fpage>661</fpage>
				<lpage>664</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">383310</pubid>
						<pubid idtype="pmpid" link="fulltext">15060007</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B38">
				<title>
					<p>dbEST: database for 'expressed sequence tags'.</p>
				</title>
				<aug>
					<au>
						<snm>Boguski</snm>
						<fnm>MS</fnm>
					</au>
					<au>
						<snm>Lowe</snm>
						<fnm>TM</fnm>
					</au>
					<au>
						<snm>Tolstoshev</snm>
						<fnm>CM</fnm>
					</au>
				</aug>
				<source>Nat Genet</source>
				<pubdate>1993</pubdate>
				<volume>4</volume>
				<fpage>332</fpage>
				<lpage>333</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">8401577</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B39">
				<title>
					<p>The Vertebrate Genome Annotation (Vega) database.</p>
				</title>
				<aug>
					<au>
						<snm>Ashurst</snm>
						<fnm>JL</fnm>
					</au>
					<au>
						<snm>Chen</snm>
						<fnm>CK</fnm>
					</au>
					<au>
						<snm>Gilbert</snm>
						<fnm>JGR</fnm>
					</au>
					<au>
						<snm>Jekosch</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Keenan</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Meidl</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Searle</snm>
						<fnm>SM</fnm>
					</au>
					<au>
						<snm>Stalker</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Storey</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Trevanion</snm>
						<fnm>S</fnm>
					</au>
					<etal/>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2005</pubdate>
				<volume>33</volume>
				<issue>Database</issue>
				<fpage>D459</fpage>
				<lpage>D465</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">540089</pubid>
						<pubid idtype="pmpid" link="fulltext">15608237</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
		</refgrp>
	</bm>
</art>
