<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
	<ui>gb-2004-5-6-r43</ui>
	<ji>GBJ</ji>
	<fm>
		<dochead>Software</dochead>
		<bibl>
			<title>
				<p>TXTGate: profiling gene groups with text-based information</p>
			</title>
			<aug>
				<au id="A1">
					<snm>Glenisson</snm>
					<fnm>Patrick</fnm>
					<insr iid="I1"/>
				</au>
				<au id="A2" ca="yes">
					<snm>Coessens</snm>
					<fnm>Bert</fnm>
					<insr iid="I1"/>
					<email>bert.coessens@esat.kuleuven.ac.be</email>
				</au>
				<au id="A3">
					<snm>Van Vooren</snm>
					<fnm>Steven</fnm>
					<insr iid="I1"/>
				</au>
				<au id="A4">
					<snm>Mathys</snm>
					<fnm>Janick</fnm>
					<insr iid="I1"/>
				</au>
				<au id="A5">
					<snm>Moreau</snm>
					<fnm>Yves</fnm>
					<insr iid="I1"/>
					<insr iid="I2"/>
				</au>
				<au id="A6">
					<snm>De Moor</snm>
					<fnm>Bart</fnm>
					<insr iid="I1"/>
				</au>
			</aug>
			<insg>
				<ins id="I1">
					<p>Departement Elektrotechniek (ESAT), Faculteit Toegepaste Wetenschappen, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, 3001 Heverlee (Leuven), Belgium</p>
				</ins>
				<ins id="I2">
					<p>Current address: Center for Biological Sequence Analysis, BioCentrum, Danish Technical University, Kemitorvet, DK-2800 Lyngby, Denmark</p>
				</ins>
			</insg>
			<source>Genome Biology</source>
			<issn>1465-6906</issn>
			<pubdate>2004</pubdate>
			<volume>5</volume>
			<issue>6</issue>
			<fpage>R43</fpage>
			<url>http://genomebiology.com/2004/5/6/R43</url>
			<xrefbib>
				<pubidlist><pubid idtype="pmpid">15186494</pubid><pubid idtype="doi">10.1186/gb-2004-5-6-r43</pubid>
				</pubidlist></xrefbib>
		</bibl>
		<history>
			<rec>
				<date>
					<day>24</day>
					<month>11</month>
					<year>2003</year>
				</date>
			</rec>
			<revrec>
				<date>
					<day>3</day>
					<month>2</month>
					<year>2004</year>
				</date>
			</revrec>
			<acc>
				<date>
					<day>27</day>
					<month>4</month>
					<year>2004</year>
				</date>
			</acc>
			<pub>
				<date>
					<day>28</day>
					<month>5</month>
					<year>2004</year>
				</date>
			</pub>
		</history>
		<cpyrt>
			<year>2004</year>
			<collab>Glenisson et al.; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.</collab>
		</cpyrt>
		<shortabs>
			<p>This study implemented a framework called TXTGate that combines literature indices of selected public biological resources in a flexible text-mining system designed towards the analysis of groups of genes.</p>
		</shortabs>
		<abs>
			<sec>
				<st>
					<p>Abstract</p>
				</st>
				<p>We implemented a framework called TXTGate that combines literature indices of selected public biological resources in a flexible text-mining system designed towards the analysis of groups of genes. By means of tailored vocabularies, term- as well as gene-centric views are offered on selected textual fields and MEDLINE abstracts used in LocusLink and the <it>Saccharomyces </it>Genome Database. Subclustering and links to external resources allow for in-depth analysis of the resulting term profiles.</p>
			</sec>
		</abs>
	</fm>
	<meta>
		<classifications>
			<classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
			<classification type="BMC" subtype="man_spc_id" id="30010013">Methods</classification>
		</classifications>
	</meta>
	<bdy>
		<sec>
			<st>
				<p>Rationale</p>
			</st>
			<p>Recent advances in high-throughput methods such as microarrays enable systematic testing of the functions of multiple genes, their interrelatedness and the controlled circumstances in which ensuing observations hold. As a result, scientific discoveries and hypotheses are stacking up, all primarily reported in the form of free text. However, as large amounts of raw textual data are hard to extract information from, various specialized databases have been implemented to provide a complementary resource for designing, performing or analyzing large-scale experiments.</p>
			<p>Until now, the fact that there is little difference between retrieving an abstract from MEDLINE and downloading an entry from a biological database has been largely overlooked <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. The fading of the boundaries between text from a scientific article and a curated annotation of a gene entry in a database is readily illustrated by the GeneRIF feature in LocusLink <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>, where snippets of a relevant article pertaining to a gene's function are manually extracted and directly pasted as an attribute in the database. The broadening of biologists' scope of investigation, along with the growing amount of information, result in an increasing need to move from single gene or keyword-based queries to more refined schemes that allow comprehensive views of text-oriented databases.</p>
			<p>As gene-expression studies typically output a list of dozens or hundreds of genes that are co-expressed, a researcher is faced with the assignment of biological meaning to such lists. Several text-mining approaches have been developed to this end. Masys <it>et al. </it><abbrgrp><abbr bid="B3">3</abbr></abbrgrp> link groups of genes with relevant MEDLINE abstracts through the PubMed engine. Each cluster is characterized by a pool of keywords derived from both the Medical Subject Headings (MeSH) and the Unified Medical Language System (UMLS) ontology. Jenssen <it>et al. </it><abbrgrp><abbr bid="B4">4</abbr></abbrgrp> set up a pioneering online system to link co-expression information from a microarray experiment with the cocitation network they constructed. This literature network covers co-occurrence information of gene identifiers in more than 10 million MEDLINE abstracts. Their system characterizes co-expressed genes using the MeSH keywords attached to the abstracts about those genes. Shatkay <it>et al. </it><abbrgrp><abbr bid="B5">5</abbr></abbrgrp> link abstracts to genes in a probabilistic scheme that uses the EM algorithm to estimate the parameters of the word distributions underlying a 'theme'. Genes are identified as similar when their corresponding gene-by-documents representations are close. Chaussabel and Sher <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> and Glenisson <it>et al. </it><abbrgrp><abbr bid="B7">7</abbr></abbrgrp> provide a proof of principle on how clustering of genes encoded in a keyword-based representation can further discern relevant subpatterns. Finally, Raychaudhuri <it>et al. </it><abbrgrp><abbr bid="B8">8</abbr></abbrgrp> developed a method called neighborhood divergence, to quantify the functional coherence of a group of genes using a database that links genes to documents. The score is successfully applied to both gold-standard and expression data, but has the slight drawback that it does not give information on the actual function. Their method is indeed geared to the identification of biologically coherent groups, rather than their interpretation.</p>
			<p>Our system is built taking into account three main considerations, in an attempt to improve the quality and interpretability of term profiles. First, the construction of a sound linkage between genes and MEDLINE abstracts is often problem-dependent and constitutes a research track on its own that requires advanced document-classification strategies as, for example, proposed by Leonard <it>et al. </it><abbrgrp><abbr bid="B9">9</abbr></abbrgrp> or Raychaudhuri <it>et al. </it><abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. Despite some shortcomings, therefore, curated gene-literature references are helpful resources to exploit. Second, the information contained within curated gene references is sometimes diverse and can range from sequence to disease. In addition, the research questions that scientists are addressing when they scrutinize gene groups from high-throughput assays are similarly diverse. Therefore, considering all the terms occurring in a large set of documents (that is, a general vocabulary) might be detrimental to the extraction of terms that are relevant to the question at hand. The construction of separate vocabularies according to gene name, disease and function seems a logical choice to provide increased insight. Third, as mentioned previously, annotations offered by curated gene databases are often in semi-structured form and encompass keywords, sentences or paragraphs. To facilitate integration of such annotations with existing knowledge, controlled vocabularies that describe conceptual properties are of great value when constructing interoperable and computer-parsable systems. A number of structured vocabularies have already arisen (Gene Ontology (GO) <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>, MeSH <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>, eVOC <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>) and, slowly but surely, certain standards are systematically being adopted to store and represent biological information <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>.</p>
			<p>Armed with these insights, we developed TXTGate <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>, a platform that offers multiple 'views' on vast amounts of gene-based free-text information available in selected curated database entries and scientific publications. TXTGate enables detailed functional analysis of interesting gene groups by displaying key terms extracted from the associated literature and by offering options to link out to other resources or to subcluster the genes on the basis of text. This way, we address on the one hand the need for easy means to validate gene clusters arising from, for instance, microarray experiments, and on the other hand the problem of using scientific literature in the form of free text as a source of functional information about genes. The strength of TXTGate is its use of tailored vocabularies to visualize only the information most relevant to the query at hand. TXTGate is implemented as a web application and is available for academic use <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>.</p>
		</sec>
		<sec>
			<st>
				<p>Related software</p>
			</st>
			<p>This work extends the general ideas of textual profiling and clustering presented in Blaschke <it>et al. </it><abbrgrp><abbr bid="B16">16</abbr></abbrgrp> and Chaussabel and Sher <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, where the utility of literature indices for profiling gene groups in yeast and humans is proven. TXTGate implements the vector-space model for gene profiling <abbrgrp><abbr bid="B7">7</abbr></abbrgrp> and provides indices for MEDLINE abstracts and selected functional annotations from two public databases. Various engineered domain-specific vocabularies (term- as well as gene-centric) act as perspectives to the literature and the tool provides direct links to external resources. In what follows, we compare TXTGate to other reported biological text-mining software.</p>
			<p>MedMiner <abbrgrp><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr></abbrgrp> retrieves relevant abstracts by formulating expanded queries to PubMed. It uses entries from the GeneCards database <abbrgrp><abbr bid="B19">19</abbr></abbrgrp> to fish for additional relevant keywords to expand a query. The resulting filtered abstracts are summarized in keywords and sentences, and feedback loops are provided. Nevertheless, the system is directed at querying terms and specific gene-drug or gene-gene relationships, rather than at scrutinizing gene clusters. MedMOLE <abbrgrp><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr></abbrgrp> is also a system to query MEDLINE more intelligently and detects Human Genome Organization (HUGO) names in abstracts via a natural language processing (NLP)-based gene-name extractor. The retrieved abstracts can be clustered, and top keywords are presented. However, the application scales less well, is not effective at profiling groups of genes, and the summaries provide much less detail than MedMiner and TXTGate. GEISHA <abbrgrp><abbr bid="B16">16</abbr><abbr bid="B22">22</abbr></abbrgrp> is a tool for profiling gene clusters with an emphasis on summarization within a shallow parsing framework. This system was implemented for <it>Escherichia coli </it>but is no longer updated. PubGene <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B23">23</abbr></abbrgrp> is a database containing gene co-occurrence and cocitation networks of human genes derived from the full MEDLINE database. For a given set of genes it reports the literature network they reside in, together with their high-scoring MeSH terms. As not all relevant information can be captured by gene symbols or MeSH terms, the functionalities offered by TXTGate provide complementary views to interpret groups of genes. Although our colinkage feature (being a weaker form of co-occurrence that spans only the set of 73,152 MEDLINE abstracts used in LocusLink) is less elaborate than the possibilities offered by PubGene, we will show its utility and added value through its integration in the broader TXTGate framework. MedGene <abbrgrp><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr></abbrgrp> and G2D <abbrgrp><abbr bid="B26">26</abbr><abbr bid="B27">27</abbr></abbrgrp> are specialized databases that, in contrast to TXTGate, are geared at ranking genes by disease. They accept user-defined queries scrutinizing gene-disease, disease-disease or gene-gene relationships extracted from the literature. Finally, MeKE <abbrgrp><abbr bid="B28">28</abbr><abbr bid="B29">29</abbr></abbrgrp> is an application listing gene functions extracted by an ontology-based NLP system. Its current setup is directed more towards a functional knowledge base, rather than comprehensibly profiling information coming from groups of genes, as offered by our software.</p>
		</sec>
		<sec>
			<st>
				<p>Application overview</p>
			</st>
			<p>A conceptual overview of the system is shown in Figure <figr fid="F1">1</figr>. Various literature indices were created based on selected annotation fields and linked MEDLINE information, both present in the curated repositories LocusLink and the <it>Saccharomyces </it>Genome Database (SGD). Several tailored vocabularies derived from public resources (GO, MeSH, Online Mendelian Inheritance in Man (OMIM), eVOC and HUGO) act as a perspective on the textual information. A user-defined query on any of these indices by providing a group of genes of interest results in a summary keyword profile which can be used for further query building for a variety of other databases. Currently, TXTGate smoothly accommodates queries of around 200 genes. Alternatively, the group can be subclustered on the basis of the selected textual information to discern substructures not apparent in the original summary profile. The operations that can be carried out are described below.</p>
			<fig id="F1">
				<title>
					<p>Figure 1</p>
				</title>
				<caption>
					<p>Conceptual overview of TXTGate</p>
				</caption>
				<text>
					<p>Conceptual overview of TXTGate. We indexed two different sources of textual information about genes (LocusLink and SGD) using different domain vocabularies (offline process). These indices are used online for textual gene profiling and clustering of interesting gene groups. TXTGate's link-out feature to external databases makes it possible to investigate the profiles in more detail.</p>
				</text>
				<graphic file="gb-2004-5-6-r43-1"/>
			</fig>
			<sec>
				<st>
					<p>Combining multiple, linked documents into a single gene profile</p>
				</st>
				<p>When a given gene has several curated MEDLINE references associated to it, we combine these abstracts into an indexed gene entry by taking the mean profile. This operation is part of the offline process.</p>
			</sec>
			<sec>
				<st>
					<p>Combining multiple gene profiles into a group profile</p>
				</st>
				<p>To summarize a cluster of genes and explore the most interesting terms they share, we compute the mean and variance of the terms over the group. Although simple, these statistics already reveal information on interesting terms characterizing the gene group. This is performed online.</p>
			</sec>
			<sec>
				<st>
					<p>Subclustering gene profiles</p>
				</st>
				<p>We offer the possibility online of subclustering a group of a maximum of 200 genes by means of hierarchical clustering. Ward's method was chosen because of its deterministic nature and the computational advantage of using the same solution when consecutively considering different numbers of clusters <it>k</it>. By varying the threshold at which to cut the tree, we can obtain an arbitrary number of clusters.</p>
				<p>Text profiling, clustering and the supporting web interface are implemented as a Java web application that communicates with a mySQL database via Java Remote Method Invocation <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. The literature indices are generated using custom-developed indexing software written in C++. Code is available on request.</p>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Program development</p>
			</st>
			<sec>
				<st>
					<p>Indexing</p>
				</st>
				<p>The indices are built using the vector-space model <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>, where a textual entity is represented by a vector (or text profile) of which each component corresponds to a single (multi-word) term from the entire set of terms (the vocabulary) being used. For each component a value denotes the importance of a given term, represented by a weight. Indexing a document <graphic file="gb-2004-5-6-r43-i1.gif"/> is performed by the calculation of these weights:</p>
				<p>
					<graphic file="gb-2004-5-6-r43-i2.gif"/>
				</p>
				<p>Each <it>w</it><sub><it>i,j </it></sub>in the vector of document <it>i </it>is a weight for term <it>j </it>from the vocabulary of size <it>N</it>. This representation is often referred to as 'bag-of-words'. All textual information is stemmed using the Porter stemmer <abbrgrp><abbr bid="B32">32</abbr></abbrgrp> (stemming is the automated conflation of related words, usually by reducing the words to a common root form) and indexed with a normalized inverse document frequency (IDF) weighting scheme, a reasonable choice for modeling pieces of text comprising up to 200 terms, as observed in database annotations and MEDLINE abstracts. With <it>D </it>the number of documents in the collection and <it>D</it><sub><it>t </it></sub>the number of documents containing term <it>t</it>, IDF is defined as</p>
				<p>
					<graphic file="gb-2004-5-6-r43-i3.gif"/>
				</p>
				<p>We downloaded the entire LocusLink (as of 8 April, 2003) and SGD (15 January, 2003) databases, and identified and indexed subsets of fields (such as GO annotations and functional summaries) that were most sensible in the presented context. Although indexing these database entries could have been performed on all fields at once, we deemed a preservation of selected parts of LocusLink's and SGD's logical field structure more appropriate for functional gene profiling. We indexed not only the textual annotations but also the 73,152 MEDLINE abstracts referred to in all entries of LocusLink, as well as the 24,909 abstracts linked to from SGD. Gene-specific indices were created by taking the average over all indices of MEDLINE abstracts annotated to a certain gene in LocusLink and SGD. The resulting indices are used in TXTGate as a basis for literature profiling and further query building of genes of interest. Table <tblr tid="T1">1</tblr> overviews the indexed resources of textual information and connects them to the used domain vocabularies.</p>
				<tbl id="T1" hint_layout="single">
					<title>
						<p>Table 1</p>
					</title>
					<caption>
						<p>Overview of the indexed resources of textual information in TXTGate</p>
					</caption>
					<tblbdy cols="3">
						<r>
							<c ca="left">
								<p>Resource</p>
							</c>
							<c ca="left">
								<p>Information fields</p>
							</c>
							<c ca="left">
								<p>Domain vocabularies used</p>
							</c>
						</r>
						<r>
							<c cspan="3">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>LocusLink</p>
							</c>
							<c ca="left">
								<p>Linked MEDLINE abstracts</p>
							</c>
							<c ca="left">
								<p>GO, MeSH, eVOC, OMIM, HUGO gene symbols</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>GeneRIF annotations</p>
							</c>
							<c ca="left">
								<p>GO</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>Functional summaries</p>
							</c>
							<c ca="left">
								<p>GO</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>GO annotations</p>
							</c>
							<c ca="left">
								<p>GO</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>SGD</p>
							</c>
							<c ca="left">
								<p>Linked MEDLINE abstracts</p>
							</c>
							<c ca="left">
								<p>GO-pruned, SGD gene symbols</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>GO annotations</p>
							</c>
							<c ca="left">
								<p>GO-pruned</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>In the second column we specify which fields of the resource were used. The third column lists the domain vocabularies with which the information was indexed.</p>
					</tblfn>
				</tbl>
			</sec>
			<sec>
				<st>
					<p>Construction of domain vocabularies</p>
				</st>
				<p>We constructed five different term-centric domain vocabularies that provide different views on the gene-specific information we indexed. All vocabulary sources underwent parsing and pruning operations to obtain stemmed words and phrases, eliminating stop words (such as 'then', 'as', 'of', 'gene') from a handcrafted list. We again applied the Porter stemmer <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>) to avoid information loss due to morphological and inflexional endings. Although stemming is not always desirable, for relatively small documents it has proved advantageous. Where applicable we derived phrases directly from the vocabulary source.</p>
				<p>A first vocabulary was derived from the GO <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> and comprises 17,965 terms. GO is a dynamic controlled hierarchy of multi-word terms with a wide coverage of life-science literature, and genetics in particular. We considered it an ideal source from which to extract a highly relevant and relatively noise-free domain vocabulary. We retained all composite GO terms shorter than five tokens as phrases. Longer terms containing brackets or commas were split to increase their detection. For the yeast indices, we pruned the vocabulary, retaining only those terms occurring at least twice and in less than 20% of all MEDLINE abstracts referred to in SGD <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>, obtaining a new vocabulary of 3,867 terms.</p>
				<p>Two other domain vocabularies are rather similar in scope but differ in size. One is based on the MeSH <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>, the National Library of Medicine's controlled vocabulary thesaurus, and counts 27,930 terms. The other is based on OMIM's Morbid Map <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>. This is a cytogenetic map location of all disease genes present in OMIM and their associated diseases. We extracted all disease terms to construct a 2,969-term vocabulary. A fifth domain vocabulary was drawn from eVOC <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>, a thesaurus consisting of four orthogonal controlled vocabularies encompassing the domain of human gene-expression data. It includes terms related to anatomical system, cell type, pathology, and developmental stage.</p>
				<p>In addition to these term-centric domain vocabularies we constructed two gene-centric vocabularies with the screening of co-occurring and colinked genes in mind. 'Co-occurrence' denotes the simultaneous presence of gene names within a single abstract, as described by Jenssen <it>et al. </it><abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. We define 'colinkage' here as a weaker form of co-occurrence screening for the simultaneous presence of gene names in the pool of abstracts that is linked to a given group of genes.</p>
				<p>From the HUGO database <abbrgrp><abbr bid="B35">35</abbr></abbrgrp> we derived a vocabulary consisting of all uniquely defined human gene symbols and their synonyms. In total, this vocabulary consists of 26,511 gene symbols. The second vocabulary consists of all uniquely defined yeast gene symbols found in SGD and contains 11,319 terms. As these official gene symbols are frequently requested and used by scientists, journals and databases, we assume they constitute a good first approximation to detect gene occurrence in MEDLINE abstracts. The domain vocabularies we adopted are listed in Table <tblr tid="T2">2</tblr>.</p>
				<tbl id="T2" hint_layout="single">
					<title>
						<p>Table 2</p>
					</title>
					<caption>
						<p>Overview of the domain vocabularies in TXTGate</p>
					</caption>
					<tblbdy cols="2">
						<r>
							<c ca="left">
								<p>Domain vocabulary</p>
							</c>
							<c ca="left">
								<p>Number of terms</p>
							</c>
						</r>
						<r>
							<c cspan="2">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Term-centric</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>GO</p>
							</c>
							<c ca="left">
								<p>17,965</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>GO-pruned (yeast)</p>
							</c>
							<c ca="left">
								<p>3,867</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>MESH</p>
							</c>
							<c ca="left">
								<p>27,930</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>OMIM</p>
							</c>
							<c ca="left">
								<p>2,969</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>eVOC</p>
							</c>
							<c ca="left">
								<p>1,553</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Gene-centric</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>HUGO gene symbols (human)</p>
							</c>
							<c ca="left">
								<p>26,511</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>SGD gene symbols (yeast)</p>
							</c>
							<c ca="left">
								<p>11,319</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>The vocabularies are named after the resource they stem from.</p>
					</tblfn>
				</tbl>
			</sec>
			<sec>
				<st>
					<p>Online clustering</p>
				</st>
				<p>The online clustering is done with our own implementation in Java of Ward's method for hierarchical clustering <abbrgrp><abbr bid="B36">36</abbr></abbrgrp>. Ward's method outperforms single, average or complete linkage. The similarity measure used is the cosine distance between two vector representations <graphic file="gb-2004-5-6-r43-i1.gif"/> and <graphic file="gb-2004-5-6-r43-i4.gif"/>. The similarity between a newly formed cluster (<it>r</it>, <it>s</it>) (by linking two existing vectors/clusters) with (<it>n</it><sub><it>r </it></sub>+ <it>n</it><sub><it>s</it></sub>) elements and an existing cluster (<it>t</it>) with <it>n</it><sub><it>t </it></sub>elements is given by</p>
				<p><it>d</it>[(<it>t</it>), (<it>r</it>, <it>s</it>)] = <it>&#945;</it><sub><it>r</it></sub><it>d</it>[(<it>t</it>), (<it>r</it>)] + <it>&#945;</it><sub><it>s</it></sub><it>d</it>[(<it>t</it>), (<it>s</it>)] + <it>&#946; </it><it>d</it>[(<it>r</it>), (<it>s</it>)]</p>
				<p>with <graphic file="gb-2004-5-6-r43-i5.gif"/>.</p>
				<p>Given the preferred number of clusters <it>k</it>, the linkage tree is cut at the appropriate level to yield <it>k </it>clusters.</p>
			</sec>
			<sec>
				<st>
					<p>Cluster coherence</p>
				</st>
				<p>As a measure of textual coherence, <it>C</it><sub><it>G</it></sub>, we calculate the median distance in term space from the profile of the group <it>G </it>of size <it>n</it><sub><it>G </it></sub>to the individual profiles, <it>g</it><sub><it>i</it></sub>, of all genes in that group:</p>
				<p>
					<graphic file="gb-2004-5-6-r43-i6.gif"/>
				</p>
				<p>We assess its significance by computing a background distribution from random gene clusters of different sizes.</p>
				<p>To demonstrate how Equation (1) scores groups of functionally related genes, we show its performance on 10 cell-cycle groups of Spellman <it>et al. </it><abbrgrp><abbr bid="B37">37</abbr></abbrgrp>. These involve 126 genes in total, which are identified manually as well as by expression analysis. As can be seen in Table <tblr tid="T3">3</tblr>, all but the sporulation group display <it>p</it>-values below the 1-sided 0.025 threshold (that is, a gene group <it>G </it>is considered coherent if <it>C</it><sub><it>G </it></sub>is smaller than expected by chance). A more detailed analysis can be found in <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>, but falls outside the scope of this manuscript. This result corroborates the ability of Equation (1), and more importantly of the vector-space model that underlies TXTGate, to represent biologically relevant functional information. It provides a quantitative foundation that supports the underlying methodology of TXTGate.</p>
				<tbl id="T3" hint_layout="single">
					<title>
						<p>Table 3</p>
					</title>
					<caption>
						<p>Significance of coherence score <it>C</it><sub><it>G</it></sub></p>
					</caption>
					<tblbdy cols="3">
						<r>
							<c ca="left">
								<p>Gene groups</p>
							</c>
							<c ca="left">
								<p>Size</p>
							</c>
							<c ca="left">
								<p>Coherence score</p>
							</c>
						</r>
						<r>
							<c cspan="3">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Cell-cycle control</p>
							</c>
							<c ca="left">
								<p>19</p>
							</c>
							<c ca="left">
								<p>1.01E-167</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>DNA repair</p>
							</c>
							<c ca="left">
								<p>3</p>
							</c>
							<c ca="left">
								<p>3.91E-61</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Fatty acids/lipids</p>
							</c>
							<c ca="left">
								<p>25</p>
							</c>
							<c ca="left">
								<p>4.28E-08</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Glycosylation</p>
							</c>
							<c ca="left">
								<p>7</p>
							</c>
							<c ca="left">
								<p>6.29E-06</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Methionine</p>
							</c>
							<c ca="left">
								<p>5</p>
							</c>
							<c ca="left">
								<p>9.88E-28</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Mitotic exit</p>
							</c>
							<c ca="left">
								<p>9</p>
							</c>
							<c ca="left">
								<p>1.50E-82</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Nutrition</p>
							</c>
							<c ca="left">
								<p>19</p>
							</c>
							<c ca="left">
								<p>1.76E-18</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Pseudohyphae</p>
							</c>
							<c ca="left">
								<p>10</p>
							</c>
							<c ca="left">
								<p>2.79E-05</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Secretion</p>
							</c>
							<c ca="left">
								<p>13</p>
							</c>
							<c ca="left">
								<p>1.11E-06</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Sporulation</p>
							</c>
							<c ca="left">
								<p>16</p>
							</c>
							<c ca="left">
								<p>1.11E-01</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>The significance is calulated with respect to 100-fold randomization for 10 cell-cycle related, functional groups selected from Figure 7 in Spellman <it>et al. </it><abbrgrp><abbr bid="B37">37</abbr></abbrgrp>. All groups are functionally coherent according to our score, except for the sporulation group.</p>
					</tblfn>
				</tbl>
			</sec>
		</sec>
		<sec>
			<st>
				<p>TXTGate summarizes and identifies subclusters</p>
			</st>
			<p>TXTGate allows online subclustering and profiling of gene groups via terms extracted from MEDLINE. Below we describe two examples.</p>
			<sec>
				<st>
					<p>Yeast data</p>
				</st>
				<p>We took the reference data set from Eisen <it>et al. </it><abbrgrp><abbr bid="B39">39</abbr></abbrgrp> and used TXTGate to conduct a textual analysis similar to that of Blaschke <it>et al. </it><abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. In Table <tblr tid="T4">4</tblr> we show the text profiles of cluster <it>E </it>from Eisen <it>et al. </it>by subclustering with <it>k </it>= 2. Although several of the text-mining settings in Blaschke <it>et al. </it>are different from ours (because of the differences in MEDLINE corpus, textual analysis methodology, and the clustering algorithm used), a comparison of the term profiles in both analyses shows that TXTGate also identifies <it>E1 </it>as being related to glycerol, whereas <it>E2 </it>is more related to pyruvate metabolism and ethanol fermentation (for more details, see Blaschke <it>et al. </it><abbrgrp><abbr bid="B16">16</abbr></abbrgrp>). Detailed text profiles for each of the clusters {<it>B</it>, <it>C</it>, <it>D</it>, <it>E</it>, <it>F</it>, <it>G</it>, <it>H</it>, <it>J</it>, and <it>K</it>} in Eisen <it>et al. </it>are given in Additional data file 1.</p>
				<tbl id="T4" hint_layout="double">
					<title>
						<p>Table 4</p>
					</title>
					<caption>
						<p>TXTGate profiling of cluster E from Eisen <it>et al. </it><abbrgrp><abbr bid="B39">39</abbr></abbrgrp></p>
					</caption>
					<tblbdy cols="4">
						<r>
							<c cspan="2" ca="left">
								<p>Gene symbol</p>
							</c>
							<c ca="left">
								<p>Cluster terms in Blaschke <it>et al. </it><abbrgrp><abbr bid="B16">16</abbr></abbrgrp></p>
							</c>
							<c ca="left">
								<p>Terms from TXTGate</p>
							</c>
						</r>
						<r>
							<c cspan="4">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Subcluster E1</p>
							</c>
							<c ca="left">
								<p>
									<it>TPT1 FBA1</it>
								</p>
							</c>
							<c ca="left">
								<p>glyceraldehyde-3-phosphate*</p>
							</c>
							<c ca="left">
								<p>glyceraldehyd_3_phosphat_dehydrogenas</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>GPM1 TKL1</it>
								</p>
							</c>
							<c ca="left">
								<p>glyceraldehyde-3-phosphate dehydrogenase*</p>
							</c>
							<c ca="left">
								<p>glycolyt</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>PGK1 CDC19</it>
								</p>
							</c>
							<c ca="left">
								<p>phosphoglycerate kinase*</p>
							</c>
							<c ca="left">
								<p>glucos</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p><it>TDH3 </it>HXK2</p>
							</c>
							<c ca="left">
								<p>phosphoglycerate*</p>
							</c>
							<c ca="left">
								<p>enzym</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>TDH2 TYE7</it>
								</p>
							</c>
							<c ca="left">
								<p>mutase*</p>
							</c>
							<c ca="left">
								<p>glycolysi</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>ENO2 PFK1</it>
								</p>
							</c>
							<c ca="left">
								<p>dehydrogenase</p>
							</c>
							<c ca="left">
								<p>carbon</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>TDH1 ACS2</it>
								</p>
							</c>
							<c ca="left">
								<p>enolase</p>
							</c>
							<c ca="left">
								<p>pyruv_kinas</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>glycerol-3-phosphate dehydrogenase</p>
							</c>
							<c ca="left">
								<p>ethanol</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>osmotic stress</p>
							</c>
							<c ca="left">
								<p>phosphoglycer_kinas</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>phospoglycerate</p>
							</c>
							<c ca="left">
								<p>growth</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Subcluster E2</p>
							</c>
							<c ca="left">
								<p>
									<it>PDC5 PDC1</it>
								</p>
							</c>
							<c ca="left">
								<p>alcohol*</p>
							</c>
							<c ca="left">
								<p>pyruv_decarboxylas</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>PDC6</it>
								</p>
							</c>
							<c ca="left">
								<p>transketolase*</p>
							</c>
							<c ca="left">
								<p>pyruv</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>catabolite repression</p>
							</c>
							<c ca="left">
								<p>glucos</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>decarboxylase</p>
							</c>
							<c ca="left">
								<p>enzym</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>ethanol</p>
							</c>
							<c ca="left">
								<p>alcohol</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>glucose</p>
							</c>
							<c ca="left">
								<p>decarboxyl</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>glucose repression</p>
							</c>
							<c ca="left">
								<p>ethanol</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>hexokinases</p>
							</c>
							<c ca="left">
								<p>ferment</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>pyruvate</p>
							</c>
							<c ca="left">
								<p>thiamin</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>pyruvate decarboxylase</p>
							</c>
							<c ca="left">
								<p>decarboxylas</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>Profiling is by subclustering (<it>k </it>= 2). High-scoring terms are shown for each subcluster E1 and E2. We also show the terms (excluding gene names) resulting from a similar analysis conducted by Blaschke <it>et al. </it><abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. *Terms that were labeled specific to a subcluster by Blaschke <it>et al</it>. Although several of their settings are different from ours (because of the differences in MEDLINE corpus, textual analysis and the cluster algorithm used), a comparison of the term profiles in both analyses shows that TXTGate also identifies E1 as related to glycerol, whereas E2 is more related to pyruvate metabolism and ethanol fermentation. Complete data can be found in Additional data file 1.</p>
					</tblfn>
				</tbl>
			</sec>
			<sec>
				<st>
					<p>Human data</p>
				</st>
				<p>To assess the quality of the indexed MEDLINE abstracts used in LocusLink, we compare the output from TXTGate with results presented in Chaussabel and Sher <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, where the authors describe, among other experiments, the profiling and clustering of nearly 200 genes involved in the 'common transcriptional program' induced in human macrophages upon bacterial infection. We interpreted the results by retrieving the MEDLINE textual profiles of all genes in the clusters and compared TXTGate's best-scoring terms to the cluster terms in Chaussabel and Sher <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. The results of the first four (non-overlapping) clusters (clusters <it>a</it>, <it>b</it>, <it>c </it>and <it>d</it>) can be found in Table <tblr tid="T5">5</tblr>. The terms 'adipose', 'metastasis' and 'NM' did not show up in the profiles from TXTGate because they are not contained in the GO domain vocabulary. For cluster <it>e </it>no common terms were found. Running TXTGate using the OMIM vocabulary, however, we were able to uncover exactly those disease-associated terms that were retrieved by Chaussabel and Sher <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> by manually investigating genes from this cluster in the OMIM database. In Table <tblr tid="T6">6</tblr> we highlight these terms in bold. As the set of diseases related to these genes is heterogeneous, the relevant terms display a high variance, rather than a high mean, a reason for also including a variance profile. Moreover, the fact that we retrieve those disease terms only by means of the OMIM vocabulary points out that the use of a variety of vocabularies in TXTGate leads to improved insights, a point discussed further in the next section. We note that all other cluster terms have a comparable equivalent in the TXTGate profiles; the complete analysis is given in Additional data file 2.</p>
				<tbl id="T5" hint_layout="double">
					<title>
						<p>Table 5</p>
					</title>
					<caption>
						<p>TXTGate profiling of clusters a, b, c, and d from Chaussabel and Sher <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> (GO vocabulary)</p>
					</caption>
					<tblbdy cols="5">
						<r>
							<c cspan="2" ca="left">
								<p>Gene symbol</p>
							</c>
							<c ca="left">
								<p>Cluster terms in <abbrgrp><abbr bid="B6">6</abbr></abbrgrp></p>
							</c>
							<c ca="left">
								<p>Terms from TXTGate</p>
							</c>
						</r>
						<r>
							<c cspan="5">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Cluster a</p>
							</c>
							<c ca="left">
								<p>
									<it>LPL</it>
								</p>
							</c>
							<c ca="left">
								<p>Lipoprotein</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<b>lipoprotein</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>CD36L1</it>
								</p>
							</c>
							<c ca="left">
								<p>Density</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>lipas</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>LDLR</it>
								</p>
							</c>
							<c ca="left">
								<p>Cholesterol</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>ldl</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>Lipid</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>ldl_receptor</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>Adipose</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<b>cholesterol</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>hdl</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>scaveng_receptor</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>high_densiti_lipoprotein</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>low_densiti_lipoprotein_receptor</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>low_densiti_lipoprotein</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Cluster b</p>
							</c>
							<c ca="left">
								<p>
									<it>UPA</it>
								</p>
							</c>
							<c ca="left">
								<p>Invasive</p>
							</c>
							<c ca="left">
								<p>Collagenase</p>
							</c>
							<c ca="left">
								<p>
									<b>metalloproteinas</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>PLAUR</it>
								</p>
							</c>
							<c ca="left">
								<p>Invasion</p>
							</c>
							<c ca="left">
								<p>Collagen</p>
							</c>
							<c ca="left">
								<p>
									<b>matrix</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>SERPIN</it>
								</p>
							</c>
							<c ca="left">
								<p>Metastasis</p>
							</c>
							<c ca="left">
								<p>Matrix</p>
							</c>
							<c ca="left">
								<p>metalloendopeptidas</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>MMP1</it>
								</p>
							</c>
							<c ca="left">
								<p>UPAR</p>
							</c>
							<c ca="left">
								<p>MMP</p>
							</c>
							<c ca="left">
								<p>
									<b>collagenas</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>MMP10</it>
								</p>
							</c>
							<c ca="left">
								<p>UPA</p>
							</c>
							<c ca="left">
								<p>Metalloproteinase</p>
							</c>
							<c ca="left">
								<p>extracellular_matrix</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>MMP14</it>
								</p>
							</c>
							<c ca="left">
								<p>Plasminogen</p>
							</c>
							<c ca="left">
								<p>Molecule-1</p>
							</c>
							<c ca="left">
								<p>alpha</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>SPARC</it>
								</p>
							</c>
							<c ca="left">
								<p>Urokinase-type</p>
							</c>
							<c ca="left">
								<p>Adhesion</p>
							</c>
							<c ca="left">
								<p>
									<b>upar</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>Urokinase</p>
							</c>
							<c ca="left">
								<p>Vascular</p>
							</c>
							<c ca="left">
								<p>
									<b>plasminogen_activ</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>Plasmin</p>
							</c>
							<c ca="left">
								<p>Endothelial</p>
							</c>
							<c ca="left">
								<p>interstiti</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>Activator</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<b>invasion</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Cluster c</p>
							</c>
							<c ca="left">
								<p>
									<it>AMPD3</it>
								</p>
							</c>
							<c ca="left">
								<p>Adenosine</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>purinerg</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>ADA</it>
								</p>
							</c>
							<c ca="left">
								<p>A2A</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<b>adenosin</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>ADORA2A</it>
								</p>
							</c>
							<c ca="left">
								<p>A1</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>deaminas</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>ADORA3</it>
								</p>
							</c>
							<c ca="left">
								<p>Antagonist</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>p2</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>P2RX</it>
								</p>
							</c>
							<c ca="left">
								<p>Agonist</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>p2x</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>P2RX1</it>
								</p>
							</c>
							<c ca="left">
								<p>NM</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>p1</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>P2RX7</it>
								</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<b>agonist</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>receptor</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>adenosin_receptor</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>ada</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Cluster d</p>
							</c>
							<c ca="left">
								<p>
									<it>IP10</it>
								</p>
							</c>
							<c ca="left">
								<p>Interferon</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>tumor_necrosi_factor</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>MIP1A</it>
								</p>
							</c>
							<c ca="left">
								<p>IFN-alpha</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>cytokin</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>MIP1B</it>
								</p>
							</c>
							<c ca="left">
								<p>IFN</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>induc</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>IL8</it>
								</p>
							</c>
							<c ca="left">
								<p>Interferon-gamma</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<b>interferon</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>STAT4</it>
								</p>
							</c>
							<c ca="left">
								<p>IFN-gamma</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>inflammatori</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>IL12B</it>
								</p>
							</c>
							<c ca="left">
								<p>Inducible</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>antigen</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>TNFRSF9</it>
								</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>lymphocyt_activ</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>TNFSF9</it>
								</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>stimul</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>SLAM</it>
								</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>chemokin</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>TNFRSF5</it>
								</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>monocyt</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<it>CD83</it>
								</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>Corresponding terms in Chaussabel and Sher <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> and TXTGate are in bold. TXTGate's profiles are comparably informative. Complete data can be found in Additional data file 2.</p>
					</tblfn>
				</tbl>
				<tbl id="T6" hint_layout="single">
					<title>
						<p>Table 6</p>
					</title>
					<caption>
						<p>Comparison of the terms in cluster e found by Chaussabel and Sher <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> with those found by TXTGate (OMIM vocabulary)</p>
					</caption>
					<tblbdy cols="3">
						<r>
							<c ca="left">
								<p>Gene symbol</p>
							</c>
							<c ca="left">
								<p>Cluster terms in Chaussabel and Sher <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>.</p>
							</c>
							<c ca="left">
								<p>Terms from TXTGate</p>
							</c>
						</r>
						<r>
							<c cspan="3">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Cluster e</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>CKB</it>
								</p>
							</c>
							<c ca="left">
								<p>Population</p>
							</c>
							<c ca="left">
								<p>deaminas</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>AMPD3</it>
								</p>
							</c>
							<c ca="left">
								<p>Frequency</p>
							</c>
							<c ca="left">
								<p>
									<b>lipoprotein_lipas</b>
								</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>ADA</it>
								</p>
							</c>
							<c ca="left">
								<p>Allele</p>
							</c>
							<c ca="left">
								<p>creatin</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>ADORA2A</it>
								</p>
							</c>
							<c ca="left">
								<p>Unrelated</p>
							</c>
							<c ca="left">
								<p>lipoprotein</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>ADORA3</it>
								</p>
							</c>
							<c ca="left">
								<p>Families</p>
							</c>
							<c ca="left">
								<p>
									<b>krabb</b>
								</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>P2RX</it>
								</p>
							</c>
							<c ca="left">
								<p>Recessive</p>
							</c>
							<c ca="left">
								<p>
									<b>epidermolysi_bullosa</b>
								</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>P2RX1</it>
								</p>
							</c>
							<c ca="left">
								<p>Autosomal</p>
							</c>
							<c ca="left">
								<p>
									<b>alagil</b>
								</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>P2RX7</it>
								</p>
							</c>
							<c ca="left">
								<p>Disorder</p>
							</c>
							<c ca="left">
								<p>bear</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>GEM</it>
								</p>
							</c>
							<c ca="left">
								<p>Severe</p>
							</c>
							<c ca="left">
								<p>leukodystrophi</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>ARHH</it>
								</p>
							</c>
							<c ca="left">
								<p>Patient</p>
							</c>
							<c ca="left">
								<p>receptor</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>LPL</it>
								</p>
							</c>
							<c ca="left">
								<p>Deficiency</p>
							</c>
							<c ca="left">
								<p>down</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>CD36L1</it>
								</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<b>corneal_dystrophi</b>
								</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>LDLR</it>
								</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<b>deaf</b>
								</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>BF</it>
								</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>hdl</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>GALC</it>
								</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>nucleosid</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>LAMB3</it>
								</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>retinoblastoma</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>GJB2</it>
								</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>junction</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>TGFBI</it>
								</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>adhesion</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>JAG1</it>
								</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>congenit_heart_defect</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>
									<it>DSCR1</it>
								</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<b>hear_loss</b>
								</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>The diversity of the diseases the member genes are related to makes the relevant terms display high variance, rather than high mean. The terms that were also found by Chaussabel and Sher <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> after manual investigation are marked in bold. Complete data can be found in Additional data file 2.</p>
					</tblfn>
				</tbl>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Textual information through the eyes of different vocabularies</p>
			</st>
			<p>Another major feature of TXTGate is its ability to present textual information (most importantly MEDLINE abstracts) from different perspectives. This is implemented by offering indices built on GO-, OMIM-, MeSH-, eVOC-, and gene nomenclature-based domain vocabularies respectively. Each configuration is meant to expose a different view of the literature. TXTGate mirrors the dual approach adopted by the external databases it links to, which separate keyword and gene-symbol queries. This, in part, motivated our strategy to construct both term- and gene-centric vocabularies.</p>
			<p>To compare our term-based vocabularies we profiled a group of genes involved in colon and colorectal cancer extracted from the OMIM Morbid Map database (see Additional data file 3). Table <tblr tid="T7">7</tblr> shows the top 10 terms for each of the retrieved profiles. As can be seen, there is little difference between the MeSH and OMIM profiles, whose terms are mainly medical- and disease-related ('colorect_cancer', 'colon_cancer', 'colorect_neoplasm', 'hereditari'), whereas the scope of the GO profile is focused more on metabolic functions of genes ('mismatch_repair', 'dna_repair', 'tumor_suppressor', 'kinas') and the eVOC profile contains terms more related to cell type and development ('growth', 'cell', 'carcinoma', 'metabol', 'fibroblast'). TXTGate's link-out feature allows a more profound analysis of the retrieved terms. Top-ranking terms can be sent to PubMed to retrieve relevant publications. Because all MEDLINE entries are tagged with MeSH keywords, using terms from the MeSH vocabulary assures a successful query. When using the GO-derived vocabulary, terms can be mapped back directly to the GO tree with AmiGO <abbrgrp><abbr bid="B40">40</abbr></abbrgrp> to investigate the term's neighborhood. Other databases available for querying include LocusLink and OMIM.</p>
			<tbl id="T7" hint_layout="single">
				<title>
					<p>Table 7</p>
				</title>
				<caption>
					<p>Various perspectives on textual information in TXTGate</p>
				</caption>
				<tblbdy cols="4">
					<r>
						<c ca="left">
							<p>GO</p>
						</c>
						<c ca="left">
							<p>OMIM</p>
						</c>
						<c ca="left">
							<p>MeSH</p>
						</c>
						<c ca="left">
							<p>eVOC</p>
						</c>
					</r>
					<r>
						<c cspan="4">
							<hr/>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>mismatch_repair</p>
						</c>
						<c ca="left">
							<p>colorect</p>
						</c>
						<c ca="left">
							<p>colorect_neoplasm</p>
						</c>
						<c ca="left">
							<p>colorect</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>tumor</p>
						</c>
						<c ca="left">
							<p>colorect_cancer</p>
						</c>
						<c ca="left">
							<p>mismatch</p>
						</c>
						<c ca="left">
							<p>tumour</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>dna_repair</p>
						</c>
						<c ca="left">
							<p>tumor</p>
						</c>
						<c ca="left">
							<p>cancer</p>
						</c>
						<c ca="left">
							<p>malign_tumour</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>mismatch</p>
						</c>
						<c ca="left">
							<p>kinas</p>
						</c>
						<c ca="left">
							<p>colorect</p>
						</c>
						<c ca="left">
							<p>colon</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>pair</p>
						</c>
						<c ca="left">
							<p>colon</p>
						</c>
						<c ca="left">
							<p>mutat</p>
						</c>
						<c ca="left">
							<p>growth</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>tumor_suppressor</p>
						</c>
						<c ca="left">
							<p>hereditari</p>
						</c>
						<c ca="left">
							<p>repair</p>
						</c>
						<c ca="left">
							<p>cell</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>apc</p>
						</c>
						<c ca="left">
							<p>cancer</p>
						</c>
						<c ca="left">
							<p>dna_repair</p>
						</c>
						<c ca="left">
							<p>carcinoma</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>kinas</p>
						</c>
						<c ca="left">
							<p>colon_cancer</p>
						</c>
						<c ca="left">
							<p>colon</p>
						</c>
						<c ca="left">
							<p>metabol</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>somat</p>
						</c>
						<c ca="left">
							<p>associ</p>
						</c>
						<c ca="left">
							<p>neoplasm_protein</p>
						</c>
						<c ca="left">
							<p>fibroblast</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>ra</p>
						</c>
						<c ca="left">
							<p>on</p>
						</c>
						<c ca="left">
							<p>tumor</p>
						</c>
						<c ca="left">
							<p>chain</p>
						</c>
					</r>
				</tblbdy>
				<tblfn>
					<p>Here we show how term-centric vocabularies based on GO, OMIM, MeSH and eVOC profile a group of genes involved in colon and colorectal cancer.</p>
				</tblfn>
			</tbl>
			<p>We used the same colon cancer case to test the ability of our human gene symbol vocabulary in screening for colinkage of genes. We constructed two different index tables - one with and one without alternative gene symbols; the former was constructed by mapping all synonymous symbols to the primary gene symbol. The first table has the disadvantage of not being able to disambiguate alternative gene symbols that are mapped to different primary gene symbols; the second does not take synonyms into account, as only true occurrences of a symbol were counted. As a consequence, frequently used symbols are ranked highly, while not being the official gene symbols. Examples of this are p21 and dra, whose primary symbols are CDKN1A and SLC26A3, respectively. The top-25 gene symbols using the first index table are given in Table <tblr tid="T8">8</tblr>. Most of the retrieved gene names are also in the query list. We used TXTGate's link-out feature to investigate the role of the genes that were not in the input list by sending them as a query to LocusLink and GeneCards. This way we were able to determine their function and their relation to colon and colorectal cancer, as can be seen in Table <tblr tid="T8">8</tblr>.</p>
			<tbl id="T8" hint_layout="double">
				<title>
					<p>Table 8</p>
				</title>
				<caption>
					<p>Co-linkage analysis of genes with gene-centric vocabularies</p>
				</caption>
				<tblbdy cols="2">
					<r>
						<c ca="left">
							<p>Gene name</p>
						</c>
						<c ca="left">
							<p>Description</p>
						</c>
					</r>
					<r>
						<c cspan="2">
							<hr/>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>hnpcc</p>
						</c>
						<c ca="left">
							<p>Hereditary nonpolyposis colon cancer</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>apc</p>
						</c>
						<c ca="left">
							<p>Adenomatous polyposis coli protein</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>p53</p>
						</c>
						<c ca="left">
							<p>Cellular tumor antigen P53 (tumor suppressor P53)</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>mlh1</p>
						</c>
						<c ca="left">
							<p>DNA mismatch repair protein MLH1 (mutL protein homolog 1)</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<b>muts</b>
							</p>
						</c>
						<c ca="left">
							<p>E. coli mismatch repair gene mutS</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<b>p21</b>
							</p>
						</c>
						<c ca="left">
							<p>Cyclin-dependent kinase inhibitor 1A</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>msh2</p>
						</c>
						<c ca="left">
							<p>DNA mismatch repair protein MSH2 (mutS protein homolog 2)</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>bax</p>
						</c>
						<c ca="left">
							<p>BAX protein, cytoplasmic isoform delta</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<b>wnt</b>
							</p>
						</c>
						<c ca="left">
							<p>Wingless-type MMTV integration site family members</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>pms2</p>
						</c>
						<c ca="left">
							<p>DNA mismatch repair protein PMS2</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>src</p>
						</c>
						<c ca="left">
							<p>Proto-oncogene tyrosine protein kinase SRC</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>dcc</p>
						</c>
						<c ca="left">
							<p>Tumor suppressor protein DCC precursor (colorectal cancer suppressor)</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>mcc</p>
						</c>
						<c ca="left">
							<p>Colorectal mutant cancer protein MCC</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>braf</p>
						</c>
						<c ca="left">
							<p>Proto-oncogene serine/threonine protein kinase B-RAF</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>fgfr3</p>
						</c>
						<c ca="left">
							<p>Fibroblast growth factor receptor 3 precursor</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>hcc</p>
						</c>
						<c ca="left">
							<p>Hepatocellular carcinoma</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>dra</p>
						</c>
						<c ca="left">
							<p>Chloride anion exchanger DRA</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>axin2</p>
						</c>
						<c ca="left">
							<p>AXIS inhibition protein 2</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>pms1</p>
						</c>
						<c ca="left">
							<p>DNA mismatch repair protein PMS1</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<b>abl</b>
							</p>
						</c>
						<c ca="left">
							<p>Abelson murine leukemia viral oncogene homolog 1</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>bub1</p>
						</c>
						<c ca="left">
							<p>Mitotic checkpoint serine/threonine protein kinase BUB1</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<b>ptp</b>
							</p>
						</c>
						<c ca="left">
							<p>Protein tyrosine phosphatase family</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>bcl10</p>
						</c>
						<c ca="left">
							<p>B cell lymphoma/leukemia 10</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<b>ptp_pest</b>
							</p>
						</c>
						<c ca="left">
							<p>Protein tyrosine phosphatase family with C-terminal PEST-motif</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<b>prlts</b>
							</p>
						</c>
						<c ca="left">
							<p>PDGF-receptor beta-like tumor suppressor</p>
						</c>
					</r>
				</tblbdy>
				<tblfn>
					<p>This table shows the top-25 colinked gene symbols in the pool of abstracts of the colon and colorectal cancer case. Genes that were not in the query list are indicated in bold.</p>
				</tblfn>
			</tbl>
		</sec>
		<sec>
			<st>
				<p>Application of TXTGate to a real-life research problem</p>
			</st>
			<p>In the framework of an ongoing collaboration with a medical research group, our system was deployed to tackle a current research issue <abbrgrp><abbr bid="B41">41</abbr><abbr bid="B42">42</abbr></abbrgrp>. We analyzed 350 genes that were upregulated in a mouse model for human benign tumors of the salivary glands and evaluated the results in a biological context. We had a medical researcher write a summary of pathological and genetic observations, reflecting relevant literature and expert knowledge. From this we derived a list of important terms. This list was cross-referenced with textual profiles retrieved from TXTGate using different domain vocabularies (see Additional data file 4). As pathology and developmental issues were the focus of the summary in this case, the eVOC domain vocabulary proved most appropriate, as can be seen from the occurrence of terms such as 'fibroblast', 'embryo', 'tumor', 'teratoma' and so on (see Table <tblr tid="T9">9</tblr>). We can conclude that the choice of domain vocabulary depends on the experimental context and focus of the investigation. This supports our strategic choice of offering different domain vocabularies.</p>
			<tbl id="T9" hint_layout="single">
				<title>
					<p>Table 9</p>
				</title>
				<caption>
					<p>Textual profile of a gene group from a mouse model for human benign tumors of the salivary glands</p>
				</caption>
				<tblbdy cols="2">
					<r>
						<c ca="left">
							<p>Terms sorted by mean</p>
						</c>
						<c ca="left">
							<p>Terms sorted by variance</p>
						</c>
					</r>
					<r>
						<c cspan="2">
							<hr/>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>organ</p>
						</c>
						<c ca="left">
							<p>organ</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>intern</p>
						</c>
						<c ca="left">
							<p>intern</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>normal</p>
						</c>
						<c ca="left">
							<p>growth</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>red</p>
						</c>
						<c ca="left">
							<p>development</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>male</p>
						</c>
						<c ca="left">
							<p>fibroblast</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>femal</p>
						</c>
						<c ca="left">
							<p>tumour</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>visual</p>
						</c>
						<c ca="left">
							<p>red</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>capillari</p>
						</c>
						<c ca="left">
							<p>nucleu</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>system</p>
						</c>
						<c ca="left">
							<p>normal</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>optic</p>
						</c>
						<c ca="left">
							<p>embryo</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>retina</p>
						</c>
						<c ca="left">
							<p>tera</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>viral</p>
						</c>
						<c ca="left">
							<p>depend</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>bacteri</p>
						</c>
						<c ca="left">
							<p>stem_cell</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>adult</p>
						</c>
						<c ca="left">
							<p>kidnei</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>chain</p>
						</c>
						<c ca="left">
							<p>epithelium</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>cell</p>
						</c>
						<c ca="left">
							<p>visual</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>growth</p>
						</c>
						<c ca="left">
							<p>multipl</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>tissu</p>
						</c>
						<c ca="left">
							<p>skin</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>development</p>
						</c>
						<c ca="left">
							<p>muscl_cell</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>metabol</p>
						</c>
						<c ca="left">
							<p>system</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>embryo</p>
						</c>
						<c ca="left">
							<p>capillari</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>fibroblast</p>
						</c>
						<c ca="left">
							<p>mammari</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>tumour</p>
						</c>
						<c ca="left">
							<p>type_ii</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>depend</p>
						</c>
						<c ca="left">
							<p>bacteri</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>genet</p>
						</c>
						<c ca="left">
							<p>male</p>
						</c>
					</r>
				</tblbdy>
				<tblfn>
					<p>This table shows the 25 top-ranking terms (for both mean and variance) of the textual profile of a group of 350 genes that were upregulated in a mouse model for human benign tumors of the salivary glands processed with the eVOC domain vocabulary.</p>
				</tblfn>
			</tbl>
			<p>As a measure of textual coherence <it>C</it><sub><it>G</it></sub>, we calculated the median distance in vocabulary space from the profile of the group <it>G </it>to the individual profiles <it>g</it><sub><it>i </it></sub>of all genes in that group:</p>
			<p>
				<graphic file="gb-2004-5-6-r43-i7.gif"/>
			</p>
			<p>As background we generated 5,000 random gene clusters of both the same size and random sizes (see Figure <figr fid="F2">2</figr>), and calculated their coherence as in Equation (2). We derived two background distributions modeling the information content for random clusters. This allows the calculation of a <it>p</it>-value for a cluster of genes, expressing the probability that the observed textual coherence occurs by chance. The cluster profile of the 350 upregulated mouse genes was significant against both the background for random cluster size (<it>p</it>-value 1.8 &#215; 10-<sup>3</sup>) and for cluster size 350 (<it>p</it>-value &lt; 10<sup>-8</sup>).</p>
			<fig id="F2">
				<title>
					<p>Figure 2</p>
				</title>
				<caption>
					<p>Background distributions for cluster incoherence</p>
				</caption>
				<text>
					<p>Background distributions for cluster incoherence. Cluster incoherence is defined as the median distance in vector space between the mean cluster profile and all individual gene profiles. Probability density functions (pdf) are shown for random clusters of size 350 (blue curve) and random clusters of random size (blue bars). For randomly sized clusters, the cumulative distribution function (cdf) is also shown (red curve).</p>
				</text>
				<graphic file="gb-2004-5-6-r43-2"/>
			</fig>
		</sec>
		<sec>
			<st>
				<p>Discussion</p>
			</st>
			<p>We have described a framework for advanced textual profiling of groups of genes. TXTGate is implemented as a web application designed to efficiently process queries of up to 200 genes, although this is not a strict limit. We believe that the application scales well enough to be of use in, for example, microarray cluster validation.</p>
			<p>Supported by the work of Stephens <it>et al. </it><abbrgrp><abbr bid="B43">43</abbr></abbrgrp> and more recently that of Chiang and Yu <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>, we aimed to complement the limitations of a single, more general, text index by offering different views. Nevertheless, some vocabularies could still be optimized to improve the information content of the profiles. For example, some general or non-informative terms are still scoring high because of our stemming and phrase-detection methods (for example, 'ii', 'protein', 'alpha').</p>
			<p>Finally, although the citations in LocusLink and SGD constitute good sources for retrieving relevant gene-related MEDLINE abstracts, weighting the information according to the context and eliminating poorly informative or contaminating annotations (such as sequence-related articles) still need to be taken into account in future incarnations of the software. Document-classification strategies as in Leonard <it>et al. </it><abbrgrp><abbr bid="B9">9</abbr></abbrgrp> or Raychaudhuri <it>et al. </it><abbrgrp><abbr bid="B10">10</abbr></abbrgrp> can be adopted to this end.</p>
			<p>As with GO annotations, transfer of literature references according to homology can be used to characterize poorly annotated genes <abbrgrp><abbr bid="B44">44</abbr><abbr bid="B45">45</abbr></abbrgrp>. At this stage, the application allows for the study of homologs within all organisms contained in LocusLink, provided the user inputs the corresponding LocusLink identifiers. This type of operation will be increasingly supported with future additions of literature indices from other organisms and databases.</p>
			<p>In conclusion, TXTGate's approach to summarizing database annotations and literature via specific vocabularies, along with its options to perform further analysis via clustering or query building, make it a flexible gateway to explore text-based information comprehensively.</p>
		</sec>
		<sec>
			<st>
				<p>Additional data files</p>
			</st>
			<p>The following additional data are available with the online version of this article: the MEDLINE-based text profiles of yeast expression clusters from Eisen <it>et al. </it><abbrgrp><abbr bid="B39">39</abbr></abbrgrp> (Additional data file <supplr sid="s1">1</supplr>); the MEDLINE-based profiles for the data in Chaussabel and Sher <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> (Additional data file <supplr sid="s2">2</supplr>); details on the colon and colorectal cancer test case (Additional data file <supplr sid="s3">3</supplr>); the expert summary and textual profiles of the 350 upregulated mouse genes for different domain vocabularies (Additional data file <supplr sid="s4">4</supplr>).</p>
			<suppl id="s1">
				<title>
					<p>Additional data file 1</p>
				</title>
				<caption>
					<p>The MEDLINE-based text profiles of yeast expression clusters from Eisen <it>et al. </it><abbrgrp><abbr bid="B39">39</abbr></abbrgrp></p>
				</caption>
				<text>
					<p>The MEDLINE-based text profiles of yeast expression clusters from Eisen <it>et al. </it><abbrgrp><abbr bid="B39">39</abbr></abbrgrp></p>
				</text>
				<file name="gb-2004-5-6-r43-s1.pdf">
					<p>Click here for additional data file</p>
				</file>
			</suppl>
			<suppl id="s2">
				<title>
					<p>Additional data file 2</p>
				</title>
				<caption>
					<p>The MEDLINE-based profiles for the data in Chaussabel and Sher <abbrgrp><abbr bid="B6">6</abbr></abbrgrp></p>
				</caption>
				<text>
					<p>The MEDLINE-based profiles for the data in Chaussabel and Sher <abbrgrp><abbr bid="B6">6</abbr></abbrgrp></p>
				</text>
				<file name="gb-2004-5-6-r43-s2.pdf">
					<p>Click here for additional data file</p>
				</file>
			</suppl>
			<suppl id="s3">
				<title>
					<p>Additional data file 3</p>
				</title>
				<caption>
					<p>Details on the colon and colorectal cancer test case</p>
				</caption>
				<text>
					<p>Details on the colon and colorectal cancer test case</p>
				</text>
				<file name="gb-2004-5-6-r43-s3.pdf">
					<p>Click here for additional data file</p>
				</file>
			</suppl>
			<suppl id="s4">
				<title>
					<p>Additional data file 4</p>
				</title>
				<caption>
					<p>The expert summary and textual profiles of the 350 upregulated mouse genes for different domain vocabularies</p>
				</caption>
				<text>
					<p>The expert summary and textual profiles of the 350 upregulated mouse genes for different domain vocabularies</p>
				</text>
				<file name="gb-2004-5-6-r43-s4.pdf">
					<p>Click here for additional data file</p>
				</file>
			</suppl>
		</sec>
	</bdy>
	<bm>
		<ack>
			<sec>
				<st>
					<p>Acknowledgements</p>
				</st>
				<p>This research was supported by grants from the Research Council K.U. Leuven (GOA-Mefisto-666, GOA-Ambiorics, IDO), the Fonds voor Wetenschappelijk Onderzoek - Vlaanderen (G.0115.01, G.0240.99, G.0407.02, G.0413.03, G.0388.03, G.0229.03, G.0241.04), the Instituut voor de aanmoediging van Innovatie door Wetenschap en Technologie Vlaanderen (STWW-Genprom, GBOU-McKnow, GBOU-SQUAD, GBOU-ANA), the Belgian Federal Science Policy Office (IUAP V-22), and the European Union (FP5 CAGE, ERNSI, FP6 NoE Biopattern, NoE E-tumours). We acknowledge Peter Antal for starting up this research direction.</p>
			</sec>
		</ack>
		<refgrp>
			<bibl id="B1">
				<title>
					<p>Blurring the boundaries between scientific papers and biological databases.</p>
				</title>
				<aug>
					<au>
						<snm>Gerstein</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Junker</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Nature Online</source>
				<url>http://www.nature.com/nature/debates/e-access/articles/gernstein.html</url>
			</bibl>
			<bibl id="B2">
				<title>
					<p>RefSeq and LocusLink: NCBI gene-centered resources.</p>
				</title>
				<aug>
					<au>
						<snm>Pruitt</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Maglott</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2001</pubdate>
				<volume>29</volume>
				<fpage>137</fpage>
				<lpage>140</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/nar/29.1.137</pubid>
						<pubid idtype="pmpid" link="fulltext">11125071</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B3">
				<title>
					<p>Use of keyword hierarchies to interpret gene expression.</p>
				</title>
				<aug>
					<au>
						<snm>Masys</snm>
						<fnm>DR</fnm>
					</au>
					<au>
						<snm>Welsh</snm>
						<fnm>JB</fnm>
					</au>
					<au>
						<snm>Fink</snm>
						<fnm>JL</fnm>
					</au>
					<au>
						<snm>Gribskov</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Klacansky</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Corbeil</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2001</pubdate>
				<volume>17</volume>
				<fpage>319</fpage>
				<lpage>326</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/17.4.319</pubid>
						<pubid idtype="pmpid" link="fulltext">11301300</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B4">
				<title>
					<p>A literature network of human genes for high-throughput analysis of gene expression.</p>
				</title>
				<aug>
					<au>
						<snm>Jenssen</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Laegreid</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Komorowski</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Hovig</snm>
						<fnm>E</fnm>
					</au>
				</aug>
				<source>Nat Genet</source>
				<pubdate>2001</pubdate>
				<volume>28</volume>
				<fpage>21</fpage>
				<lpage>28</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1038/88213</pubid>
						<pubid idtype="pmpid" link="fulltext">11326270</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B5">
				<title>
					<p>Information retrieval meets gene analysis.</p>
				</title>
				<aug>
					<au>
						<snm>Shatkay</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Edwards</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Boguski</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>IEEE Intell Syst (Special Issue on Intelligent Systems in Biology)</source>
				<pubdate>2002</pubdate>
				<volume>17</volume>
				<fpage>45</fpage>
				<lpage>53</lpage>
				<xrefbib>
					<pubid idtype="doi">10.1109/5254.999219</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B6">
				<title>
					<p>Mining microarray expression data by literature profiling.</p>
				</title>
				<aug>
					<au>
						<snm>Chaussabel</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Sher</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>Genome Biol</source>
				<pubdate>2002</pubdate>
				<volume>3</volume>
				<fpage>research0055.1</fpage>
				<lpage>0055.16</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmpid" link="fulltext">12372143</pubid>
						<pubid idtype="doi">10.1186/gb-2002-3-10-research0055</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B7">
				<title>
					<p>Evaluation of the vector space representation in text-based gene clustering.</p>
				</title>
				<aug>
					<au>
						<snm>Glenisson</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Antal</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Mathys</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Moreau</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Moor</snm>
						<fnm>BD</fnm>
					</au>
				</aug>
				<source>Pac Symp Biocomput</source>
				<pubdate>2003</pubdate>
				<fpage>391</fpage>
				<lpage>402</lpage>
				<xrefbib>
					<pubid idtype="pmpid">12603044</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B8">
				<title>
					<p>Using text analysis to identify functionally coherent gene groups.</p>
				</title>
				<aug>
					<au>
						<snm>Raychaudhuri</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Schutze</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Altman</snm>
						<fnm>RB</fnm>
					</au>
				</aug>
				<source>Genome Res</source>
				<pubdate>2002</pubdate>
				<volume>12</volume>
				<fpage>1582</fpage>
				<lpage>1590</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1101/gr.116402</pubid>
						<pubid idtype="pmpid" link="fulltext">12368251</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B9">
				<title>
					<p>Finding relevant references to genes and proteins in Medline using a Bayesian approach.</p>
				</title>
				<aug>
					<au>
						<snm>Leonard</snm>
						<fnm>JE</fnm>
					</au>
					<au>
						<snm>Colombe</snm>
						<fnm>JB</fnm>
					</au>
					<au>
						<snm>Levy</snm>
						<fnm>JL</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2002</pubdate>
				<volume>18</volume>
				<fpage>1515</fpage>
				<lpage>1522</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/18.11.1515</pubid>
						<pubid idtype="pmpid" link="fulltext">12424124</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B10">
				<title>
					<p>Associating genes with Gene Ontology codes using a maximum entropy analysis of biomedical literature.</p>
				</title>
				<aug>
					<au>
						<snm>Raychaudhuri</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Chang</snm>
						<fnm>JT</fnm>
					</au>
					<au>
						<snm>Sutphin</snm>
						<fnm>PD</fnm>
					</au>
					<au>
						<snm>Altman</snm>
						<fnm>RB</fnm>
					</au>
				</aug>
				<source>Genome Res</source>
				<pubdate>2002</pubdate>
				<volume>12</volume>
				<fpage>203</fpage>
				<lpage>214</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1101/gr.199701</pubid>
						<pubid idtype="pmpid" link="fulltext">11779846</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B11">
				<title>
					<p>Gene Ontology Consortium</p>
				</title>
				<url>http://www.geneontology.org</url>
			</bibl>
			<bibl id="B12">
				<title>
					<p>Medical Subject Headings</p>
				</title>
				<url>http://www.nlm.nih.gov/mesh/meshhome.html</url>
			</bibl>
			<bibl id="B13">
				<title>
					<p>eVOC: a controlled vocabulary for unifying gene expression data.</p>
				</title>
				<aug>
					<au>
						<snm>Kelso</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Visagie</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Theiler</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Christoels</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Bardien</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Smedley</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Otgaar</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Greyling</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Jongeneel</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>McCarthy</snm>
						<fnm>M</fnm>
					</au>
					<etal/>
				</aug>
				<source>Genome Res</source>
				<pubdate>2003</pubdate>
				<volume>13</volume>
				<fpage>1222</fpage>
				<lpage>1230</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1101/gr.985203</pubid>
						<pubid idtype="pmpid" link="fulltext">12799354</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B14">
				<title>
					<p>Gene Ontology Annotation</p>
				</title>
				<url>http://www.ebi.ac.uk/GOA</url>
			</bibl>
			<bibl id="B15">
				<title>
					<p>TXTGate Portal</p>
				</title>
				<url>http://www.esat.kuleuven.ac.be/txtgate</url>
			</bibl>
			<bibl id="B16">
				<title>
					<p>Mining functional information associated with expression arrays.</p>
				</title>
				<aug>
					<au>
						<snm>Blaschke</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Oliveros</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Valencia</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>Funct Integr Genomics</source>
				<pubdate>2001</pubdate>
				<volume>1</volume>
				<fpage>256</fpage>
				<lpage>268</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1007/s101420000036</pubid>
						<pubid idtype="pmpid" link="fulltext">11793245</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B17">
				<title>
					<p>MedMiner: an internet text-mining tool for biomedical information, with application to gene expression profiling.</p>
				</title>
				<aug>
					<au>
						<snm>Tanabe</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Scherf</snm>
						<fnm>U</fnm>
					</au>
					<au>
						<snm>Smith</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Lee</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Hunter</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Weinstein</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>BioTechniques</source>
				<pubdate>1999</pubdate>
				<volume>27</volume>
				<fpage>1210</fpage>
				<lpage>1217</lpage>
				<xrefbib>
					<pubid idtype="pmpid">10631500</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B18">
				<title>
					<p>MedMiner</p>
				</title>
				<url>http://discover.nci.nih.gov/textmining</url>
			</bibl>
			<bibl id="B19">
				<title>
					<p>GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support.</p>
				</title>
				<aug>
					<au>
						<snm>Rebhan</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Chalifa-Caspi</snm>
						<fnm>V</fnm>
					</au>
					<au>
						<snm>Prilusky</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Lancet</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>1998</pubdate>
				<volume>14</volume>
				<fpage>656</fpage>
				<lpage>664</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/14.8.656</pubid>
						<pubid idtype="pmpid" link="fulltext">9789091</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B20">
				<title>
					<p>MedMOLE: mining literature to extract biological knowledge by microarray data.</p>
				</title>
				<aug>
					<au>
						<snm>Calogero</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Iazzetti</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Motta</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Pedrazzi</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Rago</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Rossi</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Turra</snm>
						<fnm>R</fnm>
					</au>
				</aug>
				<source>In Proc Virtual Conf Genomics Bioinformatics</source>
				<pubdate>2002</pubdate>
				<volume>2</volume>
				<fpage>9</fpage>
				<lpage>14</lpage>
			</bibl>
			<bibl id="B21">
				<title>
					<p>MedMOLE at CINECA</p>
				</title>
				<url>http://www.cineca.it/HPSystems/Chimica/medmole</url>
			</bibl>
			<bibl id="B22">
				<title>
					<p>DNA Array Analysis with GEISHA</p>
				</title>
				<url>http://www.pdg.cnb.uam.es/blaschke/cgi-bin/geisha</url>
			</bibl>
			<bibl id="B23">
				<title>
					<p>PubGene Gene Database and Tools</p>
				</title>
				<url>http://www.pubgene.org</url>
			</bibl>
			<bibl id="B24">
				<title>
					<p>Analysis of genomic and proteomic data using advanced literature mining.</p>
				</title>
				<aug>
					<au>
						<snm>Hu</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Hines</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Weng</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Zuo</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Rivera</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Richardson</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>LaBaer</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>J Proteome Res</source>
				<pubdate>2003</pubdate>
				<volume>2</volume>
				<fpage>405</fpage>
				<lpage>412</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1021/pr0340227</pubid>
						<pubid idtype="pmpid">12938930</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B25">
				<title>
					<p>MedGene Database</p>
				</title>
				<url>http://hipseq.med.harvard.edu/MEDGENE</url>
			</bibl>
			<bibl id="B26">
				<title>
					<p>Association of genes to genetically inherited diseases using data mining.</p>
				</title>
				<aug>
					<au>
						<snm>Perez-Iratxeta</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Bork</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Andrade</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Nat Genet</source>
				<pubdate>2002</pubdate>
				<volume>31</volume>
				<fpage>316</fpage>
				<lpage>319</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">12006977</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B27">
				<title>
					<p>G2D Candidate Genes to Inherited Diseases</p>
				</title>
				<url>http://www.bork.embl-heidelberg.de/g2d</url>
			</bibl>
			<bibl id="B28">
				<title>
					<p>MeKE: discovering the functions of gene products from biomedical literature via sentence alignment.</p>
				</title>
				<aug>
					<au>
						<snm>Chiang</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Yu</snm>
						<fnm>H</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2003</pubdate>
				<volume>19</volume>
				<fpage>1417</fpage>
				<lpage>1422</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/btg160</pubid>
						<pubid idtype="pmpid" link="fulltext">12874055</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B29">
				<title>
					<p>MeKE (Medical Knowledge Explorer)</p>
				</title>
				<url>http://ismp.csie.ncku.edu.tw/~yuhc/meke</url>
			</bibl>
			<bibl id="B30">
				<title>
					<p>Java Remote Method Invocation (Java RMI)</p>
				</title>
				<url>http://java.sun.com/products/jdk/rmi</url>
			</bibl>
			<bibl id="B31">
				<aug>
					<au>
						<snm>Baeza-Yates</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Ribeiro-Neto</snm>
						<fnm>B</fnm>
					</au>
				</aug>
				<source>Modern Information Retrieval</source>
				<publisher>Reading, MA: Addison-Wesley/ACM Press</publisher>
				<pubdate>1999</pubdate>
			</bibl>
			<bibl id="B32">
				<title>
					<p>An algorithm for suffix stripping.</p>
				</title>
				<aug>
					<au>
						<snm>Porter</snm>
						<fnm>MF</fnm>
					</au>
				</aug>
				<source>Program</source>
				<pubdate>1980</pubdate>
				<volume>14</volume>
				<fpage>130</fpage>
				<lpage>137</lpage>
			</bibl>
			<bibl id="B33">
				<title>
					<p><it>Saccharomyces </it>Genome Database</p>
				</title>
				<url>http://www.yeastgenome.org</url>
			</bibl>
			<bibl id="B34">
				<title>
					<p>OMIM - Online Mendelian Inheritance in Man</p>
				</title>
				<url>http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM</url>
			</bibl>
			<bibl id="B35">
				<title>
					<p>HUGO Gene Nomenclature Commitee (HGNC)</p>
				</title>
				<url>http://www.gene.ucl.ac.uk/nomenclature</url>
			</bibl>
			<bibl id="B36">
				<aug>
					<au>
						<snm>Jain</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Dubes</snm>
						<fnm>R</fnm>
					</au>
				</aug>
				<source>Algorithms for Clustering Data</source>
				<publisher>Upper Saddle River, NJ: Prentice Hall</publisher>
				<pubdate>1988</pubdate>
			</bibl>
			<bibl id="B37">
				<title>
					<p>Comprehensive identification of cell cycle-regulated genes of the yeast <it>Saccharomyces cerevisiae </it>by microarray hybridization.</p>
				</title>
				<aug>
					<au>
						<snm>Spellman</snm>
						<fnm>PT</fnm>
					</au>
					<au>
						<snm>Sherlock</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Zhang</snm>
						<fnm>MQ</fnm>
					</au>
					<au>
						<snm>Iyer</snm>
						<fnm>VR</fnm>
					</au>
					<au>
						<snm>Anders</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Eisen</snm>
						<fnm>MB</fnm>
					</au>
					<au>
						<snm>Brown</snm>
						<fnm>PO</fnm>
					</au>
					<au>
						<snm>Botstein</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Futcher</snm>
						<fnm>B</fnm>
					</au>
				</aug>
				<source>Mol Biol Cell</source>
				<pubdate>1998</pubdate>
				<volume>9</volume>
				<fpage>3273</fpage>
				<lpage>3297</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">9843569</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B38">
				<title>
					<p>Scoring and summarizing gene groups from text using the vector space model.</p>
				</title>
				<aug>
					<au>
						<snm>Glenisson</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Mathys</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Moreau</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>De Moor</snm>
						<fnm>B</fnm>
					</au>
				</aug>
				<source>Technical Report 03-97, ESAT-SISTA</source>
				<publisher>Leuven, Belgium: K.U.Leuven</publisher>
				<pubdate>2003</pubdate>
				<url>ftp://ftp.esat.kuleuven.ac.be/pub/SISTA/glenisson/ reports/genomebiol/TR03-97.pdf</url>
			</bibl>
			<bibl id="B39">
				<title>
					<p>Cluster analysis and display of genome-wide expression patterns.</p>
				</title>
				<aug>
					<au>
						<snm>Eisen</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Spellman</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Brown</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Botstein</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci USA</source>
				<pubdate>1998</pubdate>
				<volume>95</volume>
				<fpage>14863</fpage>
				<lpage>14868</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1073/pnas.95.25.14863</pubid>
						<pubid idtype="pmpid" link="fulltext">9843981</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B40">
				<title>
					<p>AmiGO Gene Ontology browser</p>
				</title>
				<url>http://www.godatabase.org</url>
			</bibl>
			<bibl id="B41">
				<title>
					<p>Promoter swapping between the genes for a novel zinc finger protein and beta-catenin in pleiomorphic adenomas with t(3;8)(p21;q12) translocations.</p>
				</title>
				<aug>
					<au>
						<snm>Kas</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Voz</snm>
						<fnm>ML</fnm>
					</au>
					<au>
						<snm>Roijer</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Astrom</snm>
						<fnm>AK</fnm>
					</au>
					<au>
						<snm>Meyen</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Stenman</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Van de Ven</snm>
						<fnm>WJ</fnm>
					</au>
				</aug>
				<source>Nat Genet</source>
				<pubdate>1997</pubdate>
				<volume>15</volume>
				<fpage>170</fpage>
				<lpage>174</lpage>
				<xrefbib>
					<pubid idtype="pmpid">9020842</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B42">
				<title>
					<p>Microarray screening for target genes of the proto-oncogene PLAG1.</p>
				</title>
				<aug>
					<au>
						<snm>Voz</snm>
						<fnm>ML</fnm>
					</au>
					<au>
						<snm>Mathys</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Hensen</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Pendeville</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Van Valckenborgh</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Van Huffel</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Chavez</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Van Damme</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>De Moor</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Moreau</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Van de Ven</snm>
						<fnm>WJ</fnm>
					</au>
				</aug>
				<source>Oncogene</source>
				<pubdate>2004</pubdate>
				<volume>23</volume>
				<fpage>179</fpage>
				<lpage>191</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1038/sj.onc.1207013</pubid>
						<pubid idtype="pmpid" link="fulltext">14712223</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B43">
				<title>
					<p>Detecting gene relations from Medline abstracts.</p>
				</title>
				<aug>
					<au>
						<snm>Stephens</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Palakal</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Mukhopadhyay</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Raje</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Mostafa</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Pac Symp Biocomput</source>
				<pubdate>2001</pubdate>
				<fpage>483</fpage>
				<lpage>495</lpage>
				<xrefbib>
					<pubid idtype="pmpid">11262966</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B44">
				<title>
					<p>Gene ontology: tool for the unification of biology.</p>
				</title>
				<aug>
					<au>
						<snm>Ashburner</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Ball</snm>
						<fnm>CA</fnm>
					</au>
					<au>
						<snm>Blake</snm>
						<fnm>JA</fnm>
					</au>
					<au>
						<snm>Botstein</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Butler</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Cherry</snm>
						<fnm>JM</fnm>
					</au>
					<au>
						<snm>Davis</snm>
						<fnm>AP</fnm>
					</au>
					<au>
						<snm>Dolinski</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Dwight</snm>
						<fnm>SS</fnm>
					</au>
					<au>
						<snm>Eppig</snm>
						<fnm>JT</fnm>
					</au>
					<etal/>
				</aug>
				<source>Nat Genet</source>
				<pubdate>2000</pubdate>
				<volume>25</volume>
				<fpage>25</fpage>
				<lpage>29</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1038/75556</pubid>
						<pubid idtype="pmpid" link="fulltext">10802651</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B45">
				<title>
					<p>The computational analysis of scientific literature to define and recognize gene expression clusters.</p>
				</title>
				<aug>
					<au>
						<snm>Raychaudhuri</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Chang</snm>
						<fnm>JT</fnm>
					</au>
					<au>
						<snm>Imam</snm>
						<fnm>F</fnm>
					</au>
					<au>
						<snm>Altman</snm>
						<fnm>RB</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2003</pubdate>
				<volume>31</volume>
				<fpage>4553</fpage>
				<lpage>4560</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/nar/gkg636</pubid>
						<pubid idtype="pmpid" link="fulltext">12888516</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
		</refgrp>
	</bm>
</art>
