<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
	<ui>1471-2105-9-S3-S11</ui>
	<ji>1471-2105</ji>
	<fm>
		<dochead>Proceedings</dochead>
		<bibl>
			<title>
				<p>Identification of transcription factor contexts in literature using machine learning approaches</p>
			</title>
			<aug>
				<au id="A1">
					<snm>Yang</snm>
					<fnm>Hui</fnm>
					<insr iid="I1"/>
					<email>Hui.Yang@manchester.ac.uk</email>
				</au>
				<au id="A2" ca="yes">
					<snm>Nenadic</snm>
					<fnm>Goran</fnm>
					<insr iid="I1"/>
					<email>G.Nenadic@manchester.ac.uk</email>
				</au>
				<au id="A3">
					<snm>Keane</snm>
					<mi>A</mi>
					<fnm>John</fnm>
					<insr iid="I1"/>
					<email>John.Keane@manchester.ac.uk</email>
				</au>
			</aug>
			<insg>
				<ins id="I1">
					<p>School of Computer Science, University of Manchester, Manchester, UK</p>
				</ins>
			</insg>
			<source>BMC Bioinformatics</source>
			<supplement>
				<title>
					<p>Proceedings of the Second International Symposium on Languages in Biology and Medicine (LBM) 2007</p>
				</title>
				<editor>Christopher JO Baker and Su Jian</editor>
				<note>Proceedings</note>
				<url>http://www.biomedcentral.com/1471-2105-9-S3-info.pdf</url>
			</supplement>
			<conference>
				<title>
					<p>The Second International Symposium on Languages in Biology and Medicine (LBM) 2007</p>
				</title>
				<location>Singapore</location>
				<date-range>6-7 December 2007</date-range>
				<url>http://lbm2007.biopathway.org/</url>
			</conference>
			<issn>1471-2105</issn>
			<pubdate>2008</pubdate>
			<volume>9</volume>
			<issue>Suppl 3</issue>
			<fpage>S11</fpage>
			<url>http://www.biomedcentral.com/1471-2105/9/S3/S11</url>
			<xrefbib>
				<pubidlist><pubid idtype="pmpid">18426546</pubid><pubid idtype="doi">10.1186/1471-2105-9-S3-S11</pubid>
				</pubidlist></xrefbib>
		</bibl>
		<history>
			<pub>
				<date>
					<day>11</day>
					<month>04</month>
					<year>2008</year>
				</date>
			</pub>
		</history>
		<cpyrt>
			<year>2008</year>
			<collab>Yang et al.; licensee BioMed Central Ltd.</collab>
			<note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
		</cpyrt>
		<abs>
			<sec>
				<st>
					<p>Abstract</p>
				</st>
				<sec>
					<st>
						<p>Background</p>
					</st>
					<p>Availability of information about transcription factors (TFs) is crucial for genome biology, as TFs play a central role in the regulation of gene expression. While manual literature curation is expensive and labour intensive, the development of semi-automated text mining support is hindered by unavailability of training data. There have been no studies on how existing data sources (e.g. TF-related data from the MeSH thesaurus and GO ontology) or potentially noisy example data (e.g. protein-protein interaction, PPI) could be used to provide training data for identification of TF-contexts in literature.</p>
				</sec>
				<sec>
					<st>
						<p>Results</p>
					</st>
					<p>In this paper we describe a text-classification system designed to automatically recognise contexts related to transcription factors in literature. A learning model is based on a set of biological features (e.g. protein and gene names, interaction words, other biological terms) that are deemed relevant for the task. We have exploited background knowledge from existing biological resources (MeSH and GO) to engineer such features. Weak and noisy training datasets have been collected from descriptions of TF-related concepts in MeSH and GO, PPI data and data representing non-protein-function descriptions. Three machine-learning methods are investigated, along with a vote-based merging of individual approaches and/or different training datasets. The system achieved highly encouraging results, with most classifiers achieving an F-measure above 90%.</p>
				</sec>
				<sec>
					<st>
						<p>Conclusions</p>
					</st>
					<p>The experimental results have shown that the proposed model can be used for identification of TF-related contexts (i.e. sentences) with high accuracy, with a significantly reduced set of features when compared to traditional bag-of-words approach. The results of considering existing PPI data suggest that there is not as high similarity between TF and PPI contexts as we have expected. We have also shown that existing knowledge sources are useful both for feature engineering and for obtaining noisy positive training data.</p>
				</sec>
			</sec>
		</abs>
	</fm>
	<bdy>
		<sec>
			<st>
				<p>Background</p>
			</st>
			<p>Over the past decade, text mining techniques have been used to support the (semi-)automatic extraction of information from biomedical literature. A number of systems have been designed to capture information on general biological molecular interactions <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp> or interactions focused on a particular organism of interest (such as Homo sapiens <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>, Drosophila melanogaster <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>, and Saccharomyces cerevisiae <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>). In addition, specific types of molecular interactions have been targeted (e.g. inhibition relationships between biological entities <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>, or enzyme and metabolic pathways <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>). Several evaluation challenges and exercises have been organised to assess the development in the field, in particular for protein-protein interactions (PPI) (e.g. BioCreative <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>, LLL05 Challenge <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>, etc.).</p>
			<p>A topic that has been of particular interest in biomedicine is the investigation of gene regulatory networks, in which transcription factors play a crucial role. A transcription factor (TF) is a protein that regulates binding of RNA polymerase and initiation of transcription <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. TFs are regulators of gene expression and influence almost all biological processes in an organism. Existing TF databases (such as TRANSFAC <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>, FlyBase <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>, ORegAnno <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>) are largely based on manual literature curation. Despite their importance for genome biology, curation of these databases is far from satisfactory for many organisms, partially due to the difficulties in locating the information linked to transcription regulation stored in an ever increasing volume of relevant literature.</p>
			<p>In this paper we investigate the automatic extraction of TF-related contexts (at the sentence level) to support curation of transcription factors from biomedical literature. To the best of our knowledge, our work is one of the first attempts to apply text-mining techniques to the task <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. As opposed to PPI contexts (representing interactions between proteins), our aim is to locate a specific type of interactions related to gene regulation by TFs. More precisely, we are focused on a specific <it>role</it> of certain biological entities: our targets are contexts that mention special proteins (i.e. transcription factors) that regulate gene expressions. The following is a typical example of a TF-related context:</p>
			<p indent="3">
				<it>&#8230; Reconstituted in vitro transcription reactions and deoxyribonuclease I footprinting assays confirmed the ability of TRF1 to bind preferentially and direct transcription of the tudor gene from an alternate promoter&#8230;</it>
			</p>
			<p>Several actors and events (e.g. proteins, DNA, transcriptions, DNA binding) can be typically found in such contexts (see Table <tblr tid="T1">1</tblr>). One of the most important features of TFs is transcription regulation where transcription factors <it>interact</it> with other regulatory <it>proteins</it> to either increase or decrease the transcription of specific genes. Thus, transcription regulation contexts could be regarded as a type of PPI context and in this paper we further investigate the degree of similarities between them.</p>
			<tbl id="T1">
				<title>
					<p>Table 1</p>
				</title>
				<caption>
					<p>Examples of typical actors and events in transcription factor contexts</p>
				</caption>
				<tblbdy cols="2">
					<r>
						<c ca="left">
							<p>
								<b>Actor, event</b>
							</p>
						</c>
						<c ca="left">
							<p>
								<b>Examples</b>
							</p>
						</c>
					</r>
					<r>
						<c cspan="2">
							<hr/>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<b>
									<it>DNA binding</it>
								</b>
							</p>
						</c>
						<c ca="left">
							<p>
								<it>DNA binding; DNA binding protein; DNA binding region DNA binding property, DNA binding affinity; DNA binding specificity</it>
							</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<b>
									<it>Transcription</it>
								</b>
							</p>
						</c>
						<c ca="left">
							<p>
								<it>transcription; transcriptional regulator; gene transcription; transcription repression; transcription reaction; transcription activity</it>
							</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<b>
									<it>Protein actor</it>
								</b>
							</p>
						</c>
						<c ca="left">
							<p>
								<it>transcription factor; protein factor; transcription repressor transcriptional activator; transcriptional mediator; heterodimer</it>
							</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>
								<b>
									<it>DNA actor</it>
								</b>
							</p>
						</c>
						<c ca="left">
							<p>
								<it>enhancer; promoter; reporter</it>
							</p>
						</c>
					</r>
				</tblbdy>
			</tbl>
			<p>We focus on machine learning (ML) approaches and discuss creation of suitable training datasets that can support the task. More specifically, we present a series of investigations and experiments that aim to clarify the following issues:</p>
			<p>&#8226; <b>Training data:</b> can we use existing knowledge bases (e.g. the MeSH thesaurus <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> and GO ontology <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>) to create a collection of noisy but useful positive data? Would it be feasible to use PPI contexts to support TF-curation?</p>
			<p>&#8226; <b>Features:</b> is a small set of biological features (e.g. gene and protein names, TF-specific terms, interaction verbs, etc.), which are believed to be representative of transcription factors, enough to identify TF-related sentences?</p>
			<p>&#8226; <b>Machine learning</b>: which techniques are effective for the extraction of TF-related contexts?</p>
			<p>In the following section we present the methods and resources that have been used in our investigations. After presenting the experiments and results, we compare our approach to related work in the domain and give some conclusions and directions for future work.</p>
		</sec>
		<sec>
			<st>
				<p>Methods</p>
			</st>
			<p>We approached the problem of extracting TF-related contexts as a binary classification task: given a sentence, we aim to classify it as TF-related (positive) or not (negative). We consider three major components: selection of relevant features to support classification, obtaining training data to build classifiers, and selection of ML approaches to be employed for context recognition. We have analysed two types of features: in the <it>generic</it> model (GM), we follow the standard bag-of-words approach; in the <it>biological</it> model (BM), we consider only features that reflect the biological profile of the task. Three different machine learning algorithms are applied to train TF classifiers based on the two learning models. The overall approach is presented in Fig. <figr fid="F1">1</figr>.</p>
			<fig id="F1">
				<title>
					<p>Figure 1</p>
				</title>
				<caption>
					<p>Overall architecture of the approach</p>
				</caption>
				<text>
					<p>Overall architecture of the approach.</p>
				</text>
				<graphic file="1471-2105-9-S3-S11-1"/>
			</fig>
			<sec>
				<st>
					<p>Feature engineering</p>
				</st>
				<p>In the generic, word-based model (GM), standard word lemmatisation (using GeniaTagger <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>) is employed along with a feature selection procedure to reduce the feature space. We used Pearson's chi-square (&#967;<sup>2</sup>) test <abbrgrp><abbr bid="B25">25</abbr></abbrgrp> to rank the words in the descending order of their likelihoods of distinguishing the class. The threshold &#964; of chi-square statistics used for feature selection is calculated using the following equation:</p>
				<p>
					<display-formula>
						<m:math name="1471-2105-9-S3-S11-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mrow>
									<m:mi>&#964;</m:mi>
									<m:mo>=</m:mo>
									<m:mfrac>
										<m:mrow>
											<m:mstyle displaystyle="true">
												<m:msub>
													<m:mo>&#8721;</m:mo>
													<m:mi>w</m:mi>
												</m:msub>
												<m:mrow>
													<m:msup>
														<m:mrow>
															<m:mrow>
																<m:mo>(</m:mo>
																<m:mrow>
																	<m:msub>
																		<m:mi>f</m:mi>
																		<m:mrow>
																			<m:mi>o</m:mi>
																			<m:mi>w</m:mi>
																		</m:mrow>
																	</m:msub>
																	<m:mo>&#8722;</m:mo>
																	<m:msub>
																		<m:mi>f</m:mi>
																		<m:mrow>
																			<m:mi>e</m:mi>
																			<m:mi>w</m:mi>
																		</m:mrow>
																	</m:msub>
																</m:mrow>
																<m:mo>)</m:mo>
															</m:mrow>
														</m:mrow>
														<m:mn>2</m:mn>
													</m:msup>
													<m:mo>/</m:mo>
													<m:msub>
														<m:mi>f</m:mi>
														<m:mrow>
															<m:mi>e</m:mi>
															<m:mi>w</m:mi>
														</m:mrow>
													</m:msub>
												</m:mrow>
											</m:mstyle>
										</m:mrow>
										<m:mrow>
											<m:mo>&#8741;</m:mo>
											<m:mi>w</m:mi>
											<m:mo>&#8741;</m:mo>
										</m:mrow>
									</m:mfrac>
								</m:mrow>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGcbaqcLbxacqaHepaDcqGH9aqpkmaalaaabaWaaabeaeaadaqadaqaaiabdAgaMnaaBaaaleaacqWGVbWBcqWG3bWDaeqaaKqzWfGaeyOeI0IaemOzayMcdaWgaaWcbaGaemyzauMaem4DaChabeaaaOGaayjkaiaawMcaamaaCaaaleqabaGaeGOmaidaaKqzWfGaei4la8IaemOzayMcdaWgaaWcbaGaemyzauMaem4DaChabeaaaeaacqWG3bWDaeqaniabggHiLdaakeaajugCbiablwIiqjabdEha3jablwIiqbaaaaa@4F0D@</m:annotation>
							</m:semantics>
						</m:math>
					</display-formula>
				</p>
				<p>where <it>f<sub>ow</sub></it> denotes the frequency of the observed word <it>w</it> and <it>f<sub>ew</sub></it> is the frequency of the expected values; &#8214;<it>w</it>&#8214; is the total number of the words in the collection. A sentence vector is built by using the features above the threshold &#964; for all words that are present in it.</p>
				<p>In the biological model (BM), the following features are identified in candidate sentences: gene/protein names, interaction words, TF-related MeSH and GO terms, and other biological words. Our rationale was simple: target sentences generally describe <it>interactions</it> between TFs (proteins) and target genes, and thus we expect that gene/protein names are important features as are the interaction words <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. Protein/gene names are recognised by combining the outputs from two publicly available gene name taggers, ABNER <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> and LingPipe <abbrgrp><abbr bid="B28">28</abbr></abbrgrp> (the integrated results achieved F-measure of 78.6% on average). A thesaurus containing interaction words has been collected from the TF and PPI data (mentioned below). All morphological and derivational variants (e.g. <it>regulate</it>, <it>regulation</it>, <it>regulatory</it>) have been included, resulting in 391 potential interaction-word form features.</p>
				<p>MeSH and GO terms related to transcription regulation are also considered as potentially important features and have been collected from these two resources, resulting in 247 MeSH terms (subheading &#8216;<it>Transcription Factor</it>&#8217; and its descendents) and 223 TF-related GO terms (based on the TF-related term list curated by <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>, which has been extended by all their descendents). Moreover, we have constructed a dictionary of biologically relevant words by tokenising all the terms contained in the MeSH thesaurus and the GO ontology (not only TF-related terms). After removing stop-words (using the SMART system's stopword list of 524 common words <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>) and discarding words with fewer than 3 characters, the dictionary contains around 50,000 words, which have been used as potential features in the BM model.</p>
				<p>A feature vector for a given sentence in the BM model contains the following features. First, for each word from the biological dictionary that is present in the sentence, a feature is added (<b><it>biological-word</it></b> features), as well as for each interaction word that occurs in the sentence but is not contained in the biological word dictionary (<b><it>interaction-word</it></b> features). Then, the following binary features are generated:</p>
				<p>- <b><it>has-protein</it></b> &#8211; flagged if the sentence contains at least one protein/gene name;</p>
				<p>- <b><it>has-two-proteins</it></b> &#8211; flagged if the sentence contains at least two unique protein/gene names;</p>
				<p>- <b><it>has-interaction-word</it></b> &#8211; flagged if the sentence contains at least one interaction word;</p>
				<p>- <b><it>has-two-interaction-words</it></b> &#8211; flagged if the sentence contains at least two unique interaction words;</p>
				<p>- <b><it>has-MeSH-TF-term</it></b> &#8211; flagged if the sentence contains at least one MeSH TF term;</p>
				<p>- <b><it>has-two-MeSH-TF-terms</it></b> &#8211; flagged if the sentence contains at least two unique MeSH TF terms;</p>
				<p>- <b><it>has-GO-TF-term</it></b> &#8211; flagged if the sentence contains at least one GO TF term;</p>
				<p>- <b><it>has-two-GO-TF-terms</it></b> &#8211; flagged if the sentence contained at least two unique GO TF terms.</p>
				<p>These feature vectors are used in three different machine learning algorithms (Naive Bayes (NB), Support Vector Machine (SVM), and Maximum Entropy (ME)) to learn the classifiers.</p>
			</sec>
			<sec>
				<st>
					<p>Building training and testing datasets</p>
				</st>
				<p>Building a training set for the extraction of TF sentences proved to be the most difficult and time consuming step. The only suitable and publicly available source is the FlyTF database (the Drosophila Transcription Factor database <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>). This is a manually curated database that contains transcription information based on FlyBase/GO annotation data and the DBD Transcription Factor Database <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>. Some of the records in the database are supported by &#8220;traceable author statements&#8221;, including sentences from the literature. We have extracted 491 sentences from the database, which seemed as not being enough for a larger scale investigation on retrieving TF-related sentences. We have therefore considered additional sources to support the task by obtaining noisy and weak positive and negative training data.</p>
				<sec>
					<st>
						<p>Non-Protein-Function Description (NonPF) data</p>
					</st>
					<p>We used negative sentences from the Prodisen corpus <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>, which has been constructed for functional descriptions of genes and proteins, as negative data. A total of 1700 sentences that have been marked as &#8220;not gene function description&#8221; are randomly collected from the corpus for training and testing.</p>
				</sec>
				<sec>
					<st>
						<p>MeSH and GO TF-related descriptions</p>
					</st>
					<p>As mentioned above (cf. feature engineering), both the MeSH and GO databases contain TF-related concepts. MeSH terms located under the subheading &#8216;<it>Transcription Factor</it>&#8217; describe various types of transcription factor concepts which are classified according to either their structure of DNA-binding domains or their regulatory function. In addition, GO annotation information is usually used as a main source for the curation and exploration in transcription factor databases such as FlyTF and TFDB <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. We have therefore collected <it>definitions</it> of TF-related terms from the MeSH and GO databases to create a <it>noisy</it> positive set of TF-related sentences. In addition to sentences in definitions, synonym lists are treated as TF-related sentences. Together with FlyTF data, we have collected around 1700 positive sentences (referred to as <it>TF data</it>).</p>
					<p>The suitability of TF-related definitions from the MeSH thesaurus and GO database as positive data has been tested on the existing FlyTF data. We performed a separate experiment (details are listed in the Experiment section) in which only the MeSH and GO TF data was used for training, while the FlyTF data was used as the test data to evaluate the performance. Generally, the precision was well above 90% with the average recall around 70%, which supported our assumption that this data can be used for learning.</p>
				</sec>
				<sec>
					<st>
						<p>PPI data</p>
					</st>
					<p>There have been extensive work and several resources available for PPI-focused text mining systems (see related work discussed later). The reason for us to consider PPI data is due to a potential functional similarity between transcriptional regulation (where transcription factors interact with other regulatory proteins to either increase or decrease the transcription of specific genes) and generic PPI contexts. The aim was to investigate the possibility of using PPI data as training data for TF classification. Our rationale was the following: if PPI and TF contexts are indeed similar, then it would be difficult to differentiate between the two, and a (good) TF-classifier would generally achieve a lower precision on a dataset that contains both TF and PPI examples. On the other hand, if these two context types are generally different, then it would be easier to construct a classifier that performs well on TF and PPI data. We have tested this hypothesis by using PPI data as <it>negative</it> data and comparing it to the results obtain by using real negative data (NonPF). If PPI data can indeed be used as negative examples, then we would expect at least the same precision as for the NonPF (negative) data. To generalise the concept of PPI, the data has been collected from various sources including the datasets for LLL05 Challenge <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>, BioCreAtIvE-PPI Corpus compiled by J. Hakenberg <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>, PICorpus <abbrgrp><abbr bid="B35">35</abbr></abbrgrp> and GeneRIF HIV Interaction Corpus <abbrgrp><abbr bid="B36">36</abbr></abbrgrp>.</p>
					<p>To summarise the data preparation step, the data used for TF-sentence classification is organised into three different sets of contexts, namely, TF data (including FlyTF, MeSH and GO TF-related sentences, used as positive examples), non-protein-function-description (NonPF) and protein-protein interaction (PPI) data. The NonPF and PPI datasets are separately treated as <it>negative</it> and <it>noisy</it><it> negative</it> data to constitute two experimental settings: TF&amp;NonPF and TF&amp;PPI. The three data collections have been prepared at the sentence level, and they all have a similar number of sentences (around 1700 each).</p>
				</sec>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Experiments and results</p>
			</st>
			<p>The detailed statistics for the datasets used in the experiments are given in Table <tblr tid="T2">2</tblr>. Table <tblr tid="T3">3</tblr> presents the details of the features generated for each of the datasets after the feature selection process (using chi-square statistics).</p>
			<tbl id="T2">
				<title>
					<p>Table 2</p>
				</title>
				<caption>
					<p>Statistics for the datasets used in the experiments</p>
				</caption>
				<tblbdy cols="9">
					<r>
						<c>
							<p/>
						</c>
						<c cspan="3" ca="center">
							<p>
								<b>TF data (positive data)</b>
							</p>
						</c>
						<c cspan="4" ca="center">
							<p>
								<b>PPI data (noisy negative data)</b>
							</p>
						</c>
						<c ca="center">
							<p>
								<b>NonPF data (negative data)</b>
							</p>
						</c>
					</r>
					<r>
						<c>
							<p/>
						</c>
						<c cspan="8">
							<hr/>
						</c>
					</r>
					<r>
						<c>
							<p/>
						</c>
						<c ca="center">
							<p>
								<b>FlyTF</b>
							</p>
						</c>
						<c ca="center">
							<p>
								<b>MeSH</b>
							</p>
						</c>
						<c ca="center">
							<p>
								<b>GO</b>
							</p>
						</c>
						<c ca="center">
							<p>
								<b>LLL</b>
							</p>
						</c>
						<c ca="center">
							<p>
								<b>BioCreAtIvE</b>
							</p>
						</c>
						<c ca="center">
							<p>
								<b>PICorpus</b>
							</p>
						</c>
						<c ca="center">
							<p>
								<b>GeneRIF HIV</b>
							</p>
						</c>
						<c ca="center">
							<p>
								<b>Prodisen</b>
							</p>
						</c>
					</r>
					<r>
						<c cspan="9">
							<hr/>
						</c>
					</r>
					<r>
						<c ca="center">
							<p>
								<b># sentences per resource</b>
							</p>
						</c>
						<c ca="center">
							<p>491</p>
						</c>
						<c ca="center">
							<p>712</p>
						</c>
						<c ca="center">
							<p>477</p>
						</c>
						<c ca="center">
							<p>77</p>
						</c>
						<c ca="center">
							<p>283</p>
						</c>
						<c ca="center">
							<p>127</p>
						</c>
						<c ca="center">
							<p>1200</p>
						</c>
						<c ca="center">
							<p>1700</p>
						</c>
					</r>
					<r>
						<c ca="center">
							<p>
								<b>total # sentences</b>
							</p>
						</c>
						<c cspan="3" ca="center">
							<p>1680</p>
						</c>
						<c cspan="4" ca="center">
							<p>1687</p>
						</c>
						<c ca="center">
							<p>1700</p>
						</c>
					</r>
				</tblbdy>
			</tbl>
			<tbl id="T3">
				<title>
					<p>Table 3</p>
				</title>
				<caption>
					<p>Feature statistics for different datasets (GM = generic model; BM = biological model). Note that the feature list used in the BM model is longer than that of the GM model due to the additional binary biological features (<it>has-protein</it>, <it>has-two-proteins</it>, etc.).</p>
				</caption>
				<tblbdy cols="5">
					<r>
						<c>
							<p/>
						</c>
						<c>
							<p/>
						</c>
						<c ca="center">
							<p>
								<b>TF data</b>
							</p>
						</c>
						<c ca="center">
							<p>
								<b>PPI Data</b>
							</p>
						</c>
						<c ca="center">
							<p>
								<b>NonPF Data</b>
							</p>
						</c>
					</r>
					<r>
						<c cspan="5">
							<hr/>
						</c>
					</r>
					<r>
						<c ca="center">
							<p>
								<b>total # features</b>
							</p>
						</c>
						<c ca="center">
							<p>
								<b>GM</b>
							</p>
						</c>
						<c ca="center">
							<p>1327</p>
						</c>
						<c ca="center">
							<p>1188</p>
						</c>
						<c ca="center">
							<p>1780</p>
						</c>
					</r>
					<r>
						<c>
							<p/>
						</c>
						<c ca="center">
							<p>
								<b>BM</b>
							</p>
						</c>
						<c ca="center">
							<p>803</p>
						</c>
						<c ca="center">
							<p>760</p>
						</c>
						<c ca="center">
							<p>1306</p>
						</c>
					</r>
					<r>
						<c ca="center">
							<p>
								<b># features per sentence</b>
							</p>
						</c>
						<c ca="center">
							<p>
								<b>GM</b>
							</p>
						</c>
						<c ca="center">
							<p>9.70</p>
						</c>
						<c ca="center">
							<p>14.44</p>
						</c>
						<c ca="center">
							<p>11.43</p>
						</c>
					</r>
					<r>
						<c>
							<p/>
						</c>
						<c ca="center">
							<p>
								<b>BM</b>
							</p>
						</c>
						<c ca="center">
							<p>12.87</p>
						</c>
						<c ca="center">
							<p>17.73</p>
						</c>
						<c ca="center">
							<p>9.78</p>
						</c>
					</r>
				</tblbdy>
			</tbl>
			<p>Before presenting the results of the identification of TF-related sentences, we first report our findings on the usefulness of TF-related data collected from the MeSH and GO databases as positive data for the task. We also present an analysis of the similarities between TF and PPI data. In all experiments, the performance has been evaluated using 5-fold cross-validation (train on 80% and test on 20%, repeated 5 times on a different 20% each time), by using precision (<it>P</it>), recall (<it>R</it>) and F-measure (<it>F</it>) metrics defined as follows:</p>
			<p>
				<display-formula>
					<m:math name="1471-2105-9-S3-S11-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
						<m:semantics>
							<m:mrow>
								<m:mi>R</m:mi>
								<m:mtext>&#8201;</m:mtext>
								<m:mo>=</m:mo>
								<m:mtext>&#8201;</m:mtext>
								<m:mfrac>
									<m:mrow>
										<m:mi>T</m:mi>
										<m:mi>P</m:mi>
									</m:mrow>
									<m:mrow>
										<m:mi>T</m:mi>
										<m:mi>P</m:mi>
										<m:mtext>&#8201;</m:mtext>
										<m:mo>+</m:mo>
										<m:mtext>&#8201;</m:mtext>
										<m:mi>F</m:mi>
										<m:mi>N</m:mi>
									</m:mrow>
								</m:mfrac>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mi>P</m:mi>
								<m:mo>=</m:mo>
								<m:mfrac>
									<m:mrow>
										<m:mi>T</m:mi>
										<m:mi>P</m:mi>
									</m:mrow>
									<m:mrow>
										<m:mi>T</m:mi>
										<m:mi>P</m:mi>
										<m:mo>+</m:mo>
										<m:mi>F</m:mi>
										<m:mi>P</m:mi>
									</m:mrow>
								</m:mfrac>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mtext>&#8202;</m:mtext>
								<m:mi>F</m:mi>
								<m:mo>&#8722;</m:mo>
								<m:mi>m</m:mi>
								<m:mi>e</m:mi>
								<m:mi>a</m:mi>
								<m:mi>s</m:mi>
								<m:mi>u</m:mi>
								<m:mi>r</m:mi>
								<m:mi>e</m:mi>
								<m:mo>=</m:mo>
								<m:mfrac>
									<m:mrow>
										<m:mn>2</m:mn>
										<m:mi>P</m:mi>
										<m:mi>R</m:mi>
									</m:mrow>
									<m:mrow>
										<m:mi>P</m:mi>
										<m:mo>+</m:mo>
										<m:mi>R</m:mi>
									</m:mrow>
								</m:mfrac>
							</m:mrow>
							<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGcbaqcLbxacqWGsbGucaaMe8Uaeyypa0JaaGjbVRWaaSaaaeaajugCbiabdsfaujabdcfaqbGcbaqcLbxacqWGubavcqWGqbaucaaMe8Uaey4kaSIaaGjbVlabdAeagjabd6eaobaakiaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlabdcfaqjabg2da9maalaaabaGaemivaqLaemiuaafabaGaemivaqLaemiuaaLaey4kaSIaemOrayKaemiuaafaaiaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlabdAeagjabgkHiTiabd2gaTjabdwgaLjabdggaHjabdohaZjabdwha1jabdkhaYjabdwgaLjabg2da9maalaaabaGaeGOmaiJaemiuaaLaemOuaifabaGaemiuaaLaey4kaSIaemOuaifaaaaa@A6CB@</m:annotation>
						</m:semantics>
					</m:math>
				</display-formula>
			</p>
			<p>where <it>TP</it> (true positive) is the number of correctly recognised TF sentences, <it>FN</it> (false negative) is the number of TF sentences not identified by the system, and <it>FP</it> (False Positive) the number of TF sentences that are incorrectly detected. For most experiments we compare the results obtained from the two learning models (generic and biological) and three ML approaches (SVM, NB, ME). The SVM classifier was built with the TinySVM package <abbrgrp><abbr bid="B37">37</abbr></abbrgrp> using the polynomial kernel, and the NB and ME classifiers were implemented with MALLET <abbrgrp><abbr bid="B38">38</abbr></abbrgrp> with the default parameters.</p>
			<sec>
				<st>
					<p>Suitability of MeSH and GO TF-related data as positive examples</p>
				</st>
				<p>As described earlier, we hypothesised that the descriptions of TF-related terms from the MeSH and GO databases could be used for detecting TF-related sentences. To verify this hypothesis, we used this data as the <it>noisy</it><it> positive</it> examples for learning (with NonPF and PPI as negative examples) and the FlyTF data (real positive examples) as exclusive positive data for testing. Table <tblr tid="T4">4</tblr> shows the performance of the three machine-learning classifiers.</p>
				<tbl id="T4">
					<title>
						<p>Table 4</p>
					</title>
					<caption>
						<p>Performance of the three machine-learning classifiers on the FlyTF test data using only MeSH and GO TF data as positive training data (GM = generic model; BM = biological model)</p>
					</caption>
					<tblbdy cols="11">
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c cspan="3" ca="center">
								<p>
									<b>SVM</b>
								</p>
							</c>
							<c cspan="3" ca="center">
								<p>
									<b>NB</b>
								</p>
							</c>
							<c cspan="3" ca="center">
								<p>
									<b>ME</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c cspan="9">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>
									<b>P</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>R</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>F</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>P</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>R</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>F</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>P</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>R</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>F</b>
								</p>
							</c>
						</r>
						<r>
							<c cspan="11">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="center">
								<p>
									<b>MeSH+GO &amp; NonPF</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>GM</b>
								</p>
							</c>
							<c ca="center">
								<p>.9328</p>
							</c>
							<c ca="center">
								<p>.7352</p>
							</c>
							<c ca="center">
								<p>.8223</p>
							</c>
							<c ca="center">
								<p>.9477</p>
							</c>
							<c ca="center">
								<p>.8859</p>
							</c>
							<c ca="center">
								<p>.9158</p>
							</c>
							<c ca="center">
								<p>.9595</p>
							</c>
							<c ca="center">
								<p>.7230</p>
							</c>
							<c ca="center">
								<p>.8246</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>
									<b>BM</b>
								</p>
							</c>
							<c ca="center">
								<p>.9542</p>
							</c>
							<c ca="center">
								<p>.7210</p>
							</c>
							<c ca="center">
								<p>.8213</p>
							</c>
							<c ca="center">
								<p>.9595</p>
							</c>
							<c ca="center">
								<p>.8676</p>
							</c>
							<c ca="center">
								<p>.9112</p>
							</c>
							<c ca="center">
								<p>.9802</p>
							</c>
							<c ca="center">
								<p>.7047</p>
							</c>
							<c ca="center">
								<p>.8199</p>
							</c>
						</r>
						<r>
							<c ca="center">
								<p>
									<b>MeSH+GO &amp; PPI</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>GM</b>
								</p>
							</c>
							<c ca="center">
								<p>1.000</p>
							</c>
							<c ca="center">
								<p>.6986</p>
							</c>
							<c ca="center">
								<p>.8225</p>
							</c>
							<c ca="center">
								<p>1.000</p>
							</c>
							<c ca="center">
								<p>.6354</p>
							</c>
							<c ca="center">
								<p>.7771</p>
							</c>
							<c ca="center">
								<p>.9972</p>
							</c>
							<c ca="center">
								<p>.7210</p>
							</c>
							<c ca="center">
								<p>.8369</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>
									<b>BM</b>
								</p>
							</c>
							<c ca="center">
								<p>.9810</p>
							</c>
							<c ca="center">
								<p>.6314</p>
							</c>
							<c ca="center">
								<p>.7683</p>
							</c>
							<c ca="center">
								<p>1.000</p>
							</c>
							<c ca="center">
								<p>.5234</p>
							</c>
							<c ca="center">
								<p>.6872</p>
							</c>
							<c ca="center">
								<p>.9816</p>
							</c>
							<c ca="center">
								<p>.6517</p>
							</c>
							<c ca="center">
								<p>.7834</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
				<p>The results show that the precision achieved was well above 90% on both datasets (MeSH+GO&amp;NonPF, MeSH+GO&amp;PPI), suggesting that the TF-related term definitions from the MeSH thesaurus and GO database &#8211; despite being noisy positive data &#8211; are suitable for capturing features for TF-sentence classification. The relatively lower recall results (52-72%) reflect the issue that this data &#8211; although accurate &#8211; does not cover all expressional variations used in TF sentences. To demonstrate potential usefulness of (real positive) data from FlyTF for recall, we have conducted a set of experiments in which we added 80% FlyTF data to the MeSH+GO training (positive) data, and 20% FlyTF data was left for testing (5-fold cross-validation was used). Table <tblr tid="T5">5</tblr> shows the effects of adding the FlyTF data to the training data: there was a substantial increase in recall (10-20%) and accordingly in F-measure (with a limited drop in precision, only for TF&amp;NonPF data).</p>
				<tbl id="T5">
					<title>
						<p>Table 5</p>
					</title>
					<caption>
						<p>Performance of the three machine-learning classifiers on the FlyTF test data using both MeSH and GO TF data and part of the FlyTF data as positive training data (GM = generic model; BM = biological model)</p>
					</caption>
					<tblbdy cols="11">
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c cspan="3" ca="center">
								<p>
									<b>SVM</b>
								</p>
							</c>
							<c cspan="3" ca="center">
								<p>
									<b>NB</b>
								</p>
							</c>
							<c cspan="3" ca="center">
								<p>
									<b>ME</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c cspan="9">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>
									<b>P</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>R</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>F</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>P</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>R</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>F</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>P</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>R</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>F</b>
								</p>
							</c>
						</r>
						<r>
							<c cspan="11">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="center">
								<p>
									<b>TF &amp; NonPF</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>GM</b>
								</p>
							</c>
							<c ca="center">
								<p>.9271</p>
							</c>
							<c ca="center">
								<p>.8910</p>
							</c>
							<c ca="center">
								<p>.9087</p>
							</c>
							<c ca="center">
								<p>.9308</p>
							</c>
							<c ca="center">
								<p>.9592</p>
							</c>
							<c ca="center">
								<p>.9447</p>
							</c>
							<c ca="center">
								<p>.9588</p>
							</c>
							<c ca="center">
								<p>.8533</p>
							</c>
							<c ca="center">
								<p>.9029</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>
									<b>BM</b>
								</p>
							</c>
							<c ca="center">
								<p>.9455</p>
							</c>
							<c ca="center">
								<p>.8925</p>
							</c>
							<c ca="center">
								<p>.9182</p>
							</c>
							<c ca="center">
								<p>.9527</p>
							</c>
							<c ca="center">
								<p>.9450</p>
							</c>
							<c ca="center">
								<p>.9488</p>
							</c>
							<c ca="center">
								<p>.9770</p>
							</c>
							<c ca="center">
								<p>.8655</p>
							</c>
							<c ca="center">
								<p>.9109</p>
							</c>
						</r>
						<r>
							<c ca="center">
								<p>
									<b>TF &amp; PPI</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>GM</b>
								</p>
							</c>
							<c ca="center">
								<p>1.000</p>
							</c>
							<c ca="center">
								<p>.9183</p>
							</c>
							<c ca="center">
								<p>.9574</p>
							</c>
							<c ca="center">
								<p>1.000</p>
							</c>
							<c ca="center">
								<p>.8879</p>
							</c>
							<c ca="center">
								<p>.9406</p>
							</c>
							<c ca="center">
								<p>1.000</p>
							</c>
							<c ca="center">
								<p>.9124</p>
							</c>
							<c ca="center">
								<p>.9541</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>
									<b>BM</b>
								</p>
							</c>
							<c ca="center">
								<p>.9936</p>
							</c>
							<c ca="center">
								<p>.8926</p>
							</c>
							<c ca="center">
								<p>.9403</p>
							</c>
							<c ca="center">
								<p>1.000</p>
							</c>
							<c ca="center">
								<p>.8370</p>
							</c>
							<c ca="center">
								<p>.9112</p>
							</c>
							<c ca="center">
								<p>.9885</p>
							</c>
							<c ca="center">
								<p>.8818</p>
							</c>
							<c ca="center">
								<p>.9321</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
			</sec>
			<sec>
				<st>
					<p>Similarities between TF and PPI contexts</p>
				</st>
				<p>The last point made above was a surprise: when the PPI data was used as negative examples for training, the precision was overall better than when the NonPF data was used (see tables <tblr tid="T4">4</tblr> and <tblr tid="T5">5</tblr>). This suggests that PPI data seems to better discriminate TF contexts than the NonPF dataset. High precision (for each of the three classifiers) suggests that TF and PPI contexts are not as similar as expected, implying that PPI data could provide promising <it>noisy</it><it> negative</it> data for learning TF classifiers. Furthermore, we calculated feature distribution differences between the TF and PPI datasets, and also between the TF and NonPF data, using the Average Kullback-Leibler (AKL) divergence <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>. For two datasets <it>q</it> and <it>p</it>, the AKL divergence is calculated as:</p>
				<p>
					<display-formula>
						<m:math name="1471-2105-9-S3-S11-i3" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mrow>
									<m:mi>A</m:mi>
									<m:mi>K</m:mi>
									<m:mi>L</m:mi>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mrow>
											<m:mi>q</m:mi>
											<m:mo>,</m:mo>
											<m:mi>p</m:mi>
										</m:mrow>
										<m:mo>)</m:mo>
									</m:mrow>
									<m:mo>=</m:mo>
									<m:mfrac>
										<m:mn>1</m:mn>
										<m:mn>2</m:mn>
									</m:mfrac>
									<m:mstyle displaystyle="true">
										<m:msub>
											<m:mo>&#8721;</m:mo>
											<m:mi>x</m:mi>
										</m:msub>
										<m:mrow>
											<m:mrow>
												<m:mo>(</m:mo>
												<m:mrow>
													<m:mi>q</m:mi>
													<m:mrow>
														<m:mo>(</m:mo>
														<m:mi>x</m:mi>
														<m:mo>)</m:mo>
													</m:mrow>
													<m:mi>log</m:mi>
													<m:mo>&#8289;</m:mo>
													<m:mfrac>
														<m:mrow>
															<m:mi>q</m:mi>
															<m:mrow>
																<m:mo>(</m:mo>
																<m:mi>x</m:mi>
																<m:mo>)</m:mo>
															</m:mrow>
														</m:mrow>
														<m:mrow>
															<m:mi>p</m:mi>
															<m:mrow>
																<m:mo>(</m:mo>
																<m:mi>x</m:mi>
																<m:mo>)</m:mo>
															</m:mrow>
														</m:mrow>
													</m:mfrac>
													<m:mo>+</m:mo>
													<m:mi>p</m:mi>
													<m:mrow>
														<m:mo>(</m:mo>
														<m:mi>x</m:mi>
														<m:mo>)</m:mo>
													</m:mrow>
													<m:mi>log</m:mi>
													<m:mo>&#8289;</m:mo>
													<m:mfrac>
														<m:mrow>
															<m:mi>p</m:mi>
															<m:mrow>
																<m:mo>(</m:mo>
																<m:mi>x</m:mi>
																<m:mo>)</m:mo>
															</m:mrow>
														</m:mrow>
														<m:mrow>
															<m:mi>q</m:mi>
															<m:mrow>
																<m:mo>(</m:mo>
																<m:mi>x</m:mi>
																<m:mo>)</m:mo>
															</m:mrow>
														</m:mrow>
													</m:mfrac>
												</m:mrow>
												<m:mo>)</m:mo>
											</m:mrow>
										</m:mrow>
									</m:mstyle>
								</m:mrow>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGcbaqcLbuacqWGbbqqcqWGlbWscqWGmbatkmaabmaabaGaemyCaeNaeiilaWIaemiCaahacaGLOaGaayzkaaGaeyypa0ZaaSaaaeaacqaIXaqmaeaacqaIYaGmaaWaaabeaeaadaqadaqaaiabdghaXnaabmaabaGaemiEaGhacaGLOaGaayzkaaGagiiBaWMaei4Ba8Maei4zaC2aaSaaaeaacqWGXbqCdaqadaqaaiabdIha4bGaayjkaiaawMcaaaqaaiabdchaWnaabmaabaGaemiEaGhacaGLOaGaayzkaaaaaiabgUcaRiabdchaWnaabmaabaGaemiEaGhacaGLOaGaayzkaaGagiiBaWMaei4Ba8Maei4zaC2aaSaaaeaacqWGWbaCdaqadaqaaiabdIha4bGaayjkaiaawMcaaaqaaiabdghaXnaabmaabaGaemiEaGhacaGLOaGaayzkaaaaaaGaayjkaiaawMcaaaWcbaGaemiEaGhabeqdcqGHris5aaaa@64E2@</m:annotation>
							</m:semantics>
						</m:math>
					</display-formula>
				</p>
				<p>Here, <it>q</it>(<it>x</it>) and <it>p</it>(<it>x</it>) are occurrence probabilities of the feature x in datasets <it>q</it> and <it>p</it>, respectively. In our case, feature probabilities are calculated using the chi-square statistics value of each feature in the collection. The divergence results for TF/PPI and TF/NonPF datasets with the two feature models (GM and BM) are presented in Fig. <figr fid="F2">2</figr>, with various numbers of top ranked features selected from the datasets. Overall, the divergence between the TF and PPI data was much larger than that of TF and NonPF data. This partly explains why the accuracy on the TF&amp;PPI dataset generally outperformed that of TF&amp;NonPF.</p>
				<fig id="F2">
					<title>
						<p>Figure 2</p>
					</title>
					<caption>
						<p>The average KL divergence of feature distributions between (1) TF and PPI, and (2) TF and NonPF datasets for the GM and BM models, when the top ranked features are considered (TF&amp; PPI_GM = feature distribution in TF vs. feature distribution in PPI in GM model, etc.)</p>
					</caption>
					<text>
						<p>The average KL divergence of feature distributions between (1) TF and PPI, and (2) TF and NonPF datasets for the GM and BM models, when the top ranked features are considered (TF&amp; PPI_GM = feature distribution in TF vs. feature distribution in PPI in GM model, etc.)</p>
					</text>
					<graphic file="1471-2105-9-S3-S11-2"/>
				</fig>
				<p>Obviously, despite the high precision in discriminating between generic PPI and TF contexts, there are PPI sentences that are also TF contexts. Table <tblr tid="T6">6</tblr> presents &#8220;confusion&#8221; examples of PPI sentences wrongly classified as TF-contexts, and TF-sentences marked as non-TF (i.e. PPI) contexts. For example, sentences containing &#8216;<it>transcription</it>&#8217; are usually <it>correctly</it> identified as (positive) TF contexts, while, on the other hand, some TF sentences, which do not contain strong TF discriminative features, are wrongly recognised as PPI examples. Still, the results for the TF&amp;PPI dataset were encouraging and we decided to conduct further experiments with the PPI data used as (noisy) negative data (in addition to the NonPF data).</p>
				<tbl id="T6">
					<title>
						<p>Table 6</p>
					</title>
					<caption>
						<p>Examples of confused contexts in the TF &amp; PPI dataset</p>
					</caption>
					<tblbdy cols="3">
						<r>
							<c ca="center">
								<p>
									<b>Correct</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>Predicted</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>Example</b>
								</p>
							</c>
						</r>
						<r>
							<c cspan="3">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="center">
								<p>PPI</p>
							</c>
							<c ca="center">
								<p>TF</p>
							</c>
							<c ca="left">
								<p><it>Transcription Factor IIH (TFIIH) and p300 act cooperatively to enhance Vpr effects on glucocorticoid receptor transactivation</it>.</p>
							</c>
						</r>
						<r>
							<c ca="center">
								<p>PPI</p>
							</c>
							<c ca="center">
								<p>TF</p>
							</c>
							<c ca="left">
								<p><it>These studies show that VES induces growth inhibition of BT-20 cells through a mechanism that involves cyclin A-negative regulation of E2F-mediated transcription</it>.</p>
							</c>
						</r>
						<r>
							<c ca="center">
								<p>PPI</p>
							</c>
							<c ca="center">
								<p>TF</p>
							</c>
							<c ca="left">
								<p><it>Adenovirus E1A protein represses activation by Vpr by competing for binding to p300, suggesting that p300 is required for activation of HIV transcription by Vpr</it>.</p>
							</c>
						</r>
						<r>
							<c ca="center">
								<p>TF</p>
							</c>
							<c ca="center">
								<p>PPI</p>
							</c>
							<c ca="left">
								<p><it>It plays a role in HOMEOSTASIS of GLUCOSE and controls expression of GLUT2 PROTEIN</it>.</p>
							</c>
						</r>
						<r>
							<c ca="center">
								<p>TF</p>
							</c>
							<c ca="center">
								<p>PPI</p>
							</c>
							<c ca="left">
								<p><it>Mutations in hepatocyte nuclear factor 1-beta are associated with renal CYSTS and MATURITY-ONSET DIABETES MELLITUS type 5</it>.</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
			</sec>
			<sec>
				<st>
					<p>Performance comparisons for the TF-sentence classification task</p>
				</st>
				<p>After the preliminary experiments, the two datasets (TF&amp;NonPF and TF&amp;PPI) were used to train three machine-learning classifiers (SVM, NB, ME), using 5-fold cross-validation. Table <tblr tid="T7">7</tblr> and figures <figr fid="F3">3</figr> and <figr fid="F4">4</figr> present the results, while a detailed discussion is given below.</p>
				<fig id="F3">
					<title>
						<p>Figure 3</p>
					</title>
					<caption>
						<p>The F-measures of the three machining learning approaches on the TF&amp;NonPF dataset (GM = generic model; BM = biological model)</p>
					</caption>
					<text>
						<p>The F-measures of the three machining learning approaches on the TF&amp;NonPF dataset (GM = generic model; BM = biological model)</p>
					</text>
					<graphic file="1471-2105-9-S3-S11-3"/>
				</fig>
				<fig id="F4">
					<title>
						<p>Figure 4</p>
					</title>
					<caption>
						<p>The F-measure of the three machining learning approaches on the TF&amp;PPI dataset (GM = generic model; BM = biological model)</p>
					</caption>
					<text>
						<p>The F-measure of the three machining learning approaches on the TF&amp;PPI dataset (GM = generic model; BM = biological model)</p>
					</text>
					<graphic file="1471-2105-9-S3-S11-4"/>
				</fig>
				<tbl id="T7">
					<title>
						<p>Table 7</p>
					</title>
					<caption>
						<p>Performance of the three machine-learning classifiers on the TF &amp; NonPF and TF &amp; PPI datasets using 5-fold cross-validation (GM = generic model; BM = biological model)</p>
					</caption>
					<tblbdy cols="11">
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c cspan="3" ca="center">
								<p>
									<b>SVM</b>
								</p>
							</c>
							<c cspan="3" ca="center">
								<p>
									<b>NB</b>
								</p>
							</c>
							<c cspan="3" ca="center">
								<p>
									<b>ME</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c cspan="9">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>
									<b>P</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>R</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>F</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>P</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>R</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>F</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>P</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>R</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>F</b>
								</p>
							</c>
						</r>
						<r>
							<c cspan="11">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="center">
								<p>
									<b>TF &amp; NonPF</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>GM</b>
								</p>
							</c>
							<c ca="center">
								<p>.9342</p>
							</c>
							<c ca="center">
								<p>.9104</p>
							</c>
							<c ca="center">
								<p>.9222</p>
							</c>
							<c ca="center">
								<p>.9413</p>
							</c>
							<c ca="center">
								<p>.9744</p>
							</c>
							<c ca="center">
								<p>.9576</p>
							</c>
							<c ca="center">
								<p>.9638</p>
							</c>
							<c ca="center">
								<p>.9042</p>
							</c>
							<c ca="center">
								<p>.9330</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>
									<b>BM</b>
								</p>
							</c>
							<c ca="center">
								<p>.9421</p>
							</c>
							<c ca="center">
								<p>.9343</p>
							</c>
							<c ca="center">
								<p>.9380</p>
							</c>
							<c ca="center">
								<p>.9434</p>
							</c>
							<c ca="center">
								<p>.9726</p>
							</c>
							<c ca="center">
								<p>.9578</p>
							</c>
							<c ca="center">
								<p>.9591</p>
							</c>
							<c ca="center">
								<p>.9351</p>
							</c>
							<c ca="center">
								<p>.9470</p>
							</c>
						</r>
						<r>
							<c ca="center">
								<p>
									<b>TF &amp; PPI</b>
								</p>
							</c>
							<c ca="center">
								<p>
									<b>GM</b>
								</p>
							</c>
							<c ca="center">
								<p>.8938</p>
							</c>
							<c ca="center">
								<p>.9463</p>
							</c>
							<c ca="center">
								<p>.9193</p>
							</c>
							<c ca="center">
								<p>.8767</p>
							</c>
							<c ca="center">
								<p>.9268</p>
							</c>
							<c ca="center">
								<p>.9010</p>
							</c>
							<c ca="center">
								<p>.8685</p>
							</c>
							<c ca="center">
								<p>.9554</p>
							</c>
							<c ca="center">
								<p>.9099</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>
									<b>BM</b>
								</p>
							</c>
							<c ca="center">
								<p>.9092</p>
							</c>
							<c ca="center">
								<p>.9367</p>
							</c>
							<c ca="center">
								<p>.9227</p>
							</c>
							<c ca="center">
								<p>.8892</p>
							</c>
							<c ca="center">
								<p>.9268</p>
							</c>
							<c ca="center">
								<p>.9076</p>
							</c>
							<c ca="center">
								<p>.8974</p>
							</c>
							<c ca="center">
								<p>.9524</p>
							</c>
							<c ca="center">
								<p>.9241</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
				<sec>
					<st>
						<p>Comparison of the feature models (GM vs. BM)</p>
					</st>
					<p>The biological model consistently out-performed the generic model on both TF&amp;NonPF and TF&amp;PPI datasets. The experimental results show that the performance of individual classifiers improved up to 2.5%, while being achieved with fewer features (recall Table <tblr tid="T3">3</tblr>: the BM feature sets were almost one third of the GM model). Although the biological model requires additional pre-processing for feature extraction (e.g. gene name identification), this is typically a step in a typical text mining pipeline that would be beneficial for other tasks as well. Overall, the results suggest that biological features (gene/protein names, interaction words, MeSH/GO TF terms) seem to be to some extent more useful than non-biological features for TF-sentence identification. Still, in some cases, the BM model achieved only less than 1% improvement on the F-measure compared to the GM model. One explanation for such a modest improvement is a potential overlap between BM and GM features. We explored the top 350 features (measured by chi-square statistics) from the GM and BM models used in the TF&amp;NonPF dataset, and found that only 9.4% features of the GM model (33 features) has not appeared in the BM feature list. This implies that the best features for classification are indeed biological words, which have been selected by both models.</p>
				</sec>
				<sec>
					<st>
						<p>Using more negative data for training</p>
					</st>
					<p>To ensure unbiased learning of the classifiers, in the first set of experiments (Table <tblr tid="T7">7</tblr>) we have deliberately constructed the training datasets with balanced numbers of positive and negative examples. However, in a real-world setting, it seems that non-relevant TF contexts are far more frequent than relevant ones. To investigate the impact of an unbalanced but more realistic training dataset containing more negative cases, we performed another set of experiments with additional 1200 PPI sentences and 1000 NonPF sentences added to the corresponding (negative) training data and examined the performance of the classifiers on the unchanged test data. The results presented in Table <tblr tid="T8">8</tblr> show just a marginal improvement when compared to the balanced-training data scenario (slightly improved accuracy, with a small drop in the recall).</p>
					<tbl id="T8">
						<title>
							<p>Table 8</p>
						</title>
						<caption>
							<p>Performance of the three machine-learning classifiers on the TF &amp; NonPF and TF &amp; PPI datasets with additional negative examples for training using 5-fold cross-validation (GM = generic model; BM = biological model)</p>
						</caption>
						<tblbdy cols="11">
							<r>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c cspan="3" ca="center">
									<p>
										<b>SVM</b>
									</p>
								</c>
								<c cspan="3" ca="center">
									<p>
										<b>NB</b>
									</p>
								</c>
								<c cspan="3" ca="center">
									<p>
										<b>ME</b>
									</p>
								</c>
							</r>
							<r>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c cspan="9">
									<hr/>
								</c>
							</r>
							<r>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c ca="center">
									<p>
										<b>P</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>R</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>F</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>P</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>R</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>F</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>P</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>R</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>F</b>
									</p>
								</c>
							</r>
							<r>
								<c cspan="11">
									<hr/>
								</c>
							</r>
							<r>
								<c ca="center">
									<p>
										<b>TF &amp; NonPF</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>GM</b>
									</p>
								</c>
								<c ca="center">
									<p>.9592</p>
								</c>
								<c ca="center">
									<p>.8967</p>
								</c>
								<c ca="center">
									<p>.9269</p>
								</c>
								<c ca="center">
									<p>.9472</p>
								</c>
								<c ca="center">
									<p>.9708</p>
								</c>
								<c ca="center">
									<p>.9588</p>
								</c>
								<c ca="center">
									<p>.9700</p>
								</c>
								<c ca="center">
									<p>.8863</p>
								</c>
								<c ca="center">
									<p>.9263</p>
								</c>
							</r>
							<r>
								<c>
									<p/>
								</c>
								<c ca="center">
									<p>
										<b>BM</b>
									</p>
								</c>
								<c ca="center">
									<p>.9602</p>
								</c>
								<c ca="center">
									<p>.9242</p>
								</c>
								<c ca="center">
									<p>.9418</p>
								</c>
								<c ca="center">
									<p>.9371</p>
								</c>
								<c ca="center">
									<p>.9661</p>
								</c>
								<c ca="center">
									<p>.9513</p>
								</c>
								<c ca="center">
									<p>.9609</p>
								</c>
								<c ca="center">
									<p>.9208</p>
								</c>
								<c ca="center">
									<p>.9404</p>
								</c>
							</r>
							<r>
								<c ca="center">
									<p>
										<b>TF &amp; PPI</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>GM</b>
									</p>
								</c>
								<c ca="center">
									<p>.8959</p>
								</c>
								<c ca="center">
									<p>.9469</p>
								</c>
								<c ca="center">
									<p>.9207</p>
								</c>
								<c ca="center">
									<p>.8743</p>
								</c>
								<c ca="center">
									<p>.9149</p>
								</c>
								<c ca="center">
									<p>.8941</p>
								</c>
								<c ca="center">
									<p>.8760</p>
								</c>
								<c ca="center">
									<p>.9542</p>
								</c>
								<c ca="center">
									<p>.9134</p>
								</c>
							</r>
							<r>
								<c>
									<p/>
								</c>
								<c ca="center">
									<p>
										<b>BM</b>
									</p>
								</c>
								<c ca="center">
									<p>.9103</p>
								</c>
								<c ca="center">
									<p>.9379</p>
								</c>
								<c ca="center">
									<p>.9239</p>
								</c>
								<c ca="center">
									<p>.8891</p>
								</c>
								<c ca="center">
									<p>.9119</p>
								</c>
								<c ca="center">
									<p>.9004</p>
								</c>
								<c ca="center">
									<p>.9058</p>
								</c>
								<c ca="center">
									<p>.9506</p>
								</c>
								<c ca="center">
									<p>.9277</p>
								</c>
							</r>
						</tblbdy>
					</tbl>
				</sec>
				<sec>
					<st>
						<p>Comparison of ML approaches</p>
					</st>
					<p>Tables <tblr tid="T7">7</tblr> and <tblr tid="T8">8</tblr> show that the three ML approaches obtained a high precision (generally over 90%), suggesting that TF contexts contain distinguished features which provide strong discriminating power. Still, performance of the classifiers was not consistent on the two datasets. The NB classifier excelled the other two classifiers on the TF&amp;NonPF dataset with an F-measure of over 95% on average, but it performed worse on the TF&amp;PPI dataset (F-measure dropped down below 91%). The SVM classifier was the best on the TF&amp;PPI dataset, but on the TF&amp;NonPF dataset it did not work very well, especially for the generic model. The inconsistent performance of the NB and SVM classifiers (the ME classifier was more stable) can be partially explained by the differences between feature distributions in two datasets (see Fig. <figr fid="F2">2</figr> for the AKL divergence).</p>
				</sec>
			</sec>
			<sec>
				<st>
					<p>Merging results from different classifiers</p>
				</st>
				<p>The inconsistent results obtained by different classifiers prompted us to analyse the results obtained by combining their outputs. We investigated a vote-based merging through two stages: first, the outputs from three different classifiers trained on the <it>same dataset</it> are combined together according to different voting strategies (Stage I); then, the results integrated from the <it>distinct training datasets</it> (TF&amp;NonPF, TF&amp;PPI) are merged together to form the final classification results (Stage II).</p>
				<sec>
					<st>
						<p>Stage I: merging results from the classifiers trained on the same dataset</p>
					</st>
					<p>We experimented with the biological model only. Three voting approaches have been applied: <it>unanimous</it> (i.e. all vote), <it>any</it> (i.e. any vote) or <it>majority</it> (at least 2 out of 3 votes). Table <tblr tid="T9">9</tblr> shows the performance of Stage I. It is a no surprise that the unanimous voting strategy improved precision, while the voting based on positive outcome from any classifier improved the overall recall performance. However, the best merged F-measure was achieved by the majority voting method, with a marginally worse performance compared to the best single classifier. It is reasonable to expect that the majority voting has a slightly lower F-measure as it only builds the results by agreeing on the judgment from the majority.</p>
					<tbl id="T9">
						<title><p>Table 9</p></title>
						<caption><p>Stage I performance, after the result merging from the three different classifiers learned on the same dataset (using the biological model), along with the best performance in each column before and after Stage I highlighted</p></caption>
						<tblbdy cols="8">
							<r><c><p/></c><c><p/></c><c cspan="3" ca="center"><p><b>TF &amp; NonPF data</b></p></c><c cspan="3" ca="center"><p><b>TF &amp; PPI data</b></p></c></r>
							<r><c><p/></c><c><p/></c><c cspan="6"><hr/></c></r>
							<r><c><p/></c><c><p/></c><c ca="center"><p><b>P</b></p></c><c ca="center"><p><b>R</b></p></c><c ca="center"><p><b>F</b></p></c><c ca="center"><p><b>P</b></p></c><c ca="center"><p><b>R</b></p></c><c ca="center"><p><b>F</b></p></c></r>
							<r><c cspan="8"><hr/></c></r>
							<r><c ca="center"><p><b>before Stage I</b></p></c><c ca="center"><p><b>SVM</b></p></c><c ca="center"><p>.9381</p></c><c ca="center"><p>.9420</p></c><c ca="center"><p><b>.9039</b></p></c><c ca="center"><p>.9488</p></c><c ca="center"><p><b>.9258</b></p></c><c><p/></c></r>
							<r><c><p/></c><c ca="center"><p><b>NB</b></p></c><c ca="center"><p>.9315</p></c><c ca="center"><p><b>.9720</b></p></c><c ca="center"><p><b>.9514</b></p></c><c ca="center"><p>.8787</p></c><c ca="center"><p>.9143</p></c><c ca="center"><p>.8961</p></c></r>
							<r><c><p/></c><c ca="center"><p><b>ME</b></p></c><c ca="center"><p>.9327</p></c><c ca="center"><p>.9474</p></c><c ca="center"><p>.8895</p></c><c ca="center"><p><b>.9536</b></p></c><c ca="center"><p>.9204</p></c><c><p/></c></r>
							<r><c ca="center"><p><b>after Stage I</b></p></c><c ca="center"><p><b>2/3 majority</b></p></c><c ca="center"><p>.9576</p></c><c ca="center"><p>.9427</p></c><c ca="center"><p>.8966</p></c><c ca="center"><p>.9515</p></c><c ca="center"><p><b>.9236</b></p></c><c><p/></c></r>
							<r><c><p/></c><c ca="center"><p><b>unanimous</b></p></c><c ca="center"><p><b>.9748</b></p></c><c ca="center"><p>.9242</p></c><c ca="center"><p><b>.9279</b></p></c><c ca="center"><p>.8836</p></c><c ca="center"><p>.9052</p></c><c><p/></c></r>
							<r><c><p/></c><c ca="center"><p><b>any</b></p></c><c ca="center"><p>.9141</p></c><c ca="center"><p><b>.9785</b></p></c><c ca="center"><p>.9452</p></c><c ca="center"><p>.8598</p></c><c ca="center"><p><b>.9815</b></p></c><c ca="center"><p>.9166</p></c></r>
						</tblbdy>
					</tbl>
				</sec>
				<sec>
					<st>
						<p>Stage II: merging results from the classifiers trained on different datasets</p>
					</st>
					<p>It is obvious that the classifiers learned on different datasets may rely on different classification features. By merging the results from different datasets, we investigated potential complementarities. Two types of result filtering were considered: <it>unanimous voting</it> and <it>any voting</it>. Note that each time the results from two training datasets to be merged are obtained using the same voting strategy at Stage I.</p>
					<p>The final merged results are reported in Table <tblr tid="T10">10</tblr>. The best precision, recall, and F-measure generated in Stage II basically outperformed the results produced at Stage I as well as those from the individual classifiers. The best two F-measure values with most balanced precision and recall were obtained using a combination strategy with the 2/3 majority voting (Stage I) and any voting (Stage II), and the one with the any voting (Stage I) plus unanimous voting (Stage II). The former method achieved F-measure of 97.69% with a high recall (99.46%), while F-measure in the latter reached as high as 97.93% with a &#8216;perfect&#8217; precision (100%). These results confirm our hypothesis on a complementary relation between the results obtained from the TF&amp;PPI and TF&amp; NonPF data sources. This means that the result merging method could be an effective approach for performance improvement through different contributions from the two training datasets.</p>
					<tbl id="T10">
						<title>
							<p>Table 10</p>
						</title>
						<caption>
							<p>Stage II performance, after combining the results from the two datasets (TF &amp; NonPF and TF &amp; PPI); the best combination results are highlighted</p>
						</caption>
						<tblbdy cols="11">
							<r>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c cspan="9" ca="center">
									<p>
										<b>Stage I</b>
									</p>
								</c>
							</r>
							<r>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c cspan="9">
									<hr/>
								</c>
							</r>
							<r>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c cspan="3" ca="center">
									<p>
										<b>2/3 majority</b>
									</p>
								</c>
								<c cspan="3" ca="center">
									<p>
										<b>unanimous</b>
									</p>
								</c>
								<c cspan="3" ca="center">
									<p>
										<b>any</b>
									</p>
								</c>
							</r>
							<r>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c cspan="9">
									<hr/>
								</c>
							</r>
							<r>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c ca="center">
									<p>
										<b>P</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>R</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>F</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>P</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>R</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>F</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>P</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>R</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>F</b>
									</p>
								</c>
							</r>
							<r>
								<c cspan="11">
									<hr/>
								</c>
							</r>
							<r>
								<c ca="center">
									<p>
										<b>Stage</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>unanimous</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>1.000</b>
									</p>
								</c>
								<c ca="center">
									<p>.9036</p>
								</c>
								<c ca="center">
									<p>.9493</p>
								</c>
								<c ca="center">
									<p>
										<b>1.000</b>
									</p>
								</c>
								<c ca="center">
									<p>.8339</p>
								</c>
								<c ca="center">
									<p>.9094</p>
								</c>
								<c ca="center">
									<p>
										<b>1.000</b>
									</p>
								</c>
								<c ca="center">
									<p>.9595</p>
								</c>
								<c ca="center">
									<p>
										<b>.9793</b>
									</p>
								</c>
							</r>
							<r>
								<c ca="center">
									<p>
										<b>II</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>any</b>
									</p>
								</c>
								<c ca="center">
									<p>.9598</p>
								</c>
								<c ca="center">
									<p>
										<b>.9946</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>.9769</b>
									</p>
								</c>
								<c ca="center">
									<p>.9748</p>
								</c>
								<c ca="center">
									<p>
										<b>.9685</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>.9716</b>
									</p>
								</c>
								<c ca="center">
									<p>.9115</p>
								</c>
								<c ca="center">
									<p>
										<b>.9994</b>
									</p>
								</c>
								<c ca="center">
									<p>.9534</p>
								</c>
							</r>
						</tblbdy>
					</tbl>
				</sec>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Related work</p>
			</st>
			<p>Approaches to the extraction of protein-protein interactions and other biological relationships from biomedical text vary widely. Previous research efforts have generally focused on either statistical methods (e.g. co-occurrence of biological entities like protein names or word frequency information <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B40">40</abbr><abbr bid="B41">41</abbr></abbrgrp>), or linguistics approaches including shallow and deep parsing, applying simple pattern- or rule-based matching <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B42">42</abbr></abbrgrp> or complex template- or frame-based processing <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B43">43</abbr><abbr bid="B44">44</abbr><abbr bid="B45">45</abbr></abbrgrp>. In addition, a number of research projects have relied on machine learning. For example, Donaldson and colleagues <abbrgrp><abbr bid="B46">46</abbr></abbrgrp> built a prototype system to populate a knowledge base with PPI data recognised by an SVM classifier. Jansen and associates <abbrgrp><abbr bid="B47">47</abbr></abbrgrp> reported on a Bayesian network to predict PPI in yeast. Sugiyama and colleagues <abbrgrp><abbr bid="B48">48</abbr></abbrgrp> investigated several machine learning techniques, such as K-nearest neighbour rule, decision tree, neural network, and SVM, to verify the effectiveness of ML approaches in detecting PPI.</p>
			<p>Similarly to other ML-approaches, we have employed different machine methods (naive Bayes, SVM, and Maximum Entropy) to discover contexts describing transcription factors. However, our system differs from the related work in the following aspects:</p>
			<p>(1) Our approach is focused on a specific <it>role</it> of certain biological entities: our targets are special proteins (i.e. transcription factors) that regulate gene expressions. Due to the particular role that TFs have in gene regulation, the objective of our system is to detect relevant text contexts related to this specific biological <it>function</it> and <it>role</it>.</p>
			<p>(2) We rely on background knowledge collected from weak and noisy evidence that is available in existing resources. We have created a dataset of positive examples from descriptions of biological terms from the MeSH and GO databases related to transcription factors. The experiments have shown that although not ideal, this dataset can be used as noisy positive training data.</p>
			<p>(3) Feature selection is one of the most important issues in an ML approach. Most of existing approaches rely on weighted word-based features. We have used biological features (such as protein/gene names, molecular interaction words, and TF-related terms) and have shown that these features provide at least comparable performance with a significant reduction of the feature space.</p>
		</sec>
		<sec>
			<st>
				<p>Conclusions</p>
			</st>
			<p>We have presented a text-classification approach to automatically locate TF-related sentence contexts, in order to build a starting point for literature-based curation of transcription factor databases. The results are highly encouraging, with F-measure well above 90%. The extraction approach is built around an ML-based architecture, with a dedicated feature model based on specific biological features relevant to the task. We have investigated three different ML methods, and also presented a two-stage result-merging method that has been used to combine the results from both different types of machine-learning algorithms and the different training datasets.</p>
			<p>Our initial experiments have confirmed that reasonable training data can be obtained from existing resources, namely, MeSH and GO TF-related data. The testing results on the FlyTF data were encouraging, and strongly confirmed our assumptions that TF-related MeSH and GO term definitions are useful for the detection of TF-related contexts, but that real-world positive data (e.g. from FlyTF) are needed to improve recall. Another interesting finding from our experiments is that we have not been able to confirm strong similarity between TF and PPI contexts as expected. By using PPI data as negative examples for the TF-related sentence extraction, we were generally able to obtain comparable if not more accurate results when compared to negative data obtained from non-protein-description data (NonPF).</p>
			<p>The results reported here show that the proposed approach is capable of accurately identifying TF-related information from text. However, a number of interesting issues remain to be resolved. The first issue is related to distinguishing transcription factors from other proteins in a TF-related context in which two or more gene and protein names co-occur together. A possible solution is to make use of syntactic relations, combined with biological feature terms to judge the likelihood of a protein being a transcription factor. In addition, FlyTF data, which is treated as an important positive TF example dataset used for classification, is an organism-specific corpus. It is likely that it does not cover all TF-related features for various organisms. Therefore, an analysis of a more diverse TF data for the identification of transcription factors is needed.</p>
		</sec>
		<sec>
			<st>
				<p>Competing interests</p>
			</st>
			<p>The authors declare that they have no competing interests.</p>
		</sec>
		<sec>
			<st>
				<p>Authors' contributions</p>
			</st>
			<p>HY designed and implemented the system, performed experiments and evaluated the results. GN motivated and coordinated the study, and helped with the interpretation of the results. JAK participated in the conceptual design and machine learning. HY drafted the first version of the manuscript, and GN prepared the final manuscript, which all authors read and approved.</p>
		</sec>
	</bdy>
	<bm>
		<ack>
			<sec>
				<st>
					<p>Acknowledgements</p>
				</st>
				<p>We would like to thank Dr Casey Bergman (Faculty of Life Sciences, University of Manchester) for useful discussions on transcription factors. The authors would also like to thank anonymous reviewers whose comments were extremely helpful. This work was in part supported by the bio-MITA project (&#8220;Mining Term Associations from Literature to Support Knowledge Discovery in Biology&#8221;), funded by the UK Biotechnology and Biological Science Research Council (BBSRC).</p>
				<p>This article has been published as part of <it>BMC Bioinformatics</it> Volume 9 Supplement 3, 2008: Proceedings of the Second International Symposium on Languages in Biology and Medicine (LBM) 2007. The full contents of the supplement are available online at <url>http://www.biomedcentral.com/1471-2105/9?issue=S3</url>.</p>
			</sec>
		</ack>
		<refgrp>
			<bibl id="B1">
				<title>
					<p>Automatic extraction of biological information from scientific text: protein&#8211;protein interactions</p>
				</title>
				<aug>
					<au>
						<snm>Blaschke</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Andrade</snm>
						<fnm>MA</fnm>
					</au>
					<au>
						<snm>Ouzounis</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Valencia</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>Proc Int Conf Intell Syst Mol Biol</source>
				<pubdate>1999</pubdate>
				<fpage>60</fpage>
				<lpage>67</lpage>
				<xrefbib>
					<pubid idtype="pmpid">10786287</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B2">
				<title>
					<p>MeKE: discovering the functions of gene products from biomedical literature via sentence alignment</p>
				</title>
				<aug>
					<au>
						<snm>Chiang</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Yu</snm>
						<fnm>H</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2003</pubdate>
				<volume>19</volume>
				<fpage>1417</fpage>
				<lpage>1422</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">12874055</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B3">
				<title>
					<p>Discovering patterns to extract protein-protein interactions from the literature</p>
				</title>
				<aug>
					<au>
						<snm>Hao</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Zhu</snm>
						<fnm>X</fnm>
					</au>
					<au>
						<snm>Huang</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Li</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2005</pubdate>
				<volume>21</volume>
				<fpage>3294</fpage>
				<lpage>3300</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15890744</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B4">
				<title>
					<p>GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles</p>
				</title>
				<aug>
					<au>
						<snm>Friedman</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Kra</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Yu</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Krauthammer</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Rzhetsky</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2001</pubdate>
				<volume>17</volume>
				<fpage>S74</fpage>
				<lpage>S82</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">11472995</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B5">
				<title>
					<p>Discovering patterns to extract protein-protein interactions from full texts</p>
				</title>
				<aug>
					<au>
						<snm>Huang</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Zhu</snm>
						<fnm>X</fnm>
					</au>
					<au>
						<snm>Hao</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Payan</snm>
						<fnm>DG</fnm>
					</au>
					<au>
						<snm>Qu</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Li</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2004</pubdate>
				<volume>20</volume>
				<fpage>3604</fpage>
				<lpage>3612</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15284092</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B6">
				<title>
					<p>PRIME: Automatically Extracted PRotein Interactions and Molecular Information Database</p>
				</title>
				<aug>
					<au>
						<snm>Koike</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Takagi</snm>
						<fnm>T</fnm>
					</au>
				</aug>
				<source>In Silico Biol</source>
				<pubdate>2005</pubdate>
				<volume>5</volume>
				<issue>1</issue>
				<fpage>9</fpage>
				<lpage>20</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15972002</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B7">
				<title>
					<p>Mining literature for protein-protein interactions</p>
				</title>
				<aug>
					<au>
						<snm>Marcotte</snm>
						<fnm>EM</fnm>
					</au>
					<au>
						<snm>Xenarios</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Eisenberg</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2001</pubdate>
				<volume>17</volume>
				<fpage>359</fpage>
				<lpage>363</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">11301305</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B8">
				<title>
					<p>Automated extraction of information on protein&#8211;protein interactions from the biological literature</p>
				</title>
				<aug>
					<au>
						<snm>Ono</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Hishigaki</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Tanigami</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Takagi</snm>
						<fnm>T</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2001</pubdate>
				<volume>17</volume>
				<fpage>155</fpage>
				<lpage>161</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">11238071</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B9">
				<title>
					<p>Automatic extraction of protein interactions from scientific abstracts</p>
				</title>
				<aug>
					<au>
						<snm>Thomas</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Milward</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Ouzounis</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Pulman</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Carroll</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Pac Symp Biocomput</source>
				<pubdate>2000</pubdate>
				<fpage>541</fpage>
				<lpage>542</lpage>
				<xrefbib>
					<pubid idtype="pmpid">10902201</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B10">
				<title>
					<p>Extracting human protein interactions from MEDLINE using a full-sentence parser</p>
				</title>
				<aug>
					<au>
						<snm>Daraselia</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Yuryev</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Egorov</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Novichkova</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Nikitin</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Mazo</snm>
						<fnm>I</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2004</pubdate>
				<volume>20</volume>
				<fpage>604</fpage>
				<lpage>611</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15033866</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B11">
				<title>
					<p>Applying GIFT, a Gene Interactions Finder in Text, to fly literature</p>
				</title>
				<aug>
					<au>
						<snm>Domedel-Puig</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Wernisch</snm>
						<fnm>L</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2005</pubdate>
				<volume>21</volume>
				<fpage>3582</fpage>
				<lpage>3583</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">16014369</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B12">
				<title>
					<p>A comprehensive analysis of protein&#8211;protein interactions in Saccharomyces cerevisiae</p>
				</title>
				<aug>
					<au>
						<snm>Uetz</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Giot</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Cagney</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Manfield</snm>
						<fnm>TA</fnm>
					</au>
					<au>
						<snm>Judson</snm>
						<fnm>RS</fnm>
					</au>
					<au>
						<snm>Knight</snm>
						<fnm>JR</fnm>
					</au>
					<au>
						<snm>Lockshon</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Nature</source>
				<pubdate>2000</pubdate>
				<volume>403</volume>
				<fpage>623</fpage>
				<lpage>627</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">10688190</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B13">
				<title>
					<p>Robust relational parsing over biomedical literature: extracting inhibit relations</p>
				</title>
				<aug>
					<au>
						<snm>Pustejovsky</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Casta&#241;o</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Zhang</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Kotecki</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Cochran</snm>
						<fnm>B</fnm>
					</au>
				</aug>
				<source>Pac Symp Biocomput</source>
				<pubdate>2002</pubdate>
				<fpage>362</fpage>
				<lpage>373</lpage>
				<xrefbib>
					<pubid idtype="pmpid">11928490</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B14">
				<title>
					<p>Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and Protein Structures</p>
				</title>
				<aug>
					<au>
						<snm>Humphreys</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Demetriou</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Gaizauskas</snm>
						<fnm>R</fnm>
					</au>
				</aug>
				<source>Pac Symp Biocomput</source>
				<pubdate>2000</pubdate>
				<fpage>505</fpage>
				<lpage>516</lpage>
				<xrefbib>
					<pubid idtype="pmpid">10902198</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B15">
				<title>
					<p>BioCreative II Protein-Protein Interaction Task</p>
				</title>
				<url>http://biocreative.sourceforge.net/biocreative_2_ppi.html</url>
			</bibl>
			<bibl id="B16">
				<title>
					<p>LLL05 Challenge</p>
				</title>
				<url>http://genome.jouy.inra.fr/texte/LLLchallenge/</url>
			</bibl>
			<bibl id="B17">
				<title>
					<p>Developmental Biology</p>
				</title>
				<aug>
					<au>
						<snm>Gilbert</snm>
						<fnm>SF</fnm>
					</au>
				</aug>
				<publisher>Sunderland, Mass: Sinauer Associates</publisher>
				<pubdate>2006</pubdate>
			</bibl>
			<bibl id="B18">
				<title>
					<p>TRANSFAC: an integrated system for gene expression regulation</p>
				</title>
				<aug>
					<au>
						<snm>Wingender</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Chen</snm>
						<fnm>X</fnm>
					</au>
					<au>
						<snm>Hehl</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Karas</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Liebich</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Matys</snm>
						<fnm>V</fnm>
					</au>
					<au>
						<snm>Meinhardt</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Reuter</snm>
						<fnm>TPM</fnm>
					</au>
					<au>
						<snm>Schacherer</snm>
						<fnm>F</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2000</pubdate>
				<volume>28</volume>
				<issue>1</issue>
				<fpage>316</fpage>
				<lpage>319</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">102445</pubid>
						<pubid idtype="pmpid" link="fulltext">10592259</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B19">
				<title>
					<p>FlyBase: Genes and Gene Models</p>
				</title>
				<aug>
					<au>
						<snm>Drysdale</snm>
						<fnm>RA</fnm>
					</au>
					<au>
						<snm>Crosby</snm>
						<fnm>MA</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2005</pubdate>
				<volume>33</volume>
				<issue>Database issue</issue>
				<fpage>D390</fpage>
				<lpage>D395</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">540000</pubid>
						<pubid idtype="pmpid" link="fulltext">15608223</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B20">
				<title>
					<p>ORegAnno: an Open Access Database and Curation System for Literature-derived Promoters, Transcription Factor Binding Sites and Regulatory Variation</p>
				</title>
				<aug>
					<au>
						<snm>Montgomery</snm>
						<fnm>SB</fnm>
					</au>
					<au>
						<snm>Criffith</snm>
						<fnm>OL</fnm>
					</au>
					<au>
						<snm>Sleumer</snm>
						<fnm>MC</fnm>
					</au>
					<au>
						<snm>Bergman</snm>
						<fnm>CM</fnm>
					</au>
					<au>
						<snm>Bilenky</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>P</snm>
						<fnm>E.D.</fnm>
					</au>
					<au>
						<snm>Prychyna</snm>
						<fnm>Y.</fnm>
					</au>
					<au>
						<snm>Zhang</snm>
						<fnm>X</fnm>
					</au>
					<au>
						<snm>Jones</snm>
						<fnm>SJ</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2006</pubdate>
				<volume>22</volume>
				<fpage>637</fpage>
				<lpage>640</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">16397004</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B21">
				<title>
					<p>RegCreative Jamboree</p>
				</title>
				<url>http://www.dmbr.ugent.be/bioit/contents/regcreative/</url>
			</bibl>
			<bibl id="B22">
				<title>
					<p>Medical Subject Headings (MeSH)</p>
				</title>
				<url>http://www.nlm.nih.gov/mesh/meshhome.html</url>
			</bibl>
			<bibl id="B23">
				<title>
					<p>Gene Ontology</p>
				</title>
				<url>http://www.geneontology.org</url>
			</bibl>
			<bibl id="B24">
				<title>
					<p>Developing a Robust Part-of-Speech Tagger for Biomedical Text</p>
				</title>
				<aug>
					<au>
						<snm>Tsuruoka</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Tateishi</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Kim</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Ohta</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>McNaught</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Ananiadou</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>J</snm>
						<fnm>T</fnm>
					</au>
				</aug>
				<source>Advances in Informatics</source>
				<pubdate>2005</pubdate>
				<fpage>382</fpage>
				<lpage>392</lpage>
			</bibl>
			<bibl id="B25">
				<title>
					<p>The Statistical Analysis of Discrete Data</p>
				</title>
				<aug>
					<au>
						<snm>Santner</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Duffy</snm>
						<fnm>DE</fnm>
					</au>
				</aug>
				<publisher>Springer Verlag</publisher>
				<pubdate>1989</pubdate>
			</bibl>
			<bibl id="B26">
				<title>
					<p>Learning anchor verbs for biological interaction patterns from published text articles</p>
				</title>
				<aug>
					<au>
						<snm>Hatzivassiloglou</snm>
						<fnm>V</fnm>
					</au>
					<au>
						<snm>Weng</snm>
						<fnm>W</fnm>
					</au>
				</aug>
				<source>International Journal of Medical Informatics</source>
				<pubdate>2002</pubdate>
				<volume>67</volume>
				<fpage>19</fpage>
				<lpage>32</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">12460629</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B27">
				<title>
					<p>ABNER: An Open Source Tool for Automatically Tagging Genes</p>
				</title>
				<aug>
					<au>
						<snm>Settles</snm>
						<fnm>B</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2005</pubdate>
				<volume>21</volume>
				<issue>14</issue>
				<fpage>3191</fpage>
				<lpage>3192</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15860559</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B28">
				<title>
					<p>Phrasal Queries with LingPipe and Lucene: Ad Hoc Genomics Text Retrieval</p>
				</title>
				<aug>
					<au>
						<snm>Carpenter</snm>
						<fnm>B</fnm>
					</au>
				</aug>
				<source>Proceedings of the 13th Annual Text Retrieval Conferenc</source>
				<pubdate>2004</pubdate>
			</bibl>
			<bibl id="B29">
				<title>
					<p>A genome-wide and nonredundant mouse transcription factor database</p>
				</title>
				<aug>
					<au>
						<snm>Kanamori</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Konno</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Osato</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Kawai</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Hayashizaki</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Suzuki</snm>
						<fnm>H</fnm>
					</au>
				</aug>
				<source>Biochem Biophys Res Commun</source>
				<pubdate>2004</pubdate>
				<volume>322</volume>
				<issue>3</issue>
				<fpage>787</fpage>
				<lpage>793</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15336533</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B30">
				<title>
					<p>SMART Stop Word List</p>
				</title>
				<url>ftp://ftp.cs.cornell.edu/pub/smart/english.stop</url>
			</bibl>
			<bibl id="B31">
				<title>
					<p>FlyTF Database</p>
				</title>
				<url>http://www.mrc-lmb.cam.ac.uk/genomes/FlyTF/info.html</url>
			</bibl>
			<bibl id="B32">
				<title>
					<p>DBD: a transcription factor prediction database</p>
				</title>
				<aug>
					<au>
						<snm>Kummerfeld</snm>
						<fnm>SK</fnm>
					</au>
					<au>
						<snm>Teichmann</snm>
						<fnm>SA</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2006</pubdate>
				<volume>34</volume>
				<issue>Database issue</issue>
				<fpage>D74</fpage>
				<lpage>D81</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1347493</pubid>
						<pubid idtype="pmpid" link="fulltext">16381970</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B33">
				<title>
					<p>Prodisen Corpus</p>
				</title>
				<url>http://www.pdg.cnb.uam.es/martink/PRODISEN/index.html</url>
			</bibl>
			<bibl id="B34">
				<title>
					<p>BioCreAtIvE-PPI Corpus</p>
				</title>
				<url>http://www2.informatik.hu-berlin.de/~hakenber/corpora/</url>
			</bibl>
			<bibl id="B35">
				<title>
					<p>PICorpus</p>
				</title>
				<url>http://bionlp-corpora.sourceforge.net/index.shtml</url>
			</bibl>
			<bibl id="B36">
				<title>
					<p>GeneRIF HIV Corpus</p>
				</title>
				<url>ftp://ftp.ncbi.nih.gov/gene/GeneRIF/</url>
			</bibl>
			<bibl id="B37">
				<title>
					<p>TinySVM</p>
				</title>
				<url>http://chasen.org/~taku/software/TinySVM/</url>
			</bibl>
			<bibl id="B38">
				<title>
					<p>MALLET Toolkit</p>
				</title>
				<url>http://mallet.cs.umass.edu/index.php/Main_Page</url>
			</bibl>
			<bibl id="B39">
				<title>
					<p>On information and sufficiency</p>
				</title>
				<aug>
					<au>
						<snm>Kullback</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Leibler</snm>
						<fnm>RA</fnm>
					</au>
				</aug>
				<source>Annals of Mathematical Statistics</source>
				<pubdate>1951</pubdate>
				<volume>22</volume>
				<fpage>79</fpage>
				<lpage>86</lpage>
			</bibl>
			<bibl id="B40">
				<title>
					<p>Biobibliometrics: Information retrieval and visualization from co-occurrences of gene names in medline asbtracts</p>
				</title>
				<aug>
					<au>
						<snm>Stapley</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Benoit</snm>
						<fnm>G</fnm>
					</au>
				</aug>
				<source>Pacific Symposium on Biocomputing</source>
				<pubdate>2000</pubdate>
				<fpage>529</fpage>
				<lpage>540</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmpid">10902200</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B41">
				<title>
					<p>Automatic Construction of Knowledge Base from Biological Papers</p>
				</title>
				<aug>
					<au>
						<snm>Ohta</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Yamamoto</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Okazaki</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>U</snm>
						<fnm>I.</fnm>
					</au>
					<au>
						<snm>Takagi</snm>
						<fnm>T</fnm>
					</au>
				</aug>
				<source>Proc Int Conf Intell Syst Mol Biol</source>
				<pubdate>1997</pubdate>
				<volume>5</volume>
				<fpage>218</fpage>
				<lpage>225</lpage>
				<xrefbib>
					<pubid idtype="pmpid">9322040</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B42">
				<title>
					<p>Bidirectional incremental parsing for automatic pathway identification with combinatory categorial grammar</p>
				</title>
				<aug>
					<au>
						<snm>Park</snm>
						<fnm>JC</fnm>
					</au>
					<au>
						<snm>Kim</snm>
						<fnm>HS</fnm>
					</au>
					<au>
						<snm>Kim</snm>
						<fnm>JJ</fnm>
					</au>
				</aug>
				<source>Pac Symp Biocomput</source>
				<pubdate>2001</pubdate>
				<fpage>396</fpage>
				<lpage>407</lpage>
				<xrefbib>
					<pubid idtype="pmpid">11262958</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B43">
				<title>
					<p>The frame-based module of the Suiseki information extraction system</p>
				</title>
				<aug>
					<au>
						<snm>Blaschke</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Valencia</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>IEEE Intelligent Systems</source>
				<pubdate>2002</pubdate>
				<volume>17</volume>
				<fpage>14</fpage>
				<lpage>20</lpage>
			</bibl>
			<bibl id="B44">
				<title>
					<p>Event extraction from biomedical papers using a full parser</p>
				</title>
				<aug>
					<au>
						<snm>Yakushiji</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Tateisi</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Miyao</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Tsujii</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Pac Symp Biocomput</source>
				<pubdate>2001</pubdate>
				<fpage>408</fpage>
				<lpage>419</lpage>
				<xrefbib>
					<pubid idtype="pmpid">11262959</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B45">
				<title>
					<p>BioRAT: extracting biological information from full-length papers</p>
				</title>
				<aug>
					<au>
						<snm>Corney</snm>
						<fnm>DPA</fnm>
					</au>
					<au>
						<snm>Buxton</snm>
						<fnm>BF</fnm>
					</au>
					<au>
						<snm>Langdon</snm>
						<fnm>WB</fnm>
					</au>
					<au>
						<snm>Jones</snm>
						<fnm>DT</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2004</pubdate>
				<volume>20</volume>
				<issue>17</issue>
				<fpage>3206</fpage>
				<lpage>3213</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15231534</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B46">
				<title>
					<p>PreBIND and Textomy&#8211;mining the biomedical literature for protein-protein interactions using a support vector machine</p>
				</title>
				<aug>
					<au>
						<snm>Donaldson</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Martin</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Bruijn</snm>
						<fnm>Bd</fnm>
					</au>
					<au>
						<snm>Wolting</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Lay</snm>
						<fnm>V</fnm>
					</au>
					<au>
						<snm>Tuekam</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Zhang</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Baskin</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Bader</snm>
						<fnm>GD</fnm>
					</au>
					<au>
						<snm>Michalickova</snm>
						<fnm>K</fnm>
					</au>
					<etal/>
				</aug>
				<source>BMC Bioinformatics</source>
				<pubdate>2003</pubdate>
				<volume>4</volume>
				<issue>11</issue>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">153503</pubid>
						<pubid idtype="pmpid" link="fulltext">12689350</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B47">
				<title>
					<p>A Bayesian networks approach for predicting protein-protein interactions from genomic data</p>
				</title>
				<aug>
					<au>
						<snm>Jansen</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Yu</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Greenbaum</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Kluger</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Krogan</snm>
						<fnm>NJ</fnm>
					</au>
					<au>
						<snm>Chung</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Emili</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Snyder</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Greenblatt</snm>
						<fnm>JF</fnm>
					</au>
					<au>
						<snm>Gerstein</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Science</source>
				<pubdate>2003</pubdate>
				<volume>302</volume>
				<fpage>449</fpage>
				<lpage>453</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">14564010</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B48">
				<title>
					<p>Extracting Information on Protein-Protein Interactions from Biological Literature Based on Machine Learning Approaches</p>
				</title>
				<aug>
					<au>
						<snm>Sugiyama</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Hatano</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Yoshikawa</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Uemura</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Genome Informatics</source>
				<pubdate>2003</pubdate>
				<volume>14</volume>
				<fpage>699</fpage>
				<lpage>700</lpage>
			</bibl>
		</refgrp>
	</bm>
</art>
