<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
	<ui>1471-2105-9-S3-S4</ui>
	<ji>1471-2105</ji>
	<fm>
		<dochead>Proceedings</dochead>
		<bibl>
			<title>
				<p>Exploiting and integrating rich features for biological literature classification</p>
			</title>
			<aug>
				<au id="A1" ce="yes">
					<snm>Wang</snm>
					<fnm>Hongning</fnm>
					<insr iid="I1"/>
					<email>whn03@mails.tsinghua.edu.cn</email>
				</au>
				<au id="A2" ce="yes">
					<snm>Huang</snm>
					<fnm>Minlie</fnm>
					<insr iid="I1"/>
					<email>aihuang@tsinghua.edu.cn</email>
				</au>
				<au id="A3" ce="yes">
					<snm>Ding</snm>
					<fnm>Shilin</fnm>
					<insr iid="I1"/>
					<email>dsl05@mails.tsinghua.edu.cn</email>
				</au>
				<au id="A4" ce="yes" ca="yes">
					<snm>Zhu</snm>
					<fnm>Xiaoyan</fnm>
					<insr iid="I1"/>
					<email>zxy-dcs@tsinghua.edu.cn</email>
				</au>
			</aug>
			<insg>
				<ins id="I1">
					<p>State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information Science and Technology Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China</p>
				</ins>
			</insg>
			<source>BMC Bioinformatics</source>
			<supplement>
				<title>
					<p>Proceedings of the Second International Symposium on Languages in Biology and Medicine (LBM) 2007</p>
				</title>
				<editor>Christopher JO Baker and Su Jian</editor>
				<note>Proceedings</note>
				<url>http://www.biomedcentral.com/1471-2105-9-S3-info.pdf</url>
			</supplement>
			<conference>
				<title>
					<p>The Second International Symposium on Languages in Biology and Medicine (LBM) 2007</p>
				</title>
				<location>Singapore</location>
				<date-range>6-7 December 2007</date-range>
				<url>http://lbm2007.biopathway.org/</url>
			</conference>
			<issn>1471-2105</issn>
			<pubdate>2008</pubdate>
			<volume>9</volume>
			<issue>Suppl 3</issue>
			<fpage>S4</fpage>
			<url>http://www.biomedcentral.com/1471-2105/9/S3/S4</url>
			<xrefbib>
				<pubidlist><pubid idtype="pmpid">18426549</pubid><pubid idtype="doi">10.1186/1471-2105-9-S3-S4</pubid>
				</pubidlist></xrefbib>
		</bibl>
		<history>
			<pub>
				<date>
					<day>11</day>
					<month>04</month>
					<year>2008</year>
				</date>
			</pub>
		</history>
		<cpyrt>
			<year>2008</year>
			<collab>Wang et al.; licensee BioMed Central Ltd.</collab>
			<note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
		</cpyrt>
		<abs>
			<sec>
				<st>
					<p>Abstract</p>
				</st>
				<sec>
					<st>
						<p>Background</p>
					</st>
					<p>Efficient features play an important role in automated text classification, which definitely facilitates the access of large-scale data. In the bioscience field, biological structures and terminologies are described by a large number of features; domain dependent features would significantly improve the classification performance. How to effectively select and integrate different types of features to improve the biological literature classification performance is the major issue studied in this paper.</p>
				</sec>
				<sec>
					<st>
						<p>Results</p>
					</st>
					<p>To efficiently classify the biological literatures, we propose a novel feature value schema <it>TF</it>*<it>ML</it>, features covering from lower level domain independent &#8220;string feature&#8221; to higher level domain dependent &#8220;semantic template feature&#8221;, and proper integrations among the features. Compared to our previous approaches, the performance is improved in terms of <it>AUC</it> and <it>F-Score</it> by 11.5% and 8.8% respectively, and outperforms the best performance achieved in BioCreAtIvE 2006.</p>
				</sec>
				<sec>
					<st>
						<p>Conclusions</p>
					</st>
					<p>Different types of features possess different discriminative capabilities in literature classification; proper integration of domain independent and dependent features would significantly improve the performance and overcome the over-fitting on data distribution.</p>
				</sec>
			</sec>
		</abs>
	</fm>
	<bdy>
		<sec>
			<st>
				<p>Background</p>
			</st>
			<p>In the general text classification, effective feature is essential to make the learning task more efficient and accurate. No degree of classifiers can make up for a lack of predictive information in the input features <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. In bioscientific literature, where biological structures and terminologies are described in a large number of features, the situation is more serious: well-chosen features could improve the classification accuracy substantially and decrease the risk of over-fitting <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>.</p>
			<p>In the early days of biological literature classification study, most of the researchers depended on the domain experts to pick out the informative features. Regev et al. used expert-defined rules to extract features from the semi-structure text and figure legends. Besides, they utilized external lexical resources and semantic constraints to achieve a better coverage and accuracy <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. Min Shi et al. employed two types of keywords as feature: one type was from the given evidences and the other type was manually extracted from the training texts by domain experts <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. Moustafa M. Ghanem et al. utilized expert-edited regular expressions to capture frequently occurring keyword combinations (or motifs) within short segments of the text in a document <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. All these approaches require the involvement of domain experts in identifying the specific textual objects and the informative templates, so that they cannot easily be automatically extended to an efficient and scale-free model on other biological datasets <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>.</p>
			<p>Recent years, fully automatic and scalable text classification algorithm provides an alternative to the previous methods. Wilbur employed unigram, bigram and all of the <it>MeSH</it> terms as the set of feature to represent the documents <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. Dobrokhotv et al. utilized the words processed by the XEROX natural language processing tool as discriminating attributes <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. Aaron et al. used &#8220;Bag of Words&#8221; model: content was tokenized and stemmed into unigram feature and modelled the samples as binary feature vectors <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>.</p>
			<p>Although all of these features catch some aspects of biological and statistical meanings, they still cannot well and automatically exploit the domain dependent information from the complex biological literature. It becomes a challenge in biological text mining field to automatically introduce higher level domain dependent features into the classification process and integrate with the lower level domain independent features.</p>
			<p>In this paper, we investigate the issue of biological literature classification from the perspective of feature selection and integration, which is evaluated by BioCreAtIvE <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>, an international evaluation in biological text mining. In IAS (Protein Interaction Article Sub-task) of BioCreAtIvE 2006, participants were asked to classify a given set of <it>MEDLINE</it> titles and abstracts, according to whether a document contains at least one physical PPI (Protein Protein Interaction) or not. This procedure would be extremely useful for facilitating the efficiency of manual curation since it will largely filter out the irrelevant documents. In the evaluation, one of our implemented classifiers achieved outstanding results: the <it>Accuracy</it> ranked at the 1<sup>st</sup> place, <it>AUC</it> and <it>F-Score</it> ranked at the 2<sup>nd</sup> place respectively.</p>
			<p>Although the result is encouraging, the performance has dropped significantly from the 5-fold cross validation on the training set to the evaluation on the official testing set (15.2% lower by <it>AUC</it>, 11.8% lower by <it>F-Score</it>). Main differences between these two data sets are: 1) the testing documents are mainly published in 2006 while the training documents distributes evenly over the past years; 2) the relevant/irrelevant document rate in the training set is nearly 2:1 while in the testing set it is 1:1. To statistically analyze the phenomenon, we use the variance of <it>Kullback Leibler</it> divergence to estimate the distribution of the top 50 employed features on the training and testing sets as follows:</p>
			<p>
				<display-formula id="M1">
					<m:math name="1471-2105-9-S3-S4-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
						<m:semantics>
							<m:mrow>
								<m:mi>K</m:mi>
								<m:mi>L</m:mi>
								<m:mo>`</m:mo>
								<m:mrow>
									<m:mo>(</m:mo>
									<m:mrow>
										<m:mi>P</m:mi>
										<m:mo>,</m:mo>
										<m:mi>Q</m:mi>
									</m:mrow>
									<m:mo>)</m:mo>
								</m:mrow>
								<m:mo>=</m:mo>
								<m:mfrac>
									<m:mn>1</m:mn>
									<m:mn>2</m:mn>
								</m:mfrac>
								<m:mstyle displaystyle="true">
									<m:msub>
										<m:mo>&#8721;</m:mo>
										<m:mi>x</m:mi>
									</m:msub>
									<m:mrow>
										<m:mrow>
											<m:mo>(</m:mo>
											<m:mrow>
												<m:mi>P</m:mi>
												<m:mrow>
													<m:mo>(</m:mo>
													<m:mi>x</m:mi>
													<m:mo>)</m:mo>
												</m:mrow>
												<m:mi>log</m:mi>
												<m:mo>&#8289;</m:mo>
												<m:mfrac>
													<m:mrow>
														<m:mi>P</m:mi>
														<m:mrow>
															<m:mo>(</m:mo>
															<m:mi>x</m:mi>
															<m:mo>)</m:mo>
														</m:mrow>
													</m:mrow>
													<m:mrow>
														<m:mi>Q</m:mi>
														<m:mrow>
															<m:mo>(</m:mo>
															<m:mi>x</m:mi>
															<m:mo>)</m:mo>
														</m:mrow>
													</m:mrow>
												</m:mfrac>
												<m:mo>+</m:mo>
												<m:mi>Q</m:mi>
												<m:mrow>
													<m:mo>(</m:mo>
													<m:mi>x</m:mi>
													<m:mo>)</m:mo>
												</m:mrow>
												<m:mi>log</m:mi>
												<m:mo>&#8289;</m:mo>
												<m:mfrac>
													<m:mrow>
														<m:mi>Q</m:mi>
														<m:mrow>
															<m:mo>(</m:mo>
															<m:mi>x</m:mi>
															<m:mo>)</m:mo>
														</m:mrow>
													</m:mrow>
													<m:mrow>
														<m:mi>P</m:mi>
														<m:mrow>
															<m:mo>(</m:mo>
															<m:mi>x</m:mi>
															<m:mo>)</m:mo>
														</m:mrow>
													</m:mrow>
												</m:mfrac>
											</m:mrow>
											<m:mo>)</m:mo>
										</m:mrow>
									</m:mrow>
								</m:mstyle>
							</m:mrow>
							<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGcbaGaem4saSKaemitaWKaeiiyaa2aaeWaaeaacqWGqbaucqGGSaalcqWGrbquaiaawIcacaGLPaaacqGH9aqpdaWcaaqaaiabigdaXaqaaiabikdaYaaadaaeqaqaamaabmaabaGaemiuaa1aaeWaaeaacqWG4baEaiaawIcacaGLPaaacyGGSbaBcqGGVbWBcqGGNbWzdaWcaaqaaiabdcfaqnaabmaabaGaemiEaGhacaGLOaGaayzkaaaabaGaemyuae1aaeWaaeaacqWG4baEaiaawIcacaGLPaaaaaGaey4kaSIaemyuae1aaeWaaeaacqWG4baEaiaawIcacaGLPaaacyGGSbaBcqGGVbWBcqGGNbWzdaWcaaqaaiabdgfarnaabmaabaGaemiEaGhacaGLOaGaayzkaaaabaGaemiuaa1aaeWaaeaacqWG4baEaiaawIcacaGLPaaaaaaacaGLOaGaayzkaaaaleaacqWG4baEaeqaniabggHiLdaaaa@6266@</m:annotation>
						</m:semantics>
					</m:math>
				</display-formula>
			</p>
			<p>where <it>x</it> is the word and phrase features employed in IAS, <it>P(x)</it> and <it>Q(x)</it> are the probability of <it>x</it> in the training and testing set respectively.</p>
			<p>The result (see Table <tblr tid="T1">1</tblr>) demonstrates that there is great divergence between the probability distribution of features in the irrelevant document set. And only one thirds of the top 300 features selected from the training set accordingly occur in the testing set (see Figure <figr fid="F1">1</figr>). It is clear that our previously selected features are limited and sensitive to the data distribution. How to efficiently exploit the domain independent and dependent features in the biological literature and avoid the over-dependence on data distribution motivates us to have an in-depth investigation in this paper.</p>
			<tbl id="T1" hint_layout="single">
				<title>
					<p>Table 1</p>
				</title>
				<caption>
					<p>KL Divergence on Training, Cross Validation and Testing Set</p>
				</caption>
				<tblbdy cols="3">
					<r>
						<c>
							<p>
								<b>
									<it>Unigram Feature</it>
								</b>
							</p>
						</c>
						<c>
							<p>
								<b>
									<it>Relevant Probability</it>
								</b>
							</p>
						</c>
						<c>
							<p>
								<b>
									<it>Irrelevant Probability</it>
								</b>
							</p>
						</c>
					</r>
					<r>
						<c cspan="3">
							<hr/>
						</c>
					</r>
					<r>
						<c>
							<p>
								<it>Training Set Vs Cross Validation Set</it>
							</p>
						</c>
						<c>
							<p>0.0216</p>
						</c>
						<c>
							<p>0.0703</p>
						</c>
					</r>
					<r>
						<c>
							<p>
								<it>Training Set Vs Testing Set</it>
							</p>
						</c>
						<c>
							<p>0.0369</p>
						</c>
						<c>
							<p>
								<b>0.9926</b>
							</p>
						</c>
					</r>
				</tblbdy>
				<tblfn>
					<p>(Top 50 features according to <it>Chi-Square</it> statistics)</p>
				</tblfn>
			</tbl>
			<fig id="F1">
				<title>
					<p>Figure 1</p>
				</title>
				<caption>
					<p>Overlap of Features between Training and Testing Set</p>
				</caption>
				<text>
					<p>Overlap of Features between Training and Testing Set </p>
					<p>(Top 300 selected distinct features from the training and testing set according to <it>Chi-Square</it> statistics respectively)</p>
				</text>
				<graphic file="1471-2105-9-S3-S4-1"/>
			</fig>
			<p>The rest of the paper is organized as follows. We will introduce the detailed description of methodologies proposed in this paper in Methods section. In Results and discussion section, we will present the experiment results and analysis. In Conclusion section, we will summarize our contributions in this paper.</p>
		</sec>
		<sec>
			<st>
				<p>Methods</p>
			</st>
			<p>In this paper, we are engaged to investigate the issue from the perspective of feature selection and integration. The main contribution in this paper lies in that: we propose 1). domain independent feature value schema <it>TF</it>*<it>ML</it> and length-fixed string feature 2). domain dependent &#8220;semantic template&#8221; feature 3). efficient integrations among the features. These methods are described respectively in the following.</p>
			<sec>
				<st>
					<p>Probabilistic schema</p>
				</st>
				<p>The traditional <it>TF</it>*<it>IDF</it> schema <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> just takes into consideration the occurrences of words in the whole corpus, while discarding the distribution of words in different categories. Differently, we propose a novel probabilistic feature value schema <it>TF</it>*<it>ML</it> (production of Term Frequency by Maximum Likelihood) to substitute the traditional <it>TF</it>*<it>IDF</it> as follows:</p>
				<p>
					<display-formula id="M2">
						<m:math name="1471-2105-9-S3-S4-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mrow>
									<m:mi>T</m:mi>
									<m:mi>F</m:mi>
									<m:mtext>&#8201;</m:mtext>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mi>t</m:mi>
										<m:mo>)</m:mo>
									</m:mrow>
									<m:mtext>&#8201;</m:mtext>
									<m:mo>&#215;</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mi>M</m:mi>
									<m:mi>L</m:mi>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mi>t</m:mi>
										<m:mo>)</m:mo>
									</m:mrow>
									<m:mtext>&#8201;</m:mtext>
									<m:mo>=</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mi>T</m:mi>
									<m:mi>F</m:mi>
									<m:mtext>&#8201;</m:mtext>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mi>t</m:mi>
										<m:mo>)</m:mo>
									</m:mrow>
									<m:mtext>&#8201;</m:mtext>
									<m:mo>&#215;</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mrow>
											<m:mi>log</m:mi>
											<m:mo>&#8289;</m:mo>
											<m:mtext>&#8201;</m:mtext>
											<m:mi>P</m:mi>
											<m:mrow>
												<m:mo>(</m:mo>
												<m:mrow>
													<m:mi>t</m:mi>
													<m:mo>|</m:mo>
													<m:msup>
														<m:mi>c</m:mi>
														<m:mo>+</m:mo>
													</m:msup>
												</m:mrow>
												<m:mo>)</m:mo>
											</m:mrow>
											<m:mtext>&#8201;</m:mtext>
											<m:mo>&#8722;</m:mo>
											<m:mtext>&#8201;</m:mtext>
											<m:mi>log</m:mi>
											<m:mo>&#8289;</m:mo>
											<m:mtext>&#8201;</m:mtext>
											<m:mi>P</m:mi>
											<m:mrow>
												<m:mo>(</m:mo>
												<m:mrow>
													<m:mi>t</m:mi>
													<m:mo>|</m:mo>
													<m:msup>
														<m:mi>c</m:mi>
														<m:mo>&#8722;</m:mo>
													</m:msup>
												</m:mrow>
												<m:mo>)</m:mo>
											</m:mrow>
										</m:mrow>
										<m:mo>)</m:mo>
									</m:mrow>
								</m:mrow>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGcbaqcLbxacqWGubavcqWGgbGrcaaMe8UcdaqadaqaaKqzWfGaemiDaqhakiaawIcacaGLPaaajugCbiaaysW7cqGHxdaTcaaMe8Uaemyta0KaemitaWKcdaqadaqaaKqzWfGaemiDaqhakiaawIcacaGLPaaajugCbiaaysW7cqGH9aqpcaaMe8UaemivaqLaemOrayKaaGjbVRWaaeWaaeaajugCbiabdsha0bGccaGLOaGaayzkaaqcLbxacaaMe8Uaey41aqRaaGjbVRWaaeWaaeaajugCbiGbcYgaSjabc+gaVjabcEgaNjaaysW7cqWGqbaukmaabmaabaqcLbxacqWG0baDcqGG8baFcqWGJbWykmaaCaaaleqabaGaey4kaScaaaGccaGLOaGaayzkaaqcLbxacaaMe8UaeyOeI0IaaGjbVlGbcYgaSjabc+gaVjabcEgaNjaaysW7cqWGqbaukmaabmaabaqcLbxacqWG0baDcqGG8baFcqWGJbWykmaaCaaaleqabaGaeyOeI0caaaGccaGLOaGaayzkaaaacaGLOaGaayzkaaaaaa@7C62@</m:annotation>
							</m:semantics>
						</m:math>
					</display-formula>
				</p>
				<p>where <it>t</it> means the selected feature word, <it>c</it><sup>+</sup> and <it>c</it><sup>&#8722;</sup> mean the relevant and irrelevant category, <it>P(t|c<sup>+</sup>)</it> and <it>P(t|c<sup>&#8722;</sup>)</it> mean the probability that <it>t</it> occurs in category <it>c</it><sup>+</sup><it>and</it><it>c</it><sup>&#8722;</sup> respectively.</p>
				<p>The sign of <it>ML</it> indicates the category relevance of the feature and the magnitude reflects the classification confidence. Following the same idea as <it>TF</it>*<it>IDF</it> to express the specificities of features in different documents, we also multiply <it>TF</it> by <it>ML</it>.</p>
				<p>Here we do not depend on the posterior distribution of features to implement the prediction. To explain the reason we could rewrite the formula (2) as follows:</p>
				<p>
					<display-formula id="M3">
						<m:math name="1471-2105-9-S3-S4-i3" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mrow>
									<m:mi>M</m:mi>
									<m:mi>L</m:mi>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mi>t</m:mi>
										<m:mo>)</m:mo>
									</m:mrow>
									<m:mtext>&#8201;</m:mtext>
									<m:mo>=</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mi>log</m:mi>
									<m:mo>&#8289;</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mi>P</m:mi>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mrow>
											<m:mi>t</m:mi>
											<m:mo>|</m:mo>
											<m:msup>
												<m:mi>c</m:mi>
												<m:mo>+</m:mo>
											</m:msup>
										</m:mrow>
										<m:mo>)</m:mo>
									</m:mrow>
									<m:mtext>&#8201;</m:mtext>
									<m:mo>&#8722;</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mi>log</m:mi>
									<m:mo>&#8289;</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mi>P</m:mi>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mrow>
											<m:mi>t</m:mi>
											<m:mo>|</m:mo>
											<m:msup>
												<m:mi>c</m:mi>
												<m:mo>&#8722;</m:mo>
											</m:msup>
										</m:mrow>
										<m:mo>)</m:mo>
									</m:mrow>
									<m:mtext>&#8201;</m:mtext>
									<m:mo>=</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mi>log</m:mi>
									<m:mo>&#8289;</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mi>P</m:mi>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mrow>
											<m:mi>t</m:mi>
											<m:mo>,</m:mo>
											<m:msup>
												<m:mi>c</m:mi>
												<m:mo>+</m:mo>
											</m:msup>
										</m:mrow>
										<m:mo>)</m:mo>
									</m:mrow>
									<m:mtext>&#8201;</m:mtext>
									<m:mo>&#8722;</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mi>log</m:mi>
									<m:mo>&#8289;</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mi>P</m:mi>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mrow>
											<m:mi>t</m:mi>
											<m:mo>,</m:mo>
											<m:msup>
												<m:mi>c</m:mi>
												<m:mo>&#8722;</m:mo>
											</m:msup>
										</m:mrow>
										<m:mo>)</m:mo>
									</m:mrow>
									<m:mtext>&#8201;</m:mtext>
									<m:mo>+</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mi>log</m:mi>
									<m:mo>&#8289;</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mfrac>
										<m:mrow>
											<m:mi>P</m:mi>
											<m:mrow>
												<m:mo>(</m:mo>
												<m:mrow>
													<m:msup>
														<m:mi>c</m:mi>
														<m:mo>+</m:mo>
													</m:msup>
												</m:mrow>
												<m:mo>)</m:mo>
											</m:mrow>
										</m:mrow>
										<m:mrow>
											<m:mi>P</m:mi>
											<m:mrow>
												<m:mo>(</m:mo>
												<m:mrow>
													<m:msup>
														<m:mi>c</m:mi>
														<m:mo>&#8722;</m:mo>
													</m:msup>
												</m:mrow>
												<m:mo>)</m:mo>
											</m:mrow>
										</m:mrow>
									</m:mfrac>
								</m:mrow>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGcbaqcLbxacqWGnbqtcqWGmbatkmaabmaabaqcLbxacqWG0baDaOGaayjkaiaawMcaaKqzWfGaaGjbVlabg2da9iaaysW7cyGGSbaBcqGGVbWBcqGGNbWzcaaMe8UaemiuaaLcdaqadaqaaKqzWfGaemiDaqNaeiiFaWNaem4yamMcdaahaaWcbeqaaiabgUcaRaaaaOGaayjkaiaawMcaaKqzWfGaaGjbVlabgkHiTiaaysW7cyGGSbaBcqGGVbWBcqGGNbWzcaaMe8UaemiuaaLcdaqadaqaaKqzWfGaemiDaqNaeiiFaWNaem4yamMcdaahaaWcbeqaaiabgkHiTaaaaOGaayjkaiaawMcaaKqzWfGaaGjbVlabg2da9iaaysW7cyGGSbaBcqGGVbWBcqGGNbWzcaaMe8UaemiuaaLcdaqadaqaaKqzWfGaemiDaqNaeiilaWIaem4yamMcdaahaaWcbeqaaiabgUcaRaaaaOGaayjkaiaawMcaaKqzWfGaaGjbVlabgkHiTiaaysW7cyGGSbaBcqGGVbWBcqGGNbWzcaaMe8UaemiuaaLcdaqadaqaaKqzWfGaemiDaqNaeiilaWIaem4yamMcdaahaaWcbeqaaiabgkHiTaaaaOGaayjkaiaawMcaaKqzWfGaaGjbVlabgUcaRiaaysW7cyGGSbaBcqGGVbWBcqGGNbWzcaaMe8UcdaWcaaqaaKqzWfGaemiuaaLcdaqadaqaaKqzWfGaem4yamMcdaahaaWcbeqaaiabgUcaRaaaaOGaayjkaiaawMcaaaqaaKqzWfGaemiuaaLcdaqadaqaaKqzWfGaem4yamMcdaahaaWcbeqaaiabgkHiTaaaaOGaayjkaiaawMcaaaaaaaa@9CD5@</m:annotation>
							</m:semantics>
						</m:math>
					</display-formula>
				</p>
				<p>And the posterior distribution is:</p>
				<p>
					<display-formula id="M4">
						<m:math name="1471-2105-9-S3-S4-i4" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mrow>
									<m:mtext mathvariant="italic">Posterior</m:mtext>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mi>t</m:mi>
										<m:mo>)</m:mo>
									</m:mrow>
									<m:mtext>&#8201;</m:mtext>
									<m:mo>=</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mi>log</m:mi>
									<m:mo>&#8289;</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mi>P</m:mi>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mrow>
											<m:msup>
												<m:mi>c</m:mi>
												<m:mo>+</m:mo>
											</m:msup>
											<m:mo>|</m:mo>
											<m:mi>t</m:mi>
										</m:mrow>
										<m:mo>)</m:mo>
									</m:mrow>
									<m:mtext>&#8201;</m:mtext>
									<m:mo>&#8722;</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mi>log</m:mi>
									<m:mo>&#8289;</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mi>P</m:mi>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mrow>
											<m:msup>
												<m:mi>c</m:mi>
												<m:mo>&#8722;</m:mo>
											</m:msup>
											<m:mo>|</m:mo>
											<m:mi>t</m:mi>
										</m:mrow>
										<m:mo>)</m:mo>
									</m:mrow>
									<m:mtext>&#8201;</m:mtext>
									<m:mo>=</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mi>log</m:mi>
									<m:mo>&#8289;</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mi>P</m:mi>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mrow>
											<m:mi>t</m:mi>
											<m:mo>,</m:mo>
											<m:msup>
												<m:mi>c</m:mi>
												<m:mo>+</m:mo>
											</m:msup>
										</m:mrow>
										<m:mo>)</m:mo>
									</m:mrow>
									<m:mtext>&#8201;</m:mtext>
									<m:mo>&#8722;</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mi>log</m:mi>
									<m:mo>&#8289;</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mi>P</m:mi>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mrow>
											<m:mi>t</m:mi>
											<m:mo>,</m:mo>
											<m:msup>
												<m:mi>c</m:mi>
												<m:mo>&#8722;</m:mo>
											</m:msup>
										</m:mrow>
										<m:mo>)</m:mo>
									</m:mrow>
								</m:mrow>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGcbaqcLbxacqWGqbaucqWGVbWBcqWGZbWCcqWG0baDcqWGLbqzcqWGYbGCcqWGPbqAcqWGVbWBcqWGYbGCkmaabmaabaqcLbxacqWG0baDaOGaayjkaiaawMcaaKqzWfGaaGjbVlabg2da9iaaysW7cyGGSbaBcqGGVbWBcqGGNbWzcaaMe8UaemiuaaLcdaqadaqaaKqzWfGaem4yamMcdaahaaWcbeqaaiabgUcaRaaajugCbiabcYha8jabdsha0bGccaGLOaGaayzkaaqcLbxacaaMe8UaeyOeI0IaaGjbVlGbcYgaSjabc+gaVjabcEgaNjaaysW7cqWGqbaukmaabmaabaqcLbxacqWGJbWykmaaCaaaleqabaGaeyOeI0caaKqzWfGaeiiFaWNaemiDaqhakiaawIcacaGLPaaajugCbiaaysW7cqGH9aqpcaaMe8UagiiBaWMaei4Ba8Maei4zaCMaaGjbVlabdcfaqPWaaeWaaeaajugCbiabdsha0jabcYcaSiabdogaJPWaaWbaaSqabeaacqGHRaWkaaaakiaawIcacaGLPaaajugCbiaaysW7cqGHsislcaaMe8UagiiBaWMaei4Ba8Maei4zaCMaaGjbVlabdcfaqPWaaeWaaeaajugCbiabdsha0jabcYcaSiabdogaJPWaaWbaaSqabeaacqGHsislaaaakiaawIcacaGLPaaaaaa@902B@</m:annotation>
							</m:semantics>
						</m:math>
					</display-formula>
				</p>
				<p>In the <it>ML</it> schema, the relevant/irrelevant document rate <inline-formula><m:math name="1471-2105-9-S3-S4-i5" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mfrac><m:mrow><m:mi>P</m:mi><m:mrow><m:mo>(</m:mo><m:mrow><m:msup><m:mi>c</m:mi><m:mo>+</m:mo></m:msup></m:mrow><m:mo>)</m:mo></m:mrow></m:mrow><m:mrow><m:mi>P</m:mi><m:mrow><m:mo>(</m:mo><m:mrow><m:msup><m:mi>c</m:mi><m:mo>&#8722;</m:mo></m:msup></m:mrow><m:mo>)</m:mo></m:mrow></m:mrow></m:mfrac></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGcbaWaaSaaaeaajugCbiabdcfaqPWaaeWaaeaajugCbiabdogaJPWaaWbaaSqabeaacqGHRaWkaaaakiaawIcacaGLPaaaaeaajugCbiabdcfaqPWaaeWaaeaajugCbiabdogaJPWaaWbaaSqabeaacqGHsislaaaakiaawIcacaGLPaaaaaaaaa@3DF1@</m:annotation></m:semantics></m:math></inline-formula> has been taken into consideration as a compensate factor. But in the posterior probability schema, the impact of relevant/irrelevant document rate is eliminated, according to the independent and identical distribution hypothesis, while it is not tenable in our situation (since the different relevant/irrelevant rate between training and testing set).</p>
				<p>The essence of <it>TF</it>*<it>ML</it> schema is to fully utilize the category relevant information from the annotated samples, which cannot be inferred from the <it>TF</it>*<it>IDF</it> schema. Experiment results (see Table <tblr tid="T2">2</tblr>) demonstrate that <it>ML</it> steadily improves the discriminative capability of features.</p>
				<tbl id="T2" hint_layout="single">
					<title>
						<p>Table 2</p>
					</title>
					<caption>
						<p><it>TF</it>*<it>ML</it> Feature Value Schema.  </p>
						<p>The <it>Precision/Recall/F-Score</it> demonstrate classification capability of the model, and <it>AUC</it> (area under receiving operator characteristic curve) is to evaluate ranking capability of the model.</p>
					</caption>
					<tblbdy cols="5">
						<r>
							<c>
								<p>
									<b>
										<it>Feature value</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Precision</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Recall</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>F-Score</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>AUC</it>
									</b>
								</p>
							</c>
						</r>
						<r>
							<c cspan="5">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p><it>TF</it>*<it>IDF</it></p>
							</c>
							<c>
								<p>0.7015</p>
							</c>
							<c>
								<p>0.8213</p>
							</c>
							<c>
								<p>0.7567</p>
							</c>
							<c>
								<p>0.8036</p>
							</c>
						</r>
						<r>
							<c>
								<p><it>TF</it>*<it>ML</it></p>
							</c>
							<c>
								<p>0.7014</p>
							</c>
							<c>
								<p>
									<b>0.8773</b>
								</p>
							</c>
							<c>
								<p>0.7796</p>
							</c>
							<c>
								<p>0.8231</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>(Performance under unigram feature)</p>
					</tblfn>
				</tbl>
			</sec>
			<sec>
				<st>
					<p>String feature</p>
				</st>
				<p>In many text classification applications, it is appealing to take every document as a string of characters rather than a bag of words <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. Especially in bioscience, the tokenizing and stemming procedure would incur undesired loss of the informative attributions, since many of the semantically related biomedical terms that share the same stem or morpheme are often not reducible to the same stems <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. Therefore, we propose to directly utilize the length-fixed strings as feature to exploit most of the informative segments.</p>
				<p>To the best of our knowledge, no one explicitly takes length-fixed strings as feature because of the explosion and sparse of feature space. However, the statistical analysis based on formula (1) demonstrates that distributional divergence between the training and testing set becomes much smaller under the length-fixed string feature (see Table <tblr tid="T3">3</tblr>). So we turn to take the fixed-length strings as feature: the length-fixed strings are extracted from the whole sequential text without considering the sentence boundaries and strictly consist of 26 lowercase English letters (all the letters are converted to the lowercase first), 10 numbers (0-9) and a white space. <it>Chi-Square</it> statistics <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> is employed to select out the significant features and <it>TF</it>*<it>IDF</it> is computed to build the feature vector (we substitute <it>TF</it>*<it>ML</it> for <it>TF</it>*<it>IDF</it> for further improvement).</p>
				<tbl id="T3" hint_layout="single">
					<title>
						<p>Table 3</p>
					</title>
					<caption>
						<p>KL Divergence on Training, Cross Validation and Testing Set</p>
					</caption>
					<tblbdy cols="3">
						<r>
							<c>
								<p>
									<b>
										<it>String Feature(p=7)</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Relevant Probability</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Irrelevant Probability</it>
									</b>
								</p>
							</c>
						</r>
						<r>
							<c cspan="3">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>Training Set Vs Cross Validation Set</it>
								</p>
							</c>
							<c>
								<p>0.0029</p>
							</c>
							<c>
								<p>0.0163</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>Training Set Vs Testing Set</it>
								</p>
							</c>
							<c>
								<p>0.0357</p>
							</c>
							<c>
								<p>
									<b>0.1887</b>
								</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>(Top 50 features according to <it>Chi-Square</it> statistics)</p>
					</tblfn>
				</tbl>
				<p>Table <tblr tid="T4">4</tblr> lists the top 10 distinct features from the selected unigram features and string features respectively. It is apparent that the length-fixed string feature has at least the following potential advantages. First, inter-word features (e.g. phrasal effects) can be exploited automatically. The segmentation process spans the boundary of adjacent words, which could exploit information from the adjacent words. Second, intra-word features (e.g. morphological variants) could be captured. For example, string &#8220;interac&#8221; would occur in the word &#8220;interact&#8221; and &#8220;interaction&#8221;, both of which are important indicator of PPI relations. Third, the special meaning of length-fixed string feature in bio-literatures is that it exploits the slight but informative commonality from the structure of the words. For example, different terminologies in bio-literatures often share the same conflation (e.g. &#8216;phosph&#8217; indicates the protein phosphorylation) and most of the suffix is informative (e.g. &#8216;ase&#8217; is a common suffix to proteins that function as enzyme). The specific information is not recoverable when the general tokenizing and stemming procedure is applied.</p>
				<tbl id="T4" hint_layout="single">
					<title>
						<p>Table 4</p>
					</title>
					<caption>
						<p>Top 10 Unigram Features and String Features &#8216;_&#8217; means a white space</p>
					</caption>
					<tblbdy cols="2">
						<r>
							<c>
								<p>
									<b>
										<it>Unigram Feature</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>String Feature</it>
									</b>
								</p>
							</c>
						</r>
						<r>
							<c cspan="2">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p>interaction</p>
							</c>
							<c>
								<p>interac</p>
							</c>
						</r>
						<r>
							<c>
								<p>bind</p>
							</c>
							<c>
								<p>nteract</p>
							</c>
						</r>
						<r>
							<c>
								<p>interact</p>
							</c>
							<c>
								<p>_intera</p>
							</c>
						</r>
						<r>
							<c>
								<p>domain</p>
							</c>
							<c>
								<p>teracti</p>
							</c>
						</r>
						<r>
							<c>
								<p>proteome</p>
							</c>
							<c>
								<p>eractio</p>
							</c>
						</r>
						<r>
							<c>
								<p>proteomic</p>
							</c>
							<c>
								<p>proteom</p>
							</c>
						</r>
						<r>
							<c>
								<p>complex</p>
							</c>
							<c>
								<p>raction</p>
							</c>
						</r>
						<r>
							<c>
								<p>protein</p>
							</c>
							<c>
								<p>_domain</p>
							</c>
						</r>
						<r>
							<c>
								<p>yeast</p>
							</c>
							<c>
								<p>binding</p>
							</c>
						</r>
						<r>
							<c>
								<p>kinase</p>
							</c>
							<c>
								<p>_proteo</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>(According to <it>Chi-Square</it> statistics)</p>
					</tblfn>
				</tbl>
			</sec>
			<sec>
				<st>
					<p>Named entities and semantic template features</p>
				</st>
				<p>Both of the above proposed methods are domain independent, which are endowed with well generalization capacity and are not necessarily limited to the bioscience domain. But introducing domain dependent features could greatly filter out the false positive samples and further improve the performance <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>. In biological literatures, named entities (words and phrases belonging to certain predefined classes, e.g. protein and gene), such as CDC42 (protein), and semantic templates (co-occurrences of a pre-specified type of relationship between entities of given types), such as &#8220;ProteinA interact with ProteinB&#8221;, are the most meaningful concepts in PPI documents and well conserve the syntactic and semantic structures in describing the protein interactions. So we introduce the named entities and semantic templates as feature to exploit the domain dependent information.</p>
				<p>With the help of <it>ABNER </it><abbrgrp><abbr bid="B15">15</abbr></abbrgrp>, a named entity recognition tool, 5 types of named entities in a given document could be identified: protein, DNA, RNA, cell types and cell line. Since the recognized entity space is large and sparse, we only utilize their types as feature to decrease the dimension of feature space without losing the universality.</p>
				<p>After recognizing the named entities, semantic templates are ready to be extracted from the documents. We propose a novel template extraction algorithm named <it>KeyBT</it>, i.e. <it>Key</it>word <it>B</it>ased <it>T</it>emplate extraction algorithm, to extract the semantic templates describing the interaction patterns among all of the recognized entities.</p>
				<p>Compared to the traditional local alignment algorithm, <it>KeyBT</it> operates differently: first locate statistical significant words as seeds, and expand the seeds in the contextual environment iteratively, finally preserve the most &#8220;powerful&#8221; templates as the result.</p>
				<p>The flow chart of the <it>KeyBT</it> algorithm is as follows:</p>
				<p>1) Locate the occurrences of predefined candidate keywords in each sentence; discard the sentences without any keywords; get the initial candidate sentence set S<sub>0</sub>;</p>
				<p>2) Locate each entity type in S<sub>0</sub>; discard the sentences without any entities; get the initial candidate template set T<sub>0</sub>;</p>
				<p>3) Iteratively normalize each template in T<sub>0</sub>: removing the redundant templates by syntax parsing; get the raw templates set T<sub>1</sub>;</p>
				<p>4) Evaluating the templates in T<sub>1</sub>, filter out the templates of low quality, get the final template set T<sub>f</sub>.</p>
				<p><it>KeyBT</it> not only depends on <it>Chi-Square</it> statistics to select the most distinct keywords but also utilizes <it>ML</it> to determine the category relevance of the keywords, because <it>Chi-Square</it> does not distinguish the association between features and different categories: a few high quality features of irrelevant category might be overwhelmed in the large amount of features of relevant category. <it>Chi-Square</it> is employed to select a raw candidate keyword list (with low threshold), and then top 50 features from both categories are preserved according to <it>ML</it> respectively.</p>
				<p>We use the following formula <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> to evaluate the relevance of templates based on the balance between generality and specificity of the templates.</p>
				<p>
					<display-formula id="M5">
						<m:math name="1471-2105-9-S3-S4-i6" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mrow>
									<m:mi>S</m:mi>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mi>t</m:mi>
										<m:mo>)</m:mo>
									</m:mrow>
									<m:mtext>&#8201;</m:mtext>
									<m:mo>=</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mrow>
											<m:mi>&#946;</m:mi>
											<m:mtext>&#8201;</m:mtext>
											<m:mo>+</m:mo>
											<m:mtext>&#8201;</m:mtext>
											<m:msub>
												<m:mrow>
													<m:mi>log</m:mi>
													<m:mo>&#8289;</m:mo>
												</m:mrow>
												<m:mn>2</m:mn>
											</m:msub>
											<m:mtext>&#8201;</m:mtext>
											<m:mfrac>
												<m:mrow>
													<m:mi>t</m:mi>
													<m:mo>.</m:mo>
													<m:mi mathvariant="italic">pos</m:mi>
													<m:mtext>&#8201;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#8201;</m:mtext>
													<m:mn>0.5</m:mn>
												</m:mrow>
												<m:mrow>
													<m:mi>t</m:mi>
													<m:mo>.</m:mo>
													<m:mi mathvariant="italic">neg</m:mi>
													<m:mtext>&#8201;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#8201;</m:mtext>
													<m:mn>0.5</m:mn>
												</m:mrow>
											</m:mfrac>
										</m:mrow>
										<m:mo>)</m:mo>
									</m:mrow>
									<m:mtext>&#8201;</m:mtext>
									<m:mo>&#215;</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mi>ln</m:mi>
									<m:mo>&#8289;</m:mo>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mrow>
											<m:mi>t</m:mi>
											<m:mo>.</m:mo>
											<m:mi mathvariant="italic">pos</m:mi>
											<m:mtext>&#8201;</m:mtext>
											<m:mo>+</m:mo>
											<m:mtext>&#160;</m:mtext>
											<m:mi>t</m:mi>
											<m:mo>.</m:mo>
											<m:mi mathvariant="italic">neg</m:mi>
											<m:mtext>&#8201;</m:mtext>
											<m:mo>+</m:mo>
											<m:mtext>&#8201;</m:mtext>
											<m:mn>1</m:mn>
										</m:mrow>
										<m:mo>)</m:mo>
									</m:mrow>
								</m:mrow>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGcbaqcLbxacqWGtbWukmaabmaabaqcLbxacqWG0baDaOGaayjkaiaawMcaaKqzWfGaaGjbVlabg2da9iaaysW7kmaabmaabaqcLbxacqaHYoGycaaMe8Uaey4kaSIaaGjbVlGbcYgaSjabc+gaVjabcEgaNPWaaSbaaSqaaiabikdaYaqabaqcLbxacaaMe8UcdaWcaaqaaKqzWfGaemiDaqNaeiOla4IaemiCaaNaem4Ba8Maem4CamNaaGjbVlabgUcaRiaaysW7cqaIWaamcqGGUaGlcqaI1aqnaOqaaKqzWfGaemiDaqNaeiOla4IaemOBa4MaemyzauMaem4zaCMaaGjbVlabgUcaRiaaysW7cqaIWaamcqGGUaGlcqaI1aqnaaaakiaawIcacaGLPaaajugCbiaaysW7cqGHxdaTcaaMe8UagiiBaWMaeiOBa4McdaqadaqaaKqzWfGaemiDaqNaeiOla4IaemiCaaNaem4Ba8Maem4CamNaaGjbVlabgUcaRiabbccaGiabdsha0jabc6caUiabd6gaUjabdwgaLjabdEgaNjaaysW7cqGHRaWkcaaMe8UaeGymaedakiaawIcacaGLPaaaaaa@8795@</m:annotation>
							</m:semantics>
						</m:math>
					</display-formula>
				</p>
				<p>where <it>t.pos</it> and <it>t.neg</it> are the positive/negative matching count of template <it>t</it> in the training set, and &#946; is the parameter tuning the positive/negative matching rate.</p>
				<p>When we get the final templates set T<sub>f</sub>, we do not simply depend on the positive/negative matching rate of each template to make the prediction. Instead, we use them to build feature vectors and train a classifier.</p>
				<p>Top 5 <it>KeyBT</it>-extracted templates are illustrated in Table <tblr tid="T5">5</tblr>.</p>
				<tbl id="T5" hint_layout="single">
					<title>
						<p>Table 5</p>
					</title>
					<caption>
						<p><it>KeyBT</it>-extracted Templates. </p>
						<p>&lt;PTN&gt;, &lt;DNA&gt;, &lt;CEL&gt; mean protein, DNA and cell-line, E* means any words occurrence</p>
					</caption>
					<tblbdy cols="1">
						<r>
							<c>
								<p>
									<b>
										<it>KeyBT Templates</it>
									</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p>&lt;PTN&gt; E* &lt;DNA&gt; E* association E* &lt;PTN&gt;</p>
							</c>
						</r>
						<r>
							<c>
								<p>&lt;PTN&gt; E* bind E* &lt;DNA&gt;</p>
							</c>
						</r>
						<r>
							<c>
								<p>&lt;PTN&gt; E* interact E* &lt;PTN&gt;</p>
							</c>
						</r>
						<r>
							<c>
								<p>&lt;PTN&gt; E* colocalize E* &lt;CEL&gt;</p>
							</c>
						</r>
						<r>
							<c>
								<p>&lt;PTN&gt; E* contact E* &lt;DNA&gt; E* &lt;PTN&gt;</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
				<p>Compared with the local alignment algorithm that depends on the post evaluation to remove meaningless and noisy templates, the potential advantages of <it>KeyBT</it> algorithm are as follows: 1) <it>KeyBT</it> utilizes the statistical characteristic of the candidate keywords to largely remove noise before extraction; 2) <it>KeyBT</it> templates need not to fix the entities' type beforehand, so that it could catch the distribution of templates in both categories to discriminate both of the relevant and irrelevant categories; 3) the heuristic rules applied on the relation of named entities and candidate words (such as their sequence, the average template length and type of distinct entities) would guarantee the biological meaning of the extracted templates.</p>
			</sec>
			<sec>
				<st>
					<p>Feature integration</p>
				</st>
				<p>Experiment results of the overlap among the misclassified samples by different features show that there is great complement among different features: in many cases, the false prediction caused by one feature would be treated correctly by another one. And a single type of feature is easy to lead the classifier over-fitting on the data distribution (see Table <tblr tid="T1">1</tblr> and Figure <figr fid="F1">1</figr>). Thus, the integration among different features would be beneficial. In this sense, we propose two kinds of integration from different levels: feature-level and classifier-level to integrate all of above proposed features.</p>
				<p>We perform the feature-level integration in a typical way: normalizing each part of features and unifying them into a new feature vector. We do the normalization as follows <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>:</p>
				<p>
					<display-formula id="M6">
						<m:math name="1471-2105-9-S3-S4-i7" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mrow>
									<m:mtext mathvariant="italic">norm</m:mtext>
									<m:mo>_</m:mo>
									<m:mtext mathvariant="italic">value</m:mtext>
									<m:mtext>&#8201;</m:mtext>
									<m:mo>=</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mfrac>
										<m:mrow>
											<m:mtext mathvariant="italic">unnorm</m:mtext>
											<m:mo>_</m:mo>
											<m:mtext mathvariant="italic">value
                           </m:mtext>
											<m:mo>&#8722;</m:mo>
											<m:mtext mathvariant="italic">min</m:mtext>
											<m:mo>_</m:mo>
											<m:mtext mathvariant="italic">value&#160;</m:mtext>
										</m:mrow>
										<m:mrow>
											<m:mtext mathvariant="italic">max</m:mtext>
											<m:mo>_</m:mo>
											<m:mtext mathvariant="italic">value</m:mtext>
											<m:mo>&#8722;</m:mo>
											<m:mtext mathvariant="italic">min</m:mtext>
											<m:mo>_</m:mo>
											<m:mtext mathvariant="italic">value</m:mtext>
										</m:mrow>
									</m:mfrac>
								</m:mrow>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGcbaqcLbxacqWGUbGBcqWGVbWBcqWGYbGCcqWGTbqBcqGGFbWxcqWG2bGDcqWGHbqycqWGSbaBcqWG1bqDcqWGLbqzcaaMe8Uaeyypa0JaaGjbVRWaaSaaaeaajugCbiabdwha1jabd6gaUjabd6gaUjabd+gaVjabdkhaYjabd2gaTjabc+faFjabdAha2jabdggaHjabdYgaSjabdwha1jabdwgaLjaaysW7cqGHsislieGacqWFTbqBcqWFPbqAcqWFUbGBcqGGFbWxcqWG2bGDcqWGHbqycqWGSbaBcqWG1bqDcqWGLbqzcqqGGaaiaOqaaKqzWfGae8xBa0Mae8xyaeMae8hEaGNaei4xa8LaemODayNaemyyaeMaemiBaWMaemyDauNaemyzauMaeyOeI0Iae8xBa0Mae8xAaKMae8NBa4Maei4xa8LaemODayNaemyyaeMaemiBaWMaemyDauNaemyzaugaaaaa@7E52@</m:annotation>
							</m:semantics>
						</m:math>
					</display-formula>
				</p>
				<p>where <it>max_value</it> and <it>min_value</it> are the maximum and minimum values that are actually seen in the input feature set.</p>
				<p>But there is an obvious defection in the above method: some lower dimensional features might be overwhelmed by the higher dimensional features (e.g. named entity feature has only 5 dimensions while length-fixed string feature has more than 10 thousand dimensions). Based on this consideration, we turn to perform the integration on the classifier level and propose two different ways to implement the integration. The first one is to integrate the output of each classifier: after training classifiers on different types of features respectively, we normalize and unify the output of each classifier into feature vectors and train a classifier. The other one is <it>Adaboost</it><abbrgrp><abbr bid="B18">18</abbr></abbrgrp>, a general classifier integration method, which has two major advantages: firstly, <it>Adaboost</it> tunes the weight of each classifier according to its performance in each kind of training samples, which could fully utilize the discriminative capability of features; secondly, soft margin of <it>Adaboost</it> avoids the risk of over-fitting in the training process. These approaches well overcome the defection mentioned above.</p>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Results and discussion</p>
			</st>
			<p>The benchmark corpus is provided by BioCreAtIvE 2006. The training set contains 3536 relevant documents (title and abstract) and 1959 irrelevant. The testing set contains 750 documents, 375 of which are labelled as relevant. All of the proposed features and integration methods are implemented on the linear-kernel SVM.</p>
			<sec>
				<st>
					<p>Probabilistic schema</p>
				</st>
				<p>In Table <tblr tid="T2">2</tblr>, <it>TF</it>*<it>ML</it> schema improves recall performance by 6.9% without losing precision compared to the traditional <it>TF</it>*<it>IDF</it> schema. The improvement validates the effectivity of exploiting the category relevance information of features and testifies <it>ML</it> to be a more effective and general feature value schema in general text classification applications.</p>
			</sec>
			<sec>
				<st>
					<p>String feature</p>
				</st>
				<p>In our experiment, the best performance is achieved when the string length <it>p</it> is set to 7. In Table <tblr tid="T6">6</tblr>, the length-fixed string feature (<it>p</it>=7) gains encouraging recall improvement by more than 12.0% compared to unigram and bigram feature. But the precision has dropped about 7.2% as the expense, which can be further compensated by employing <it>TF</it>*<it>ML</it> as feature value. The practical efficiency confirms our statistical analysis of the distribution of features and gives us insight in the selection of lower level features.</p>
				<tbl id="T6" hint_layout="single">
					<title>
						<p>Table 6</p>
					</title>
					<caption>
						<p>Length-fixed String Feature (TF*IDF)</p>
					</caption>
					<tblbdy cols="5">
						<r>
							<c>
								<p>
									<b>
										<it>Feature</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Precision</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Recall</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>F-Score</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>AUC</it>
									</b>
								</p>
							</c>
						</r>
						<r>
							<c cspan="5">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>Unigram + Bigram</it>
								</p>
							</c>
							<c>
								<p>0.7015</p>
							</c>
							<c>
								<p>0.8213</p>
							</c>
							<c>
								<p>0.7567</p>
							</c>
							<c>
								<p>0.8036</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>String (p=7)</it>
								</p>
							</c>
							<c>
								<p>0.6497</p>
							</c>
							<c>
								<p>
									<b>0.9200</b>
								</p>
							</c>
							<c>
								<p>0.7615</p>
							</c>
							<c>
								<p>0.8245</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>(Performance under <it>TF</it>*<it>IDF</it> schema)</p>
					</tblfn>
				</tbl>
			</sec>
			<sec>
				<st>
					<p>Named entities and semantic template features</p>
				</st>
				<p>In Table <tblr tid="T7">7</tblr>, only depending on a simple criterion that if a document contains at least one protein entity, the document should be judged relevant otherwise irrelevant, we could achieve a very high recall (0.96) with an acceptable precision (0.58). Our proposed template extraction algorithm <it>KeyBT</it> well captures the complex association between the keywords and named entities and achieves promising performance in term of precision (by 11.8%) and a better improvement comparing to our former approach <it>ONBIRES</it> templates <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>, which is based on local alignment algorithm.</p>
				<tbl id="T7" hint_layout="single">
					<title>
						<p>Table 7</p>
					</title>
					<caption>
						<p>Named Entity and Semantic Template Feature</p>
					</caption>
					<tblbdy cols="5">
						<r>
							<c>
								<p>
									<b>
										<it>Feature</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Precision</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Recall</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>F-Score</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>AUC</it>
									</b>
								</p>
							</c>
						</r>
						<r>
							<c cspan="5">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p><it>Unigram + Bigram</it> (<it>TF</it>*<it>IDF</it>)</p>
							</c>
							<c>
								<p>0.7015</p>
							</c>
							<c>
								<p>0.8213</p>
							</c>
							<c>
								<p>0.7567</p>
							</c>
							<c>
								<p>0.8036</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>Protein Entity occurrence</it>
								</p>
							</c>
							<c>
								<p>0.5815</p>
							</c>
							<c>
								<p>
									<b>0.9600</b>
								</p>
							</c>
							<c>
								<p>0.7243</p>
							</c>
							<c>
								<p>0.7570</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>ONBIRES template</it>
								</p>
							</c>
							<c>
								<p>0.7647</p>
							</c>
							<c>
								<p>0.7973</p>
							</c>
							<c>
								<p>0.7806</p>
							</c>
							<c>
								<p>0.8156</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>KeyBT template</it>
								</p>
							</c>
							<c>
								<p>
									<b>0.7841</b>
								</p>
							</c>
							<c>
								<p>0.7653</p>
							</c>
							<c>
								<p>0.7746</p>
							</c>
							<c>
								<p>
									<b>0.8239</b>
								</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
			</sec>
			<sec>
				<st>
					<p>Feature integration</p>
				</st>
				<p>In Table <tblr tid="T8">8</tblr>, feature-level integration contributes the improvement in terms of <it>F-score</it> by 5.2% and <it>AUC</it> by 5.9%; in table <tblr tid="T9">9</tblr>, integration based on the output of classifier achieves better improvement in terms of <it>F-score</it> by 5.3% and <it>AUC</it> by 6.3%. The best performance is reached by <it>AdaBoost</it>: in <it>F-score</it> by 11.5% and in <it>AUC</it> by 8.8%. Advantage of the feature integration is obvious: different types of features are independently selected from the corpus, which focus on the different aspects of feature space and reinforce each other. From the result, it is apparent that different integration methods well leverage the capability of different types of features and achieve promising improvement.</p>
				<tbl id="T8" hint_layout="single">
					<title>
						<p>Table 8</p>
					</title>
					<caption>
						<p>Feature-level Integration</p>
					</caption>
					<tblbdy cols="5">
						<r>
							<c>
								<p>
									<b>
										<it>Feature</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Precision</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Recall</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>F-Score</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>AUC</it>
									</b>
								</p>
							</c>
						</r>
						<r>
							<c cspan="5">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>String</it>
								</p>
							</c>
							<c>
								<p>0.7044</p>
							</c>
							<c>
								<p>0.8960</p>
							</c>
							<c>
								<p>0.7887</p>
							</c>
							<c>
								<p>0.8416</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>String + Entity</it>
								</p>
							</c>
							<c>
								<p>0.7360</p>
							</c>
							<c>
								<p>0.8773</p>
							</c>
							<c>
								<p>0.8004</p>
							</c>
							<c>
								<p>0.8479</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>String + Template</it>
								</p>
							</c>
							<c>
								<p>0.7416</p>
							</c>
							<c>
								<p>0.8880</p>
							</c>
							<c>
								<p>0.8082</p>
							</c>
							<c>
								<p>0.8372</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>String + Entity + Template</it>
								</p>
							</c>
							<c>
								<p>0.7584</p>
							</c>
							<c>
								<p>0.8373</p>
							</c>
							<c>
								<p>0.7959</p>
							</c>
							<c>
								<p>
									<b>0.8507</b>
								</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>(Normalize each part of the features and unify them into new feature vectors)</p>
					</tblfn>
				</tbl>
				<tbl id="T9" hint_layout="single">
					<title>
						<p>Table 9</p>
					</title>
					<caption>
						<p>Classifier-level Integration.  </p>
						<p>Integration on length-fixed string feature, entity feature and template feature</p>
					</caption>
					<tblbdy cols="5">
						<r>
							<c>
								<p>
									<b>
										<it>Feature</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Precision</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Recall</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>F-Score</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>AUC</it>
									</b>
								</p>
							</c>
						</r>
						<r>
							<c cspan="5">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>Unigram + Bigram</it>
								</p>
							</c>
							<c>
								<p>0.7015</p>
							</c>
							<c>
								<p>0.8213</p>
							</c>
							<c>
								<p>0.7567</p>
							</c>
							<c>
								<p>0.8036</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>Output based Integration</it>
								</p>
							</c>
							<c>
								<p>0.7248</p>
							</c>
							<c>
								<p>0.8853</p>
							</c>
							<c>
								<p>
									<b>0.7971</b>
								</p>
							</c>
							<c>
								<p>
									<b>0.8539</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p>AdaBoost</p>
							</c>
							<c>
								<p>0.7995</p>
							</c>
							<c>
								<p>0.8933</p>
							</c>
							<c>
								<p>
									<b>0.8438</b>
								</p>
							</c>
							<c>
								<p>
									<b>0.8746</b>
								</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>(Normalize the output of each classifier and unify them into new feature vectors)</p>
					</tblfn>
				</tbl>
			</sec>
			<sec>
				<st>
					<p>Statistical significance test</p>
				</st>
				<p>Since the size of the evaluation corpus is not large enough, it is necessary to perform the statistical significance test to validate the reliability of our proposed features and integration methods. Here we employ <it>s-test</it> to evaluate the performance of systems on the pooled decisions on the individual documents/category pairs <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>.</p>
				<p>In Table <tblr tid="T10">10</tblr>, we can find that the proposed feature value schema <it>TF</it>*<it>ML</it>, length-fixed string feature and semantic template feature are much better than their counterpart (<it>p value</it> lower then 0.05), and two different level of feature integrations significantly improve the classification performance (<it>p value</it> lower then 0.005).</p>
				<tbl id="T10" hint_layout="single">
					<title>
						<p>Table 10</p>
					</title>
					<caption>
						<p>Statistical Significance Test (<it>s-test</it>).  </p>
						<p>The null hypothesis is that the performance of two methods is the same; the alternative hypothesis is that the former is better than the latter.</p>
					</caption>
					<tblbdy cols="4">
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p>
									<b>
										<it>String Vs. Unigram+Bigram</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b><it>TF</it>*<it>ML</it> Vs. <it>TF</it>*<it>IDF</it></b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>KeyBT Template Vs. Unigram+Bigram</it>
									</b>
								</p>
							</c>
						</r>
						<r>
							<c cspan="4">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p>
									<b>
										<it>p value</it>
									</b>
								</p>
							</c>
							<c>
								<p>0.015</p>
							</c>
							<c>
								<p>0.012</p>
							</c>
							<c>
								<p>0.0188</p>
							</c>
						</r>
						<r>
							<c cspan="4">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c cspan="2">
								<p>
									<b>
										<it>Feature Level Integration Vs. Unigram+Bigram</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Classifier Level Integration Vs. Unigram+Bigram</it>
									</b>
								</p>
							</c>
						</r>
						<r>
							<c cspan="4">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p>
									<b>
										<it>p value</it>
									</b>
								</p>
							</c>
							<c cspan="2">
								<p>0.0026</p>
							</c>
							<c>
								<p>0.0010</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
			</sec>
			<sec>
				<st>
					<p>Comparison with the state of arts</p>
				</st>
				<p>In Table <tblr tid="T11">11</tblr>, the mean, standard deviation and best performance from BioCreAtIvE 2006 are selected from 51 runs of 19 teams. Under our feature selection and integration procedure, the performance outperforms the previous best results (F-score improved by 8.2% and AUC improved by 2.2%).</p>
				<tbl id="T11" hint_layout="single">
					<title>
						<p>Table 11</p>
					</title>
					<caption>
						<p>Mean, Standard Deviation and Best Performance from BioCreAtIvE 2006 Vs Our Final Performance.  </p>
						<p>The best performance from BioCreAtIvE 2006 is selected from 51 runs of 19 teams respectively.</p>
					</caption>
					<tblbdy cols="6">
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p>
									<b>
										<it>Precision</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Recall</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>F-Score</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>AUC</it>
									</b>
								</p>
							</c>
						</r>
						<r>
							<c cspan="6">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p>
									<it>Mean</it>
								</p>
							</c>
							<c>
								<p>0.6642</p>
							</c>
							<c>
								<p>0.7636</p>
							</c>
							<c>
								<p>0.6868</p>
							</c>
							<c>
								<p>0.7351</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>BioCreAtIvE 2006</it>
								</p>
							</c>
							<c>
								<p>
									<it>Standard Deviation</it>
								</p>
							</c>
							<c>
								<p>0.0810</p>
							</c>
							<c>
								<p>0.1926</p>
							</c>
							<c>
								<p>0.1035</p>
							</c>
							<c>
								<p>0.0741</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p>
									<it>Best Reported</it>
								</p>
							</c>
							<c>
								<p>-</p>
							</c>
							<c>
								<p>-</p>
							</c>
							<c>
								<p>0.7800</p>
							</c>
							<c>
								<p>0.8554</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>Final Performance</it>
								</p>
							</c>
							<c>
								<p>-</p>
							</c>
							<c>
								<p>0.7995</p>
							</c>
							<c>
								<p>0.8933</p>
							</c>
							<c>
								<p>
									<b>0.8438</b>
								</p>
							</c>
							<c>
								<p>
									<b>0.8746</b>
								</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Conclusions</p>
			</st>
			<p>The experiment results clearly demonstrate that the lower level features are endowed with better generalization capability, but hampered by lower accuracy; higher level features contain rich domain dependent information, with better specificity but poor universality. Integration of different level of features would benefit from the different aspects of the feature space, which would reinforce the domain dependent classification and overcome the bias on the data distribution.</p>
			<p>Main contributions of this paper are as follows:</p>
			<p>(1) Propose novel domain independent feature value schema <it>TF</it>*<it>ML</it> and length-fixed string feature;</p>
			<p>(2) Introduce domain dependent features (e.g. named entities, semantic templates) into the biological literature classification, and propose a novel template extraction algorithm <it>KeyBT</it>;</p>
			<p>(3) Investigate the feature-level and classifier-level integration methods to incorporate the information from different levels and perspectives.</p>
			<p>Now, the proposed methods are being integrated into our online service <it>ONBIRES </it><abbrgrp><abbr bid="B21">21</abbr></abbrgrp> as a pre-processing module. In the next step, we will be engaged in the aspect of incremental learning to make our approaches portable to different datasets.</p>
		</sec>
		<sec>
			<st>
				<p>Competing interests</p>
			</st>
			<p>The authors declare that they have no competing interests.</p>
		</sec>
		<sec>
			<st>
				<p>Authors' contributions</p>
			</st>
			<p>HW carried out the main work of the paper, proposed the methods and drafted the manuscript. MH gave directions in the whole process and revised the draft. DS participated in the design and implementation of the experiments. XZ supervised the whole work, gave a number of valuable suggestions and helped to revise the manuscript. All authors have read and approved the final manuscript.</p>
		</sec>
	</bdy>
	<bm>
		<ack>
			<sec>
				<st>
					<p>Acknowledgements</p>
				</st>
				<p>This work was supported by the Chinese Natural Science Foundation under grant No. 60572084 and 60621062, National High Technology Research and Development Program of China (863 Program) under No. 2006AA02Z321, as well as Tsinghua Basic Research Foundation under grant No. 052220205 and No. 053220002.</p>
				<p>This article has been published as part of <it>BMC Bioinformatics</it> Volume 9 Supplement 3, 2008: Proceedings of the Second International Symposium on Languages in Biology and Medicine (LBM) 2007. The full contents of the supplement are available online at <url>http://www.biomedcentral.com/1471-2105/9?issue=S3</url>.</p>
			</sec>
		</ack>
		<refgrp>
			<bibl id="B1">
				<title>
					<p>An Extensive Empirical Study of Feature Selection Metrics for Text Classification, in the Journal of Machine Learning Research</p>
				</title>
				<aug>
					<au>
						<snm>Forman</snm>
						<fnm>G</fnm>
					</au>
				</aug>
				<source>Special Issue on Variable and Feature Selection</source>
				<pubdate>2002</pubdate>
			</bibl>
			<bibl id="B2">
				<title>
					<p>A Review of feature selection techniques in bioinformatics</p>
				</title>
				<aug>
					<au>
						<snm>Saeys</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Inza</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Larranaga</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<publisher>Oxford University Press</publisher>
				<pubdate>2007</pubdate>
				<volume>23</volume>
				<issue>19</issue>
				<fpage>2507</fpage>
				<lpage>2517</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmpid" link="fulltext">17720704</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B3">
				<title>
					<p>Rule-based extraction of experimental evidence in the biomedical domain: The KDD Cup 2002 (task 1)</p>
				</title>
				<aug>
					<au>
						<snm>Regev</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Finkelstein-Landau</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Feldman</snm>
						<fnm>R</fnm>
					</au>
				</aug>
				<source>ACM SIGKDD Explorations Newsletter</source>
				<volume>4</volume>
				<issue>2</issue>
				<fpage>90</fpage>
				<lpage>92</lpage>
			</bibl>
			<bibl id="B4">
				<title>
					<p>A machine learning approach for the curation of biomedical literature-KDD Cup 2002 (task 1)</p>
				</title>
				<aug>
					<au>
						<snm>Shi</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Edwin</snm>
						<fnm>DS</fnm>
					</au>
					<au>
						<snm>Menon</snm>
						<fnm>R</fnm>
					</au>
					<etal/>
				</aug>
				<source>ACM SIGKDD Explorations Newsletter</source>
				<volume>4</volume>
				<issue>2</issue>
				<fpage>93</fpage>
				<lpage>94</lpage>
			</bibl>
			<bibl id="B5">
				<title>
					<p>Automatic scientific text classification using local patterns: KDD Cup 2002 (task 1)</p>
				</title>
				<aug>
					<au>
						<snm>Ghanem</snm>
						<fnm>MM</fnm>
					</au>
					<au>
						<snm>Guo</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Lodhi</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Zhang</snm>
						<fnm>Y</fnm>
					</au>
				</aug>
				<source>ACM SIGKDD Explorations Newsletter</source>
				<volume>4</volume>
				<issue>2</issue>
				<fpage>95</fpage>
				<lpage>96</lpage>
			</bibl>
			<bibl id="B6">
				<title>
					<p>Substring selection for biomedical document classification</p>
				</title>
				<aug>
					<au>
						<snm>Han</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Obradovic</snm>
						<fnm>Z</fnm>
					</au>
					<au>
						<snm>Hu</snm>
						<fnm>ZZ</fnm>
					</au>
					<au>
						<snm>Cathy</snm>
						<fnm>WH</fnm>
					</au>
					<au>
						<snm>Vucetic</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2006</pubdate>
				<volume>22</volume>
				<issue>17</issue>
				<fpage>2136</fpage>
				<lpage>2142</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmpid" link="fulltext">16837530</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B7">
				<title>
					<p>Boosting Naive Bayesian Learning on a Large Subset of MEDLINE</p>
				</title>
				<aug>
					<au>
						<snm>Wilbur</snm>
						<fnm>JW</fnm>
					</au>
				</aug>
				<source>Proceedings of AMIA Symposium</source>
				<publisher>Los Angeles, CA</publisher>
				<fpage>918</fpage>
				<lpage>922</lpage>
			</bibl>
			<bibl id="B8">
				<title>
					<p>Combing NLP and probabilistic categorization for document and term selection for Swiss-Prot medical annotation</p>
				</title>
				<aug>
					<au>
						<snm>Dobrokhotov</snm>
						<fnm>PB</fnm>
					</au>
					<etal/>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2003</pubdate>
				<volume>19</volume>
				<fpage>91</fpage>
				<lpage>94</lpage>
			</bibl>
			<bibl id="B9">
				<title>
					<p>Automatically Expanded Dictionaries with Exclusion Rules and Support Vector Machine Text Classifiers: Approaches to the BioCreAtIve 2 GN and PPI_IAS Tasks</p>
				</title>
				<aug>
					<au>
						<snm>Cohehn</snm>
						<fnm>AM</fnm>
					</au>
				</aug>
				<source>Proceedings of the BioCreative Workshop: 22-25 April 2007. Madrid.</source>
				<publisher>Spanish National Cancer Research Centre</publisher>
				<editor>Krallinger M</editor>
				<pubdate>2007</pubdate>
				<fpage>169</fpage>
				<lpage>174</lpage>
				<note/>
			</bibl>
			<bibl id="B10">
				<title>
					<p>BioCreAtIvE</p>
				</title>
				<note>[<url>http://biocreative.sourceforge.net/</url>]</note>
			</bibl>
			<bibl id="B11">
				<title>
					<p>Introduction to Modern Information Retrieval</p>
				</title>
				<aug>
					<au>
						<snm>Salton</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>McGill</snm>
						<fnm>MJ</fnm>
					</au>
				</aug>
				<publisher>McGraw-Hill, Inc</publisher>
				<pubdate>1986</pubdate>
			</bibl>
			<bibl id="B12">
				<title>
					<p>Extracting Key-Substring-Group Features for Text Classification</p>
				</title>
				<aug>
					<au>
						<snm>Zhang</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Lee</snm>
						<fnm>WS</fnm>
					</au>
				</aug>
				<source>the proceeding of KDD'06</source>
				<publisher>Philadelphia, Pennsylvania, USA</publisher>
				<pubdate>2006</pubdate>
				<fpage>474</fpage>
				<lpage>483</lpage>
				<note>August 20-23</note>
			</bibl>
			<bibl id="B13">
				<title>
					<p>A Comparative Study on Feature Selection in Text Categorization</p>
				</title>
				<aug>
					<au>
						<snm>Yang</snm>
						<fnm>Y</fnm>
					</au>
				</aug>
				<note>School of Computer Science, Jan O.Pedrsen Verity, Inc. Sunnyvale, CA USA Carnegie Mellon University Pittsburgh, PA, USA.</note>
			</bibl>
			<bibl id="B14">
				<title>
					<p>Adding Domain Knowledge to SBL through Feature Construction</p>
				</title>
				<aug>
					<au>
						<snm>Matheus</snm>
						<fnm>CJ</fnm>
					</au>
				</aug>
				<source>Proceedings of the Eighth National Conference on Artificial Intelligence</source>
				<publisher>Boston</publisher>
				<pubdate>1990</pubdate>
				<fpage>803</fpage>
				<lpage>808</lpage>
			</bibl>
			<bibl id="B15">
				<title>
					<p>ABNER v1.5 homepage</p>
				</title>
				<note>[<url>http://pages.cs.wisc.edu/~bsettles/abner/#performance</url>]</note>
			</bibl>
			<bibl id="B16">
				<title>
					<p>Semi-supervised Pattern Learning for Extracting Relations from Bioscience Texts</p>
				</title>
				<aug>
					<au>
						<snm>Ding</snm>
						<fnm>SL</fnm>
					</au>
					<au>
						<snm>Huang</snm>
						<fnm>ML</fnm>
					</au>
					<au>
						<snm>Zhu</snm>
						<fnm>XY</fnm>
					</au>
				</aug>
				<source>Proceedings of the 5th Asia-Pacific Bioinformatics Conference</source>
				<pubdate>2007</pubdate>
				<fpage>307</fpage>
				<lpage>316</lpage>
			</bibl>
			<bibl id="B17">
				<title>
					<p>ProbFuse: A Probabilistic Approach to Data Fusion</p>
				</title>
				<aug>
					<au>
						<snm>Lillis</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Toolan</snm>
						<fnm>F</fnm>
					</au>
					<au>
						<snm>Collier</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Dunnion</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Proceedings of SIGIR'06</source>
				<publisher>August. Seattle, Washington, USA</publisher>
				<pubdate>2006</pubdate>
				<note/>
			</bibl>
			<bibl id="B18">
				<title>
					<p>Soft Margins for AdaBoost</p>
				</title>
				<aug>
					<au>
						<snm>Ratsch</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Onoda</snm>
						<fnm>T.</fnm>
					</au>
					<au>
						<snm>Muller</snm>
						<fnm>KR</fnm>
					</au>
				</aug>
				<source>Machine Learning</source>
				<pubdate>2001</pubdate>
				<volume>42</volume>
				<fpage>287</fpage>
				<lpage>320</lpage>
			</bibl>
			<bibl id="B19">
				<title>
					<p>ONBIRES: Ontology-based biological relation extraction system</p>
				</title>
				<aug>
					<au>
						<snm>Huang</snm>
						<fnm>ML</fnm>
					</au>
					<au>
						<snm>Zhu</snm>
						<fnm>XY</fnm>
					</au>
					<au>
						<snm>Ding</snm>
						<fnm>SL</fnm>
					</au>
					<au>
						<snm>Yu</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Li</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>4th Asia-Pacific Bioinformatics Conference</source>
				<pubdate>2006</pubdate>
				<fpage>327</fpage>
				<lpage>336</lpage>
				<note>FEB 13-16</note>
			</bibl>
			<bibl id="B20">
				<title>
					<p>A re-examination of text categorization methods</p>
				</title>
				<aug>
					<au>
						<snm>Yang</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Liu</snm>
						<fnm>X</fnm>
					</au>
				</aug>
				<source>Proceedings of SIGIR'99 August</source>
				<publisher>Berkley, CA USA</publisher>
				<fpage>42</fpage>
				<lpage>49</lpage>
			</bibl>
			<bibl id="B21">
				<title>
					<p>ONBIRES Homepage</p>
				</title>
				<note>[<url>http://spies.cs.tsinghua.edu.cn:8080</url>]</note>
			</bibl>
		</refgrp>
	</bm>
</art>
