<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
	<ui>1471-2105-9-S3-S2</ui>
	<ji>1471-2105</ji>
	<fm>
		<dochead>Proceedings</dochead>
		<bibl>
			<title>
				<p>Normalizing biomedical terms by minimizing ambiguity and variability</p>
			</title>
			<aug>
				<au id="A1" ca="yes">
					<snm>Tsuruoka</snm>
					<fnm>Yoshimasa</fnm>
					<insr iid="I1"/>
					<email>yoshimasa.tsuruoka@manchester.ac.uk</email>
				</au>
				<au id="A2">
					<snm>McNaught</snm>
					<fnm>John</fnm>
					<insr iid="I1"/>
					<insr iid="I2"/>
					<email>john.mcnaught@manchester.ac.uk</email>
				</au>
				<au id="A3">
					<snm>Ananiadou</snm>
					<fnm>Sophia</fnm>
					<insr iid="I1"/>
					<insr iid="I2"/>
					<email>sophia.ananiadou@manchester.ac.uk</email>
				</au>
			</aug>
			<insg>
				<ins id="I1">
					<p>School of Computer Science, The University of Manchester, MIB, 131 Princess Street, Manchester, M1 7DN, UK</p>
				</ins>
				<ins id="I2">
					<p>National Centre for Text Mining (NaCTeM), MIB, 131 Princess Street, Manchester, M1 7DN, UK</p>
				</ins>
			</insg>
			<source>BMC Bioinformatics</source>
			<supplement>
				<title>
					<p>Proceedings of the Second International Symposium on Languages in Biology and Medicine (LBM) 2007</p>
				</title>
				<editor>Christopher JO Baker and Su Jian</editor>
				<note>Proceedings</note>
				<url>http://www.biomedcentral.com/1471-2105-9-S3-info.pdf</url>
			</supplement>
			<conference>
				<title>
					<p>The Second International Symposium on Languages in Biology and Medicine (LBM) 2007</p>
				</title>
				<location>Singapore</location>
				<date-range>6-7 December 2007</date-range>
				<url>http://lbm2007.biopathway.org/</url>
			</conference>
			<issn>1471-2105</issn>
			<pubdate>2008</pubdate>
			<volume>9</volume>
			<issue>Suppl 3</issue>
			<fpage>S2</fpage>
			<url>http://www.biomedcentral.com/1471-2105/9/S3/S2</url>
			<xrefbib>
				<pubidlist><pubid idtype="pmpid">18426547</pubid><pubid idtype="doi">10.1186/1471-2105-9-S3-S2</pubid>
				</pubidlist></xrefbib>
		</bibl>
		<history>
			<pub>
				<date>
					<day>11</day>
					<month>04</month>
					<year>2008</year>
				</date>
			</pub>
		</history>
		<cpyrt>
			<year>2008</year>
			<collab>Tsuruoka et al.; licensee BioMed Central Ltd.</collab>
			<note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
		</cpyrt>
		<abs>
			<sec>
				<st>
					<p>Abstract</p>
				</st>
				<sec>
					<st>
						<p>Background</p>
					</st>
					<p>One of the difficulties in mapping biomedical named entities, e.g. genes, proteins, chemicals and diseases, to their concept identifiers stems from the potential variability of the terms. Soft string matching is a possible solution to the problem, but its inherent heavy computational cost discourages its use when the dictionaries are large or when real time processing is required. A less computationally demanding approach is to normalize the terms by using heuristic rules, which enables us to look up a dictionary in a constant time regardless of its size. The development of good heuristic rules, however, requires extensive knowledge of the terminology in question and thus is the bottleneck of the normalization approach.</p>
				</sec>
				<sec>
					<st>
						<p>Results</p>
					</st>
					<p>We present a novel framework for discovering a list of normalization rules from a dictionary in a fully automated manner. The rules are discovered in such a way that they minimize the ambiguity and variability of the terms in the dictionary. We evaluated our algorithm using two large dictionaries: a human gene/protein name dictionary built from BioThesaurus and a disease name dictionary built from UMLS.</p>
				</sec>
				<sec>
					<st>
						<p>Conclusions</p>
					</st>
					<p>The experimental results showed that automatically discovered rules can perform comparably to carefully crafted heuristic rules in term mapping tasks, and the computational overhead of rule application is small enough that a very fast implementation is possible. This work will help improve the performance of term-concept mapping tasks in biomedical information extraction especially when good normalization heuristics for the target terminology are not fully known.</p>
				</sec>
			</sec>
		</abs>
	</fm>
	<bdy>
		<sec>
			<st>
				<p>Background</p>
			</st>
			<p>Named entities such as names of genes, proteins, chemicals, tissues, and diseases play a central role in information extraction from biomedical documents <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr></abbrgrp>. To fully utilize the information they convey in the document, we generally need to perform two steps. In the first step, which is commonly called named entity recognition, we identify the regions of text that are likely to be named entities and classify them into predefined categories. Substantial research efforts have been devoted to the improvement of named entity recognizers, and today we can identify biomedical named entities in the literature with reasonable (although still not entirely satisfactory) accuracy by using rule-based or machine learning-based techniques <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp>.</p>
			<p>In the second step, we map the recognized entities with the corresponding concepts in the dictionary (or ontology). This step is crucial for making the extracted information exchangeable at the concept level <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. This mapping task has proven to be non trivial especially in the biomedical domain <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp>. One of the main problems is that biomedical terms have many potential variants, and it is not possible for a dictionary to cover all possible variants in advance.</p>
			<p>One possible approach to tackle this problem is to use soft string matching techniques. Soft matching enables us to compute the degree of similarity between strings, and thus we can associate a term with its concept even when the dictionary fails to contain the exact spelling of the term. In fact, soft matching methods have been shown to be useful in several gene/protein name mapping tasks <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr></abbrgrp>. Soft string matching, however, is not without drawbacks: the method requires a considerable computational cost when looking up the dictionary <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. This problem is particularly serious when we use large dictionaries such as those for gene/protein names and disease names, which can contain more than hundreds of thousands terms. Although there are techniques to speed up the computation for simple similarity measures like uniform-cost edit distance <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>, it is hard to apply those techniques to the sophisticated similarity measures needed in real mapping tasks. To make matters worse, the size of the literature that we need to analyze for biomedical information extraction could be huge&#8212;MEDLINE abstracts contain more than 70 million sentences, let alone full papers.</p>
			<p>Another approach to alleviate the problem of term variation is to normalize the terms by using heuristic rules <abbrgrp><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr></abbrgrp>. For example, converting capital letters to lower case has been shown to be an effective normalization rule for gene/protein names <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. The distinct advantage of the normalization approach over the soft matching approach is the speed of looking up the dictionary. Once the terms are normalized, we can use a <it>hashing</it> technique to lookup a dictionary with a constant computational cost regardless of its size, while the cost for soft matching increases linearly with the size of the dictionary.</p>
			<p>What is most important in the normalization approach is how we normalize the terms. Bad heuristic rules often lose important information in the terms. For example, deleting <it>all</it> digits from a term is probably a bad rule for gene/protein names, because the rule makes it impossible to distinguish &#8216;ACE1&#8217; and &#8216;ACE2&#8217; although it enables us to match &#8216;ACE&#8217; with &#8216;ACE1.&#8217;</p>
			<p>Although using good heuristic rules is certainly important, their development is not straightforward. It requires good intuition and extensive knowledge of the terminology in question; the developer has to know the types of variation and potential side effects of normalization. Consequently, it remains to be seen what normalization rules would work well for various classes of named entities in the biomedical domain.</p>
			<p>In this paper, we present a novel approach for the automatic discovery of term normalization rules, which requires no expert knowledge of the terminology. To achieve this goal, we leverage the important insight provided in previous studies <abbrgrp><abbr bid="B17">17</abbr><abbr bid="B20">20</abbr></abbrgrp> in which contrast and variability in gene names were analyzed to test the effectiveness of several normalization heuristics. Their work suggests that one could distinguish good normalization rules from bad ones by analyzing the effect of normalization on the relationships between terms and their concept IDs in the dictionary. We take their work one step further and present a framework for discovering a list of &#8220;good&#8221; normalization rules from a dictionary in a fully automated manner.</p>
		</sec>
		<sec>
			<st>
				<p>Methods</p>
			</st>
			<sec>
				<st>
					<p>Ambiguity and variability</p>
				</st>
				<p>In this section, we describe two notions that are needed to quantify the utility of a normalization rule. We call them <it>ambiguity</it> and <it>variability</it>.</p>
				<p>First, let us define a dictionary simply as a list of terms {<it>t</it><sub>1</sub>, &#8230;, <it>t<sub>N</sub></it>} where each term is associated with a concept ID <it>c<sub>j</sub></it> &#8712; {<it>c</it><sub>1</sub>, &#8230;, <it>c<sub>M</sub></it>}. In the biomedical domain, concept IDs typically correspond to the unique identifiers for conceptual entities such as genes, chemicals, and diseases defined in biomedical databases (e.g. UniProt, InChI, and OMIM).</p>
				<p>Table <tblr tid="T1">1</tblr> shows an imaginary dictionary consisting of only three concept IDs. Here, we define two values for the dictionary: the ambiguity value and the variability value. The ambiguity value quantifies how ambiguous, on average, the terms in the dictionary are. More specifically, we define the ambiguity value as follows:</p>
				<p>
					<display-formula id="M1">
						<m:math name="1471-2105-9-S3-S2-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mrow>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mrow>
											<m:mi>a</m:mi>
											<m:mi>m</m:mi>
											<m:mi>b</m:mi>
											<m:mi>i</m:mi>
											<m:mi>g</m:mi>
											<m:mi>u</m:mi>
											<m:mi>i</m:mi>
											<m:mi>t</m:mi>
											<m:mi>y</m:mi>
										</m:mrow>
										<m:mo>)</m:mo>
									</m:mrow>
									<m:mi/>
									<m:mo>=</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mfrac>
										<m:mn>1</m:mn>
										<m:mi>N</m:mi>
									</m:mfrac>
									<m:mtext>&#8201;</m:mtext>
									<m:mstyle displaystyle="true">
										<m:munderover>
											<m:mo>&#8721;</m:mo>
											<m:mrow>
												<m:mi>i</m:mi>
												<m:mo>=</m:mo>
												<m:mn>1</m:mn>
											</m:mrow>
											<m:mi>N</m:mi>
										</m:munderover>
										<m:mrow>
											<m:mi>C</m:mi>
											<m:mrow>
												<m:mo>(</m:mo>
												<m:mrow>
													<m:msub>
														<m:mi>t</m:mi>
														<m:mi>i</m:mi>
													</m:msub>
												</m:mrow>
												<m:mo>)</m:mo>
											</m:mrow>
											<m:mo>,</m:mo>
										</m:mrow>
									</m:mstyle>
								</m:mrow>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGcbaWaaeWaaeaajugCbiabdggaHjabd2gaTjabdkgaIjabdMgaPjabdEgaNjabdwha1jabdMgaPjabdsha0jabdMha5bGccaGLOaGaayzkaaacbaqcLbxacqWFGaaicqGH9aqpcaaMe8UcdaWcaaqaaKqzWfGaeGymaedakeaajugCbiabd6eaobaacaaMe8UcdaaeWbqaaiabdoeadnaabmaabaGaemiDaq3aaSbaaSqaaiabdMgaPbqabaaakiaawIcacaGLPaaacqGGSaalaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6eaobqdcqGHris5aaaa@55D7@</m:annotation>
							</m:semantics>
						</m:math>
					</display-formula>
				</p>
				<p>where <it>N</it> is the number of terms in the dictionary, and <it>C</it>(<it>t<sub>i</sub></it>) is the number of the concept IDs that include a term whose spelling is identical to <it>t<sub>i</sub></it>.</p>
				<tbl id="T1" hint_layout="single">
					<title>
						<p>Table 1</p>
					</title>
					<caption>
						<p>Ambiguity and variability in a dictionary. </p>
						<p> This is an imaginary dictionary consisting of three concept IDs. All terms belonging to the same concept ID are assumed to be synonymous (conveying the same meaning).</p>
					</caption>
					<tblbdy cols="2">
						<r>
							<c>
								<p>Concept ID</p>
							</c>
							<c>
								<p>Term</p>
							</c>
						</r>
						<r>
							<c cspan="2">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p>1</p>
							</c>
							<c>
								<p>IL2</p>
							</c>
						</r>
						<r>
							<c>
								<p>1</p>
							</c>
							<c>
								<p>IL-2</p>
							</c>
						</r>
						<r>
							<c>
								<p>1</p>
							</c>
							<c>
								<p>Interleukin</p>
							</c>
						</r>
						<r>
							<c>
								<p>2</p>
							</c>
							<c>
								<p>IL3</p>
							</c>
						</r>
						<r>
							<c>
								<p>2</p>
							</c>
							<c>
								<p>IL-3</p>
							</c>
						</r>
						<r>
							<c>
								<p>2</p>
							</c>
							<c>
								<p>Interleukin</p>
							</c>
						</r>
						<r>
							<c>
								<p>3</p>
							</c>
							<c>
								<p>ZFP580</p>
							</c>
						</r>
						<r>
							<c>
								<p>3</p>
							</c>
							<c>
								<p>ZFP581</p>
							</c>
						</r>
						<r>
							<c>
								<p>3</p>
							</c>
							<c>
								<p>Zinc finger protein</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
				<p>The variability value, in contrast, quantifies how variable the terms are. This is calculated as:</p>
				<p>
					<display-formula id="M2">
						<m:math name="1471-2105-9-S3-S2-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mrow>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mrow>
											<m:mi>v</m:mi>
											<m:mi>a</m:mi>
											<m:mi>r</m:mi>
											<m:mi>i</m:mi>
											<m:mi>a</m:mi>
											<m:mi>b</m:mi>
											<m:mi>i</m:mi>
											<m:mi>l</m:mi>
											<m:mi>i</m:mi>
											<m:mi>t</m:mi>
											<m:mi>y</m:mi>
										</m:mrow>
										<m:mo>)</m:mo>
									</m:mrow>
									<m:mi/>
									<m:mo>=</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mfrac>
										<m:mn>1</m:mn>
										<m:mi>M</m:mi>
									</m:mfrac>
									<m:mtext>&#8201;</m:mtext>
									<m:mstyle displaystyle="true">
										<m:munderover>
											<m:mo>&#8721;</m:mo>
											<m:mrow>
												<m:mi>j</m:mi>
												<m:mo>=</m:mo>
												<m:mn>1</m:mn>
											</m:mrow>
											<m:mi>M</m:mi>
										</m:munderover>
										<m:mrow>
											<m:mi>T</m:mi>
											<m:mrow>
												<m:mo>(</m:mo>
												<m:mrow>
													<m:msub>
														<m:mi>c</m:mi>
														<m:mi>j</m:mi>
													</m:msub>
												</m:mrow>
												<m:mo>)</m:mo>
											</m:mrow>
											<m:mo>,</m:mo>
										</m:mrow>
									</m:mstyle>
								</m:mrow>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGcbaWaaeWaaeaaieGajugCbiab=zha2jab=fgaHjab=jhaYjab=LgaPjab=fgaHjab=jgaIjab=LgaPjab=XgaSjab=LgaPjab=rha0jab=Lha5bGccaGLOaGaayzkaaacbaqcLbxacqGFGaaicqGH9aqpcaaMe8UcdaWcaaqaaKqzWfGaeGymaedakeaajugCbiabd2eanbaacaaMe8UcdaaeWbqaaiabdsfaunaabmaabaGaem4yam2aaSbaaSqaaiabdQgaQbqabaaakiaawIcacaGLPaaacqGGSaalaSqaaiabdQgaQjabg2da9iabigdaXaqaaiabd2eanbqdcqGHris5aaaa@5871@</m:annotation>
							</m:semantics>
						</m:math>
					</display-formula>
				</p>
				<p>Where <it>M</it> is the number of concept IDs in the dictionary, and <it>T</it>(<it>c<sub>j</sub></it>) is the number of <it>unique</it> terms that the concept <it>c<sub>j</sub></it> includes.</p>
				<p>For the dictionary shown in Table <tblr tid="T1">1</tblr>, we can calculate these values as follows:</p>
				<p>
					<display-formula>
						<m:math name="1471-2105-9-S3-S2-i3" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mtable>
									<m:mtr>
										<m:mtd>
											<m:mrow>
												<m:mo>(</m:mo>
												<m:mrow>
													<m:mi>a</m:mi>
													<m:mi>m</m:mi>
													<m:mi>b</m:mi>
													<m:mi>i</m:mi>
													<m:mi>g</m:mi>
													<m:mi>u</m:mi>
													<m:mi>i</m:mi>
													<m:mi>t</m:mi>
													<m:mi>y</m:mi>
												</m:mrow>
												<m:mo>)</m:mo>
											</m:mrow>
											<m:mtext>&#160;</m:mtext>
											<m:mo>=</m:mo>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mfrac>
												<m:mn>1</m:mn>
												<m:mn>9</m:mn>
											</m:mfrac>
											<m:mtext>&#8201;</m:mtext>
											<m:mrow>
												<m:mo>(</m:mo>
												<m:mrow>
													<m:mn>1</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>1</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>2</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>1</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>1</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>2</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>1</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>1</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>1</m:mn>
												</m:mrow>
												<m:mo>)</m:mo>
											</m:mrow>
										</m:mtd>
									</m:mtr>
									<m:mtr>
										<m:mtd>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mo>=</m:mo>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#160;</m:mtext>
											<m:mn>1.22...</m:mn>
										</m:mtd>
									</m:mtr>
									<m:mtr>
										<m:mtd>
											<m:mrow>
												<m:mo>(</m:mo>
												<m:mrow>
													<m:mi>v</m:mi>
													<m:mi>a</m:mi>
													<m:mi>r</m:mi>
													<m:mi>i</m:mi>
													<m:mi>a</m:mi>
													<m:mi>b</m:mi>
													<m:mi>i</m:mi>
													<m:mi>l</m:mi>
													<m:mi>i</m:mi>
													<m:mi>t</m:mi>
													<m:mi>y</m:mi>
												</m:mrow>
												<m:mo>)</m:mo>
											</m:mrow>
											<m:mtext>&#160;</m:mtext>
											<m:mo>=</m:mo>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mfrac>
												<m:mn>1</m:mn>
												<m:mn>3</m:mn>
											</m:mfrac>
											<m:mtext>&#8201;</m:mtext>
											<m:mrow>
												<m:mo>(</m:mo>
												<m:mrow>
													<m:mn>3</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>3</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>3</m:mn>
												</m:mrow>
												<m:mo>)</m:mo>
											</m:mrow>
										</m:mtd>
									</m:mtr>
									<m:mtr>
										<m:mtd>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mo>=</m:mo>
											<m:mtext>&#160;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mn>3</m:mn>
										</m:mtd>
									</m:mtr>
								</m:mtable>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGceaabbeaadaqadaqaaGqacKqzWfGae8xyaeMae8xBa0Mae8NyaiMae8xAaKMae83zaCMae8xDauNae8xAaKMae8hDaqNae8xEaKhakiaawIcacaGLPaaajugCbiabbccaGiabg2da9iaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaaysW7kmaalaaabaqcLbxacqaIXaqmaOqaaKqzWfGaeGyoaKdaaiaaysW7kmaabmaabaqcLbxacqaIXaqmcqqGGaaicqGHRaWkcqqGGaaicqaIXaqmcqqGGaaicqGHRaWkcqqGGaaicqaIYaGmcqqGGaaicqGHRaWkcqqGGaaicqaIXaqmcqqGGaaicqGHRaWkcqqGGaaicqaIXaqmcqqGGaaicqGHRaWkcqqGGaaicqaIYaGmcqqGGaaicqGHRaWkcqqGGaaicqaIXaqmcqqGGaaicqGHRaWkcqqGGaaicqaIXaqmcqqGGaaicqGHRaWkcqqGGaaicqaIXaqmaOGaayjkaiaawMcaaaqaaKqzWfGaaGjbVlaaysW7caaMe8UaaGjbVlaaysW7caaMe8UaaGjbVlaaysW7caaMe8UaaGjbVlaaysW7caaMe8UaaGjbVlaaysW7caaMe8UaaGjbVlaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7caaMi8Uaeyypa0JaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7cqqGGaaicqaIXaqmcqGGUaGlcqaIYaGmcqaIYaGmcqGGUaGlcqGGUaGlcqGGUaGlaOqaamaabmaabaqcLbxacqWF2bGDcqWFHbqycqWFYbGCcqWFPbqAcqWFHbqycqWFIbGycqWFPbqAcqWFSbaBcqWFPbqAcqWF0baDcqWF5bqEaOGaayjkaiaawMcaaKqzWfGaeeiiaaIaeyypa0JaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7caaMe8UcdaWcaaqaaKqzWfGaeGymaedakeaajugCbiabiodaZaaacaaMe8UcdaqadaqaaKqzWfGaeG4mamJaeeiiaaIaey4kaSIaeeiiaaIaeG4mamJaeeiiaaIaey4kaSIaeeiiaaIaeG4mamdakiaawIcacaGLPaaaaeaajugCbiaaysW7caaMe8UaaGjbVlaaysW7caaMe8UaaGjbVlaaysW7caaMe8UaaGjbVlaaysW7caaMe8UaaGjbVlaaysW7caaMe8UaaGjbVlaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7caaMi8Uaeyypa0JaeeiiaaIaaGjcVlaayIW7caaMi8UaaGjcVlabiodaZaaaaa@2B7B@</m:annotation>
							</m:semantics>
						</m:math>
					</display-formula>
				</p>
				<p>These values can be seen as the indicators of the complexity of terminology. Ideally, we do not want the terms to be ambiguous or variable, because both lead to impaired performance in mapping tasks. We thus favour smaller values for these factors.</p>
				<p>Now let us see how a normalization rule can change the situation. Suppose that we have the normalization rule that removes hyphens. By applying it to the terms in the dictionary, &#8216;IL-2&#8217; becomes &#8216;IL2&#8217;, and &#8216;IL-3&#8217; becomes &#8216;IL3&#8217;. Then we obtain new values for ambiguity and variability:</p>
				<p>
					<display-formula>
						<m:math name="1471-2105-9-S3-S2-i4" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mtable>
									<m:mtr>
										<m:mtd>
											<m:mrow>
												<m:mo>(</m:mo>
												<m:mrow>
													<m:mi>a</m:mi>
													<m:mi>m</m:mi>
													<m:mi>b</m:mi>
													<m:mi>i</m:mi>
													<m:mi>g</m:mi>
													<m:mi>u</m:mi>
													<m:mi>i</m:mi>
													<m:mi>t</m:mi>
													<m:mi>y</m:mi>
												</m:mrow>
												<m:mo>)</m:mo>
											</m:mrow>
											<m:mtext>&#160;</m:mtext>
											<m:mo>=</m:mo>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mfrac>
												<m:mn>1</m:mn>
												<m:mn>9</m:mn>
											</m:mfrac>
											<m:mtext>&#8201;</m:mtext>
											<m:mrow>
												<m:mo>(</m:mo>
												<m:mrow>
													<m:mn>1</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>1</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>2</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>1</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>1</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>2</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>1</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>1</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>1</m:mn>
												</m:mrow>
												<m:mo>)</m:mo>
											</m:mrow>
										</m:mtd>
									</m:mtr>
									<m:mtr>
										<m:mtd>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mo>=</m:mo>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#160;</m:mtext>
											<m:mn>1.22...</m:mn>
										</m:mtd>
									</m:mtr>
									<m:mtr>
										<m:mtd>
											<m:mrow>
												<m:mo>(</m:mo>
												<m:mrow>
													<m:mi>v</m:mi>
													<m:mi>a</m:mi>
													<m:mi>r</m:mi>
													<m:mi>i</m:mi>
													<m:mi>a</m:mi>
													<m:mi>b</m:mi>
													<m:mi>i</m:mi>
													<m:mi>l</m:mi>
													<m:mi>i</m:mi>
													<m:mi>t</m:mi>
													<m:mi>y</m:mi>
												</m:mrow>
												<m:mo>)</m:mo>
											</m:mrow>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mo>=</m:mo>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mfrac>
												<m:mn>1</m:mn>
												<m:mn>3</m:mn>
											</m:mfrac>
											<m:mtext>&#8202;</m:mtext>
											<m:mrow>
												<m:mo>(</m:mo>
												<m:mrow>
													<m:mn>2</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>2</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>3</m:mn>
												</m:mrow>
												<m:mo>)</m:mo>
											</m:mrow>
										</m:mtd>
									</m:mtr>
									<m:mtr>
										<m:mtd>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mo>=</m:mo>
											<m:mtext>&#160;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mn>2.33...</m:mn>
										</m:mtd>
									</m:mtr>
								</m:mtable>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGceaabbeaadaqadaqaaKqzWfGaemyyaeMaemyBa0MaemOyaiMaemyAaKMaem4zaCMaemyDauNaemyAaKMaemiDaqNaemyEaKhakiaawIcacaGLPaaajugCbiabbccaGiabg2da9iaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaaysW7kmaalaaabaqcLbxacqaIXaqmaOqaaKqzWfGaeGyoaKdaaiaaysW7kmaabmaabaqcLbxacqaIXaqmcqqGGaaicqGHRaWkcqqGGaaicqaIXaqmcqqGGaaicqGHRaWkcqqGGaaicqaIYaGmcqqGGaaicqGHRaWkcqqGGaaicqaIXaqmcqqGGaaicqGHRaWkcqqGGaaicqaIXaqmcqqGGaaicqGHRaWkcqqGGaaicqaIYaGmcqqGGaaicqGHRaWkcqqGGaaicqaIXaqmcqqGGaaicqGHRaWkcqqGGaaicqaIXaqmcqqGGaaicqGHRaWkcqqGGaaicqaIXaqmaOGaayjkaiaawMcaaaqaaKqzWfGaaGjbVlaaysW7caaMe8UaaGjbVlaaysW7caaMe8UaaGjbVlaaysW7caaMe8UaaGjbVlaaysW7caaMe8UaaGjbVlaaysW7caaMe8UaaGjbVlabg2da9iaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaeeiiaaIaeGymaeJaeiOla4IaeGOmaiJaeGOmaiJaeiOla4IaeiOla4IaeiOla4cakeaadaqadaqaaGqacKqzWfGae8NDayNae8xyaeMae8NCaiNae8xAaKMae8xyaeMae8NyaiMae8xAaKMae8hBaWMae8xAaKMae8hDaqNae8xEaKhakiaawIcacaGLPaaacaaMc8UaaGPaVNqzWfGaaGjbVlabg2da9iaaysW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7kmaalaaabaqcLbxacqaIXaqmaOqaaKqzWfGaeG4mamdaaOGaaGjcVpaabmaabaqcLbxacqaIYaGmcqqGGaaicqGHRaWkcqqGGaaicqaIYaGmcqqGGaaicqGHRaWkcqqGGaaicqaIZaWmaOGaayjkaiaawMcaaaqaaKqzWfGaaGjbVlaaysW7caaMe8UaaGjbVlaaysW7caaMe8UaaGjbVlaaysW7caaMe8UaaGjbVlaaysW7caaMe8UaaGjbVlaaysW7caaMe8UaaGjbVlabg2da9iabbccaGiaayIW7caaMi8UaaGjcVlabikdaYiabc6caUiabiodaZiabiodaZiabc6caUiabc6caUiabc6caUaaaaa@F323@</m:annotation>
							</m:semantics>
						</m:math>
					</display-formula>
				</p>
				<p>We have succeeded in reducing the variability value without increasing the ambiguity. This indicates that this normalization rule is a good one.</p>
				<p>For the same dictionary shown in Table <tblr tid="T1">1</tblr>, we could think of a different normalization rule that replaces all digits with the special symbol &#8216;#&#8217;. If we apply this rule to the dictionary, &#8216;IL2&#8217; and &#8216;IL3&#8217; become &#8216;IL #&#8217;, &#8216;IL-2&#8217; and &#8216;IL-3&#8217; become &#8216;IL-#&#8217;, and &#8216;ZFP580&#8217; and &#8216;ZFP581&#8217; become &#8216;ZFP###&#8217;. We then obtain:</p>
				<p>
					<display-formula>
						<m:math name="1471-2105-9-S3-S2-i5" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mtable>
									<m:mtr>
										<m:mtd>
											<m:mrow>
												<m:mo>(</m:mo>
												<m:mrow>
													<m:mi>a</m:mi>
													<m:mi>m</m:mi>
													<m:mi>b</m:mi>
													<m:mi>i</m:mi>
													<m:mi>g</m:mi>
													<m:mi>u</m:mi>
													<m:mi>i</m:mi>
													<m:mi>t</m:mi>
													<m:mi>y</m:mi>
												</m:mrow>
												<m:mo>)</m:mo>
											</m:mrow>
											<m:mtext>&#160;</m:mtext>
											<m:mo>=</m:mo>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mfrac>
												<m:mn>1</m:mn>
												<m:mn>9</m:mn>
											</m:mfrac>
											<m:mtext>&#8201;</m:mtext>
											<m:mrow>
												<m:mo>(</m:mo>
												<m:mrow>
													<m:mn>2</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>2</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>2</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>2</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>2</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>2</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>1</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>1</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>1</m:mn>
												</m:mrow>
												<m:mo>)</m:mo>
											</m:mrow>
										</m:mtd>
									</m:mtr>
									<m:mtr>
										<m:mtd>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mo>=</m:mo>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#160;</m:mtext>
											<m:mn>1.66...</m:mn>
										</m:mtd>
									</m:mtr>
									<m:mtr>
										<m:mtd>
											<m:mrow>
												<m:mo>(</m:mo>
												<m:mrow>
													<m:mi>v</m:mi>
													<m:mi>a</m:mi>
													<m:mi>r</m:mi>
													<m:mi>i</m:mi>
													<m:mi>a</m:mi>
													<m:mi>b</m:mi>
													<m:mi>i</m:mi>
													<m:mi>l</m:mi>
													<m:mi>i</m:mi>
													<m:mi>t</m:mi>
													<m:mi>y</m:mi>
												</m:mrow>
												<m:mo>)</m:mo>
											</m:mrow>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mo>=</m:mo>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mfrac>
												<m:mn>1</m:mn>
												<m:mn>3</m:mn>
											</m:mfrac>
											<m:mtext>&#8202;</m:mtext>
											<m:mrow>
												<m:mo>(</m:mo>
												<m:mrow>
													<m:mn>3</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>3</m:mn>
													<m:mtext>&#160;</m:mtext>
													<m:mo>+</m:mo>
													<m:mtext>&#160;</m:mtext>
													<m:mn>2</m:mn>
												</m:mrow>
												<m:mo>)</m:mo>
											</m:mrow>
										</m:mtd>
									</m:mtr>
									<m:mtr>
										<m:mtd>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mtext>&#8201;</m:mtext>
											<m:mo>=</m:mo>
											<m:mtext>&#160;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mtext>&#8202;</m:mtext>
											<m:mn>2.66...</m:mn>
										</m:mtd>
									</m:mtr>
								</m:mtable>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGceaabbeaadaqadaqaaKqzWfGaemyyaeMaemyBa0MaemOyaiMaemyAaKMaem4zaCMaemyDauNaemyAaKMaemiDaqNaemyEaKhakiaawIcacaGLPaaajugCbiabbccaGiabg2da9iaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaaysW7kmaalaaabaqcLbxacqaIXaqmaOqaaKqzWfGaeGyoaKdaaiaaysW7kmaabmaabaqcLbxacqaIYaGmcqqGGaaicqGHRaWkcqqGGaaicqaIYaGmcqqGGaaicqGHRaWkcqqGGaaicqaIYaGmcqqGGaaicqGHRaWkcqqGGaaicqaIYaGmcqqGGaaicqGHRaWkcqqGGaaicqaIYaGmcqqGGaaicqGHRaWkcqqGGaaicqaIYaGmcqqGGaaicqGHRaWkcqqGGaaicqaIXaqmcqqGGaaicqGHRaWkcqqGGaaicqaIXaqmcqqGGaaicqGHRaWkcqqGGaaicqaIXaqmaOGaayjkaiaawMcaaaqaaKqzWfGaaGjbVlaaysW7caaMe8UaaGjbVlaaysW7caaMe8UaaGjbVlaaysW7caaMe8UaaGjbVlaaysW7caaMe8UaaGjbVlaaysW7caaMe8UaaGjbVlabg2da9iaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaeeiiaaIaeGymaeJaeiOla4IaeGOnayJaeGOnayJaeiOla4IaeiOla4IaeiOla4cakeaadaqadaqaaGqacKqzWfGae8NDayNae8xyaeMae8NCaiNae8xAaKMae8xyaeMae8NyaiMae8xAaKMae8hBaWMae8xAaKMae8hDaqNae8xEaKhakiaawIcacaGLPaaacaaMc8UaaGPaVNqzWfGaaGjbVlabg2da9iaaysW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7kmaalaaabaqcLbxacqaIXaqmaOqaaKqzWfGaeG4mamdaaOGaaGjcVpaabmaabaqcLbxacqaIZaWmcqqGGaaicqGHRaWkcqqGGaaicqaIZaWmcqqGGaaicqGHRaWkcqqGGaaicqaIYaGmaOGaayjkaiaawMcaaaqaaKqzWfGaaGjbVlaaysW7caaMe8UaaGjbVlaaysW7caaMe8UaaGjbVlaaysW7caaMe8UaaGjbVlaaysW7caaMe8UaaGjbVlaaysW7caaMe8UaaGjbVlabg2da9iabbccaGiaayIW7caaMi8UaaGjcVlaayIW7caaMi8UaaGjcVlabikdaYiabc6caUiabiAda2iabiAda2iabc6caUiabc6caUiabc6caUaaaaa@F7FC@</m:annotation>
							</m:semantics>
						</m:math>
					</display-formula>
				</p>
				<p>Although we have a decreased value for variability, the ambiguity value has increased. This indicates that this normalization rule may not be a good one.</p>
				<p>The examples above demonstrate that we could distinguish good normalization rules from bad ones by observing the change of the ambiguity/variability values defined in the dictionary. In general, a normalization rule reduces the variability value at the sacrifice of the increase in the ambiguity value. Therefore, what we want is a rule that can maximize the reduction of the variability value and keep the increase of the ambiguity value minimal.</p>
				<p>We now need to integrate the two values in order to quantify the overall &#8220;goodness&#8221; of a normalization rule. We define a new value, which we call <it>complexity</it>, as follows:</p>
				<p>
					<display-formula id="M3">
						<m:math name="1471-2105-9-S3-S2-i6" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mrow>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mrow>
											<m:mi>c</m:mi>
											<m:mi>o</m:mi>
											<m:mi>m</m:mi>
											<m:mi>p</m:mi>
											<m:mi>l</m:mi>
											<m:mi>e</m:mi>
											<m:mi>x</m:mi>
											<m:mi>i</m:mi>
											<m:mi>t</m:mi>
											<m:mi>y</m:mi>
										</m:mrow>
										<m:mo>)</m:mo>
									</m:mrow>
									<m:mtext>&#8202;</m:mtext>
									<m:mtext>&#8202;</m:mtext>
									<m:mtext>&#8202;</m:mtext>
									<m:mtext>&#8202;</m:mtext>
									<m:mo>=</m:mo>
									<m:mtext>&#8202;</m:mtext>
									<m:mtext>&#8202;</m:mtext>
									<m:mtext>&#8202;</m:mtext>
									<m:mtext>&#8202;</m:mtext>
									<m:mtext>&#8202;</m:mtext>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mrow>
											<m:mi>a</m:mi>
											<m:mi>m</m:mi>
											<m:mi>b</m:mi>
											<m:mi>i</m:mi>
											<m:mi>g</m:mi>
											<m:mi>u</m:mi>
											<m:mi>i</m:mi>
											<m:mi>t</m:mi>
											<m:mi>y</m:mi>
										</m:mrow>
										<m:mo>)</m:mo>
									</m:mrow>
									<m:mtext>&#8202;</m:mtext>
									<m:mtext>&#8202;</m:mtext>
									<m:mtext>&#8202;</m:mtext>
									<m:mtext>&#8202;</m:mtext>
									<m:mo>&#215;</m:mo>
									<m:mtext>&#8202;</m:mtext>
									<m:mtext>&#8202;</m:mtext>
									<m:mtext>&#8202;</m:mtext>
									<m:mtext>&#8202;</m:mtext>
									<m:msup>
										<m:mrow>
											<m:mrow>
												<m:mo>(</m:mo>
												<m:mrow>
													<m:mi>v</m:mi>
													<m:mi>a</m:mi>
													<m:mi>r</m:mi>
													<m:mi>i</m:mi>
													<m:mi>a</m:mi>
													<m:mi>b</m:mi>
													<m:mi>i</m:mi>
													<m:mi>l</m:mi>
													<m:mi>i</m:mi>
													<m:mi>t</m:mi>
													<m:mi>y</m:mi>
												</m:mrow>
												<m:mo>)</m:mo>
											</m:mrow>
										</m:mrow>
										<m:mi>&#945;</m:mi>
									</m:msup>
									<m:mo>,</m:mo>
								</m:mrow>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGcbaWaaeWaaeaajugCbiabdogaJjabd+gaVjabd2gaTjabdchaWjabdYgaSjabdwgaLjabdIha4jabdMgaPjabdsha0jabdMha5bGccaGLOaGaayzkaaGaaGjcVlaayIW7caaMi8UaaGjcVNqzWfGaeyypa0JaaGjcVlaayIW7caaMi8UaaGjcVlaayIW7kmaabmaabaqcLbxacqWGHbqycqWGTbqBcqWGIbGycqWGPbqAcqWGNbWzcqWG1bqDcqWGPbqAcqWG0baDcqWG5bqEaOGaayjkaiaawMcaaiaayIW7caaMi8UaaGjcVlaayIW7cqGHxdaTcaaMi8UaaGjcVlaayIW7caaMi8+aaeWaaeaaieGajugCbiab=zha2jab=fgaHjab=jhaYjab=LgaPjab=fgaHjab=jgaIjab=LgaPjab=XgaSjab=LgaPjab=rha0jab=Lha5bGccaGLOaGaayzkaaWaaWbaaSqabeaajugCbiabeg7aHbaacqGGSaalaaa@82E5@</m:annotation>
							</m:semantics>
						</m:math>
					</display-formula>
				</p>
				<p>where &#945; is the constant that determines the trade-off between ambiguity and variability.</p>
				<p>Now the problem has become very simple; if a normalization rule can reduce the complexity value for the dictionary, then the rule is a good rule, otherwise it is a bad rule.</p>
			</sec>
			<sec>
				<st>
					<p>Generating rule candidates</p>
				</st>
				<p>The next problem is how we automatically generate normalization rules. Ideally, we want to allow normalization rules to be of any type, such as regular expressions and context-free grammars. However, we found it difficult to incorporate such complex rules in a fully automatic manner because it entails a huge search space for rule hypotheses.</p>
				<p>In this work, we focus only on character-level replacement rules. By focusing on this type, we can easily generate rule candidates from the terms in the dictionary. For example, the first and the second terms in the dictionary given in Table <tblr tid="T1">1</tblr> constitute the following pair.</p>
				<p>IL2</p>
				<p>IL-2</p>
				<p>From this pair, we can easily see that we will be able to match the two terms if we remove the hyphens (i.e. replace the hyphens with the <it>null</it> character), which in turn will reduce the variability value of the dictionary. In other words, we can automatically generate the rule that removes hyphens, by observing this term pair.</p>
				<p>More formally, we can represent any pair of terms X and Y as follows:</p>
				<p>LX<sub>C</sub>R</p>
				<p>LY<sub>C</sub>R</p>
				<p>where L is the left common substring shared by X and Y, R is the right common substring, and X<sub>C</sub> and Y<sub>C</sub> are the substrings at the center that are not shared by the two strings. From this representation, we create the rule that replaces Y<sub>C</sub> with X<sub>C</sub>, which will transform Y into <it>X</it>.</p>
				<p>For the above example pair &#8216;IL2&#8217; and &#8216;IL-2&#8217;, L is &#8216;IL&#8217;, R is &#8217;2&#8217;, Y<sub>C</sub> is &#8216;-&#8217;, and X<sub>C</sub> is the null character. If we take the first term &#8216;IL2&#8217; and the third term &#8216;Interleukin&#8217; from the dictionary in Table <tblr tid="T1">1</tblr>, L is &#8216;I&#8217;, R is the null character, Y<sub>C</sub> is &#8216;nterleukin&#8217;, and X<sub>C</sub> is &#8216;L2&#8217;.</p>
			</sec>
			<sec>
				<st>
					<p>Discovering rules</p>
				</st>
				<p>In the previous sections, we have defined a measure to quantify the utility of a normalization rule and presented a method to generate a rule candidate from any given term pair. Now we describe the whole process of rule discovery. The process is as follows:</p>
				<p>1. Generate rule candidates from all possible pairs of synonymous terms in the dictionary (i.e. terms sharing the same concept ID).</p>
				<p>2. Select a rule that can reduce the complexity value defined by Equation <formr mid="M3">3</formr>.</p>
				<p>3. Apply the rule to all the terms in the dictionary.</p>
				<p>4. Go back to 1&#8212;repeat until the predefined number of iterations is reached.</p>
				<p>Notice that the process is iterative&#8212;we apply the discovered rule immediately to the dictionary and then use the updated dictionary for the next iteration. This is because the rules discovered are to be used in sequential manner; the end product of our rule discovery system is a list of normalization rules, and we shall use them exactly in the same order specified in the list. Thus the terms in the dictionary have to be sequentially updated in the rule discovery process to make sure that they go through the same rule applications.</p>
				<p>In step 2, we need to select a good rule from the rule candidates generated in step 1. The obvious strategy would be to select the rule that maximizes the reduction of the complexity value of the dictionary. However, we found this strategy impractical when the dictionary is large, because it requires us to try applying every rule candidate to the dictionary to see its utility. In this work, we use a less computationally intensive strategy. First, we sort the rule candidates in descending order of frequency of occurrence. We then pick up the first rule that can decrease the complexity value. This strategy worked reasonably well, since the rule candidates that are generated many times decrease the variability value to a greater degree than infrequent ones do.</p>
				<p>To further improve the efficiency of the entire process, we do not consider any rule candidates that have failed once to reduce the complexity value. This pruning method is not completely safe, because the terms in the dictionary change as the process proceeds and thus a candidate that has been rejected once could become acceptable at some point. However, we found that the speed-up gain outweighs the demerit when the dictionary is large.</p>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Results and discussion</p>
			</st>
			<sec>
				<st>
					<p>Dictionaries</p>
				</st>
				<p>We used two large-scale dictionaries for the experiments to evaluate our rule discovery algorithm. One is a dictionary for human gene/protein names, and the other for disease names.</p>
				<p>The gene/protein name dictionary was created from BioThesaurus <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>, which is a collection of more than two million gene/protein names for various species. We selected only the human genes/proteins by consulting the UniProt database <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> and removed the names that were nonsensical (e.g. IDs for other databases). The resulting dictionary consisted of 14,893 concept IDs and 205,909 terms.</p>
				<p>The disease dictionary was created from UMLS Metathesaurus <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>, which is a large multi-lingual vocabulary database that contains biomedical and health related concepts and their various names. We extracted all entries that were associated with the semantic type &#8220;Disease or Syndrome.&#8221; The resulting dictionary consisted of 48,391 concept IDs and 148,531 terms.</p>
				<p>Table <tblr tid="T2">2</tblr> shows statistics of the dictionaries. Note that the terms in the gene/protein dictionary are highly ambiguous in the first place. For evaluation purpose, we also created a reduced version for each dictionary by removing all ambiguous terms.</p>
				<tbl id="T2" hint_layout="single">
					<title>
						<p>Table 2</p>
					</title>
					<caption>
						<p>Statistics of the dictionaries</p>
					</caption>
					<tblbdy cols="5">
						<r>
							<c>
								<p>Dictionary</p>
							</c>
							<c>
								<p>#Concept IDs</p>
							</c>
							<c>
								<p>#Terms</p>
							</c>
							<c>
								<p>Ambiguity</p>
							</c>
							<c>
								<p>Variability</p>
							</c>
						</r>
						<r>
							<c cspan="5">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p>Gene/protein name dictionary (original)</p>
							</c>
							<c>
								<p>14,893</p>
							</c>
							<c>
								<p>205,909</p>
							</c>
							<c>
								<p>5.715</p>
							</c>
							<c>
								<p>13.826</p>
							</c>
						</r>
						<r>
							<c>
								<p>Gene/protein name dictionary (reduced)</p>
							</c>
							<c>
								<p>14,882</p>
							</c>
							<c>
								<p>174,162</p>
							</c>
							<c>
								<p>1.000</p>
							</c>
							<c>
								<p>11.703</p>
							</c>
						</r>
						<r>
							<c>
								<p>Disease dictionary (original)</p>
							</c>
							<c>
								<p>48,391</p>
							</c>
							<c>
								<p>148,531</p>
							</c>
							<c>
								<p>1.005</p>
							</c>
							<c>
								<p>3.069</p>
							</c>
						</r>
						<r>
							<c>
								<p>Disease dictionary (reduced)</p>
							</c>
							<c>
								<p>48,391</p>
							</c>
							<c>
								<p>147,859</p>
							</c>
							<c>
								<p>1.000</p>
							</c>
							<c>
								<p>3.056</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
			</sec>
			<sec>
				<st>
					<p>Evaluation using held-out terms</p>
				</st>
				<p>As discussed in the Methods section, we create normalization rules in such a way that they minimize the variability and ambiguity of the terms in the dictionary. We thus know that they are &#8220;good&#8221; rules for the terms included in the dictionary. It is, however, not clear if those rules are also appropriate for the terms that are not included in the dictionary. In other words, we need to evaluate how the discovered rules will help map <it>unseen</it> terms with their correct concept IDs.</p>
				<p>One way of evaluating such performance is to use a <it>held-out</it> data set for evaluation. Before executing a rule discovery process, we remove some randomly selected terms from the dictionary and keep them as separate data. We then execute the rule discovery process. The mapping performance is then evaluated by applying the discovered rules also to the held-out terms and looking them up in the dictionary, where the lookup system produces, for each heldout term, zero or more concept IDs by exact string matching. The overall lookup performance can be evaluated in terms of precision and recall. Precision is given by</p>
				<p>
					<display-formula id="M4">
						<m:math name="1471-2105-9-S3-S2-i7" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mrow>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mrow>
											<m:mi>p</m:mi>
											<m:mi>r</m:mi>
											<m:mi>e</m:mi>
											<m:mi>c</m:mi>
											<m:mi>i</m:mi>
											<m:mi>s</m:mi>
											<m:mi>i</m:mi>
											<m:mi>o</m:mi>
											<m:mi>n</m:mi>
										</m:mrow>
										<m:mo>)</m:mo>
									</m:mrow>
									<m:mo>=</m:mo>
									<m:mfrac>
										<m:mrow>
											<m:msub>
												<m:mi>n</m:mi>
												<m:mi>c</m:mi>
											</m:msub>
										</m:mrow>
										<m:mrow>
											<m:msub>
												<m:mi>n</m:mi>
												<m:mi>m</m:mi>
											</m:msub>
										</m:mrow>
									</m:mfrac>
									<m:mo>,</m:mo>
								</m:mrow>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGcbaWaaeWaaeaajugCbiabdchaWjabdkhaYjabdwgaLjabdogaJjabdMgaPjabdohaZjabdMgaPjabd+gaVjabd6gaUbGccaGLOaGaayzkaaGaeyypa0ZaaSaaaeaajugCbiabd6gaUPWaaSbaaSqaaiabdogaJbqabaaakeaajugCbiabd6gaUPWaaSbaaSqaaiabd2gaTbqabaaaaKqzWfGaeiilaWcaaa@495E@</m:annotation>
							</m:semantics>
						</m:math>
					</display-formula>
				</p>
				<p>where <it>n<sub>m</sub></it> is the total number of concept IDs output by the lookup system, and <it>n<sub>c</sub></it> is the total number of correct concept IDs output by the system. Recall is given by</p>
				<p>
					<display-formula id="M5">
						<m:math name="1471-2105-9-S3-S2-i8" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mrow>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mrow>
											<m:mi>r</m:mi>
											<m:mi>e</m:mi>
											<m:mi>c</m:mi>
											<m:mi>a</m:mi>
											<m:mi>l</m:mi>
											<m:mi>l</m:mi>
										</m:mrow>
										<m:mo>)</m:mo>
									</m:mrow>
									<m:mtext>&#8201;</m:mtext>
									<m:mo>=</m:mo>
									<m:mtext>&#8201;</m:mtext>
									<m:mfrac>
										<m:mrow>
											<m:msub>
												<m:mi>n</m:mi>
												<m:mi>c</m:mi>
											</m:msub>
										</m:mrow>
										<m:mrow>
											<m:msub>
												<m:mi>n</m:mi>
												<m:mi>h</m:mi>
											</m:msub>
										</m:mrow>
									</m:mfrac>
									<m:mo>,</m:mo>
								</m:mrow>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGcbaWaaeWaaeaajugCbiabdkhaYjabdwgaLjabdogaJjabdggaHjabdYgaSjabdYgaSbGccaGLOaGaayzkaaqcLbxacaaMe8Uaeyypa0JaaGjbVRWaaSaaaeaajugCbiabd6gaUPWaaSbaaSqaaGqaciab=ngaJbqabaaakeaajugCbiabd6gaUPWaaSbaaSqaaiabdIgaObqabaaaaKqzWfGaeiilaWcaaa@4911@</m:annotation>
							</m:semantics>
						</m:math>
					</display-formula>
				</p>
				<p>where <it>n<sub>h</sub></it> is the number of heldout terms.</p>
				<p>With these performance measures, we carried out several sets of experiments for automatic rule discovery. Throughout all experiments reported in this paper, we set the tradeoff parameter &#945; in Equation <formr mid="M3">3</formr> to 0.1. All capital letters in the terms are converted to lower case before applying our rule discovery algorithm.</p>
				<p>Table <tblr tid="T3">3</tblr> shows the result for the human gene/protein dictionary. We used the reduced version of the dictionary in this experiment in order to make clear how normalization affects the precision of lookup performance (i.e. the lookup precision without normalization is ensured to be 100%). We used 1,000 heldout terms for evaluation. The first column shows the iteration counts of the rule discovery process. The second column shows the values of ambiguity and variability in the dictionary. The third column shows the rules discovered. The fourth column shows the lookup performance evaluated using the heldout terms.</p>
				<tbl id="T3" hint_layout="single">
					<title>
						<p>Table 3</p>
					</title>
					<caption>
						<p>Discovering rules from a gene/protein dictionary</p>
					</caption>
					<tblbdy cols="6">
						<r>
							<c>
								<p/>
							</c>
							<c cspan="2" ca="center">
								<p>Dictionary</p>
							</c>
							<c>
								<p/>
							</c>
							<c cspan="2" ca="center">
								<p>Lookup performance</p>
							</c>
						</r>
						<r>
							<c cspan="6">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p>Iter.</p>
							</c>
							<c>
								<p>Ambiguity</p>
							</c>
							<c>
								<p>Variability</p>
							</c>
							<c>
								<p>Rule</p>
							</c>
							<c>
								<p>Precision</p>
							</c>
							<c>
								<p>Recall</p>
							</c>
						</r>
						<r>
							<c cspan="6">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p>0</p>
							</c>
							<c>
								<p>1.004</p>
							</c>
							<c>
								<p>10.399</p>
							</c>
							<c>
								<p>
									<it>(convert capital letters to lower case)</it>
								</p>
							</c>
							<c>
								<p>0.975</p>
							</c>
							<c>
								<p>0.194</p>
							</c>
						</r>
						<r>
							<c>
								<p>1</p>
							</c>
							<c>
								<p>1.006</p>
							</c>
							<c>
								<p>10.101</p>
							</c>
							<c>
								<p>&#8216; &#8217; &#8594; &#8216;-&#8217;</p>
							</c>
							<c>
								<p>0.967</p>
							</c>
							<c>
								<p>0.233</p>
							</c>
						</r>
						<r>
							<c>
								<p>2</p>
							</c>
							<c>
								<p>1.009</p>
							</c>
							<c>
								<p>9.759</p>
							</c>
							<c>
								<p>&#8216;-&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.966</p>
							</c>
							<c>
								<p>0.280</p>
							</c>
						</r>
						<r>
							<c>
								<p>3</p>
							</c>
							<c>
								<p>1.012</p>
							</c>
							<c>
								<p>9.318</p>
							</c>
							<c>
								<p>&#8216;protein&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.958</p>
							</c>
							<c>
								<p>0.340</p>
							</c>
						</r>
						<r>
							<c>
								<p>4</p>
							</c>
							<c>
								<p>1.013</p>
							</c>
							<c>
								<p>9.155</p>
							</c>
							<c>
								<p>&#8216;precursor&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.959</p>
							</c>
							<c>
								<p>0.347</p>
							</c>
						</r>
						<r>
							<c>
								<p>5</p>
							</c>
							<c>
								<p>1.013</p>
							</c>
							<c>
								<p>9.038</p>
							</c>
							<c>
								<p>&#8216;,&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.961</p>
							</c>
							<c>
								<p>0.366</p>
							</c>
						</r>
						<r>
							<c>
								<p>6</p>
							</c>
							<c>
								<p>1.013</p>
							</c>
							<c>
								<p>9.006</p>
							</c>
							<c>
								<p>&#8216;incfinger&#8217; &#8594; &#8216;nf&#8217;</p>
							</c>
							<c>
								<p>0.961</p>
							</c>
							<c>
								<p>0.368</p>
							</c>
						</r>
						<r>
							<c>
								<p>7</p>
							</c>
							<c>
								<p>1.013</p>
							</c>
							<c>
								<p>8.979</p>
							</c>
							<c>
								<p>&#8216;isoforma&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.962</p>
							</c>
							<c>
								<p>0.375</p>
							</c>
						</r>
						<r>
							<c>
								<p>8</p>
							</c>
							<c>
								<p>1.013</p>
							</c>
							<c>
								<p>8.953</p>
							</c>
							<c>
								<p>&#8216;isoformb&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.962</p>
							</c>
							<c>
								<p>0.377</p>
							</c>
						</r>
						<r>
							<c>
								<p>9</p>
							</c>
							<c>
								<p>1.013</p>
							</c>
							<c>
								<p>8.937</p>
							</c>
							<c>
								<p>&#8216;prepro&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.962</p>
							</c>
							<c>
								<p>0.379</p>
							</c>
						</r>
						<r>
							<c>
								<p>10</p>
							</c>
							<c>
								<p>1.013</p>
							</c>
							<c>
								<p>8.916</p>
							</c>
							<c>
								<p>&#8216;ike&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.962</p>
							</c>
							<c>
								<p>0.380</p>
							</c>
						</r>
						<r>
							<c>
								<p>11</p>
							</c>
							<c>
								<p>1.013</p>
							</c>
							<c>
								<p>8.911</p>
							</c>
							<c>
								<p>&#8216;rotocadherin&#8217; &#8594; &#8216;cdh&#8217;</p>
							</c>
							<c>
								<p>0.962</p>
							</c>
							<c>
								<p>0.380</p>
							</c>
						</r>
						<r>
							<c>
								<p>12</p>
							</c>
							<c>
								<p>1.013</p>
							</c>
							<c>
								<p>8.891</p>
							</c>
							<c>
								<p>&#8216;(drosophila)&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.962</p>
							</c>
							<c>
								<p>0.383</p>
							</c>
						</r>
						<r>
							<c>
								<p>13</p>
							</c>
							<c>
								<p>1.013</p>
							</c>
							<c>
								<p>8.873</p>
							</c>
							<c>
								<p>&#8216;variant&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.962</p>
							</c>
							<c>
								<p>0.384</p>
							</c>
						</r>
						<r>
							<c>
								<p>14</p>
							</c>
							<c>
								<p>1.014</p>
							</c>
							<c>
								<p>8.867</p>
							</c>
							<c>
								<p>&#8216;nterleukin&#8217; &#8594; &#8216;l&#8217;</p>
							</c>
							<c>
								<p>0.962</p>
							</c>
							<c>
								<p>0.384</p>
							</c>
						</r>
						<r>
							<c>
								<p>15</p>
							</c>
							<c>
								<p>1.014</p>
							</c>
							<c>
								<p>8.857</p>
							</c>
							<c>
								<p>&#8216;drosophilahomologof&#8217; &#8594; &#8216;homolog&#8217;</p>
							</c>
							<c>
								<p>0.963</p>
							</c>
							<c>
								<p>0.385</p>
							</c>
						</r>
						<r>
							<c>
								<p>16</p>
							</c>
							<c>
								<p>1.014</p>
							</c>
							<c>
								<p>8.846</p>
							</c>
							<c>
								<p>&#8216;coupledrecepto&#8217; &#8594; &#8216;p&#8217;</p>
							</c>
							<c>
								<p>0.963</p>
							</c>
							<c>
								<p>0.387</p>
							</c>
						</r>
						<r>
							<c>
								<p>17</p>
							</c>
							<c>
								<p>1.014</p>
							</c>
							<c>
								<p>8.830</p>
							</c>
							<c>
								<p>&#8216;(s.cerevisiae)&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.963</p>
							</c>
							<c>
								<p>0.390</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>20</p>
							</c>
							<c>
								<p>1.014</p>
							</c>
							<c>
								<p>8.805</p>
							</c>
							<c>
								<p>&#8216;oncogene&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.963</p>
							</c>
							<c>
								<p>0.393</p>
							</c>
						</r>
						<r>
							<c>
								<p>21</p>
							</c>
							<c>
								<p>1.014</p>
							</c>
							<c>
								<p>8.796</p>
							</c>
							<c>
								<p>&#8216;ingfinger&#8217; &#8594; &#8216;nf&#8217;</p>
							</c>
							<c>
								<p>0.963</p>
							</c>
							<c>
								<p>0.394</p>
							</c>
						</r>
						<r>
							<c>
								<p>22</p>
							</c>
							<c>
								<p>1.014</p>
							</c>
							<c>
								<p>8.790</p>
							</c>
							<c>
								<p>&#8216;isoformc&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.963</p>
							</c>
							<c>
								<p>0.395</p>
							</c>
						</r>
						<r>
							<c>
								<p>23</p>
							</c>
							<c>
								<p>1.014</p>
							</c>
							<c>
								<p>8.783</p>
							</c>
							<c>
								<p>&#8216;ransmembrane&#8217; &#8594; &#8216;mem&#8217;</p>
							</c>
							<c>
								<p>0.963</p>
							</c>
							<c>
								<p>0.395</p>
							</c>
						</r>
						<r>
							<c>
								<p>24</p>
							</c>
							<c>
								<p>1.014</p>
							</c>
							<c>
								<p>8.778</p>
							</c>
							<c>
								<p>&#8216;ibosomal&#8217; &#8594; &#8216;p&#8217;</p>
							</c>
							<c>
								<p>0.964</p>
							</c>
							<c>
								<p>0.396</p>
							</c>
						</r>
						<r>
							<c>
								<p>25</p>
							</c>
							<c>
								<p>1.014</p>
							</c>
							<c>
								<p>8.770</p>
							</c>
							<c>
								<p>&#8216;subunit&#8217; &#8594; &#8216;chain&#8217;</p>
							</c>
							<c>
								<p>0.964</p>
							</c>
							<c>
								<p>0.397</p>
							</c>
						</r>
						<r>
							<c>
								<p>26</p>
							</c>
							<c>
								<p>1.014</p>
							</c>
							<c>
								<p>8.761</p>
							</c>
							<c>
								<p>&#8216;s.cerevisiaehomologof&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.964</p>
							</c>
							<c>
								<p>0.398</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>34</p>
							</c>
							<c>
								<p>1.014</p>
							</c>
							<c>
								<p>8.719</p>
							</c>
							<c>
								<p>&#8216;/&#8217; &#8594; &#8216;f&#8217;</p>
							</c>
							<c>
								<p>0.962</p>
							</c>
							<c>
								<p>0.400</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>37</p>
							</c>
							<c>
								<p>1.014</p>
							</c>
							<c>
								<p>8.703</p>
							</c>
							<c>
								<p>&#8216;hypothetical&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.962</p>
							</c>
							<c>
								<p>0.402</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>41</p>
							</c>
							<c>
								<p>1.014</p>
							</c>
							<c>
								<p>8.685</p>
							</c>
							<c>
								<p>&#8216;eptid&#8217; &#8594; &#8216;rote&#8217;</p>
							</c>
							<c>
								<p>0.962</p>
							</c>
							<c>
								<p>0.403</p>
							</c>
						</r>
						<r>
							<c>
								<p>42</p>
							</c>
							<c>
								<p>1.014</p>
							</c>
							<c>
								<p>8.682</p>
							</c>
							<c>
								<p>&#8216;eucinerichrepeatcontaining&#8217; &#8594; &#8216;rrc&#8217;</p>
							</c>
							<c>
								<p>0.962</p>
							</c>
							<c>
								<p>0.403</p>
							</c>
						</r>
						<r>
							<c>
								<p>43</p>
							</c>
							<c>
								<p>1.014</p>
							</c>
							<c>
								<p>8.678</p>
							</c>
							<c>
								<p>&#8216;betadefensin&#8217; &#8594; &#8216;defb&#8217;</p>
							</c>
							<c>
								<p>0.962</p>
							</c>
							<c>
								<p>0.404</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>57</p>
							</c>
							<c>
								<p>1.014</p>
							</c>
							<c>
								<p>8.639</p>
							</c>
							<c>
								<p>&#8216;molecule&#8217; &#8594; &#8216;antigen&#8217;</p>
							</c>
							<c>
								<p>0.962</p>
							</c>
							<c>
								<p>0.405</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>62</p>
							</c>
							<c>
								<p>1.014</p>
							</c>
							<c>
								<p>8.631</p>
							</c>
							<c>
								<p>&#8216;oxonly&#8217; &#8594; &#8216;x&#8217;</p>
							</c>
							<c>
								<p>0.962</p>
							</c>
							<c>
								<p>0.406</p>
							</c>
						</r>
						<r>
							<c>
								<p>63</p>
							</c>
							<c>
								<p>1.014</p>
							</c>
							<c>
								<p>8.627</p>
							</c>
							<c>
								<p>&#8216;hromosome21openreadingframe&#8217; &#8594; &#8216;21orf&#8217;</p>
							</c>
							<c>
								<p>0.962</p>
							</c>
							<c>
								<p>0.407</p>
							</c>
						</r>
						<r>
							<c>
								<p>64</p>
							</c>
							<c>
								<p>1.014</p>
							</c>
							<c>
								<p>8.625</p>
							</c>
							<c>
								<p>&#8216;typeicytoskeletal&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.962</p>
							</c>
							<c>
								<p>0.408</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>68</p>
							</c>
							<c>
								<p>1.014</p>
							</c>
							<c>
								<p>8.611</p>
							</c>
							<c>
								<p>&#8216;member&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.962</p>
							</c>
							<c>
								<p>0.410</p>
							</c>
						</r>
						<r>
							<c>
								<p>69</p>
							</c>
							<c>
								<p>1.014</p>
							</c>
							<c>
								<p>8.587</p>
							</c>
							<c>
								<p>&#8216;lfactoryreceptorfamily&#8217; &#8594; &#8216;r&#8217;</p>
							</c>
							<c>
								<p>0.963</p>
							</c>
							<c>
								<p>0.413</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
				<p>The table clearly shows that the recall of lookup improved as we applied the discovered rules to the terms. More importantly, the degradation of precision was kept minimal. The discovered rules indicate that some technical words such as &#8216;protein&#8217;, &#8216;precursor&#8217;, &#8216;variant&#8217; and &#8216;hypothetical&#8217; are not important in conceptually characterizing a term and thus can be safely removed. The 14th rule is concerned with the acronym &#8216;il&#8217;. The rule effectively converts its long form &#8216;interleukin&#8217; into the acronym. Some of the rules capture synonymous expressions. For example, the 25th rule replaces the word &#8216;subunit&#8217; with &#8216;chain&#8217;.</p>
				<p>Table <tblr tid="T4">4</tblr> shows the result for the (reduced) disease dictionary. Again, the discovered rules improved the recall of lookup performance without causing a significant deterioration of precision. The rules discovered were very different from those for the gene/protein dictionary. For example, the fourth rule that removes &#8216;o&#8217;s enables us to match words in British spelling (e.g. &#8216;oesophageal&#8217;) with American counterparts (e.g. &#8216;esophageal&#8217;). The fifth rule that replaces &#8216;ies&#8217; with &#8216;y&#8217; can convert plural forms into singular. The 13th rule captures synonymous expressions in medical terminology (i.e. &#8216;kidniy&#8217; (kidney) and &#8216;rinal&#8217; (renal); note that &#8216;e&#8217;s are already converted to &#8216;i&#8217;s by a previous rule).</p>
				<tbl id="T4" hint_layout="single">
					<title>
						<p>Table 4</p>
					</title>
					<caption>
						<p>Discovering rules from a disease dictionary</p>
					</caption>
					<tblbdy cols="6">
						<r>
							<c>
								<p/>
							</c>
							<c cspan="2" ca="center">
								<p>Dictionary</p>
							</c>
							<c>
								<p/>
							</c>
							<c cspan="2" ca="center">
								<p>Lookup performance</p>
							</c>
						</r>
						<r>
							<c cspan="6">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p>Iter.</p>
							</c>
							<c>
								<p>Ambiguity</p>
							</c>
							<c>
								<p>Variability</p>
							</c>
							<c>
								<p>Rule</p>
							</c>
							<c>
								<p>Precision</p>
							</c>
							<c>
								<p>Recall</p>
							</c>
						</r>
						<r>
							<c cspan="6">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p>0</p>
							</c>
							<c>
								<p>1.001</p>
							</c>
							<c>
								<p>2.794</p>
							</c>
							<c>
								<p>
									<it>(convert capital letters to lower case)</it>
								</p>
							</c>
							<c>
								<p>0.994</p>
							</c>
							<c>
								<p>0.158</p>
							</c>
						</r>
						<r>
							<c>
								<p>1</p>
							</c>
							<c>
								<p>1.002</p>
							</c>
							<c>
								<p>2.747</p>
							</c>
							<c>
								<p>&#8216;,&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.989</p>
							</c>
							<c>
								<p>0.184</p>
							</c>
						</r>
						<r>
							<c>
								<p>2</p>
							</c>
							<c>
								<p>1.002</p>
							</c>
							<c>
								<p>2.667</p>
							</c>
							<c>
								<p>&#8216; nos&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.986</p>
							</c>
							<c>
								<p>0.216</p>
							</c>
						</r>
						<r>
							<c>
								<p>3</p>
							</c>
							<c>
								<p>1.003</p>
							</c>
							<c>
								<p>2.609</p>
							</c>
							<c>
								<p>&#8216;[x]&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.985</p>
							</c>
							<c>
								<p>0.263</p>
							</c>
						</r>
						<r>
							<c>
								<p>4</p>
							</c>
							<c>
								<p>1.003</p>
							</c>
							<c>
								<p>2.580</p>
							</c>
							<c>
								<p>&#8216;o&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.982</p>
							</c>
							<c>
								<p>0.275</p>
							</c>
						</r>
						<r>
							<c>
								<p>5</p>
							</c>
							<c>
								<p>1.003</p>
							</c>
							<c>
								<p>2.554</p>
							</c>
							<c>
								<p>&#8216;ies&#8217; &#8594; &#8216;y&#8217;</p>
							</c>
							<c>
								<p>0.983</p>
							</c>
							<c>
								<p>0.291</p>
							</c>
						</r>
						<r>
							<c>
								<p>6</p>
							</c>
							<c>
								<p>1.003</p>
							</c>
							<c>
								<p>2.529</p>
							</c>
							<c>
								<p>&#8216; &#8217; &#8594; &#8216;-&#8217;</p>
							</c>
							<c>
								<p>0.984</p>
							</c>
							<c>
								<p>0.305</p>
							</c>
						</r>
						<r>
							<c>
								<p>7</p>
							</c>
							<c>
								<p>1.003</p>
							</c>
							<c>
								<p>2.504</p>
							</c>
							<c>
								<p>&#8216;-&#8217; &#8594; &#8216;;&#8217;</p>
							</c>
							<c>
								<p>0.984</p>
							</c>
							<c>
								<p>0.317</p>
							</c>
						</r>
						<r>
							<c>
								<p>8</p>
							</c>
							<c>
								<p>1.003</p>
							</c>
							<c>
								<p>2.484</p>
							</c>
							<c>
								<p>&#8216;e&#8217; &#8594; &#8216;i&#8217;</p>
							</c>
							<c>
								<p>0.985</p>
							</c>
							<c>
								<p>0.332</p>
							</c>
						</r>
						<r>
							<c>
								<p>9</p>
							</c>
							<c>
								<p>1.004</p>
							</c>
							<c>
								<p>2.472</p>
							</c>
							<c>
								<p>&#8216;iasi&#8217; &#8594; &#8216;rdir&#8217;</p>
							</c>
							<c>
								<p>0.986</p>
							</c>
							<c>
								<p>0.336</p>
							</c>
						</r>
						<r>
							<c>
								<p>10</p>
							</c>
							<c>
								<p>1.004</p>
							</c>
							<c>
								<p>2.459</p>
							</c>
							<c>
								<p>&#8216;&#8217;s&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.986</p>
							</c>
							<c>
								<p>0.345</p>
							</c>
						</r>
						<r>
							<c>
								<p>11</p>
							</c>
							<c>
								<p>1.004</p>
							</c>
							<c>
								<p>2.449</p>
							</c>
							<c>
								<p>&#8216;s&#8217; &#8594; &#8216;z&#8217;</p>
							</c>
							<c>
								<p>0.986</p>
							</c>
							<c>
								<p>0.347</p>
							</c>
						</r>
						<r>
							<c>
								<p>12</p>
							</c>
							<c>
								<p>1.004</p>
							</c>
							<c>
								<p>2.448</p>
							</c>
							<c>
								<p>&#8216;;(nz)&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.986</p>
							</c>
							<c>
								<p>0.347</p>
							</c>
						</r>
						<r>
							<c>
								<p>13</p>
							</c>
							<c>
								<p>1.004</p>
							</c>
							<c>
								<p>2.447</p>
							</c>
							<c>
								<p>&#8216;kidniy&#8217; &#8594; &#8216;rinal&#8217;</p>
							</c>
							<c>
								<p>0.986</p>
							</c>
							<c>
								<p>0.347</p>
							</c>
						</r>
						<r>
							<c>
								<p>14</p>
							</c>
							<c>
								<p>1.004</p>
							</c>
							<c>
								<p>2.446</p>
							</c>
							<c>
								<p>&#8216;pulmnary&#8217; &#8594; &#8216;lung&#8217;</p>
							</c>
							<c>
								<p>0.986</p>
							</c>
							<c>
								<p>0.347</p>
							</c>
						</r>
						<r>
							<c>
								<p>15</p>
							</c>
							<c>
								<p>1.004</p>
							</c>
							<c>
								<p>2.443</p>
							</c>
							<c>
								<p>&#8216;ir&#8217; &#8594; &#8216;ri&#8217;</p>
							</c>
							<c>
								<p>0.986</p>
							</c>
							<c>
								<p>0.348</p>
							</c>
						</r>
						<r>
							<c>
								<p>16</p>
							</c>
							<c>
								<p>1.004</p>
							</c>
							<c>
								<p>2.441</p>
							</c>
							<c>
								<p>&#8216;aimia&#8217; &#8594; &#8216;imiaz&#8217;</p>
							</c>
							<c>
								<p>0.986</p>
							</c>
							<c>
								<p>0.349</p>
							</c>
						</r>
						<r>
							<c>
								<p>17</p>
							</c>
							<c>
								<p>1.004</p>
							</c>
							<c>
								<p>2.439</p>
							</c>
							<c>
								<p>&#8216;[d]&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.986</p>
							</c>
							<c>
								<p>0.349</p>
							</c>
						</r>
						<r>
							<c>
								<p>18</p>
							</c>
							<c>
								<p>1.004</p>
							</c>
							<c>
								<p>2.436</p>
							</c>
							<c>
								<p>&#8216;aimlytic;animiaz&#8217; &#8594; &#8216;imlytic;animia&#8217;</p>
							</c>
							<c>
								<p>0.986</p>
							</c>
							<c>
								<p>0.351</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>24</p>
							</c>
							<c>
								<p>1.004</p>
							</c>
							<c>
								<p>2.427</p>
							</c>
							<c>
								<p>&#8216;z;thi&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.986</p>
							</c>
							<c>
								<p>0.354</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>31</p>
							</c>
							<c>
								<p>1.004</p>
							</c>
							<c>
								<p>2.420</p>
							</c>
							<c>
								<p>&#8216;z;&#8217; &#8594; &#8216;/&#8217;</p>
							</c>
							<c>
								<p>0.986</p>
							</c>
							<c>
								<p>0.355</p>
							</c>
						</r>
						<r>
							<c>
								<p>32</p>
							</c>
							<c>
								<p>1.004</p>
							</c>
							<c>
								<p>2.348</p>
							</c>
							<c>
								<p>&#8216;/&#8217; &#8594; &#8216;;&#8217;</p>
							</c>
							<c>
								<p>0.987</p>
							</c>
							<c>
								<p>0.377</p>
							</c>
						</r>
						<r>
							<c>
								<p>33</p>
							</c>
							<c>
								<p>1.004</p>
							</c>
							<c>
								<p>2.348</p>
							</c>
							<c>
								<p>&#8216;dizrdri;liv&#8217; &#8594; &#8216;livri;dizrd&#8217;</p>
							</c>
							<c>
								<p>0.987</p>
							</c>
							<c>
								<p>0.377</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>38</p>
							</c>
							<c>
								<p>1.004</p>
							</c>
							<c>
								<p>2.345</p>
							</c>
							<c>
								<p>&#8216;uding&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.987</p>
							</c>
							<c>
								<p>0.378</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>42</p>
							</c>
							<c>
								<p>1.005</p>
							</c>
							<c>
								<p>2.343</p>
							</c>
							<c>
								<p>&#8216;zufficiincy&#8217; &#8594; &#8216;cmpitinci&#8217;</p>
							</c>
							<c>
								<p>0.987</p>
							</c>
							<c>
								<p>0.380</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>50</p>
							</c>
							<c>
								<p>1.005</p>
							</c>
							<c>
								<p>2.339</p>
							</c>
							<c>
								<p>&#8216;(in;zputum)&#8217; &#8594; &#8216;in;zputum&#8217;</p>
							</c>
							<c>
								<p>0.987</p>
							</c>
							<c>
								<p>0.381</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>57</p>
							</c>
							<c>
								<p>1.005</p>
							</c>
							<c>
								<p>2.335</p>
							</c>
							<c>
								<p>&#8216;iincy&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.987</p>
							</c>
							<c>
								<p>0.382</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>70</p>
							</c>
							<c>
								<p>1.005</p>
							</c>
							<c>
								<p>2.333</p>
							</c>
							<c>
								<p>&#8216;[idta]&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.987</p>
							</c>
							<c>
								<p>0.385</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>89</p>
							</c>
							<c>
								<p>1.005</p>
							</c>
							<c>
								<p>2.327</p>
							</c>
							<c>
								<p>&#8216;ph&#8217; &#8594; &#8216;f&#8217;</p>
							</c>
							<c>
								<p>0.987</p>
							</c>
							<c>
								<p>0.387</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>93</p>
							</c>
							<c>
								<p>1.005</p>
							</c>
							<c>
								<p>2.325</p>
							</c>
							<c>
								<p>&#8216;ci&#8217; &#8594; &#8216;x&#8217;</p>
							</c>
							<c>
								<p>0.987</p>
							</c>
							<c>
								<p>0.388</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
			</sec>
			<sec>
				<st>
					<p>Evaluation using MEDLINE snippets</p>
				</st>
				<p>In the experiments presented in the previous section, we have demonstrated that the normalization rules discovered by our algorithm work well for unseen terms as well. It is, however, still not entirely clear how useful and safe those rules are. Although we used heldout data for evaluation, the nature of the heldout terms might be too similar to the remaining terms in the dictionary and thus we cannot rule out the possibility that the rules were actually overfitting the data. Moreover, the distribution of the terms in the dictionary is different from that of the terms appearing in real text, so the rules that are harmless within the dictionary might cause a problem of ambiguity when applied to terms in text.</p>
				<p>To confirm the effectiveness of our normalization method, we need evaluation data that stem from real text rather than a dictionary. Fortunately, the BioCreAtIvE II gene normalization task <abbrgrp><abbr bid="B24">24</abbr></abbrgrp> provides data which can be used for our experiments. The data (the &#8220;training.genelist&#8221; file) includes gene/protein name snippets extracted from MEDLINE abstracts, and each snippet is assigned an EntrezGene ID. Table <tblr tid="T5">5</tblr> shows some examples of the snippets. This evaluation setting could be seen as the situation where we have a named entity recognizer that can <it>perfectly</it> identify the regions of gene/protein names in text. We converted the EntrezGene IDs to UniProt IDs so that they can be compared to the IDs in our human gene/protein dictionary. The resulting evaluation data consisted of 965 gene/protein name snippets and their IDs (there were 33 EntrezGene IDs that we failed to convert to UniProt IDs).</p>
				<tbl id="T5" hint_layout="single">
					<title>
						<p>Table 5</p>
					</title>
					<caption>
						<p>Gene/protein name snippets. </p>
						<p>Examples of the gene/protein name snippets used in the lookup experiments reported in Table <tblr tid="T6">6</tblr> and <tblr tid="T7">7</tblr>. The snippets are indicated in boldface type.</p>
					</caption>
					<tblbdy cols="2">
						<r>
							<c>
								<p>Snippets in context</p>
							</c>
							<c>
								<p>EntrezGene IDs</p>
							</c>
						</r>
						<r>
							<c cspan="2">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p>&#8230; conserved in <b>VH1</b> and the <b>VH1-related (VHR) human protein</b>.</p>
							</c>
							<c>
								<p>1845</p>
							</c>
						</r>
						<r>
							<c>
								<p>These properties suggest that <b>VHR</b> is capable of regulating intracellular &#8230;</p>
							</c>
							<c>
								<p>1845</p>
							</c>
						</r>
						<r>
							<c>
								<p>&#8230; the kinase domain of the <b>keratinocyte growth factor receptor</b> ( &#8230;</p>
							</c>
							<c>
								<p>2263</p>
							</c>
						</r>
						<r>
							<c>
								<p>&#8230; (<b>bek/fibroblast growth factor receptor 2</b>) were infected with &#8230;</p>
							</c>
							<c>
								<p>2263</p>
							</c>
						</r>
						<r>
							<c>
								<p>The <b>Ah (dioxin) receptor</b> binds a number of widely disseminated &#8230;</p>
							</c>
							<c>
								<p>196</p>
							</c>
						</r>
						<r>
							<c>
								<p>&#8230; as a component of the DNA binding form of the <b>Ah receptor</b>.</p>
							</c>
							<c>
								<p>196</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
				<p>With this evaluation data, we ran experiments using our gene/protein name dictionary (not the reduced version). The result is shown in Table <tblr tid="T6">6</tblr>. Again, the discovered rules improved the recall of lookup performance without losing precision. The main reason why the improvement of recall is not as significant as in Table <tblr tid="T3">3</tblr>,<tblr tid="T4">4</tblr> is that, unlike heldout terms, many of the snippets are readily mappable to the terms in the dictionary without any normalization. The useful rules were slightly different from the ones in Table <tblr tid="T3">3</tblr>. For example, the 38th rule, in effect, converts &#8216;receptor&#8217; to &#8216;r&#8217;. The 44th rule converts &#8216;alpha&#8217; to &#8216;a&#8217;. The Roman numeral &#8216;i&#8217; is converted to the Arabic counterpart &#8216;1&#8217; by the 75th rule.</p>
				<tbl id="T6" hint_layout="single">
					<title>
						<p>Table 6</p>
					</title>
					<caption>
						<p>Evaluation using gene/protein name snippets from MEDLINE abstracts</p>
					</caption>
					<tblbdy cols="6">
						<r>
							<c>
								<p/>
							</c>
							<c cspan="2" ca="center">
								<p>Dictionary</p>
							</c>
							<c>
								<p/>
							</c>
							<c cspan="2" ca="center">
								<p>Lookup performance</p>
							</c>
						</r>
						<r>
							<c cspan="6">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p>Iter.</p>
							</c>
							<c>
								<p>Ambiguity</p>
							</c>
							<c>
								<p>Variability</p>
							</c>
							<c>
								<p>Rule</p>
							</c>
							<c>
								<p>Precision</p>
							</c>
							<c>
								<p>Recall</p>
							</c>
						</r>
						<r>
							<c cspan="6">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p>0</p>
							</c>
							<c>
								<p>5.797</p>
							</c>
							<c>
								<p>12.479</p>
							</c>
							<c>
								<p>
									<it>(convert capital letters to lower case)</it>
								</p>
							</c>
							<c>
								<p>0.782</p>
							</c>
							<c>
								<p>0.582</p>
							</c>
						</r>
						<r>
							<c>
								<p>1</p>
							</c>
							<c>
								<p>5.807</p>
							</c>
							<c>
								<p>12.161</p>
							</c>
							<c>
								<p>&#8216;-&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.766</p>
							</c>
							<c>
								<p>0.603</p>
							</c>
						</r>
						<r>
							<c>
								<p>2</p>
							</c>
							<c>
								<p>5.811</p>
							</c>
							<c>
								<p>12.025</p>
							</c>
							<c>
								<p>&#8216; precursor&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.767</p>
							</c>
							<c>
								<p>0.611</p>
							</c>
						</r>
						<r>
							<c>
								<p>3</p>
							</c>
							<c>
								<p>5.812</p>
							</c>
							<c>
								<p>11.941</p>
							</c>
							<c>
								<p>&#8216;,&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.767</p>
							</c>
							<c>
								<p>0.611</p>
							</c>
						</r>
						<r>
							<c>
								<p>4</p>
							</c>
							<c>
								<p>5.812</p>
							</c>
							<c>
								<p>11.907</p>
							</c>
							<c>
								<p>&#8216;inc finger protein&#8217; &#8594; &#8216;nf&#8217;</p>
							</c>
							<c>
								<p>0.767</p>
							</c>
							<c>
								<p>0.611</p>
							</c>
						</r>
						<r>
							<c>
								<p>5</p>
							</c>
							<c>
								<p>5.812</p>
							</c>
							<c>
								<p>11.868</p>
							</c>
							<c>
								<p>&#8216; isoform 1&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.767</p>
							</c>
							<c>
								<p>0.611</p>
							</c>
						</r>
						<r>
							<c>
								<p>6</p>
							</c>
							<c>
								<p>5.813</p>
							</c>
							<c>
								<p>11.832</p>
							</c>
							<c>
								<p>&#8216; isoform 2&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.766</p>
							</c>
							<c>
								<p>0.611</p>
							</c>
						</r>
						<r>
							<c>
								<p>7</p>
							</c>
							<c>
								<p>5.813</p>
							</c>
							<c>
								<p>11.806</p>
							</c>
							<c>
								<p>&#8216; isoform a&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.766</p>
							</c>
							<c>
								<p>0.611</p>
							</c>
						</r>
						<r>
							<c>
								<p>8</p>
							</c>
							<c>
								<p>5.813</p>
							</c>
							<c>
								<p>11.781</p>
							</c>
							<c>
								<p>&#8216; isoform b&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.766</p>
							</c>
							<c>
								<p>0.611</p>
							</c>
						</r>
						<r>
							<c>
								<p>9</p>
							</c>
							<c>
								<p>5.813</p>
							</c>
							<c>
								<p>11.748</p>
							</c>
							<c>
								<p>&#8216; containing protein&#8217; &#8594; &#8216;containing&#8217;</p>
							</c>
							<c>
								<p>0.766</p>
							</c>
							<c>
								<p>0.611</p>
							</c>
						</r>
						<r>
							<c>
								<p>10</p>
							</c>
							<c>
								<p>5.813</p>
							</c>
							<c>
								<p>11.730</p>
							</c>
							<c>
								<p>&#8216; variant&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.766</p>
							</c>
							<c>
								<p>0.611</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>21</p>
							</c>
							<c>
								<p>5.815</p>
							</c>
							<c>
								<p>11.597</p>
							</c>
							<c>
								<p>&#8216;nterleukin&#8217; &#8594; &#8216;l&#8217;</p>
							</c>
							<c>
								<p>0.767</p>
							</c>
							<c>
								<p>0.613</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>24</p>
							</c>
							<c>
								<p>5.816</p>
							</c>
							<c>
								<p>11.566</p>
							</c>
							<c>
								<p>&#8216;specific&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.767</p>
							</c>
							<c>
								<p>0.615</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>33</p>
							</c>
							<c>
								<p>5.816</p>
							</c>
							<c>
								<p>11.450</p>
							</c>
							<c>
								<p>&#8216;protein&#8217; &#8594; &#8216;gene&#8217;</p>
							</c>
							<c>
								<p>0.765</p>
							</c>
							<c>
								<p>0.616</p>
							</c>
						</r>
						<r>
							<c>
								<p>34</p>
							</c>
							<c>
								<p>5.828</p>
							</c>
							<c>
								<p>11.056</p>
							</c>
							<c>
								<p>&#8216; gene&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.765</p>
							</c>
							<c>
								<p>0.619</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>38</p>
							</c>
							<c>
								<p>5.829</p>
							</c>
							<c>
								<p>11.016</p>
							</c>
							<c>
								<p>&#8216; recepto&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.767</p>
							</c>
							<c>
								<p>0.623</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>44</p>
							</c>
							<c>
								<p>5.830</p>
							</c>
							<c>
								<p>10.970</p>
							</c>
							<c>
								<p>&#8216; alph&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.765</p>
							</c>
							<c>
								<p>0.625</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>75</p>
							</c>
							<c>
								<p>5.831</p>
							</c>
							<c>
								<p>10.838</p>
							</c>
							<c>
								<p>&#8216; i&#8217; &#8594; &#8216;1&#8217;</p>
							</c>
							<c>
								<p>0.766</p>
							</c>
							<c>
								<p>0.626</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>84</p>
							</c>
							<c>
								<p>5.831</p>
							</c>
							<c>
								<p>10.790</p>
							</c>
							<c>
								<p>&#8216; lpha&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.766</p>
							</c>
							<c>
								<p>0.627</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>86</p>
							</c>
							<c>
								<p>5.831</p>
							</c>
							<c>
								<p>10.782</p>
							</c>
							<c>
								<p>&#8216; beta&#8217; &#8594; &#8216;b&#8217;</p>
							</c>
							<c>
								<p>0.767</p>
							</c>
							<c>
								<p>0.630</p>
							</c>
						</r>
						<r>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
							<c>
								<p>:</p>
							</c>
						</r>
						<r>
							<c>
								<p>100</p>
							</c>
							<c>
								<p>5.832</p>
							</c>
							<c>
								<p>10.732</p>
							</c>
							<c>
								<p>&#8216; type&#8217; &#8594; &#8216;&#8217;</p>
							</c>
							<c>
								<p>0.767</p>
							</c>
							<c>
								<p>0.633</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
			</sec>
			<sec>
				<st>
					<p>Lookup performance</p>
				</st>
				<p>The greatest advantage of the normalization approach is the speed of looking up a dictionary. Once we normalize the terms in the dictionary and the input term, we can use a <it>hashing</it> technique to look it up in a constant time regardless of the dictionary size. The cost required for normalizing the terms in the dictionary is not a problem since it is done prior to processing the text. In contrast, the cost required for normalizing the input term could be an issue because we need to invoke the normalization process every time we come across a term in the course of text processing.</p>
				<p>To see the computational overhead of normalization, we carried out experiments using the same dictionary and evaluation data used in the above experiment. We implemented the methods in C++ and ran the experiments on AMD Opteron 2.2GHz servers.</p>
				<p>Table <tblr tid="T7">7</tblr> shows the result. The bottom row shows the result of our automatic normalization method in which we used the 100 normalization rules discovered by the algorithm. We can see that the application of 100 rules made the lookup process several times slower than the case without any normalization. Note, however, that it is still more than ten thousand times faster than the soft matching cases where a simple character-level bigram similarity was employed. 0.67 seconds per lookup with soft matching may not appear to be hugely problematic, but it is not a desirable speed when we want to process a large amount of text or when real time processing is required (recall that we used only the human gene/protein dictionary in this experiment, which is a tiny fraction of the biomedical terminology).</p>
				<tbl id="T7" hint_layout="single">
					<title>
						<p>Table 7</p>
					</title>
					<caption>
						<p>Dictionary lookup performance. </p>
						<p>This table shows the speed and accuracy of dictionary lookup tasks using the human gene/protein dictionary and gene/protein name snippets. F-score is the harmonic mean of precision and recall. The values in the parentheses are the threshold values in soft string matching.</p>
					</caption>
					<tblbdy cols="5">
						<r>
							<c>
								<p>Method</p>
							</c>
							<c>
								<p>Precision</p>
							</c>
							<c>
								<p>Recall</p>
							</c>
							<c>
								<p>F-score</p>
							</c>
							<c>
								<p>Average lookup time (microsecond)</p>
							</c>
						</r>
						<r>
							<c cspan="5">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p>Bigram similariy (0.97)</p>
							</c>
							<c>
								<p>0.758</p>
							</c>
							<c>
								<p>0.587</p>
							</c>
							<c>
								<p>0.661</p>
							</c>
							<c>
								<p>6.7 &#215; 10<sup>5</sup></p>
							</c>
						</r>
						<r>
							<c>
								<p>Bigram similariy (0.95)</p>
							</c>
							<c>
								<p>0.691</p>
							</c>
							<c>
								<p>0.592</p>
							</c>
							<c>
								<p>0.638</p>
							</c>
							<c>
								<p>6.8 &#215; 10<sup>5</sup></p>
							</c>
						</r>
						<r>
							<c>
								<p>Bigram similariy (0.93)</p>
							</c>
							<c>
								<p>0.612</p>
							</c>
							<c>
								<p>0.610</p>
							</c>
							<c>
								<p>0.611</p>
							</c>
							<c>
								<p>6.8 &#215; 10<sup>5</sup></p>
							</c>
						</r>
						<r>
							<c>
								<p>No normalization</p>
							</c>
							<c>
								<p>0.809</p>
							</c>
							<c>
								<p>0.502</p>
							</c>
							<c>
								<p>0.619</p>
							</c>
							<c>
								<p>7</p>
							</c>
						</r>
						<r>
							<c>
								<p>Case normalization</p>
							</c>
							<c>
								<p>0.782</p>
							</c>
							<c>
								<p>0.582</p>
							</c>
							<c>
								<p>0.666</p>
							</c>
							<c>
								<p>8</p>
							</c>
						</r>
						<r>
							<c>
								<p>Heuristic normalization <abbrgrp><abbr bid="B18">18</abbr></abbrgrp></p>
							</c>
							<c>
								<p>0.730</p>
							</c>
							<c>
								<p>0.657</p>
							</c>
							<c>
								<p>0.692</p>
							</c>
							<c>
								<p>8</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<b>Automatic normalization</b>
								</p>
							</c>
							<c>
								<p>0.767</p>
							</c>
							<c>
								<p>0.633</p>
							</c>
							<c>
								<p>0.694</p>
							</c>
							<c>
								<p>29</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
				<p>We should nevertheless emphasize that the purpose of this work is not to claim that our automatic term normalization approach is superior to soft string matching approaches. Soft matching methods have a distinct advantage of being able to output similarity scores for matched terms. Also, soft matching is in general more robust to various transformations than normalization approaches. The heavy computational cost is not a problem in certain applications. Soft matching and normalization are, in fact, complementary.</p>
				<p>The table also shows the performance achieved by the heuristic rules given in Fang et al. <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. The normalization consists of case normalization, replacement of hyphens with spaces, removal of punctuation, removal of parenthesized materials, and removal of spaces. Their normalization gave a better recall than our system at the price of a degradation of precision. Among their normalization rules, removal of parenthesized materials is particularly interesting, because this rule can never be produced by our algorithm. This is an instance of &#8220;clever&#8221; rules that are difficult to discover without the help of human knowledge.</p>
				<p>We conducted a brief error analysis on the results of this mapping task to see what types of term variations were yet to be captured by the system. Somewhat surprisingly, there were still many terms that could be mappable via character-level replacement rules. This indicates that we could improve the rule discovery process by employing a more sophisticated method to explore the hypothesis space. Our rule discovery algorithm has some commonalities with Transformation Based Learning (TBL) <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>, so the approaches proposed to improve the training process in TBL (e.g. <abbrgrp><abbr bid="B26">26</abbr></abbrgrp> and <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>) may also be useful in pursuing this direction. The other types of unresolved variations include different word ordering (e.g. &#8216;IgA Fc receptor&#8217; and &#8216;Fc receptor for IgA&#8217;) and coordination (e.g. &#8216;ZNF133, 136 and 140&#8217;).</p>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Conclusions</p>
			</st>
			<p>Developing good heuristics for term normalization requires extensive knowledge of the terminology in question, and it is the bottleneck of normalization approaches for term-concept mapping tasks. In this paper, we have shown that the automatic development of normalization rules is a viable solution to the problem, by presenting an algorithm that can discover effective normalization rules from a dictionary. The algorithm is easy to implement and efficient enough that it is applicable to large dictionaries. Experimental results using a human gene/protein dictionary and a disease dictionary have shown that the automatically discovered rules can improve recall without a significant loss of precision in term-concept mapping tasks. This work should be particularly useful for terminologies for which good normalization rules are not fully known.</p>
			<p>In this work, we limited the type of normalization rules to character-level replacement. There are, however, many good heuristics that cannot be captured in this framework. Extending the scope of normalization rules to more flexible expressions is certainly an interesting direction of future work.</p>
		</sec>
		<sec>
			<st>
				<p>Competing interests</p>
			</st>
			<p>The authors declare that they have no competing interests.</p>
		</sec>
		<sec>
			<st>
				<p>Authors' contributions</p>
			</st>
			<p>YT developed the algorithm, carried out the experiments and drafted the manuscript. JM and SA conceived the study and participated in its design and coordination. All authors read and approved the final manuscript.</p>
		</sec>
	</bdy>
	<bm>
		<ack>
			<sec>
				<st>
					<p>Acknowledgements</p>
				</st>
				<p>We thank Y. Sasaki, J. Tsujii for many valuable comments and discussions, and also the reviewers. Our thanks to the Rebholz Text Mining Group at EMBL-EBI, Hixton, for domain expertise related to bio-resources. This research was supported by the EC project BOOTStrep FP6-028099 (<url>http://www.bootstrep.org</url>). The UK National Centre for Text Mining is sponsored by the JISC/BBSRC/EPSRC.</p>
				<p>This article has been published as part of <it>BMC Bioinformatics</it> Volume 9 Supplement 3, 2008: Proceedings of the Second International Symposium on Languages in Biology and Medicine (LBM) 2007. The full contents of the supplement are available online at <url>http://www.biomedcentral.com/1471-2105/9?issue=S3</url>.</p>
			</sec>
		</ack>
		<refgrp>
			<bibl id="B1">
				<title>
					<p>GENIA corpus&#8212;semantically annotated corpus for bio-textmining</p>
				</title>
				<aug>
					<au>
						<snm>Kim</snm>
						<fnm>JD</fnm>
					</au>
					<au>
						<snm>Ohta</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Tateisi</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Tsujii</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2003</pubdate>
				<volume>19</volume>
				<issue>Suppl 1</issue>
				<fpage>i180</fpage>
				<lpage>i182</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/btg1023</pubid>
						<pubid idtype="pmpid" link="fulltext">12855455</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B2">
				<title>
					<p>Integrated Annotation for Biomedical Information Extraction</p>
				</title>
				<aug>
					<au>
						<snm>Kulick</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Bies</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Libeman</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Mandel</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>McDonald</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Palmer</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Schein</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Ungar</snm>
						<fnm>L</fnm>
					</au>
				</aug>
				<source>Proceedings of HLT-NAACL 2004 Workshop: Biolink 2004</source>
				<pubdate>2004</pubdate>
				<fpage>61</fpage>
				<lpage>68</lpage>
			</bibl>
			<bibl id="B3">
				<title>
					<p>GENETAG: a tagged corpus for gene/protein named entity recognition</p>
				</title>
				<aug>
					<au>
						<snm>Tanabe</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Xie</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Thom</snm>
						<fnm>LH</fnm>
					</au>
					<au>
						<snm>Matten</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Wilbur</snm>
						<fnm>WJ</fnm>
					</au>
				</aug>
				<source>BMC Bioinformatics</source>
				<pubdate>2005</pubdate>
				<volume>6</volume>
				<issue>Suppl 1</issue>
				<fpage>S3</fpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1869017</pubid>
						<pubid idtype="pmpid" link="fulltext">15960837</pubid>
						<pubid idtype="doi">10.1186/1471-2105-6-S1-S3</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B4">
				<title>
					<p>Tagging gene and protein names in biomedical text</p>
				</title>
				<aug>
					<au>
						<snm>Tanabe</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Wilbur</snm>
						<fnm>WJ</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2002</pubdate>
				<volume>18</volume>
				<issue>8</issue>
				<fpage>1124</fpage>
				<lpage>1132</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/18.8.1124</pubid>
						<pubid idtype="pmpid" link="fulltext">12176836</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B5">
				<title>
					<p>Recognizing names in biomedical texts: a machine learning approach</p>
				</title>
				<aug>
					<au>
						<snm>Zhou</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Zhang</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Su</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Shen</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Tan</snm>
						<fnm>C</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2004</pubdate>
				<volume>20</volume>
				<issue>7</issue>
				<fpage>1178</fpage>
				<lpage>1190</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/bth060</pubid>
						<pubid idtype="pmpid" link="fulltext">14871877</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B6">
				<title>
					<p>ProMiner: rule-based protein and gene entity recognition</p>
				</title>
				<aug>
					<au>
						<snm>Hanisch</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Fundel</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Mevissen</snm>
						<fnm>HT</fnm>
					</au>
					<au>
						<snm>Zimmer</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Fluck</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>BMC Bioinformatics</source>
				<pubdate>2005</pubdate>
				<volume>6</volume>
				<issue>Suppl 1</issue>
				<fpage>S14</fpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1869006</pubid>
						<pubid idtype="pmpid" link="fulltext">15960826</pubid>
						<pubid idtype="doi">10.1186/1471-2105-6-S1-S14</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B7">
				<title>
					<p>ABNER: an open source tool for automatically tagging genes, proteins, and other named entities in text</p>
				</title>
				<aug>
					<au>
						<snm>Settles</snm>
						<fnm>B</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2005</pubdate>
				<volume>21</volume>
				<fpage>3191</fpage>
				<lpage>3192</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/bti475</pubid>
						<pubid idtype="pmpid" link="fulltext">15860559</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B8">
				<title>
					<p>A scalable machine-learning approach to recognize chemical names within large text databases</p>
				</title>
				<aug>
					<au>
						<snm>Wren</snm>
						<fnm>JD</fnm>
					</au>
				</aug>
				<source>BMC Bioinformatics</source>
				<pubdate>2006</pubdate>
				<volume>7</volume>
				<issue>Suppl 2</issue>
				<fpage>S3</fpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1683569</pubid>
						<pubid idtype="pmpid" link="fulltext">17118146</pubid>
						<pubid idtype="doi">10.1186/1471-2105-7-S2-S3</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B9">
				<title>
					<p>Information extraction in molecular biology</p>
				</title>
				<aug>
					<au>
						<snm>Blaschke</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Hirschman</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Valencia</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>Briefings in Bioinformatics</source>
				<pubdate>2002</pubdate>
				<volume>3</volume>
				<issue>2</issue>
				<fpage>154</fpage>
				<lpage>165</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bib/3.2.154</pubid>
						<pubid idtype="pmpid" link="fulltext">12139435</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B10">
				<title>
					<p>Using BLAST for identifying gene and protein names in journal articles</p>
				</title>
				<aug>
					<au>
						<snm>Krauthammer</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Rzhetsky</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Morozov</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Friedman</snm>
						<fnm>C</fnm>
					</au>
				</aug>
				<source>Gene</source>
				<pubdate>2000</pubdate>
				<volume>259</volume>
				<fpage>245</fpage>
				<lpage>252</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1016/S0378-1119(00)00431-5</pubid>
						<pubid idtype="pmpid" link="fulltext">11163982</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B11">
				<title>
					<p>Identification of related gene/protein names based on an HMM of name variations</p>
				</title>
				<aug>
					<au>
						<snm>Yeganova</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Smith</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Wilbur</snm>
						<fnm>WJ</fnm>
					</au>
				</aug>
				<source>Comput Biol Chem</source>
				<pubdate>2004</pubdate>
				<volume>28</volume>
				<fpage>97</fpage>
				<lpage>107</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1016/j.compbiolchem.2003.12.003</pubid>
						<pubid idtype="pmpid" link="fulltext">15130538</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B12">
				<title>
					<p>Overview of BioCreAtIvE task 1B: normalized gene lists</p>
				</title>
				<aug>
					<au>
						<snm>Hirschman</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Colosimo</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Morgan</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Yeh</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>BMC Bioinformatics</source>
				<pubdate>2005</pubdate>
				<volume>6</volume>
				<issue>Suppl 1</issue>
				<fpage>S11</fpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1869004</pubid>
						<pubid idtype="pmpid" link="fulltext">15960823</pubid>
						<pubid idtype="doi">10.1186/1471-2105-6-S1-S11</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B13">
				<title>
					<p>A graph-search framework for associating gene identifies with documents</p>
				</title>
				<aug>
					<au>
						<snm>Cohen</snm>
						<fnm>WW</fnm>
					</au>
					<au>
						<snm>Minkov</snm>
						<fnm>E</fnm>
					</au>
				</aug>
				<source>BMC Bioinformatics</source>
				<pubdate>2006</pubdate>
				<volume>7</volume>
				<fpage>440</fpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1617121</pubid>
						<pubid idtype="pmpid" link="fulltext">17032441</pubid>
						<pubid idtype="doi">10.1186/1471-2105-7-440</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B14">
				<title>
					<p>Learning string similarity measures for gene/protein name dictionary look-up using logistic regression</p>
				</title>
				<aug>
					<au>
						<snm>Tsuruoka</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>McNaught</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Tsujii</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Ananiadou</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2007</pubdate>
				<volume>23</volume>
				<issue>20</issue>
				<fpage>2768</fpage>
				<lpage>2774</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/btm393</pubid>
						<pubid idtype="pmpid" link="fulltext">17698493</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B15">
				<title>
					<p>Improving the Performance of Dictionary-based Approaches in Protein Name Recognition</p>
				</title>
				<aug>
					<au>
						<snm>Tsuruoka</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Tsujii</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Journal of Biomedical Informatics</source>
				<pubdate>2004</pubdate>
				<volume>37</volume>
				<fpage>461</fpage>
				<lpage>470</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1016/j.jbi.2004.08.003</pubid>
						<pubid idtype="pmpid" link="fulltext">15542019</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B16">
				<title>
					<p>A guided tour to approximate string matching</p>
				</title>
				<aug>
					<au>
						<snm>Navarro</snm>
						<fnm>G</fnm>
					</au>
				</aug>
				<source>ACM Computing Surveys</source>
				<pubdate>2001</pubdate>
				<volume>33</volume>
				<fpage>31</fpage>
				<lpage>88</lpage>
				<xrefbib>
					<pubid idtype="doi">10.1145/375360.375365</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B17">
				<title>
					<p>Contrast And Variability In Gene Names</p>
				</title>
				<aug>
					<au>
						<snm>Cohen</snm>
						<fnm>KB</fnm>
					</au>
					<au>
						<snm>Dolbey</snm>
						<fnm>AE</fnm>
					</au>
					<au>
						<snm>Acquaah-Mensah</snm>
						<fnm>GK</fnm>
					</au>
					<au>
						<snm>Hunter</snm>
						<fnm>L</fnm>
					</au>
				</aug>
				<source>Proceedings of the Workshop on Natural Language Processing in the Biomedical Domain</source>
				<volume>2002</volume>
				<fpage>14</fpage>
				<lpage>20</lpage>
			</bibl>
			<bibl id="B18">
				<title>
					<p>Human Gene Name Normalization using Text Matching with Automatically Extracted Synonym Dictionaries</p>
				</title>
				<aug>
					<au>
						<snm>Fang</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Murphy</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Jin</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Kim</snm>
						<fnm>JS</fnm>
					</au>
					<au>
						<snm>White</snm>
						<fnm>PS</fnm>
					</au>
				</aug>
				<source>Proceedings of BioNLP'06</source>
				<pubdate>2006</pubdate>
			</bibl>
			<bibl id="B19">
				<title>
					<p>Evaluation of techniques for increasing recall in a dictionary appr