<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
	<ui>1471-2105-9-S5-S5</ui>
	<ji>1471-2105</ji>
	<fm>
		<dochead>Proceedings</dochead>
		<bibl>
			<title>
				<p>Facilitating the development of controlled vocabularies for metabolomics technologies with text mining</p>
			</title>
			<aug>
				<au id="A1" ca="yes">
					<snm>Spasi&#263;</snm>
					<fnm>Irena</fnm>
					<insr iid="I1"/>
					<insr iid="I2"/>
					<email>i.spasic@manchester.ac.uk</email>
				</au>
				<au id="A2">
					<snm>Schober</snm>
					<fnm>Daniel</fnm>
					<insr iid="I3"/>
					<email>schober@ebi.ac.uk</email>
				</au>
				<au id="A3">
					<snm>Sansone</snm>
					<fnm>Susanna-Assunta</fnm>
					<insr iid="I3"/>
					<email>sansone@ebi.ac.uk</email>
				</au>
				<au id="A4">
					<snm>Rebholz-Schuhmann</snm>
					<fnm>Dietrich</fnm>
					<insr iid="I3"/>
					<email>rebholz@ebi.ac.uk</email>
				</au>
				<au id="A5">
					<snm>Kell</snm>
					<mi>B</mi>
					<fnm>Douglas</fnm>
					<insr iid="I1"/>
					<insr iid="I4"/>
					<email>dbk@manchester.ac.uk</email>
				</au>
				<au id="A6">
					<snm>Paton</snm>
					<mi>W</mi>
					<fnm>Norman</fnm>
					<insr iid="I1"/>
					<insr iid="I2"/>
					<email>norm@cs.man.ac.uk</email>
				</au>
			</aug>
			<insg>
				<ins id="I1">
					<p>Manchester Centre for Integrative Systems Biology, The University of Manchester, 131 Princess Street, Manchester, M1 7ND, UK</p>
				</ins>
				<ins id="I2">
					<p>School of Computer Science, The University of Manchester, Oxford Road, Manchester, M13 9PL, UK</p>
				</ins>
				<ins id="I3">
					<p>The European Bioinformatics Institute, EMBL Outstation - Hinxton, Wellcome Trust Genome Campus, Cambridge, CB10 1SD, UK</p>
				</ins>
				<ins id="I4">
					<p>School of Chemistry, The University of Manchester, Oxford Road, Manchester, M13 9PL, UK</p>
				</ins>
			</insg>
			<source>BMC Bioinformatics</source>
			<supplement>
				<title>
					<p>Proceedings of the 10<sup>th</sup> Bio-Ontologies Special Interest Group Workshop 2007. Ten years past and looking to the future</p>
				</title>
				<editor>Phillip Lord, Robert Stevens, Susanna-Assunta Sansone, Robin MacEntire</editor>
				<note>Proceedings</note>
			</supplement>
			<conference>
				<title>
					<p>10<sup>th</sup> Bio-Ontologies Special Interest Group Workshop 2007. Ten years past and looking to the future</p>
				</title>
				<location>Vienna, Austria</location>
				<date-range>20 July 2007</date-range>
			</conference>
			<issn>1471-2105</issn>
			<pubdate>2008</pubdate>
			<volume>9</volume>
			<issue>Suppl 5</issue>
			<fpage>S5</fpage>
			<url>http://www.biomedcentral.com/1471-2105/9/S5/S5</url>
			<xrefbib>
				<pubidlist><pubid idtype="pmpid">18460187</pubid><pubid idtype="doi">10.1186/1471-2105-9-S5-S5</pubid>
				</pubidlist></xrefbib>
		</bibl>
		<history>
			<pub>
				<date>
					<day>29</day>
					<month>04</month>
					<year>2008</year>
				</date>
			</pub>
		</history>
		<cpyrt>
			<year>2008</year>
			<collab>Spasi&#263; et al.; licensee BioMed Central Ltd.</collab>
			<note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
		</cpyrt>
		<abs>
			<sec>
				<st>
					<p>Abstract</p>
				</st>
				<sec>
					<st>
						<p>Background</p>
					</st>
					<p>Many bioinformatics applications rely on controlled vocabularies or ontologies to consistently interpret and seamlessly integrate information scattered across public resources. Experimental data sets from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology, hence the pressing need for vocabularies and ontologies in metabolomics. However, it is time-consuming and non trivial to construct these resources manually.</p>
				</sec>
				<sec>
					<st>
						<p>Results</p>
					</st>
					<p>We describe a methodology for rapid development of controlled vocabularies, a study originally motivated by the needs for vocabularies describing metabolomics technologies. We present case studies involving two controlled vocabularies (for nuclear magnetic resonance spectroscopy and gas chromatography) whose development is currently underway as part of the Metabolomics Standards Initiative. The initial vocabularies were compiled manually, providing a total of 243 and 152 terms. A total of 5,699 and 2,612 new terms were acquired automatically from the literature. The analysis of the results showed that full-text articles (especially the Materials and Methods sections) are the major source of technology-specific terms as opposed to paper abstracts.</p>
				</sec>
				<sec>
					<st>
						<p>Conclusions</p>
					</st>
					<p>We suggest a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a set of controlled vocabularies with the terms used in the scientific literature. We adopted an integrative approach, combining relatively generic software and data resources for time- and cost-effective development of a text mining tool for expansion of controlled vocabularies across various domains, as a practical alternative to both manual term collection and tailor-made named entity recognition methods.</p>
				</sec>
			</sec>
		</abs>
	</fm>
	<bdy>
		<sec>
			<st>
				<p>Background</p>
			</st>
			<p>The lack of a suitable means for formally describing the semantic aspects of omics investigations presents challenges to effective information exchange between biologists <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr></abbrgrp>. The inherent imprecision of free-text descriptions of experimental procedures hinders computational approaches to the interpretation of experimental results. Controlled vocabularies and/or ontologies can be used as a means of adding an interpretative annotation layer to the textual information <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>. A controlled vocabulary (CV) is a structured set of terms (i.e. linguistic representations of domain-specific concepts <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>, and as such a means of conveying scientific and technical information <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>) and definitions agreed by an authority or a community. An ontology includes CV terms to refer to concepts at the linguistic level, but also utilises a richer semantic representation to characterise the ways in which these concepts are related <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. Many scientific communities, including those operating in the metabolomics domain <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>, have started developing ontologies for data annotation <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. The Metabolomics Standards Initiative (MSI) <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp> Ontology Working Group (OWG) <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> has been appointed to establish a common semantic framework (i.e. a set of ontologies and their CVs) for metabolomics studies to be used to describe the experimental process consistently, and to ensure meaningful and unambiguous data exchange <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. While providing a mechanism for coherent and rigorous structuring of domain-specific knowledge, it is necessary for ontologies and CVs in an expanding domain such as metabolomics to be easily extensible. The new knowledge, largely generated by high-throughput screening, is communicated through the biotechnology literature, which can be exploited by text mining (TM) tools to facilitate the process of keeping ontologies and their CVs up to date <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B16">16</abbr></abbrgrp>. In this article we describe a TM approach for rapidly expanding a set of CVs maintained by the MSI OWG with terms extracted from the scientific literature, following initial term acquisition from sources such as domain specialists, literature, databases, existing ontologies, etc.</p>
			<p>The MSI OWG <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> aims to develop a set of ontologies and CVs in metabolomics as a direct support to the activities of other MSI WGs <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>, which are responsible for: Biological Context Metadata, Chemical Analysis, Data Processing and Exchange Formats. The coverage of the domain has been divided in accordance with the typical structure of metabolomics investigations:</p>
			<p>&#8226; general components (investigation design; sample source, characteristics, treatments and collection; computational analysis), and</p>
			<p>&#8226; technology-specific components (sample preparation; instrumental analysis; data pre-processing).</p>
			<p>The ongoing standardisation endeavours in other omics domains, such as the Human Proteome Organization (HUPO) Proteomics Standards Initiatives (PSI) <abbrgrp><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr></abbrgrp>, the Microarray Gene Expression Data Society (MGED) <abbrgrp><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr></abbrgrp> and other ontology communities under the Open Biomedical Ontologies (OBO) Foundry <abbrgrp><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr><abbr bid="B24">24</abbr></abbrgrp> umbrella can largely be re-used to describe the general aspects of metabolomics investigations. Therefore, the MSI OWG has focused initially on the technology-specific components. Further, development activities in this sub-domain have been prioritised according to the pervasiveness of the analytical platforms used.</p>
			<p>A range of analytical technologies have been employed in metabolomics studies <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. Mass spectrometry (MS) is the most widely used analytical technology in metabolomics, as it enables rapid, sensitive and selective qualitative and quantitative analyses with the ability to identify individual metabolites. In particular, the combined chromatography-MS technologies have proven to be highly effective in this respect. Gas chromatography-mass spectrometry (GC-MS) uses GC to separate volatile and thermally stable compounds prior to detection via MS. Similarly, liquid chromatography-mass spectrometry (LC-MS) provides the separation of compounds by LC, which is again followed by MS. On the other hand, nuclear magnetic resonance (NMR) spectroscopy does not require any separation of the compounds prior to analysis, thus providing a non-destructive, high-throughput detection method with minimal sample preparation, which has made it highly popular in metabolomics investigations despite being relatively insensitive in comparison to the MS-based methods.</p>
			<p>For MS, the MSI OWG will leverage previous work by the PSI MS Standards WG <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. For chromatography, which is used in both proteomics and metabolomics, the MSI OWG is closely collaborating with the PSI Sample Processing Ontology WG. Consequently, the technologies the MSI OWG is currently focusing on are NMR and GC. These two technologies are used in this paper to illustrate the effectiveness of the proposed TM approach.</p>
			<p>The MSI OWG efforts are divided into two key stages: (1) reaching a consensus on the CVs, and (2) developing the corresponding ontology as part of the Ontology for Biomedical Investigations (OBI, previously FuGO) <abbrgrp><abbr bid="B27">27</abbr><abbr bid="B28">28</abbr></abbrgrp>. In this paper, we focus on the first stage. Each CV is compiled in the following three steps:</p>
			<p>1. Compilation: An initial CV is created by re-using the existing terminologies from database models (e.g. <abbrgrp><abbr bid="B29">29</abbr><abbr bid="B30">30</abbr></abbrgrp>), glossaries, etc. and normalising the terms according to some common naming conventions <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>. The result of this phase is a draft CV encompassing terms of different types: methods, instruments, parameters that can be measured, etc.</p>
			<p>2. Expansion: In the highly dynamic metabolomics domain, experts often use non-standardised terms. Therefore, in order to reduce the time and cost of compiling a CV and to strive for its completeness, we use a TM approach to automatically identify additional technology-related terms frequently occurring in the scientific literature.</p>
			<p>3. Curation: The CV is discussed within the MSI OWG and is passed on to the practitioners in the relevant metabolomics area for validation in order to ensure the quality and completeness of the proposed CV.</p>
			<p>We expect the CVs to evolve in time by reflecting the changes in the domain and the availability of new literature, and therefore steps 2 and 3 should be iterated over in certain time intervals.</p>
		</sec>
		<sec>
			<st>
				<p>Implementation</p>
			</st>
			<p>A set of relevant tasks regarding CV term acquisition has been identified, including information retrieval, term recognition and term filtering. Figure <figr fid="F1">1</figr> summarises the main steps taken in our TM approach to CV expansion. First, the information retrieval module is used to gather documents relevant for a given CV from the literature databases. Once a domain-specific corpus of documents has been assembled, it is searched for potential terms unaccounted for in the initial CV. Automatic term recognition is performed to extract terms as domain-specific lexical units, i.e. the ones that frequently occur in the corpus and bear special meaning in the domain. In order to reduce the number of terms not directly related to a given technology, and therefore not relevant for the given CV, we filter out typically co-occurring types of terms denoting substances, organisms, organs, diseases, etc. In contrast to the considered analytical techniques, these sub-domains have more established CVs, which can be exploited to recognise these terms using a dictionary-based approach <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>. Each of the TM steps is described in more detail in the forthcoming sub-sections.</p>
			<fig id="F1">
				<title>
					<p>Figure 1</p>
				</title>
				<caption>
					<p>The flow of data in a TM approach to CV expansion</p>
				</caption>
				<text>
					<p><b>The flow of data in a TM approach to CV expansion.</b> The information retrieval module is used to gather a corpus of documents relevant for a given CV from the literature databases. Automatic term recognition is applied against the corpus to extract terms as domain-specific lexical units. Some of the extracted terms not directly related to the CV are filtered out by using the knowledge about typically co-occurring types of terms.</p>
				</text>
				<graphic file="1471-2105-9-S5-S5-1"/>
			</fig>
			<sec>
				<st>
					<p>Information retrieval</p>
				</st>
				<p><it>Information retrieval</it> (IR) implements the representation, storage and organisation of textual data to enable a user to access relevant pieces of information <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>. Biomedical experts regularly exploit IR to locate relevant information (most often in the form of scientific publications) on the Internet. Apart from general-purpose search engines such as Google&#8482; <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>, many IR systems have been designed specifically to query databases of biomedical publications (e.g. <abbrgrp><abbr bid="B35">35</abbr><abbr bid="B36">36</abbr><abbr bid="B37">37</abbr><abbr bid="B38">38</abbr><abbr bid="B39">39</abbr></abbrgrp>) such as Medical Literature Analysis and Retrieval System Online (MEDLINE) <abbrgrp><abbr bid="B40">40</abbr></abbrgrp> and PubMed Central (PMC) <abbrgrp><abbr bid="B41">41</abbr></abbrgrp> (henceforth referred to together as <it>PubMed</it>), which provide peer-reviewed literature and make it freely accessible in a uniform format. MEDLINE distributes <it>abstracts</it> only, while PMC provides <it>full-text articles</it>. PubMed is accessible through <it>Entrez </it><abbrgrp><abbr bid="B42">42</abbr></abbrgrp>, an integrated retrieval system that provides access to a family of related biomedical databases maintained by the National Center for Biotechnology Information (NCBI).</p>
				<p>Documents available in PubMed are indexed by Medical Subject Headings (MeSH) <abbrgrp><abbr bid="B43">43</abbr></abbrgrp> terms (<it>index terms</it> are pre-selected to refer to the content of a document <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>). MeSH is a CV consisting of hierarchically organised terms that serve as descriptors to index and annotate documents. This permits direct access to relevant documents at various levels of specificity, thus improving the performance of IR in terms of speed as well as precision and recall. Entrez uses automatic term mapping to match terms against the MeSH hierarchy and to expand a query with (near-)synonyms and subsumed terms. For example, all of the following terms are explicitly listed as terms matching <it>Magnetic Resonance Spectroscopy</it> in MeSH:</p>
				<p>&#8226; <it>In Vivo NMR Spectroscopy</it></p>
				<p>&#8226; <it>Magnetic Resonance</it></p>
				<p>&#8226; <it>MR Spectroscopy</it></p>
				<p>&#8226; <it>NMR Spectroscopy</it></p>
				<p>&#8226; <it>NMR Spectroscopy, In Vivo</it></p>
				<p>&#8226; <it>Nuclear Magnetic Resonance</it></p>
				<p>&#8226; <it>Spectroscopy, Magnetic Resonance</it></p>
				<p>&#8226; <it>Spectroscopy, NMR</it></p>
				<p>&#8226; <it>Spectroscopy, Nuclear Magnetic Resonance</it></p>
				<p>Similarly, a query searching for information on <it>Gas Chromatography</it> can be expanded automatically to include <it>Gas Chromatography-Mass Spectrometry</it> as a more specific term (see figure <figr fid="F2">2</figr>).</p>
				<fig id="F2">
					<title>
						<p>Figure 2</p>
					</title>
					<caption>
						<p>A sub-tree of the MeSH hierarchy</p>
					</caption>
					<text>
						<p><b>A sub-tree of the MeSH hierarchy.</b> We show part of the MeSH hierarchy relevant for the two CVs (i.e. NMR and GC) considered.</p>
					</text>
					<graphic file="1471-2105-9-S5-S5-2"/>
				</fig>
				<p>While the use of the MeSH for indexing and query expansion in Entrez is undoubtedly useful, these benefits cannot be fully exploited for the particular problem of accessing articles describing research that utilizes some analytical technology. In particular, an analytical technique employed in metabolomics is unlikely to be the main focus of the reported studies. Consequently, the corresponding documents may not necessarily be indexed with technology-related MeSH terms. Further, the abstracts of such articles are more likely to report the actual findings rather than the technology-specific experimental conditions applied. These parameters are usually described in the <it>Materials and methods</it> section or as part of the supplementary material. Hence, two points arise when retrieving documents containing information pertinent for analytical techniques deployed in metabolomics studies. First, it is important to search full-text articles as opposed to abstracts only. For this reason we used PMC, which provides access to full-text articles, in addition to MEDLINE, which offers only abstracts. Second, it is necessary to go beyond MeSH terms in query formulation. This problem is alleviated using the following assumption: terms denoting related concepts tend to co-occur within textual documents <abbrgrp><abbr bid="B44">44</abbr><abbr bid="B45">45</abbr></abbrgrp>. On this basis, terms from an initially compiled CV can be combined in a search query to retrieve additional documents that describe research that utilises a technology, i.e. the ones that do not necessarily deal with the technology <it>per se</it> and thus may not be indexed by technology-related MeSH terms. To achieve this, we index the literature with the CV terms. Each CV term is used to search the literature via Entrez. As a result, each term is mapped to a set of documents it matches. This information is stored in a local database using the following structure described in SQL:</p>
				<p>CREATE TABLE index</p>
				<p>(</p>
				<p>term VARCHAR(200) NOT NULL,</p>
				<p>document VARCHAR(50) NOT NULL</p>
				<p>);</p>
				<p>A cut-off point (this is a configurable parameter; the specific values used in our case studies are reported in the <it>Results &amp; Discussion</it> section) is set to remove the non-discriminatory terms, i.e. the ones that return too many documents. These are likely to be broad terms not limited to a specific analytical technique, and consequently introducing unwanted noise in the context of the domain-specific corpus. For example, in the case of the NMR CV, the mean number of abstracts returned was 2,772 with the median being just 0, which is due to the fact that the NMR CV was constructed using a considerable number of terms coming from database schemata. These terms are semi-formal in the sense that they do not necessarily reflect the terminology used in the literature, e.g. <it>AMIX VIEWER &amp; AMIX-TOOLS</it> and <it>JEOL NMR instrument</it>. On the other extreme, terms returning the maximal number of abstracts (set to 50,000) were: <it>analysis, characteristic, concentration, Delta, instrument, method, reference, software, states</it> and <it>tube</it>. The following SQL query can be used to identify such terms:</p>
				<p>SELECT term, COUNT(document) AS matching_documents</p>
				<p>FROM index</p>
				<p>GROUP BY term</p>
				<p>WHERE matching_documents &gt;= D;</p>
				<p>where D is chosen a cut-off point. Having removed such terms from further consideration from the IR point of view, a cut-off point (as before, this is a configurable parameter, and the specific values used in our case studies are reported in the <it>Results &amp; Discussion</it> section) is set to remove the documents that do not contain a sufficient number of the CV terms. The following SQL query can be used to identify such documents:</p>
				<p>SELECT document, COUNT(term) AS matching_terms</p>
				<p>FROM index</p>
				<p>GROUP BY document</p>
				<p>WHERE matching_terms &lt;= T;</p>
				<p>where T is chosen a cut-off point. For example, some of the documents with the highest number of matching terms from the NMR CV were <abbrgrp><abbr bid="B46">46</abbr><abbr bid="B47">47</abbr><abbr bid="B48">48</abbr></abbrgrp>.</p>
				<p>The IR module based on the methods described above is encoded in Java. The Java application takes advantage of E-Utilities <abbrgrp><abbr bid="B42">42</abbr></abbrgrp>, a web service which enables the users to run Entrez queries and download data using their own applications. The information gathered about terms, documents and their relations is stored in a local database (DB) hosted on a PostgreSQL <abbrgrp><abbr bid="B49">49</abbr></abbrgrp> system. By storing the mappings between terms and documents, the querying ability of the DB management system can be combined with that of Entrez. The local DB is also accessible via Java applications (using the JDBC protocol &#8211; a standard SQL DB access interface). Hence, all our implemented IR modules can be incorporated into customised workflows <abbrgrp><abbr bid="B50">50</abbr></abbrgrp>.</p>
			</sec>
			<sec>
				<st>
					<p>Term recognition</p>
				</st>
				<p>In the literature dealing with terminology issues, a term is intuitively defined as a phrase (typically a noun phrase <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B51">51</abbr></abbrgrp>): (1) frequently occurring in texts restricted to a specific domain, and (2) having a special meaning in the given domain <abbrgrp><abbr bid="B52">52</abbr></abbrgrp>. Bearing in mind the potentially unlimited number of different domains and the dynamic nature of newly emerging ones (many of which expand rapidly together with the corresponding terminologies, as is the case in metabolomics), the need for efficient term recognition becomes apparent. Manual term recognition approaches are time-consuming, labour-intensive and prone to error due to subjective judgement. These shortcomings can be addressed by automatic term recognition (ATR), the process of annotating an electronic document with a set of terms extracted from the document <abbrgrp><abbr bid="B53">53</abbr></abbrgrp>. Here, we emphasise that ATR refers to the computer-based extraction of terms from a domain-specific corpus as opposed to merely matching the corpus against a dictionary of terms <abbrgrp><abbr bid="B54">54</abbr></abbrgrp>. It has been suggested that scientific corpora can be used as reliable sources for terminology construction exploiting <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>:</p>
				<p>&#8226; the growing number of electronic corpora,</p>
				<p>&#8226; efficient NLP tools (such as part-of-speech taggers, parsers, etc.),</p>
				<p>&#8226; linguistically and/or statistically based ATR procedures, and</p>
				<p>&#8226; the fact that domain experts often use terms that have not been standardised, and as such are not included into standardised dictionaries.</p>
				<p>The lack of terminological standards is especially apparent in the rapidly expanding domain of metabolomics, where there is no exact consensus on what constitutes a metabolite name although naming conventions do exist for some entities, e.g. the Chemical Entities of Biological Interest (ChEBI) dictionary that is emerging for small molecules <abbrgrp><abbr bid="B55">55</abbr></abbrgrp>. Still, these are only guidelines and as such do not impose restrictions on domain experts.</p>
				<p>Manual term recognition is performed by relying on conceptual knowledge, i.e. humans identify terms by relating them to the corresponding concepts. It is currently not feasible to implement an ATR approach following such a paradigm due to the lack of appropriate knowledge representation systems and the difficulty of automatically performing &#8220;intelligent&#8221; tasks. For these reasons, ATR approaches resort to other types of knowledge that can provide clues about the terminological status of a given natural language clause <abbrgrp><abbr bid="B56">56</abbr></abbrgrp>. Generally, the knowledge used for ATR may involve two types of information:</p>
				<p>&#8226; internal: morphological, syntactic, semantic and/or statistical knowledge about terms and/or their constituents (nested terms, words, morphemes), and</p>
				<p>&#8226; external: linguistic and/or statistical knowledge regarding the term context, together with the knowledge contained in external resources, such as electronic dictionaries, ontologies, corpora, etc.</p>
				<p>ATR methods typically combine two approaches: linguistic (or symbolic) and statistical (or numeric) <abbrgrp><abbr bid="B51">51</abbr></abbrgrp>. Linguistic approaches to ATR usually involve pattern matching to recognise candidate terms by checking if their internal structure conforms to a predefined set of morpho-syntactic rules. Statistical methods rely on at least one of the following hypotheses regarding the term usage <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>:</p>
				<p>&#8226; specificity: terms are likely to be confined to a single or few domains,</p>
				<p>&#8226; absolute frequency: terms tend to appear frequently in their domain, and</p>
				<p>&#8226; relative frequency: terms tend to appear more frequently in their domain than in general.</p>
				<p>Statistical approaches are prone to extracting not only terms, but also other types of collocations (sequences of words co-occurring more frequently than would be expected by chance) <abbrgrp><abbr bid="B57">57</abbr></abbrgrp>: functional, semantic, thematic and others, e.g. &#8220;&#8230;<it>to play an important role in</it>&#8230;&#8221;. This problem is typically remedied by employing linguistic filters to extract candidate terms from a corpus, which are then ranked using statistical methods.</p>
				<p>In this work, we utilised the C-value method <abbrgrp><abbr bid="B58">58</abbr></abbrgrp>, publicly accessible at <abbrgrp><abbr bid="B59">59</abbr></abbrgrp> to the TM community via a web service. It first applies syntactic pattern matching to select term candidates, e.g. noun phrases having the structure described by the following regular expression:</p>
				<p>
					<display-formula>
						<m:math name="1471-2105-9-S5-S5-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mrow>
									<m:msup>
										<m:mrow>
											<m:mrow>
												<m:mo>(</m:mo>
												<m:mrow>
													<m:mi>A</m:mi>
													<m:mi>D</m:mi>
													<m:mi>J</m:mi>
													<m:mo>|</m:mo>
													<m:mi>N</m:mi>
												</m:mrow>
												<m:mo>)</m:mo>
											</m:mrow>
										</m:mrow>
										<m:mo>+</m:mo>
									</m:msup>
									<m:mo>|</m:mo>
									<m:mtext>&#160;</m:mtext>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mrow>
											<m:msup>
												<m:mrow>
													<m:mrow>
														<m:mo>(</m:mo>
														<m:mrow>
															<m:mi>A</m:mi>
															<m:mi>D</m:mi>
															<m:mi>J</m:mi>
															<m:mo>|</m:mo>
															<m:mi>N</m:mi>
														</m:mrow>
														<m:mo>)</m:mo>
													</m:mrow>
												</m:mrow>
												<m:mo>*</m:mo>
											</m:msup>
											<m:mtext>&#160;</m:mtext>
											<m:mrow>
												<m:mo>[</m:mo>
												<m:mrow>
													<m:mi>N</m:mi>
													<m:mtext>&#160;</m:mtext>
													<m:mi>P</m:mi>
													<m:mi>R</m:mi>
													<m:mi>E</m:mi>
													<m:mi>P</m:mi>
												</m:mrow>
												<m:mo>]</m:mo>
											</m:mrow>
											<m:mtext>&#160;</m:mtext>
											<m:msup>
												<m:mrow>
													<m:mrow>
														<m:mo>(</m:mo>
														<m:mrow>
															<m:mi>A</m:mi>
															<m:mi>D</m:mi>
															<m:mi>J</m:mi>
															<m:mo>|</m:mo>
															<m:mi>N</m:mi>
														</m:mrow>
														<m:mo>)</m:mo>
													</m:mrow>
												</m:mrow>
												<m:mo>*</m:mo>
											</m:msup>
										</m:mrow>
										<m:mo>)</m:mo>
									</m:mrow>
									<m:mtext>&#160;</m:mtext>
									<m:mi>N</m:mi>
								</m:mrow>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGcbaWaaeWaaeaacqWGbbqqcqWGebarcqWGkbGscqGG8baFcqWGobGtaiaawIcacaGLPaaadaahaaWcbeqaaiabgUcaRaaakiabcYha8jabbccaGmaabmaabaWaaeWaaeaacqWGbbqqcqWGebarcqWGkbGscqGG8baFcqWGobGtaiaawIcacaGLPaaadaahaaWcbeqaaiabcQcaQaaakiabbccaGmaadmaabaGaemOta4KaeeiiaaIaemiuaaLaemOuaiLaemyrauKaemiuaafacaGLBbGaayzxaaGaeeiiaaYaaeWaaeaacqWGbbqqcqWGebarcqWGkbGscqGG8baFcqWGobGtaiaawIcacaGLPaaadaahaaWcbeqaaiabcQcaQaaaaOGaayjkaiaawMcaaiabbccaGiabd6eaobaa@592C@</m:annotation>
							</m:semantics>
						</m:math>
					</display-formula>
				</p>
				<p>where <it>ADJ</it>, <it>N</it> and <it>PREP</it> denote adjective, noun and preposition respectively. The C-value of each candidate term <it>t</it> is then calculated as:</p>
				<p>
					<display-formula>
						<m:math name="1471-2105-9-S5-S5-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mrow>
									<m:mi>C</m:mi>
									<m:mo>&#8722;</m:mo>
									<m:mi>v</m:mi>
									<m:mi>a</m:mi>
									<m:mi>l</m:mi>
									<m:mi>u</m:mi>
									<m:mi>e</m:mi>
									<m:mrow>
										<m:mo>(</m:mo>
										<m:mi>t</m:mi>
										<m:mo>)</m:mo>
									</m:mrow>
									<m:mo>=</m:mo>
									<m:mrow>
										<m:mo>{</m:mo>
										<m:mrow>
											<m:mtable columnalign="left">
												<m:mtr columnalign="left">
													<m:mtd columnalign="left">
														<m:mrow>
															<m:mi>ln</m:mi>
															<m:mo>&#8289;</m:mo>
															<m:mo>|</m:mo>
															<m:mi>t</m:mi>
															<m:mo>|</m:mo>
															<m:mo>&#8901;</m:mo>
															<m:mi>f</m:mi>
															<m:mrow>
																<m:mo>(</m:mo>
																<m:mi>t</m:mi>
																<m:mo>)</m:mo>
															</m:mrow>
														</m:mrow>
													</m:mtd>
													<m:mtd columnalign="left">
														<m:mrow>
															<m:mo>,</m:mo>
															<m:mi>i</m:mi>
															<m:mi>f</m:mi>
															<m:mtext>&#160;</m:mtext>
															<m:mi>S</m:mi>
															<m:mrow>
																<m:mo>(</m:mo>
																<m:mi>t</m:mi>
																<m:mo>)</m:mo>
															</m:mrow>
															<m:mo>=</m:mo>
															<m:mo>&#8709;</m:mo>
														</m:mrow>
													</m:mtd>
												</m:mtr>
												<m:mtr columnalign="left">
													<m:mtd columnalign="left">
														<m:mrow>
															<m:mi>ln</m:mi>
															<m:mo>&#8289;</m:mo>
															<m:mo>|</m:mo>
															<m:mi>t</m:mi>
															<m:mo>|</m:mo>
															<m:mo>&#8901;</m:mo>
															<m:mrow>
																<m:mo>(</m:mo>
																<m:mrow>
																	<m:mi>f</m:mi>
																	<m:mrow>
																		<m:mo>(</m:mo>
																		<m:mi>t</m:mi>
																		<m:mo>)</m:mo>
																	</m:mrow>
																	<m:mo>&#8722;</m:mo>
																	<m:mfrac>
																		<m:mn>1</m:mn>
																		<m:mrow>
																			<m:mo>|</m:mo>
																			<m:mi>S</m:mi>
																			<m:mrow>
																				<m:mo>(</m:mo>
																				<m:mi>t</m:mi>
																				<m:mo>)</m:mo>
																			</m:mrow>
																			<m:mo>|</m:mo>
																		</m:mrow>
																	</m:mfrac>
																	<m:mstyle displaystyle="true">
																		<m:munder>
																			<m:mo>&#8721;</m:mo>
																			<m:mrow>
																				<m:mi>s</m:mi>
																				<m:mo>&#8712;</m:mo>
																				<m:mi>S</m:mi>
																				<m:mrow>
																					<m:mo>(</m:mo>
																					<m:mi>t</m:mi>
																					<m:mo>)</m:mo>
																				</m:mrow>
																			</m:mrow>
																		</m:munder>
																		<m:mrow>
																			<m:mi>f</m:mi>
																			<m:mrow>
																				<m:mo>(</m:mo>
																				<m:mi>s</m:mi>
																				<m:mo>)</m:mo>
																			</m:mrow>
																		</m:mrow>
																	</m:mstyle>
																</m:mrow>
																<m:mo>)</m:mo>
															</m:mrow>
														</m:mrow>
													</m:mtd>
													<m:mtd columnalign="left">
														<m:mrow>
															<m:mo>,</m:mo>
															<m:mi>i</m:mi>
															<m:mi>f</m:mi>
															<m:mtext>&#160;</m:mtext>
															<m:mi>S</m:mi>
															<m:mrow>
																<m:mo>(</m:mo>
																<m:mi>t</m:mi>
																<m:mo>)</m:mo>
															</m:mrow>
															<m:mo>&#8800;</m:mo>
															<m:mo>&#8709;</m:mo>
														</m:mrow>
													</m:mtd>
												</m:mtr>
											</m:mtable>
										</m:mrow>
									</m:mrow>
								</m:mrow>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGcbaGaem4qamKaeyOeI0IaemODayNaemyyaeMaemiBaWMaemyDauNaemyzau2aaeWaaeaacqWG0baDaiaawIcacaGLPaaacqGH9aqpdaGabaqaauaabaqaciaaaeaacyGGSbaBcqGGUbGBcqGG8baFcqWG0baDcqGG8baFcqGHflY1cqWGMbGzdaqadaqaaiabdsha0bGaayjkaiaawMcaaaqaaiabcYcaSGqaaiab=LgaPjab=zgaMjabbccaGiabdofatnaabmaabaGaemiDaqhacaGLOaGaayzkaaGaeyypa0JaeyybIymabaGagiiBaWMaeiOBa4MaeiiFaWNaemiDaqNaeiiFaWNaeyyXIC9aaeWaaeaacqWGMbGzdaqadaqaaiabdsha0bGaayjkaiaawMcaaiabgkHiTmaalaaabaGaeGymaedabaGaeiiFaWNaem4uam1aaeWaaeaacqWG0baDaiaawIcacaGLPaaacqGG8baFaaWaaabuaeaacqWGMbGzdaqadaqaaiabdohaZbGaayjkaiaawMcaaaWcbaGaem4CamNaeyicI4Saem4uam1aaeWaaeaacqWG0baDaiaawIcacaGLPaaaaeqaniabggHiLdaakiaawIcacaGLPaaaaeaacqGGSaalcqWFPbqAcqWFMbGzcqqGGaaicqWGtbWudaqadaqaaiabdsha0bGaayjkaiaawMcaaiabgcMi5kabgwGigdaaaiaawUhaaaaa@8897@</m:annotation>
							</m:semantics>
						</m:math>
					</display-formula>
				</p>
				<p>where |<it>t</it>| is the length of <it>t</it> in words, <it>f</it>(<it>t</it>) is <it>t</it>'s frequency of occurrence and <it>S</it>(<it>t</it>) is the set of other term candidates containing <it>t</it> as a sub-phrase. All candidates whose C-value exceeds a certain threshold are proposed as domain-specific terms by this method. The threshold chosen will affect the performance of ATR in terms of precision and recall, which are calculated as <it>P</it> = <it>A</it> / (<it>A</it> + <it>B</it>) and <it>R</it> = <it>A</it> / (<it>A</it> + <it>C</it>), where <it>A</it> is the number of true positives (correctly recognised terms), <it>B</it> is the number of false positives (phrases incorrectly recognised as terms) and <it>C</it> is the number of false negatives (non-recognised terms). Higher thresholds will typically result in higher precision and lower recall, and vice versa, lower thresholds will increase the recall at the expense of precision. In general, a threshold used should be corpus-specific (e.g. the average C-value found in the given corpus), as the C-value of each term candidate also depends on the corpus.</p>
				<p>By its definition, the C-value method favours longer and more frequent phrases that are not typically nested within a relatively small set of other phrases. Obviously, the C-value method relies primarily on the frequency of term usage and their general syntactic properties rather than exploiting orthographic, morphological and lexical features of specific named entities. For example, while protein names may vary significantly between authors, some general characteristics still apply <abbrgrp><abbr bid="B60">60</abbr><abbr bid="B61">61</abbr></abbrgrp>:</p>
				<p>&#8226; distinctive orthographic characteristics of protein names such as capital letters, digits, special characters (e.g. <it>p<ul>54</ul><ul>SAP</ul> kinase</it>),</p>
				<p>&#8226; keywords (e.g. <it>protein</it>, <it>receptor</it>, etc.) describing the protein function in multi-word protein names (e.g. <it>Ras GTPase-activating <ul>protein</ul></it>, <it>EGF <ul>receptor</ul></it>), and</p>
				<p>&#8226; morphological principles for naming proteins, such as highly abundant affixes -<it>ase</it>, -<it>in</it>, etc. (e.g. <it>hexokin<ul>ase</ul></it>, <it>haemoglob<ul>in</ul></it>).</p>
				<p>Opting for a similar named entity recognition approach would significantly increase the time and cost of developing CV term acquisition methods, as these would have to be re-implemented for specific domains. Moreover, the type of terms sought may not necessarily exhibit sufficiently discriminatory textual properties <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>.</p>
				<p>On the other hand, a generic ATR approach (such as the C-value method) can be manipulated to extract terms that are more likely to be of the required type by targeting only relevant documents, and within them specific sections potentially dense with terms of the given type. This can be followed by additional filtering of terms, known to be of different and not directly relevant semantic types to the ones needed, by using lexical resources of these terms where such resources exist. This issue of ATR targeting only relevant documents has been addressed by the IR module described in the previous section. A domain-specific corpora is produced as a result of IR by using either MeSH or CV terms in the search queries over collections of either abstracts or full-text articles in PubMed.</p>
				<p>Further, it is particularly important to target only sections that are likely to contain terms relevant for an analytical technology as a preparation step for ATR in order to increase its precision. Therefore, when using full-text documents we reduce them to the <it>Materials and Methods</it> sections, which are recognised automatically utilising PMC's XML format in which articles are distributed. Once a domain-specific corpus is obtained, the C-value terms are extracted and further inspected to see if they include any terms known to belong to other sub-domains not directly related to the analytical technology under investigation, in which case they can be safely filtered out.</p>
			</sec>
			<sec>
				<st>
					<p>Term filtering</p>
				</st>
				<p>Given the initially compiled CVs for NMR and GC, we automatically obtained terms loosely related to these two analytical techniques by applying IR to compile a technology-specific corpus, followed by ATR to extract a list of terms from the corpus in a way described in the preceding sub-sections. Manual inspection of the extracted terms revealed typical types of terms frequently co-occurring with the NMR- and GC-specific terms, namely those denoting substances, organisms, organs, conditions/diseases, etc., which are not of direct interest for the analytical technology <it>per se</it>. Examples of such terms automatically extracted by the C-value method are: <it>amino acid</it>, <it>linseed oil</it>, <it>pancreatic juice</it>, <it>blood glucose</it>, <it>cell wall</it>, <it>Halophilic bacterium</it>, <it>Streptomyces antibioticus</it>, <it>systemic hypertension</it>, <it>cervical dislocation</it>, etc. Unlike analytical techniques, many of which are relatively recent, some of these terminologies are relatively stable with respect to the number of new terms being introduced, e.g. Linnaean taxonomy <abbrgrp><abbr bid="B62">62</abbr></abbrgrp> classifies living organisms in a systematic manner.</p>
				<p>The Unified Medical Language System <abbrgrp><abbr bid="B63">63</abbr></abbrgrp> is a multi-purpose resource merging information from over 100 biomedical source vocabularies developed for different purposes. By providing uniform access (including a web service) to terms belonging to various sub-domains of interest, UMLS aims to facilitate the development of information systems for text processing in biomedicine via a semi-formal representation of domain-specific knowledge in order to process, retrieve, integrate, and aggregate biomedical data and information contained in the relevant literature <abbrgrp><abbr bid="B64">64</abbr></abbrgrp>. It currently contains 1.4 million concepts named by 7.2 million terms, organised into a hierarchy of 135 semantic types and interconnected by 54 different relations.</p>
				<p>The following semantic types in the UMLS proved relevant to our problem of detecting technique-specific terms in a subtractive approach: <it>Organism</it>, <it>Anatomical Structure</it>, <it>Substance</it>, <it>Biological Function</it> and <it>Injury or Poisoning</it>. Given these semantic types as part of the input to the term filtering module (implemented as a Java application), the subsumed terms are automatically selected from the latest version of the UMLS thesaurus. Then, a simple pattern matching approach is applied to filter out these terms and their variations. For example, the filtering approach helped identify the following &#8220;outliers&#8221; amongst terms extracted by the C-value method: <it>experimental <ul>rat</ul></it>, <it><ul>bovine </ul><ul>heart</ul><ul> muscle</ul></it>, <it>maternal <ul>blood</ul> sera specimen</it>, <it>farmworker <ul>pesticide</ul> exposure</it>, <it>arterial <ul>carbon dioxide </ul><ul>tension</ul></it>, etc., simply by matching the UMLS terms from the above mentioned classes (e.g. <it>rat</it>, <it>bovine</it>, <it>heart</it>, <it>muscle</it>, <it>blood</it>, <it>pesticide</it>, <it>carbon dioxide</it>, <it>tension</it>).</p>
			</sec>
			<sec>
				<st>
					<p>Output</p>
				</st>
				<p>We have described an integrative approach combining relatively generic software (e.g. Entrez for IR, C-value for ATR) and data resources (e.g. UMLS as a semantic network of biomedical terms) for the rapid development of a TM tool for automatic expansion of CVs as a practical alternative to tailor-made named entity recognition methods (see discussion above). An HTML report is generated as a result of the automated CV expansion (see Figure <figr fid="F3">3</figr> for an example report generated for the NMR CV). The report summarises the output of each module described earlier, i.e.:</p>
				<p>&#8226; the number of documents collected by the IR module with a link to the list of their citation details (see Figure <figr fid="F4">4</figr>) and cross-references to the actual documents in PubMed (see Figure <figr fid="F5">5</figr>)</p>
				<p>&#8226; the size of the final text corpus with a link to the corresponding ASCII file (see Figure <figr fid="F6">6</figr>), and</p>
				<p>&#8226; the number of new terms extracted by ATR with a link to the list of terms sorted by their C-values.</p>
				<fig id="F3">
					<title>
						<p>Figure 3</p>
					</title>
					<caption>
						<p>An HTML report summarising CV expansion results</p>
					</caption>
					<text>
						<p>
							<b>An HTML report summarising CV expansion results</b>
						</p>
					</text>
					<graphic file="1471-2105-9-S5-S5-3"/>
				</fig>
				<fig id="F4">
					<title>
						<p>Figure 4</p>
					</title>
					<caption>
						<p>Citation details of the retrieved documents</p>
					</caption>
					<text>
						<p>
							<b>Citation details of the retrieved documents</b>
						</p>
					</text>
					<graphic file="1471-2105-9-S5-S5-4"/>
				</fig>
				<fig id="F5">
					<title>
						<p>Figure 5</p>
					</title>
					<caption>
						<p>A full-text document retrieved from PMC</p>
					</caption>
					<text>
						<p>
							<b>A full-text document retrieved from PMC</b>
						</p>
					</text>
					<graphic file="1471-2105-9-S5-S5-5"/>
				</fig>
				<fig id="F6">
					<title>
						<p>Figure 6</p>
					</title>
					<caption>
						<p>A corpus of &#8220;Materials and Methods&#8221; sections</p>
					</caption>
					<text>
						<p>
							<b>A corpus of &#8220;Materials and Methods&#8221; sections</b>
						</p>
					</text>
					<graphic file="1471-2105-9-S5-S5-6"/>
				</fig>
				<p>Terms extracted from four different corpora are also amalgamated into a single, alphabetically ordered list (see Figure <figr fid="F7">7</figr>, left-hand side window). To aid the curation of automatically extracted terms and their incorporation into the CV, the context of a term can be obtained on-the-fly. The context should help the curator interpret the intended meaning of a term and provide clues useful for generating its textual definition. The context of a term rather than its definition may be more crucial for the association of a term with its correct meaning <abbrgrp><abbr bid="B65">65</abbr></abbrgrp>. Terms sharing the same context are likely to have similar (or even the same) meaning <abbrgrp><abbr bid="B66">66</abbr></abbrgrp>. Conversely, different contexts of the same term may point to the problem of term ambiguity (the same term denoting different concepts). Less drastically, the context may &#8220;deviate&#8221; the meaning of a term by emphasising only certain aspects of a term (e.g. insulin can be interpreted as both hormone and pharmacological substance). Bearing in mind the importance of contextual information in determining the correct meaning of a term and hence its position in a CV, we deployed a practical solution: all new terms reported are linked to MedEvi <abbrgrp><abbr bid="B67">67</abbr></abbrgrp>, a service providing local context (extracted from MEDLINE) for query terms <abbrgrp><abbr bid="B68">68</abbr></abbrgrp>. Clicking on a term launches a query to MedEvi, which in turn returns the aligned concordance (words used in a context) lines together with some handy features such as lists of co-occurring keywords and terms (see Figure <figr fid="F7">7</figr>, right-hand side window).</p>
				<fig id="F7">
					<title>
						<p>Figure 7</p>
					</title>
					<caption>
						<p> A list of automatically extracted terms with links to their concordances</p>
					</caption>
					<text>
						<p>
							<b>A list of automatically extracted terms with links to their concordances</b>
						</p>
					</text>
					<graphic file="1471-2105-9-S5-S5-7"/>
				</fig>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Results and discussion</p>
			</st>
			<p>We performed two case studies to evaluate the effectiveness of the proposed CV expansion approach using the two CVs for NMR and GC, which are currently under development as part of the MSI OWG activities. The initial CVs were compiled manually by the MSI OWG members, providing a total of 243 and 152 terms for NMR and GC respectively. In addition to these terms, we hand-picked the MeSH terms (<it>Magnetic Resonance Spectroscopy</it> and <it>Chromatography</it>, <it>Gas</it>) relevant for the techniques of interest by using the web-based MeSH browser. We used the given MeSH terms to retrieve documents from PubMed that have been manually annotated with these terms. A complementary IR approach was based on the search queries combining the CV terms: at least 3 and 7 matching terms for abstracts and full papers respectively.</p>
			<p>Tables <tblr tid="T1">1</tblr> and <tblr tid="T2">2</tblr> provide the IR and ATR results. The top two rows refer to the IR approach used for collecting a corpus of relevant documents. The use of MeSH and CV terms to conduct searches over abstracts and full-text documents results in a total of four corpora, whose numerical properties are described in separate columns. The size of each corpus is given as the number of documents retrieved and its size in KBs (rows three and four). Although freely available for browsing, for most articles in PMC the publisher does not allow downloading of the text in XML format; neither does PMC allow bulk downloading in HTML format. Hence, we were able to process only a small number of full-text documents (the numbers in brackets refer to these papers). Total numbers of C-value terms extracted from each corpus are given in the bottom two rows, one referring to the total number of terms recognised by the C-value method and the other referring to the number of these terms remaining after applying the filtering approach based on the available knowledge about their semantic types.</p>
			<tbl id="T1" hint_layout="single">
				<title>
					<p>Table 1</p>
				</title>
				<caption>
					<p>Term acquisition results for NMR</p>
				</caption>
				<tblbdy cols="6">
					<r>
						<c rspan="3">
							<p>
								<b>IR</b>
							</p>
						</c>
						<c>
							<p>
								<b>search terms</b>
							</p>
						</c>
						<c cspan="2" ca="center">
							<p>
								<b>MeSH</b>
							</p>
						</c>
						<c cspan="2" ca="center">
							<p>
								<b>CV</b>
							</p>
						</c>
					</r>
					<r>
						<c cspan="5">
							<hr/>
						</c>
					</r>
					<r>
						<c>
							<p>
								<b>document type</b>
							</p>
						</c>
						<c>
							<p>
								<b>abstracts</b>
							</p>
						</c>
						<c>
							<p>
								<b>full papers</b>
							</p>
						</c>
						<c>
							<p>
								<b>abstracts</b>
							</p>
						</c>
						<c>
							<p>
								<b>full papers</b>
							</p>
						</c>
					</r>
					<r>
						<c cspan="6">
							<hr/>
						</c>
					</r>
					<r>
						<c rspan="3">
							<p>
								<b>corpus size</b>
							</p>
						</c>
						<c>
							<p>
								<b>documents</b>
							</p>
						</c>
						<c>
							<p>122,867</p>
						</c>
						<c>
							<p>6,125 (141)</p>
						</c>
						<c>
							<p>1,613</p>
						</c>
						<c>
							<p>758 (29)</p>
						</c>
					</r>
					<r>
						<c cspan="5">
							<hr/>
						</c>
					</r>
					<r>
						<c>
							<p>
								<b>KBs</b>
							</p>
						</c>
						<c>
							<p>113,191</p>
						</c>
						<c>
							<p>663</p>
						</c>
						<c>
							<p>2,047</p>
						</c>
						<c>
							<p>270</p>
						</c>
					</r>
					<r>
						<c cspan="6">
							<hr/>
						</c>
					</r>
					<r>
						<c rspan="3">
							<p>
								<b>C-value </b>
								<b>terms</b>
							</p>
						</c>
						<c>
							<p>
								<b>before filtering</b>
							</p>
						</c>
						<c>
							<p>5,602</p>
						</c>
						<c>
							<p>6,215</p>
						</c>
						<c>
							<p>124</p>
						</c>
						<c>
							<p>2,601</p>
						</c>
					</r>
					<r>
						<c cspan="5">
							<hr/>
						</c>
					</r>
					<r>
						<c>
							<p>
								<b>after filtering</b>
							</p>
						</c>
						<c>
							<p>2,298</p>
						</c>
						<c>
							<p>3,257</p>
						</c>
						<c>
							<p>61</p>
						</c>
						<c>
							<p>1,385</p>
						</c>
					</r>
				</tblbdy>
			</tbl>
			<tbl id="T2" hint_layout="single">
				<title>
					<p>Table 2</p>
				</title>
				<caption>
					<p>Term acquisition results for GC</p>
				</caption>
				<tblbdy cols="6">
					<r>
						<c rspan="3">
							<p>
								<b>IR</b>
							</p>
						</c>
						<c>
							<p>
								<b>search terms</b>
							</p>
						</c>
						<c cspan="2" ca="center">
							<p>
								<b>MeSH</b>
							</p>
						</c>
						<c cspan="2" ca="center">
							<p>
								<b>CV</b>
							</p>
						</c>
					</r>
					<r>
						<c cspan="5">
							<hr/>
						</c>
					</r>
					<r>
						<c>
							<p>
								<b>document type</b>
							</p>
						</c>
						<c>
							<p>
								<b>abstracts</b>
							</p>
						</c>
						<c>
							<p>
								<b>full papers</b>
							</p>
						</c>
						<c>
							<p>
								<b>abstracts</b>
							</p>
						</c>
						<c>
							<p>
								<b>full papers</b>
							</p>
						</c>
					</r>
					<r>
						<c cspan="6">
							<hr/>
						</c>
					</r>
					<r>
						<c rspan="3">
							<p>
								<b>corpus size</b>
							</p>
						</c>
						<c>
							<p>
								<b>documents</b>
							</p>
						</c>
						<c>
							<p>60,338</p>
						</c>
						<c>
							<p>1,351 (79)</p>
						</c>
						<c>
							<p>3,948</p>
						</c>
						<c>
							<p>1,383 (58)</p>
						</c>
					</r>
					<r>
						<c cspan="5">
							<hr/>
						</c>
					</r>
					<r>
						<c>
							<p>
								<b>KBs</b>
							</p>
						</c>
						<c>
							<p>42,418</p>
						</c>
						<c>
							<p>68</p>
						</c>
						<c>
							<p>3,012</p>
						</c>
						<c>
							<p>97</p>
						</c>
					</r>
					<r>
						<c cspan="6">
							<hr/>
						</c>
					</r>
					<r>
						<c rspan="3">
							<p>
								<b>C-value terms</b>
							</p>
						</c>
						<c>
							<p>
								<b>before filtering</b>
							</p>
						</c>
						<c>
							<p>2,708</p>
						</c>
						<c>
							<p>811</p>
						</c>
						<c>
							<p>2,442</p>
						</c>
						<c>
							<p>1,114</p>
						</c>
					</r>
					<r>
						<c cspan="5">
							<hr/>
						</c>
					</r>
					<r>
						<c>
							<p>
								<b>after filtering</b>
							</p>
						</c>
						<c>
							<p>567</p>
						</c>
						<c>
							<p>348</p>
						</c>
						<c>
							<p>1,323</p>
						</c>
						<c>
							<p>526</p>
						</c>
					</r>
				</tblbdy>
			</tbl>
			<p>By amalgamating all filtered terms, a total of 5,699 and 2,612 new terms were acquired for NMR and GC respectively. The bottom rows in Tables <tblr tid="T1">1</tblr> and <tblr tid="T2">2</tblr> show their distribution across the four corpora. Note that the total number of new terms does not correspond to the sum of these numbers due to duplication of terms extracted from different corpora. Given a type of search terms (i.e. MeSH or CV terms), we compared the ATR results acquired from abstracts and those obtained from <it>Materials and Methods</it> sections of full-text articles. We determined that the overlap between the terms extracted from abstracts and those from the body of full-text articles was 2% on average. By further contrasting the results acquired from abstracts and full-text articles, we determined the average ratio between the number of acquired technology-specific terms and the corpus size was 16.25 for full-text articles and only 0.13 for abstracts. This comparison confirms that the <it>Materials and Methods</it> sections represent a significant source of technology-specific terms and also emphasises the benefits that can result from making full-text articles available to TM applications for the benefits of the overall biomedical community.</p>
			<p>The preliminary results are available at <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>, where the potential CV terms are accessible to the metabolomics community for comments and curation. The official version of the NMR CV has been made publicly available at <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> as part of the NMR ontology. We have to note that the integration of new terms into the MSI CVs has only just started and a full evaluation can only be published later on the web pages. Nevertheless, we performed a preliminary evaluation using the following setup. For each case study, we selected a test set of 100 terms chosen randomly from the resulting set of candidate CV terms. Each test set was evaluated independently by two domain experts. Each term from the test sets was scored from 1 to 5 reflecting an expert opinion about the degree to which the term in question is related to the technology described by the CV: 1 &#8211; no, definitely; 2 &#8211; no, probably; 3 &#8211; don't know / not sure; 4 &#8211; yes, probably; 5 &#8211; yes, definitely. The detailed evaluation results are given in Additional File <supplr sid="S1">1</supplr>, where a reader can find the score given to each term by each of the curators. We also provide a mean score for each evaluated term and we measure the agreement between the curators by giving the score difference for each of the terms. The mean and median values for all scores are summarised in Tables <tblr tid="T3">3</tblr> and <tblr tid="T4">4</tblr>. In both cases, the mean value of the average score was around 3.5 with the average difference in scores given by two curators not being greater than one. The distribution of the scores is shown in Figures <figr fid="F8">8</figr> and <figr fid="F9">9</figr>. From these results we extract the fact that in the case of NMR 51 terms were deemed relevant (having an average score greater than 3), 22 terms were undecided (having an average score of 3) and 27 terms were deemed irrelevant (having an average score less than 3). Similarly, in the case of GC we obtained 61 positive examples, 35 negative ones and 4 undecided. By projecting these numbers to the total of 5,699 candidate NMR terms extracted, we estimate the numbers of relevant, undecided and irrelevant terms to be 2,906, 1254 and 1539 respectively. For the total of 2,612 candidate GC terms, it is projected that 1,593 will be relevant, 104 undecided and 914 irrelevant. By including &#8776;2,900 positive examples into the NMR CV (initially containing 243 terms) and &#8776;1,600 new terms into the GC CV (initially containing 152 terms), both CVs can be effectively expanded by more than ten times the original size simply by curating terms as opposed to the process of CV term collection using interviewing techniques and reading the relevant literature.</p>
			<suppl id="S1">
				<title>
					<p>Additional File 1</p>
				</title>
				<text>
					<p>Evaluation results: each test set was evaluated independently by two domain experts. Each term from the test sets was scored from 1 to 5 reflecting an expert opinion about the degree to which the term in question is related to the technology described by the CV: 1 &#8211; no, definitely; 2 &#8211; no, probably; 3 &#8211; don't know / not sure; 4 &#8211; yes, probably; 5 &#8211; yes, definitely.</p>
				</text>
				<file name="1471-2105-9-S5-S5-S1.xls">
					<p>Click here for file</p>
				</file>
			</suppl>
			<tbl id="T3" hint_layout="single">
				<title>
					<p>Table 3</p>
				</title>
				<caption>
					<p>Evaluation of term acquisition results for NMR</p>
				</caption>
				<tblbdy cols="5">
					<r>
						<c>
							<p>
								<b>score</b>
							</p>
						</c>
						<c>
							<p>
								<b>by curator #1</b>
							</p>
						</c>
						<c>
							<p>
								<b>by curator #2</b>
							</p>
						</c>
						<c>
							<p>
								<b>mean between #1 &amp; #2</b>
							</p>
						</c>
						<c>
							<p>
								<b>difference between #1 &amp; #2</b>
							</p>
						</c>
					</r>
					<r>
						<c cspan="5">
							<hr/>
						</c>
					</r>
					<r>
						<c>
							<p>
								<b>mean</b>
							</p>
						</c>
						<c>
							<p>3.81</p>
						</c>
						<c>
							<p>3.19</p>
						</c>
						<c>
							<p>3.5</p>
						</c>
						<c>
							<p>0.88</p>
						</c>
					</r>
					<r>
						<c>
							<p>
								<b>median</b>
							</p>
						</c>
						<c>
							<p>4</p>
						</c>
						<c>
							<p>3</p>
						</c>
						<c>
							<p>3.5</p>
						</c>
						<c>
							<p>1</p>
						</c>
					</r>
				</tblbdy>
			</tbl>
			<tbl id="T4" hint_layout="single">
				<title>
					<p>Table 4</p>
				</title>
				<caption>
					<p>Evaluation of term acquisition results for GC</p>
				</caption>
				<tblbdy cols="5">
					<r>
						<c>
							<p>
								<b>score</b>
							</p>
						</c>
						<c>
							<p>
								<b>by curator #1</b>
							</p>
						</c>
						<c>
							<p>
								<b>by curator #2</b>
							</p>
						</c>
						<c>
							<p>
								<b>mean between #1 &amp; #2</b>
							</p>
						</c>
						<c>
							<p>
								<b>difference between #1 &amp; #2</b>
							</p>
						</c>
					</r>
					<r>
						<c cspan="5">
							<hr/>
						</c>
					</r>
					<r>
						<c>
							<p>
								<b>mean</b>
							</p>
						</c>
						<c>
							<p>3.06</p>
						</c>
						<c>
							<p>3.79</p>
						</c>
						<c>
							<p>3.425</p>
						</c>
						<c>
							<p>0.93</p>
						</c>
					</r>
					<r>
						<c>
							<p>
								<b>median</b>
							</p>
						</c>
						<c>
							<p>4</p>
						</c>
						<c>
							<p>4</p>
						</c>
						<c>
							<p>4</p>
						</c>
						<c>
							<p>1</p>
						</c>
					</r>
				</tblbdy>
			</tbl>
			<fig id="F8">
				<title>
					<p>Figure 8</p>
				</title>
				<caption>
					<p>Distribution of evaluation scores for NMR</p>
				</caption>
				<text>
					<p>
						<b>Distribution of evaluation scores for NMR</b>
					</p>
				</text>
				<graphic file="1471-2105-9-S5-S5-8"/>
			</fig>
			<fig id="F9">
				<title>
					<p>Figure 9</p>
				</title>
				<caption>
					<p>Distribution of evaluation scores for GC</p>
				</caption>
				<text>
					<p>
						<b>Distribution of evaluation scores for GC</b>
					</p>
				</text>
				<graphic file="1471-2105-9-S5-S5-9"/>
			</fig>
			<p>In addition to the preliminary quantitative evaluation, we also provide some qualitative remarks about our approach TM approach to CV expansion, which will be taken into account in order to improve the functionality of the tool. Some of the extracted terms were &#8220;incomplete&#8221;. For example, the term <it>comparative NMR</it> as found in the result list lacks the headword to be of sufficient understandability and to get inserted into a CV, e.g. as its concordance (<url>http://www.ebi.ac.uk/tc-test/textmining/medevi/results.jsp?query=%22comparative%20nmr%22&amp;submitbutton=Submit</url>) reveals this term should be <it>comparative NMR analysis</it> or <it>comparative NMR study</it>. This is due to the term variation phenomenon when the same concept is designated by more than one term. When such term candidates are processed separately, their C-values are distributed across different variants providing separate frequencies for individual variants instead of a single frequency unifying all of the variants. Hence, in order to make the most of the statistical part of the C-value method, term candidates need to be normalised prior to statistical analysis <abbrgrp><abbr bid="B69">69</abbr></abbrgrp>.</p>
			<p>Further, the CV expansion process can be helped by a different way of presenting the resulting terms. Having the candidate terms clustered according to their head noun phrases (e.g. <it>experiment</it>, <it>assay</it>, <it>spectrum</it>, <it>chemical shift</it>) would facilitate term integration and hierarchical structuring of the CV.</p>
		</sec>
		<sec>
			<st>
				<p>Conclusions</p>
			</st>
			<p>We described an integrative approach combining relatively generic, public software and data resources for time- and cost-effective development of a TM tool to aid the expansion of CVs across various domains. This should serve as a practical alternative to both manual term collection and tailor-made named entity recognition methods. The software makes use of web services to access three key resources:</p>
			<p>&#8226; Entrez for IR,</p>
			<p>&#8226; C-value for ATR, and</p>
			<p>&#8226; UMLS as a semantic network of biomedical terms.</p>
			<p>It is disseminated under an open-source licence. Originally developed to the specification of the MSI OWG, it is still generic enough to be applied for the expansion of other CVs in biomedicine simply by changing the input parameters:</p>
			<p>&#8226; the initially compiled CV,</p>
			<p>&#8226; the MeSH terms that reflect the domain of the CV, and</p>
			<p>&#8226; the UMLS semantic types of terms indirectly related to those covered by the CV.</p>
			<p>The output terms are presented to the user in HTML format so they can be inspected through a web browser, in which the context of each term as used in the scientific literature can be explored through the hyperlinked MedEvi service (a web-based search tool for the MEDLINE corpus) in an effort to aid the curation of the potential CV terms.</p>
		</sec>
		<sec>
			<st>
				<p>Availability and requirements</p>
			</st>
			<p>Project name: CVexpand</p>
			<p>Project home page: <url>http://mcisb.org/resources/CVexpand/</url></p>
			<p>Operating system(s): Platform independent</p>
			<p>Programming language: Java (version 1.6)</p>
			<p>Other requirements: Access to SQL database</p>
			<p>License: Academic Free License v3.0</p>
			<p>Any restrictions to use by non-academics: None</p>
		</sec>
		<sec>
			<st>
				<p>List of abbreviations used</p>
			</st>
			<p>ATR automatic term recognition</p>
			<p>CV controlled vocabulary</p>
			<p>DB database</p>
			<p>GC gas chromatography</p>
			<p>GC-MS gas chromatography &#8211; mass spectrometry</p>
			<p>HUPO human proteome organization</p>
			<p>HTML hypertext markup language</p>
			<p>IR information retrieval</p>
			<p>JDBC Java database connectivity</p>
			<p>MEDLINE medical literature analysis and retrieval system online</p>
			<p>MeSH medical subject headings</p>
			<p>MGED microarray gene expression data society</p>
			<p>MS mass spectrometry</p>
			<p>MSI metabolomics standards initiative</p>
			<p>NMR nuclear magnetic resonance</p>
			<p>OBI ontology for biomedical investigations</p>
			<p>OBO open biomedical ontologies</p>
			<p>OWG ontology working group</p>
			<p>PSI proteomics standards initiative</p>
			<p>PMC PubMed Central</p>
			<p>SQL structured query language</p>
			<p>TM text mining</p>
			<p>UMLS unified medical language system</p>
			<p>XML extended markup language</p>
		</sec>
		<sec>
			<st>
				<p>Competing interests</p>
			</st>
			<p>The authors declare that they have no competing interests.</p>
		</sec>
		<sec>
			<st>
				<p>Authors' contributions</p>
			</st>
			<p>IS designed and implemented the text mining application and drafted the manuscript. DS provided the initial data, evaluated the results and helped to draft the manuscript. SAS conceived the overall study and participated in its design and coordination. DRS participated in the design and coordination of the text mining aspects of the study. DBK provided his expertise in metabolomics to help evaluate the results. NP supervised the bioinformatics integration aspects. MSI OWG members participated in provision of the data, discussions and evaluation. All authors read and approved the final manuscript.</p>
		</sec>
	</bdy>
	<bm>
		<ack>
			<sec>
				<st>
					<p>Acknowledgements</p>
				</st>
				<p>We kindly acknowledge other members of the MSI Ontology WG, the MSI Oversight Committee, other MSI WGs, National Centre for Text Mining, the OBI WG, the OBO Foundry leaders and the Ontogenesis Networks members for their contributions in fruitful discussions. We also owe thanks to our colleagues for their assistance in the evaluation of the results. Their names are (in alphabetical order): Warwick Dunn, Farid Khan and Denis V. Rubtsov. We gratefully acknowledge the support of the BBSRC/EPSRC via &#8220;The Manchester Centre for Integrative Systems Biology&#8221; grant (BB/C008219/1: DBK, NP and IS), the BBSRC e-Science Development Fund (BB/D524283/1: SAS and DS) and the EU Network of Excellence Semantic Interoperability and Data Mining in Biomedicine (NoE 507505: IS and DS).</p>
				<p>This article has been published as part of <it>BMC Bioinformatics</it> Volume 9 Supplement 5, 2008: Proceedings of the 10th Bio-Ontologies Special Interest Group Workshop 2007. Ten years past and looking to the future. The full contents of the supplement are available online at <url>http://www.biomedcentral.com/1471-2105/9?issue=S5.</url></p>
			</sec>
		</ack>
		<refgrp>
			<bibl id="B1">
				<title>
					<p>A special issue on data standards</p>
				</title>
				<aug>
					<au>
						<snm>Field</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Sansone</snm>
						<fnm>S-A</fnm>
					</au>
				</aug>
				<source>OMICS</source>
				<pubdate>2006</pubdate>
				<volume>10</volume>
				<fpage>84</fpage>
				<lpage>93</lpage>
			</bibl>
			<bibl id="B2">
				<title>
					<p>Data standards for &#8216;omic&#8217; science</p>
				</title>
				<aug>
					<au>
						<snm>Quackenbush</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Nature Biotechnology</source>
				<pubdate>2004</pubdate>
				<volume>22</volume>
				<fpage>613</fpage>
				<lpage>614</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15122299</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B3">
				<title>
					<p>Metabolomics technology and bioinformatics</p>
				</title>
				<aug>
					<au>
						<snm>Shulaev</snm>
						<fnm>V</fnm>
					</au>
				</aug>
				<source>Briefings in Bioinformatics</source>
				<pubdate>2006</pubdate>
				<volume>7</volume>
				<fpage>128</fpage>
				<lpage>139</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">16772266</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B4">
				<title>
					<p>The practical impact of ontologies on biomedical informatics</p>
				</title>
				<aug>
					<au>
						<snm>Cimino</snm>
						<fnm>JJ</fnm>
					</au>
					<au>
						<snm>Zhu</snm>
						<fnm>X</fnm>
					</au>
				</aug>
				<source>Methods of information in medicine</source>
				<pubdate>2006</pubdate>
				<volume>45</volume>
				<fpage>124</fpage>
				<lpage>135</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">17051306</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B5">
				<title>
					<p>Ontologies for molecular biology and bioinformatics</p>
				</title>
				<aug>
					<au>
						<snm>Schulze-Kremer</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>In Silico Biol</source>
				<pubdate>2002</pubdate>
				<volume>2</volume>
				<fpage>179</fpage>
				<lpage>193</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">12542404</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B6">
				<title>
					<p>Text mining and ontologies in biomedicine: making sense of raw text</p>
				</title>
				<aug>
					<au>
						<snm>Spasic</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Ananiadou</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>McNaught</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Kumar</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>Briefings in Bioinformatics</source>
				<pubdate>2005</pubdate>
				<volume>6</volume>
				<fpage>239</fpage>
				<lpage>251</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">16212772</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B7">
				<title>
					<p>Methods of automatic term recognition: a review</p>
				</title>
				<aug>
					<au>
						<snm>Kageura</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Umino</snm>
						<fnm>B</fnm>
					</au>
				</aug>
				<source>Terminology</source>
				<pubdate>1996</pubdate>
				<volume>3</volume>
				<fpage>259</fpage>
				<lpage>289</lpage>
			</bibl>
			<bibl id="B8">
				<aug>
					<au>
						<snm>Jacquemin</snm>
						<fnm>C</fnm>
					</au>
				</aug>
				<source>Spotting and discovering terms through natural language processing</source>
				<publisher>Cambridge, Mass, USA: The MIT Press</publisher>
				<pubdate>2001</pubdate>
			</bibl>
			<bibl id="B9">
				<title>
					<p>From concepts to clinical reality: an essay on the benchmarking of biomedical terminologies</p>
				</title>
				<aug>
					<au>
						<snm>Smith</snm>
						<fnm>B</fnm>
					</au>
				</aug>
				<source>Journal of Biomedical Informatics</source>
				<pubdate>2006</pubdate>
				<volume>39</volume>
				<fpage>288</fpage>
				<lpage>298</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">16293444</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B10">
				<title>
					<p>Metabolomics Standards Workshop and the development of international standards for reporting metabolomics experimental results</p>
				</title>
				<aug>
					<au>
						<snm>Castle</snm>
						<fnm>AL</fnm>
					</au>
					<au>
						<snm>Fiehn</snm>
						<fnm>O</fnm>
					</au>
					<au>
						<snm>Kaddurah-Daouk</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Lindon</snm>
						<fnm>JC</fnm>
					</au>
				</aug>
				<source>Briefings in Bioinformatics</source>
				<pubdate>2006</pubdate>
				<volume>7</volume>
				<fpage>159</fpage>
				<lpage>165</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">16772263</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B11">
				<title>
					<p>Bio-ontologies: current trends and future directions</p>
				</title>
				<aug>
					<au>
						<snm>Bodenreider</snm>
						<fnm>O</fnm>
					</au>
					<au>
						<snm>Stevens</snm>
						<fnm>R</fnm>
					</au>
				</aug>
				<source>Briefings in Bioinformatics</source>
				<pubdate>2006</pubdate>
				<volume>7</volume>
				<fpage>256</fpage>
				<lpage>274</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1847325</pubid>
						<pubid idtype="pmpid" link="fulltext">16899495</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B12">
				<title>
					<p>MSI</p>
				</title>
				<pubdate>2007</pubdate>
				<note>[<url>http://msi-workgroups.sf.net/</url>].</note>
			</bibl>
			<bibl id="B13">
				<title>
					<p>The Metabolomics Standards Initiative</p>
				</title>
				<source>Nat Biotechnol</source>
				<pubdate>2007</pubdate>
				<volume>25</volume>
				<fpage>846</fpage>
				<lpage>848</lpage>
				<xrefbib>
					<pubidlist>
						<pubid>17687353</pubid>
						<pubid idtype="pmpid" link="fulltext">17687353</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B14">
				<title>
					<p>MSI OWG</p>
				</title>
				<pubdate>2007</pubdate>
				<note>[<url>http://msi-ontology.sf.net/</url>].</note>
			</bibl>
			<bibl id="B15">
				<title>
					<p>The metabolomics standards initiative (MSI)</p>
				</title>
				<aug>
					<au>
						<snm>Fiehn</snm>
						<fnm>O</fnm>
					</au>
					<au>
						<snm>Robertson</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Griffin</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>van der Werf</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Nikolau</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Morrison</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Sumner</snm>
						<fnm>LW</fnm>
					</au>
					<au>
						<snm>Goodacre</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Hardy</snm>
						<fnm>NW</fnm>
					</au>
					<au>
						<snm>Taylor</snm>
						<fnm>C</fnm>
					</au>
					<etal/>
				</aug>
				<source>Metabolomics</source>
				<pubdate>2007</pubdate>
				<volume>3</volume>
				<fpage>175</fpage>
				<lpage>178</lpage>
			</bibl>
			<bibl id="B16">
				<title>
					<p>Text-based knowledge discovery: search and mining of life-sciences documents</p>
				</title>
				<aug>
					<au>
						<snm>Mack</snm>
						<fnm>RL</fnm>
					</au>
					<au>
						<snm>Hehenberger</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Drug Discovery Today</source>
				<pubdate>2002</pubdate>
				<volume>7</volume>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">12047886</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B17">
				<title>
					<p>Metabolomics Standards Initiative - Ontology Working Group: Work in progress</p>
				</title>
				<aug>
					<au>
						<snm>Sansone</snm>
						<fnm>S-A</fnm>
					</au>
					<au>
						<snm>Schober</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Atherton</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Fiehn</snm>
						<fnm>O</fnm>
					</au>
					<au>
						<snm>Jenkins</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Rocca-Serra</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Rubtsov</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Spasic</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Soldatova</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Taylor</snm>
						<fnm>C</fnm>
					</au>
					<etal/>
				</aug>
				<source>Metabolomics</source>
				<pubdate>2007</pubdate>
				<volume>3</volume>
				<fpage>249</fpage>
				<lpage>256</lpage>
			</bibl>
			<bibl id="B18">
				<title>
					<p>HUPO-PSI</p>
				</title>
				<pubdate>2007</pubdate>
				<note>[<url>http://www.psidev.info/</url>].</note>
			</bibl>
			<bibl id="B19">
				<title>
					<p>The work of the Human Proteome Organisation's Proteomics Standards Initiative (HUPO PSI)</p>
				</title>
				<aug>
					<au>
						<snm>Taylor</snm>
						<fnm>CF</fnm>
					</au>
					<au>
						<snm>Hermjakob</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Julian</snm>
						<fnm>RK</fnm>
					</au>
					<au>
						<snm>Garavelli</snm>
						<fnm>JS</fnm>
					</au>
					<au>
						<snm>Aebersold</snm>
						<fnm>R</fnm>
					</au>
				</aug>
				<source>OMICS</source>
				<pubdate>2006</pubdate>
				<volume>10</volume>
				<fpage>145</fpage>
				<lpage>151</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">16901219</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B20">
				<title>
					<p>MGED</p>
				</title>
				<pubdate>2007</pubdate>
				<note>[<url>http://www.mged.org/</url>].</note>
			</bibl>
			<bibl id="B21">
				<title>
					<p>The MGED Ontology: a resource for semantics-based description of microarray experiments</p>
				</title>
				<aug>
					<au>
						<snm>Whetzel</snm>
						<fnm>PL</fnm>
					</au>
					<au>
						<snm>Parkinson</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Causton</snm>
						<fnm>HC</fnm>
					</au>
					<au>
						<snm>Fan</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Fostel</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Fragoso</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Game</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Heiskanen</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Morrison</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Rocca-Serra</snm>
						<fnm>P</fnm>
					</au>
					<etal/>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2006</pubdate>
				<volume>22</volume>
				<fpage>866</fpage>
				<lpage>873</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">16428806</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B22">
				<title>
					<p>OBO</p>
				</title>
				<pubdate>2007</pubdate>
				<note>[<url>http://obo.sourceforge.net/</url>].</note>
			</bibl>
			<bibl id="B23">
				<title>
					<p>National Center for Biomedical Ontology: advancing biomedicine through structured organization of scientific knowledge</p>
				</title>
				<aug>
					<au>
						<snm>Rubin</snm>
						<fnm>DL</fnm>
					</au>
					<au>
						<snm>Lewis</snm>
						<fnm>SE</fnm>
					</au>
					<au>
						<snm>Mungall</snm>
						<fnm>CJ</fnm>
					</au>
					<au>
						<snm>Misra</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Westerfield</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Ashburner</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Sim</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Chute</snm>
						<fnm>CG</fnm>
					</au>
					<au>
						<snm>Solbrig</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Storey</snm>
						<fnm>M-A</fnm>
					</au>
					<etal/>
				</aug>
				<source>OMICS</source>
				<pubdate>2006</pubdate>
				<volume>10</volume>
				<fpage>185</fpage>
				<lpage>198</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">16901225</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B24">
				<title>
					<p>The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration</p>
				</title>
				<aug>
					<au>
						<snm>Smith</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Ashburner</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Rosse</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Bard</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Bug</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Ceusters</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Goldberg</snm>
						<fnm>LJ</fnm>
					</au>
					<au>
						<snm>Eilbeck</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Ireland</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Mungall</snm>
						<fnm>CJ</fnm>
					</au>
					<etal/>
				</aug>
				<source>Nat Biotechnol</source>
				<pubdate>2007</pubdate>
				<volume>25</volume>
				<fpage>1251</fpage>
				<lpage>1255</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">17989687</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B25">
				<title>
					<p>Metabolomics: Current analytical platforms and methodologies</p>
				</title>
				<aug>
					<au>
						<snm>Dunn</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Ellis</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Trends in Analytical Chemistry</source>
				<pubdate>2005</pubdate>
				<volume>24</volume>
				<fpage>285</fpage>
				<lpage>294</lpage>
			</bibl>
			<bibl id="B26">
				<title>
					<p>PSI</p>
				</title>
				<pubdate>2007</pubdate>
				<note>[<url>http://www.psidev.info/</url>].</note>
			</bibl>
			<bibl id="B27">
				<title>
					<p>OBI</p>
				</title>
				<pubdate>2007</pubdate>
				<note>[<url>http://obi.sf.net/</url>].</note>
			</bibl>
			<bibl id="B28">
				<title>
					<p>Development of FuGO: An ontology for functional genomics investigations</p>
				</title>
				<aug>
					<au>
						<snm>Whetzel</snm>
						<fnm>PL</fnm>
					</au>
					<au>
						<snm>Brinkman</snm>
						<fnm>RR</fnm>
					</au>
					<au>
						<snm>Causton</snm>
						<fnm>HC</fnm>
					</au>
					<au>
						<snm>Fan</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Field</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Fostel</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Fragoso</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Gray</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Heiskanen</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Hernandez-Boussard</snm>
						<fnm>T</fnm>
					</au>
					<etal/>
				</aug>
				<source>OMICS A Journal of Integrative Biology</source>
				<pubdate>2006</pubdate>
				<volume>10</volume>
				<fpage>199</fpage>
				<lpage>204</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">16901226</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B29">
				<title>
					<p>A proposed framework for the description of plant metabolomics experiments and their results</p>
				</title>
				<aug>
					<au>
						<snm>Jenkins</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Hardy</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Beckmann</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Draper</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Smith</snm>
						<fnm>AR</fnm>
					</au>
					<au>
						<snm>Taylor</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Fiehn</snm>
						<fnm>O</fnm>
					</au>
					<au>
						<snm>Goodacre</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Bino</snm>
						<fnm>RJ</fnm>
					</au>
					<au>
						<snm>Hall</snm>
						<fnm>R</fnm>
					</au>
					<etal/>
				</aug>
				<source>Nat Biotechnol</source>
				<pubdate>2004</pubdate>
				<volume>22</volume>
				<fpage>1601</fpage>
				<lpage>1606</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15583675</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B30">
				<title>
					<p>MeMo: a hybrid SQL/XML approach to metabolomic data management for functional genomics</p>
				</title>
				<aug>
					<au>
						<snm>Spasi&#263;</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Dunn</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Velarde</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Tseng</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Jenkins</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Hardy</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Oliver</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Kell</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>BMC Bioinformatics</source>
				<pubdate>2006</pubdate>
				<volume>7</volume>
				<fpage>281</fpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1522028</pubid>
						<pubid idtype="pmpid" link="fulltext">16753052</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B31">
				<title>
					<p>Towards naming conventions for use in controlled vocabulary and ontology engineering</p>
				</title>
				<aug>
					<au>
						<snm>Schober</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Kusnirczyk</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Lewis</snm>
						<fnm>SE</fnm>
					</au>
					<au>
						<snm>Lomax</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<cnm>members of the MSI PWG</cnm>
					</au>
					<au>
						<snm>Mungall</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Rocca-Serra</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Smith</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Sansone</snm>
						<fnm>S-A</fnm>
					</au>
				</aug>
				<source>ISMB/ECCB Special Interest Group (SIG) Meeting Program Materials, Bio-Ontologies SIG Workshop Vienna, Austria</source>
				<publisher>Vienna, Austria</publisher>
				<pubdate>2007</pubdate>
			</bibl>
			<bibl id="B32">
				<title>
					<p>Term identification in the biomedical literature</p>
				</title>
				<aug>
					<au>
						<snm>Krauthammer</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Nenadic</snm>
						<fnm>G</fnm>
					</au>
				</aug>
				<source>Journal of Biomedical Informatics</source>
				<pubdate>2004</pubdate>
				<volume>37</volume>
				<fpage>512</fpage>
				<lpage>526</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15542023</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B33">
				<aug>
					<au>
						<snm>Baeza-Yates</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Ribeiro-Neto</snm>
						<fnm>B</fnm>
					</au>
				</aug>
				<source>Modern Information Retrieval</source>
				<publisher>Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.</publisher>
				<pubdate>1999</pubdate>
			</bibl>
			<bibl id="B34">
				<title>
					<p>Information retrieval: an overview of system characteristics</p>
				</title>
				<aug>
					<au>
						<snm>Wiesman</snm>
						<fnm>F</fnm>
					</au>
					<au>
						<snm>Hasman</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>van den Herik</snm>
						<fnm>HJ</fnm>
					</au>
				</aug>
				<source>International Journal of Medical Informatics</source>
				<pubdate>1997</pubdate>
				<volume>47</volume>
				<fpage>5</fpage>
				<lpage>26</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">9506386</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B35">
				<title>
					<p>MeSHmap: a text mining tool for MEDLINE</p>
				</title>
				<aug>
					<au>
						<snm>Srinivasan</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>Proc AMIA Symp</source>
				<pubdate>2001</pubdate>
				<fpage>642</fpage>
				<lpage>646</lpage>
				<xrefbib>
					<pubid idtype="pmpid">11825264</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B36">
				<title>
					<p>Update on XplorMed: A web server for exploring scientific literature</p>
				</title>
				<aug>
					<au>
						<snm>Perez-Iratxeta</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>P&#233;rez</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Bork</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Andrade</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2003</pubdate>
				<volume>31</volume>
				<fpage>3866</fpage>
				<lpage>3868</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">168945</pubid>
						<pubid idtype="pmpid" link="fulltext">12824439</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B37">
				<title>
					<p>Integrating query of relational and textual data in clinical databases: a case study</p>
				</title>
				<aug>
					<au>
						<snm>Fisk</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Mutalik</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Levin</snm>
						<fnm>F</fnm>
					</au>
					<au>
						<snm>Erdos</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Taylor</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Nadkarni</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>J Am Med Inform Assoc</source>
				<pubdate>2003</pubdate>
				<volume>10</volume>
				<fpage>21</fpage>
				<lpage>38</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">150357</pubid>
						<pubid idtype="pmpid" link="fulltext">12509355</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B38">
				<title>
					<p>PubMatrix: a tool for multiplex literature mining</p>
				</title>
				<aug>
					<au>
						<snm>Becker</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Hosack</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Dennis</snm>
						<fnm>G</fnm>
						<suf>Jr</suf>
					</au>
					<au>
						<snm>Lempicki</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Bright</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Cheadle</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Engel</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>BMC Bioinformatics</source>
				<pubdate>2003</pubdate>
				<volume>4</volume>
				<fpage>61</fpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">317283</pubid>
						<pubid idtype="pmpid" link="fulltext">14667255</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B39">
				<title>
					<p>Using the biological taxonomy to access biological literature with PathBinderH</p>
				</title>
				<aug>
					<au>
						<snm>Ding</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Viswanathan</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Berleant</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Hughes</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Wurtele</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Ashlock</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Dickerson</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Fulmer</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Schnable</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2005</pubdate>
				<volume>21</volume>
				<fpage>2560</fpage>
				<lpage>2562</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15769838</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B40">
				<title>
					<p>MEDLINE</p>
				</title>
				<pubdate>2007</pubdate>
				<note>[<url>http://www.pubmed.gov/</url>].</note>
			</bibl>
			<bibl id="B41">
				<title>
					<p>PMC</p>
				</title>
				<pubdate>2007</pubdate>
				<note>[<url>http://www.pubmedcentral.nih.gov/</url>].</note>
			</bibl>
			<bibl id="B42">
				<title>
					<p>Entrez</p>
				</title>
				<pubdate>2007</pubdate>
				<note>[<url>http://www.ncbi.nlm.nih.gov/Entrez/</url>].</note>
			</bibl>
			<bibl id="B43">
				<title>
					<p>MeSH</p>
				</title>
				<pubdate>2007</pubdate>
				<note>[<url>http://www.nlm.nih.gov/mesh/</url>].</note>
			</bibl>
			<bibl id="B44">
				<title>
					<p>Literature mining for the biologist: from information retrieval to biological discovery</p>
				</title>
				<aug>
					<au>
						<snm>Jensen</snm>
						<fnm>LJ</fnm>
					</au>
					<au>
						<snm>Saric</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Bork</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>Nat Rev Genet</source>
				<pubdate>2006</pubdate>
				<volume>7</volume>
				<fpage>119</fpage>
				<lpage>129</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">16418747</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B45">
				<title>
					<p>Characterizing Biomedical Concept Relationships</p>
				</title>
				<aug>
					<au>
						<snm>Revere</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Fuller</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Medical Informatics</source>
				<pubdate>2005</pubdate>
				<fpage>183</fpage>
				<lpage>210</lpage>
			</bibl>
			<bibl id="B46">
				<title>
					<p>Hemoglobin affinity for 23-bisphosphoglycerate in solutions and intact erythrocytes: studies using pulsed-field gradient nuclear magnetic resonance and Monte Carlo simulations</p>
				</title>
				<aug>
					<au>
						<snm>Lennon</snm>
						<fnm>AJ</fnm>
					</au>
					<au>
						<snm>Scott</snm>
						<fnm>NR</fnm>
					</au>
					<au>
						<snm>Chapman</snm>
						<fnm>BE</fnm>
					</au>
					<au>
						<snm>Kuchel</snm>
						<fnm>PW</fnm>
					</au>
				</aug>
				<source>Biophys J</source>
				<pubdate>1994</pubdate>
				<volume>67</volume>
				<fpage>2096</fpage>
				<lpage>2109</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1225585</pubid>
						<pubid idtype="pmpid" link="fulltext">7858147</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B47">
				<title>
					<p>Automated microflow NMR: routine analysis of five-microliter samples</p>
				</title>
				<aug>
					<au>
						<snm>Jansma</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Chuan</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Albrecht</snm>
						<fnm>RW</fnm>
					</au>
					<au>
						<snm>Olson</snm>
						<fnm>DL</fnm>
					</au>
					<au>
						<snm>Peck</snm>
						<fnm>TL</fnm>
					</au>
					<au>
						<snm>Geierstanger</snm>
						<fnm>BH</fnm>
					</au>
				</aug>
				<source>Anal Chem</source>
				<pubdate>2005</pubdate>
				<volume>77</volume>
				<fpage>6509</fpage>
				<lpage>6515</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1395504</pubid>
						<pubid idtype="pmpid" link="fulltext">16194121</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B48">
				<title>
					<p>Magnetic resonance imaging, microscopy, and spectroscopy of the central nervous system in experimental animals</p>
				</title>
				<aug>
					<au>
						<snm>Pirko</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Fricke</snm>
						<fnm>ST</fnm>
					</au>
					<au>
						<snm>Johnson</snm>
						<fnm>AJ</fnm>
					</au>
					<au>
						<snm>Rodriguez</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Macura</snm>
						<fnm>SI</fnm>
					</au>
				</aug>
				<source>NeuroRx</source>
				<pubdate>2005</pubdate>
				<volume>2</volume>
				<fpage>250</fpage>
				<lpage>264</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1064990</pubid>
						<pubid idtype="pmpid">15897949</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B49">
				<title>
					<p>PostgreSQL</p>
				</title>
				<pubdate>2007</pubdate>
				<note>[<url>http://www.postgresql.org/</url>].</note>
			</bibl>
			<bibl id="B50">
				<title>
					<p>Taverna / myGrid: aligning a workflow system with the life sciences community</p>
				</title>
				<aug>
					<au>
						<snm>Oinn</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Li</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Kell</snm>
						<fnm>DB</fnm>
					</au>
					<au>
						<snm>Goble</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Goderis</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Greenwood</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Hull</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Stevens</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Turi</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Zhao</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Workflows for e-Science: scientific workflows for grids</source>
				<publisher>Springer</publisher>
				<editor>Taylor IJ, Deelman E, Gannon DB, Shields M. Guildford, UK</editor>
				<pubdate>2007</pubdate>
				<fpage>300</fpage>
				<lpage>319</lpage>
			</bibl>
			<bibl id="B51">
				<title>
					<p>Study and Implementation of Combined Techniques for Automatic Extraction of Terminology</p>
				</title>
				<aug>
					<au>
						<snm>Daille</snm>
						<fnm>B</fnm>
					</au>
				</aug>
				<source>The Balancing Act - Combining Symbolic and Statistical Approaches to Language</source>
				<publisher>MIT Press</publisher>
				<editor>Resnik P, Klavans J</editor>
				<pubdate>1996</pubdate>
				<fpage>49</fpage>
				<lpage>66</lpage>
			</bibl>
			<bibl id="B52">
				<title>
					<p>Term Extraction from Unrestricted Text</p>
				</title>
				<aug>
					<au>
						<snm>Arppe</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>10th Nordic Conference of Computational Linguistics (NODALIDA-95); Helsinki, Finland</source>
				<pubdate>1995</pubdate>
			</bibl>
			<bibl id="B53">
				<title>
					<p>Text Mining at the Term Level</p>
				</title>
				<aug>
					<au>
						<snm>Feldman</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Fresko</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Kinar</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Lindell</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Liphstat</snm>
						<fnm>O</fnm>
					</au>
					<au>
						<snm>Rajman</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Schler</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Zamir</snm>
						<fnm>O</fnm>
					</au>
				</aug>
				<source>Principles of Data Mining and Knowledge Discovery, Second European Symposium, PKDD '98 Nantes, France, Proceedings</source>
				<editor>Zytkow J, Quafafou M: Springer-Verlag</editor>
				<pubdate>1998</pubdate>
				<volume>1510</volume>
				<fpage>65</fpage>
				<lpage>73</lpage>
				<note>Lecture Notes in Computer Science</note>
			</bibl>
			<bibl id="B54">
				<title>
					<p>Automatic Term Recognition using Contextual Cues</p>
				</title>
				<aug>
					<au>
						<snm>Frantzi</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Ananiadou</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Proceedings of 3rd DELOS Workshop, Zurich, Switzerland</source>
				<pubdate>1997</pubdate>
			</bibl>
			<bibl id="B55">
				<title>
					<p>ChEBI</p>
				</title>
				<pubdate>2007</pubdate>
				<note>[<url>http://www.ebi.ac.uk/chebi/</url>].</note>
			</bibl>
			<bibl id="B56">
				<title>
					<p>A Methodology for Automatic Term Recognition</p>
				</title>
				<aug>
					<au>
						<snm>Ananiadou</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Proceedings of the 15th International Conference on Computational Linguistics (COLING '94), Kyoto, Japan</source>
				<pubdate>1994</pubdate>
				<fpage>1034</fpage>
				<lpage>1038</lpage>
			</bibl>
			<bibl id="B57">
				<title>
					<p>Mining Terminological Knowledge in Large Biomedical Corpora</p>
				</title>
				<aug>
					<au>
						<snm>Liu</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Friedman</snm>
						<fnm>C</fnm>
					</au>
				</aug>
				<source>Proceedings of the 8th Pacific Symposium on Biocomputing (PSB 2003), Lihue, Hawaii, USA</source>
				<pubdate>2003</pubdate>
				<fpage>415</fpage>
				<lpage>426</lpage>
			</bibl>
			<bibl id="B58">
				<title>
					<p>The C-value/NC-value Domain Independent Method for Multiword Term Extraction</p>
				</title>
				<aug>
					<au>
						<snm>Frantzi</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Ananiadou</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Journal of Natural Language Processing</source>
				<pubdate>1999</pubdate>
				<volume>6</volume>
				<fpage>145</fpage>
				<lpage>180</lpage>
			</bibl>
			<bibl id="B59">
				<title>
					<p>NaCTeM</p>
				</title>
				<pubdate>2007</pubdate>
				<note>[<url>http://www.nactem.ac.uk/</url>].</note>
			</bibl>
			<bibl id="B60">
				<title>
					<p>Exploiting Syntax when Detecting Protein Names in Text</p>
				</title>
				<aug>
					<au>
						<snm>Eriksson</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Franzen</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Olsson</snm>
						<fnm>F</fnm>
					</au>
					<au>
						<snm>Asker</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Linden</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>Proceedings of Workshop on Natural Language Processing in Biomedical Applications - NLPBA 2002 Nicosia, Cyprus</source>
				<pubdate>2002</pubdate>
			</bibl>
			<bibl id="B61">
				<title>
					<p>Toward Information Extraction: Identifying Protein Names from Biological Papers</p>
				</title>
				<aug>
					<au>
						<snm>Fukuda</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Tsunoda</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Tamura</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Takagi</snm>
						<fnm>T</fnm>
					</au>
				</aug>
				<source>Proceedings of the 3rd Pacific Symposium on Biocomputing (PSB 1998), Hawaii, USA</source>
				<pubdate>1998</pubdate>
				<fpage>705</fpage>
				<lpage>716</lpage>
			</bibl>
			<bibl id="B62">
				<aug>
					<au>
						<snm>Linnaeus</snm>
						<fnm>C</fnm>
					</au>
				</aug>
				<source>Species plantarum</source>
				<publisher>Stockholm</publisher>
				<pubdate>1753</pubdate>
			</bibl>
			<bibl id="B63">
				<title>
					<p>UMLS</p>
				</title>
				<pubdate>2007</pubdate>
				<note>[<url>http://umlsinfo.nlm.nih.gov/</url>].</note>
			</bibl>
			<bibl id="B64">
				<title>
					<p>The Unified Medical Language System (UMLS): integrating biomedical terminology</p>
				</title>
				<aug>
					<au>
						<snm>Bodenreider</snm>
						<fnm>O</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Research</source>
				<pubdate>2004</pubdate>
				<volume>32</volume>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">308795</pubid>
						<pubid idtype="pmpid" link="fulltext">14681409</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B65">
				<title>
					<p>Terminological Acquaintance: The Importance of Contextual Information in Terminology</p>
				</title>
				<aug>
					<au>
						<snm>Maynard</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Ananiadou</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Natural Language Processing - NLP 2000 Second International Conference, Patras, Greece, Proceedings</source>
				<publisher>Springer-Verlag</publisher>
				<editor>Christodoulakis D</editor>
				<pubdate>2000</pubdate>
				<volume>1835</volume>
				<note>Lecture Notes in Computer Science</note>
			</bibl>
			<bibl id="B66">
				<title>
					<p>Exploration in Automatic Thesaurus Discovery</p>
				</title>
				<aug>
					<au>
						<snm>Grefenstette</snm>
						<fnm>G</fnm>
					</au>
				</aug>
				<pubdate>1994</pubdate>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">7714149</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B67">
				<title>
					<p>MedEvi</p>
				</title>
				<pubdate>2007</pubdate>
				<note>[<url>http://www.ebi.ac.uk/tc-test/textmining/medevi/</url>].</note>
			</bibl>
			<bibl id="B68">
				<title>
					<p>MedEvi: Retrieving textual evidence of relations between biomedical concepts from Medline</p>
				</title>
				<aug>
					<au>
						<snm>Kim</snm>
						<fnm>JJ</fnm>
					</au>
					<au>
						<snm>Pezik</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Rebholz-Schuhmann</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2008</pubdate>
			</bibl>
			<bibl id="B69">
				<title>
					<p>Automatic Acronym Acquisition and Management within Domain-Specific Texts</p>
				</title>
				<aug>
					<au>
						<snm>Nenadic</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Spasic</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Ananiadou</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Proceedings of 3rd International Conference on Language, Resources and Evaluation</source>
				<publisher>Las Palmas, Spain</publisher>
				<pubdate>2002</pubdate>
				<fpage>2155</fpage>
				<lpage>2162</lpage>
			</bibl>
		</refgrp>
	</bm>
</art>
