<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-10-S11-S18</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Proceedings</dochead>
      <bibl>
         <title>
            <p>PathBinder &#8211; text empirics and automatic extraction of biomolecular interactions</p>
         </title>
         <aug>
            <au id="A1"><snm>Zhang</snm><fnm>Lifeng</fnm><insr iid="I1"/><email>zlfpeak@iastate.edu</email></au>
            <au ca="yes" id="A2"><snm>Berleant</snm><fnm>Daniel</fnm><insr iid="I2"/><email>berleant@gmail.com</email></au>
            <au id="A3"><snm>Ding</snm><fnm>Jing</fnm><insr iid="I3"/><email>jing.ding@osumc.edu</email></au>
            <au id="A4"><snm>Cao</snm><fnm>Tuan</fnm><insr iid="I1"/><email>antuan@iastate.edu</email></au>
            <au id="A5"><snm>Syrkin Wurtele</snm><fnm>Eve</fnm><insr iid="I1"/><email>mash@iastate.edu</email></au>
         </aug>
         <insg>
            <ins id="I1"><p>Iowa State University, Ames, Iowa, USA</p></ins>
            <ins id="I2"><p>University of Arkansas at Little Rock, Little Rock, Arkansas, USA</p></ins>
            <ins id="I3"><p>Ohio State University Medical Center, Columbus, Ohio, USA</p></ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <supplement>
            <title>
               <p>Proceedings of the Sixth Annual MCBIOS Conference. Transformational Bioinformatics: Delivering Value from Genomes</p>
            </title>
            <editor>Jonathan D Wren (Senior Editor), Yuriy Gusev, Raphael D Isokpehi, Dan Berleant, Ulisses Braga-Neto, Dawn Wilkins and Susan Bridges</editor>
            <note>Proceedings</note>
            <url>http://www.biomedcentral.com/content/pdf/1471-2105-10-S11-info.pdf</url>
         </supplement>
         <conference>
            <title>
               <p>Sixth Annual MCBIOS Conference. Transformational Bioinformatics: Delivering Value from Genomes</p>
            </title>
            <location>Starkville, MS, USA</location>
            <date-range>20&#8211;21 February 2009</date-range>
            <url>http://www.mcbios.org/</url>
         </conference>
         <issn>1471-2105</issn>
         <pubdate>2009</pubdate>
         <volume>10</volume>
         <issue>Suppl 11</issue>
         <fpage>S18</fpage>
         <url>http://www.biomedcentral.com/1471-2105/10/S11/S18</url>
         <xrefbib><pubidlist><pubid idtype="pmpid">19811683</pubid><pubid idtype="doi">10.1186/1471-2105-10-S11-S18</pubid></pubidlist></xrefbib>
      </bibl>
      <history><pub><date><day>8</day><month>10</month><year>2009</year></date></pub></history>
      <cpyrt><year>2009</year><collab>Zhang et al; licensee BioMed Central Ltd.</collab><note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note></cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Motivation</p>
               </st>
               <p>The increasingly large amount of free, online biological text makes automatic interaction extraction correspondingly attractive. Machine learning is one strategy that works by uncovering and using useful properties that are implicit in the text. However these properties are usually not reported in the literature explicitly. By investigating specific properties of biological text passages in this paper, we aim to facilitate an alternative strategy, the use of <it>text empirics</it>, to support mining of biomedical texts for biomolecular interactions. We report on our application of this approach, and also report some empirical findings about an important class of passages. These may be useful to others who may also wish to use the empirical properties we describe.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We manually analyzed syntactic and semantic properties of sentences likely to describe interactions between biomolecules. The resulting empirical data were used to design an algorithm for the PathBinder system to extract biomolecular interactions from texts. PathBinder searches PubMed for sentences describing interactions between two given biomolecules. PathBinder then uses probabilistic methods to combine evidence from multiple relevant sentences in PubMed to assess the relative likelihood of interaction between two arbitrary biomolecules. A biomolecular interaction network was constructed based on those likelihoods.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>The text empirics approach used here supports computationally friendly, performance competitive, automatic extraction of biomolecular interactions from texts.</p>
            </sec>
            <sec>
               <st>
                  <p>Availability</p>
               </st>
               <p><url>http://www.metnetdb.org/pathbinder</url>.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Introduction</p>
         </st>
         <p>Increasingly large collections of gene sequence and expression data continue to appear. Biomolecular interaction databases are one kind of collection and are useful for such tasks as understanding biological processes <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>, extrapolating knowledge about organisms to make predictions about other organisms as in BioCyc <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>, and serving as components of larger resources like MetNet <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. A database can be populated through expert curation, like MIPS <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> and KEGG <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. In particular, extracting interactions from literature by expert curation has attracted considerable attention. Efforts include the Database of Interacting Proteins <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, BIND <abbrgrp><abbr bid="B7">7</abbr></abbrgrp> and BioCyc. Manual methods are costly, however, so work has increasingly focused on automatic interaction extraction from scientific literature based on text mining technology. Extracted interactions can help researchers use knowledge buried in the literature and can even be used to construct interaction databases automatically.</p>
         <p>Analysis of passages containing biological term co-occurrences or tri-occurrences enables the extraction of relations among biological entities. There are different methods of automatically extracting interactions between pairs of biomolecules from the literature, including readily implemented co-occurrence based methods <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>, corpus-based statistical methods, template matching methods, and natural language processing <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>.</p>
         <sec>
            <st>
               <p>Natural language processing</p>
            </st>
            <p>Santos et al. <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>, Natarajan et al. <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>, Fundel et al. <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> and Rinaldi et al. <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> used full parsing to verify matches to predefined rules about descriptions of relations. Miyao et al. <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> used different natural language parsing tools to extract interactions and compared the results. Giles and Wren <abbrgrp><abbr bid="B15">15</abbr></abbrgrp> applied full parsing in conjunction with a support vector machine (SVM) to extract the directions of interactions, since in a pair of interacting entities one tends to be the cause of an effect on the other. However full parsing is computationally expensive and relatively slow, subject to ambiguous parse results, and will only be a partial solution to the natural language processing (NLP) problem which includes semantic and other issues.</p>
            <p>Yakushiji et al. <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> built a term recognizer to identify multi-word terms and a shallow parser to reduce lexical ambiguity. Then, they applied full parses over the preprocessed sentences. From the full parses, domain-specific knowledge including a set of target verbs and mapping rules provided by domain specialists was used to construct frame representations of interactions. Another example, GENIES <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>, extracted semantic patterns by observing typical semantic and syntactic co-occurrence patterns in a sample corpus using semantic relationship categories and biological objects. It fully parsed sentences and outputted a frame structure when pattern matching was successful. GIS <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> and GIFT <abbrgrp><abbr bid="B19">19</abbr></abbrgrp> also matched sentences to predefined interaction description patterns to identify the interactions.</p>
            <p>There are different degrees of NLP, of course, and one way to make NLP more practical with large amounts of text is to use shallower analyses. Chilibot <abbrgrp><abbr bid="B20">20</abbr></abbrgrp> takes this approach, using POS tagging followed by shallow parsing to extract interactions from MEDLINE and support a search engine for interactions in MEDLINE.</p>
         </sec>
         <sec>
            <st>
               <p>Template matching</p>
            </st>
            <p>Template matching approaches form another and typically computationally more tractable strategy. A sentence, abstract or parsed result is matched against predefined <it>patterns </it>associated with interactions <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. A pattern is a partial specification of words and locations in a passage, such as &lt;biomolecule1 verb "the" verb "of" biomolecule2 "into"&gt;. The template term 'biomolecule' in such a pattern might match, for example, any molecule synthesized by living organisms. The matching process can involve a simple match using shallow parsing to identify terms meeting category or other constraints, or a complicated full parsing that analyzes the syntactic structure of the passage before matching against parse result templates.</p>
            <p>Although pattern-matching can yield relatively high precision because patterns may be derived from existing sentences describing interactions, recall may be limited because it is not possible to manually describe all possible patterns of biomolecular interaction descriptions <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. Therefore some interaction descriptions will not match the manually derived patterns, so some interactions will not be extracted by the template approach. For example, MedScan <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> obtained a recall of 21% with relatively restrictive templates, while Koike et al. <abbrgrp><abbr bid="B24">24</abbr></abbrgrp> achieved 54% with more unconstraining, inclusive templates that assumed some syntactic analysis.</p>
         </sec>
         <sec>
            <st>
               <p>Term occurrence</p>
            </st>
            <p>Term occurrence based approaches can avoid the recall issue just noted. Marcotte et al. <abbrgrp><abbr bid="B25">25</abbr></abbrgrp> identified discriminating words based on a training set of 260 MEDLINE abstracts describing yeast protein interactions, based on differences in frequencies of occurrence of those discriminating words. They used the probabilities of each word's appearance in documents describing interactions to train Na&#239;ve Bayesian classifiers to score a document and judge whether the document describes an interaction.</p>
            <p>A direct approach to identifying an interaction is to find co-occurrences of two biomolecules in the literature. Dragon Plant Biology Explorer (DPBE) <abbrgrp><abbr bid="B26">26</abbr></abbrgrp> parses documents provided by users using this type of co-occurrence criterion and displays the results in, among other forms, a network of interactions. Albert et al. <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> applied co-occurrence extraction to create a protein interaction database for nuclear receptors, then post-processed this database by manual curation to delete false interactions. PDQ Wizard <abbrgrp><abbr bid="B28">28</abbr></abbrgrp> and Hofmann and Schomburg <abbrgrp><abbr bid="B29">29</abbr></abbrgrp> also used co-occurrences and a subsequent filtering stage to extract interactions between biomolecules.</p>
            <p>The iHOP system (e.g. <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>) converts MEDLINE into a navigable hyperlinked resource by extracting sentences from it that contain biomolecules and annotating them with hyperlinks from the biomolecular and interaction terms to related sentences. A Web-based interface provides flexible access to this resource. This and similar systems extract sentences that appear to provide evidence for biomolecular interactions from the literature, but do not analyze this evidence further for probabilities of interaction based on empirical investigations of sets of related sentences. This motivates the current work, which fills that gap.</p>
            <p>Wren and Garner <abbrgrp><abbr bid="B31">31</abbr></abbrgrp> assigned a weight 1 - <it>r</it><sup><it>n </it></sup>to the potential relationship between co-occurring terms, where <it>n </it>is the number of times they co-occur and <it>r </it>is one value when the co-occurrence is in a sentence and another value, 0.58, when the co-occurrence is in an abstract but not the same sentence. Ding et al. <abbrgrp><abbr bid="B8">8</abbr></abbrgrp> also reported 0.58 for abstracts, but found a value for sentences different from Wren and Garner's.</p>
            <p>Because co-occurrence based methods are relatively simple they cannot, in theory, match the potential performance of methods that incorporate information obtained by additional computation such as sentence parsing. However, they are computationally simpler and faster. NLP can get more out of text than co-occurrence based methods, while empirical facts derived from empirical analyses can provide heuristic guidance to NLP-based methods to enhance computational speed and help resolve ambiguities that arise. Thus, automated text analysis using a hybrid of both empirical facts about texts and deeper NLP-based analyses is expected to do better than either method alone. As an example, Zhou and He <abbrgrp><abbr bid="B32">32</abbr></abbrgrp> used a machine learning method to estimate probabilities that help parse a document.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Methods and analysis</p>
         </st>
         <p>This paper seeks to advance understanding about the properties of biomedical texts and to apply this knowledge to automatic identification of biomolecular interactions. Properties of texts were identified empirically (i.e. by examining actual sentences) and used to evaluate the probability that a given sentence describes an interaction between a specific biomolecule pair. A major issue in evaluating such extracted interactions is how to specify a good ranking policy. Such a policy would facilitate assessment of putative interactions.</p>
         <p>By <it>empirical </it>we refer to knowledge about text properties derived from "experience or observation" <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>. Our observations are derived by manually examining corpora, and tabulating and analyzing the passages therein. This is distinguished from other common approaches to extracting knowledge from text such as Natural Language Processing, which deduces knowledge from passages based on syntactic and semantic rules, and Machine Learning (ML). Machine learning offers a corpus-based, statistical approach like the text empirics approach, but differs in that with ML, text properties are found automatically by a computer. This has the following shortcomings compared to using text empirics.</p>
         <p indent="1">1) Classification rule sets (typically arranged in decision trees) derived by ML usually include uninteresting junk mixed in. As a result,</p>
         <p indent="1">2) the rules derived by ML are typically omitted from publications, in favor of conclusions about the parameters of the ML process itself. As a result,</p>
         <p indent="1">3) the outcome of ML can be harder to apply than the results of an empirical text analysis, since ML-derived knowledge tends to be less readily available in a directly usable form, while text empirics-derived results must necessarily be disseminated in an explicit form readily used by software designers.</p>
         <p>Our software, PathBinder, extracts ranked interactions and provides query functions. Users can search for sentences describing interactions in MEDLINE by providing a pair of biomolecules. The entire comprehensive MEDLINE collection is searched for these sentences and the returned sentences can be ranked by their calculated likelihood of describing an interaction between the biomolecules. PathBinder can combine the evidence from multiple sentences to assess the relative likelihood of an interaction between two given biomolecules, and construct a biomolecular interaction network from MEDLINE automatically.</p>
         <p>We chose MEDLINE as the repository to analyze. Much text mining research uses the MEDLINE collection <url>http://www.nlm.nih.gov/pubs/factsheets/medline.html</url>. MEDLINE contains approximately 18 million citation records to articles in the life sciences. A query interface, PubMed <url>http://www.ncbi.nlm.nih.gov/pubmed/</url>, enables users to search the records, and the Entrez Programming Utilities <url>http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html</url> lets developers write software to access these data. While these records may not completely reflect the idea that an article tries to communicate, they usually contain the abstract and thus the most important information that the authors wish to convey. Using MEDLINE, Ding et al. <abbrgrp><abbr bid="B8">8</abbr></abbrgrp> showed that sentences are useful text units for automatically extracting interactions. Therefore we collected sentences containing biomolecule co-occurrences to analyze as the basis of this work.</p>
         <p>To extract an interaction we require a sentence to contain two biomolecules of interest. However such a sentence does not necessarily describe an interaction. For example, the sentence</p>
         <p indent="1">"Both A and B can bind to C."</p>
         <p>does not describe an interaction between A and B, even though it describes interactions between A and C, and between B and C. Our hypothesis is that we can find properties of sentences from the MEDLINE collection that can support automatic interaction extraction. The first goal is therefore to advance understanding of relevant sentence properties. The second and related goal is to better understand properties of interaction-indicating terms (IITs). The third goal is to use results of the first and second goals to predict whether a sentence describes an interaction. The fourth goal is to scale up by generating and evaluating a database of biomolecular interactions.</p>
         <p>By analyzing typical passages from MEDLINE it is possible to focus on those goals by empirically investigating certain questions such as the following.</p>
         <p indent="1">1) How can the presence of IITs (interaction-indicating terms) be used to infer the type of interaction between two specific biomolecules?</p>
         <p indent="1">2) If <it>p</it><sub>phrase </sub>is the likelihood that biomolecules co-occurring in the same phrase are described by the phrase as interacting, how does <it>p</it><sub>phrase </sub>differ from <it>p</it><sub>sentence</sub>, the analogous situation where they are in different phrases of the same sentence?</p>
         <p indent="1">3) How does the order of appearance of three important words, two biomolecules and an IIT, in a phrase or sentence affect the probability that the biomolecules are described as interacting?</p>
         <p indent="1">4) How do properties of IITs occurring near two biomolecule names, such as their identities, inflections, roots, and semantic categories, affect the probability that they help describe an interaction between the biomolecules?</p>
         <p>For questions 1&#8211;4, we collected 303 MEDLINE abstracts and extracted 664 sentences, based on ten queries to PubMed. Each query consisted of two biomolecule names known to interact, and was elicited from biologists to be typical of the kinds of queries biologists are likely to make. Some further details about this corpus appear in Ding et al. (2002 <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>), and a list of the abstracts in the corpus may be downloaded from <url>http://ifsc.ualr.edu/jdberleant/IEPA/IEPA.htm</url>. Each sentence was manually analyzed with respect to the properties related to questions 1&#8211;4 above and tagged as to whether or not it described an interaction between the two query biomolecules.</p>
         <p>To support the accurate description of passage properties for interaction extraction, we use the definitions shown in Table <tblr tid="T1">1</tblr>.</p>
         <tbl id="T1"><title><p>Table 1</p></title><caption><p>Definitions used in text analyses.</p></caption><tblbdy cols="2">
      <r>
         <c ca="center">
            <p>
               <b>Term</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Definition</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>
               <it>sentence</it>
            </p>
         </c>
         <c ca="left">
            <p>Either an article title, or a word sequence beginning with a capital letter and ending with a period.</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>
               <it>phrase</it>
            </p>
         </c>
         <c ca="left">
            <p>A word sequence that occurs inside a <it>sentence</it>, and begins and ends with: , | ; | : | . | &lt;the beginning of the sentence> | &lt;whitespace>-&lt;whitespace> | (|).</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>
               <it>IIT</it>
            </p>
         </c>
         <c ca="left">
            <p><it>Interaction-indicating term. </it>A word, often a verb, that can describe an interaction between two biomolecules.</p>
         </c>
      </r>
   </tblbdy></tbl>
         <p>We have manually created a list of IITs based on reading several hundred MEDLINE abstracts. For example, <it>activate</it>, <it>activation</it>, etc., can describe an interaction between two biomolecules, as in "the activation of A by B."</p>
         <p>The results for questions 1 and 2 (Table <tblr tid="T2">2</tblr>) indicate that the probability an interaction is described when two biomolecules co-occur in a phrase is higher than when they are in different phrases in a sentence (67% vs. 33%). Secondly, if an IIT appears with the two biomolecules, the probability that an interaction is described is higher than without an IIT present (55% vs. 7.99% and 71% vs. 0%). These two comparisons are statistically significant (p &lt; 0.001, &#967;<sup>2 </sup>test).</p>
         <tbl id="T2"><title><p>Table 2</p></title><caption><p>Biomolecule co-occurrences in sentences and phrases, with and without IITs.</p></caption><tblbdy cols="3">
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>
               <b># (%) that describe the interaction</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Total number</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="3">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <b>Sentences where two biomolecules tri-occur with at least one IIT</b>
            </p>
         </c>
         <c ca="center">
            <p>331 (55%)</p>
         </c>
         <c ca="right">
            <p>606</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <b>Sentences where two biomolecules co-occur without any IIT</b>
            </p>
         </c>
         <c ca="center">
            <p>3 (7.9%)</p>
         </c>
         <c ca="right">
            <p>38</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <b>All sentences where two biomolecules co-occur</b>
            </p>
         </c>
         <c ca="center">
            <p>334 (52%)</p>
         </c>
         <c ca="right">
            <p>644</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <b>Phrases where two biomolecules tri-occur with at least one IIT</b>
            </p>
         </c>
         <c ca="center">
            <p>236 (71%)</p>
         </c>
         <c ca="right">
            <p>334</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <b>Phrases where two biomolecules co-occur without any IIT</b>
            </p>
         </c>
         <c ca="center">
            <p>0 (0%)</p>
         </c>
         <c ca="right">
            <p>17</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <b>All phrases where two biomolecules co-occur</b>
            </p>
         </c>
         <c ca="center">
            <p>236 (67%)</p>
         </c>
         <c ca="right">
            <p>351</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <b>Sentence co-occurrences not in phrases</b>
            </p>
         </c>
         <c ca="center">
            <p>98 (33%)</p>
         </c>
         <c ca="right">
            <p>293</p>
         </c>
      </r>
   </tblbdy></tbl>
         <p>For question (3), we investigated how an IIT present between the two biomolecules differs from when an IIT is present but not between the biomolecules. The results are shown in Table <tblr tid="T3">3</tblr>.</p>
         <tbl id="T3"><title><p>Table 3</p></title><caption><p>Percentages of sentences and phrases describing interactions (i.e., precisions), by IIT location.</p></caption><tblbdy cols="4">
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>
               <b>IIT intervening</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>IIT elsewhere in sentence</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>IIT in either place</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="4">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <b>Phrases in which two biomolecules co-occur</b>
            </p>
         </c>
         <c ca="center">
            <p>63%</p>
         </c>
         <c ca="center">
            <p>24%</p>
         </c>
         <c ca="center">
            <p>45%</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <b>Sentence co-occurrences that are not also phrase co-occurrences</b>
            </p>
         </c>
         <c ca="center">
            <p>30%</p>
         </c>
         <c ca="center">
            <p>9.1%</p>
         </c>
         <c ca="center">
            <p>21%</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <b>Both phrase and sentence co-occurrences</b>
            </p>
         </c>
         <c ca="center">
            <p>48%</p>
         </c>
         <c ca="center">
            <p>17%</p>
         </c>
         <c ca="center">
            <p>34%</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <b>Percent of interaction descriptions</b>
            </p>
         </c>
         <c ca="center">
            <p>77%</p>
         </c>
         <c ca="center">
            <p>23%</p>
         </c>
         <c ca="center">
            <p>100%</p>
         </c>
      </r>
   </tblbdy></tbl>
         <p>Table <tblr tid="T3">3</tblr> shows that the presence of an IIT intervening between the two biomolecule names is associated with relatively high likelihood that an interaction is described. Consequently, for descriptions in which one or more IIT was present, most (77%) had an IIT between the biomolecule names.</p>
         <p>For question (4), we collected a new set of 320 sentences from the results of 10 queries to PubMed. The queries were picked by biologists to represent typical interests. In addition, these 320 sentences were required to contain at least one IIT, thus permitting us to analyze IIT properties. The queries were <it>nitrite &amp; xanthine</it>, <it>pyruvate dehydrogenase &amp; phosphofructokinase</it>, <it>indole acetic acid &amp; starch</it>, <it>glucose &amp; starch</it>, <it>glucose-6-p &amp; starch</it>, <it>carotenoid &amp; IPP</it>, <it>cre &amp; cytokinin</it>, <it>acetyl-CoA &amp; leucine</it>, <it>glucose &amp; pyruvate</it>, and <it>ATP &amp; myosin</it>.</p>
         <p>Syntactic and semantic categories of the IITs in each sentence were recorded along with whether an interaction was described between the pair of biomolecules specified by the query. From these data, we investigated the possibility that IIT <it>form </it>(noun, adjective, adverb, present, present continuous and past/perfect) and <it>semantic category </it>(association, modification, negative regulation, positive regulation, transportation, transcription, create, and vague) can be used as evidence for mining interactions from text. 'Vague' was used as the category when an IIT could not be clearly placed in one of the other categories. The past and perfect forms of IITs are sometimes the same, and the frequency of the perfect form is low, so we did not distinguish between them.</p>
         <p>The noun form and the 'modification' category appeared more often than other forms and categories in sentences describing interactions. However this combination also appeared in more sentences overall than others. More details appear in Tables <tblr tid="T4">4</tblr> and <tblr tid="T5">5</tblr>, which give the percentages of sentences and phrases describing interactions between two given biomolecules broken out by IIT forms and categories. Note that some IITs have the same spelling for both the noun and present tense forms. We can manually differentiate them but to use those results in automatic methods would require parsing at least to the extent of POS tagging.</p>
         <tbl id="T4"><title><p>Table 4</p></title><caption><p>Data on likelihoods that sentences describe interactions when they contain biomolecule co-occurrences that are not in the same phrase.</p></caption><tblbdy cols="3">
      <r>
         <c ca="center">
            <p>
               <b>Forms</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b># (%) of sentences describing interactions</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Total sentences</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="3">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Noun</p>
         </c>
         <c ca="right">
            <p>141 (59%)</p>
         </c>
         <c ca="right">
            <p>237</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Adjective</p>
         </c>
         <c ca="right">
            <p>9 (45%)</p>
         </c>
         <c ca="right">
            <p>20</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Present</p>
         </c>
         <c ca="right">
            <p>50 (66%)</p>
         </c>
         <c ca="right">
            <p>76</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>-ing</p>
         </c>
         <c ca="right">
            <p>35 (51%)</p>
         </c>
         <c ca="right">
            <p>69</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Past/Perfect</p>
         </c>
         <c ca="right">
            <p>77 (55%)</p>
         </c>
         <c ca="right">
            <p>141</p>
         </c>
      </r>
      <r>
         <c ca="left" cspan="3">
            <p>
               <b>Categories</b>
            </p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Association</p>
         </c>
         <c ca="right">
            <p>60 (67%)</p>
         </c>
         <c ca="right">
            <p>89</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Modification</p>
         </c>
         <c ca="right">
            <p>80 (66%)</p>
         </c>
         <c ca="right">
            <p>121</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Negative regulation</p>
         </c>
         <c ca="right">
            <p>33 (39%)</p>
         </c>
         <c ca="right">
            <p>84</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Positive regulation</p>
         </c>
         <c ca="right">
            <p>47 (42%)</p>
         </c>
         <c ca="right">
            <p>112</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Transportation</p>
         </c>
         <c ca="right">
            <p>14 (67%)</p>
         </c>
         <c ca="right">
            <p>21</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Transcription</p>
         </c>
         <c ca="right">
            <p>5 (71%)</p>
         </c>
         <c ca="right">
            <p>7</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Create</p>
         </c>
         <c ca="right">
            <p>63 (66%)</p>
         </c>
         <c ca="right">
            <p>96</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Vague</p>
         </c>
         <c ca="right">
            <p>41 (54%)</p>
         </c>
         <c ca="right">
            <p>76</p>
         </c>
      </r>
   </tblbdy></tbl>
         <p>Comparing the 13 rows in Table <tblr tid="T4">4</tblr> and the corresponding rows in Table <tblr tid="T5">5</tblr>, a phrase containing a biomolecule pair has a higher probability of describing an interaction than a sentence containing the pair not within a single phrase in that sentence (<it>p </it>&lt; 0.005, <it>t </it>test on the 13 <it>z </it>values).</p>
         <tbl id="T5"><title><p>Table 5</p></title><caption><p>Data on likelihoods that phrases containing biomolecule co-occurrences describe interactions, by interaction-indicating term form and category.</p></caption><tblbdy cols="3">
      <r>
         <c ca="center">
            <p>
               <b>Forms</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b># (%) phrases describing interactions</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Total phrases</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="3">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Noun</p>
         </c>
         <c ca="right">
            <p>97 (66%)</p>
         </c>
         <c ca="right">
            <p>148</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Adjective</p>
         </c>
         <c ca="right">
            <p>3 (43%)</p>
         </c>
         <c ca="right">
            <p>7</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Present</p>
         </c>
         <c ca="right">
            <p>31 (74%)</p>
         </c>
         <c ca="right">
            <p>42</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>-ing</p>
         </c>
         <c ca="right">
            <p>16 (55%)</p>
         </c>
         <c ca="right">
            <p>29</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Past/Perfect</p>
         </c>
         <c ca="right">
            <p>56 (65%)</p>
         </c>
         <c ca="right">
            <p>86</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Association</p>
         </c>
         <c ca="right">
            <p>41 (75%)</p>
         </c>
         <c ca="right">
            <p>55</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Modification</p>
         </c>
         <c ca="right">
            <p>60 (78%)</p>
         </c>
         <c ca="right">
            <p>77</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Negative regulation</p>
         </c>
         <c ca="right">
            <p>24 (49%)</p>
         </c>
         <c ca="right">
            <p>49</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Positive regulation</p>
         </c>
         <c ca="right">
            <p>30 (52%)</p>
         </c>
         <c ca="right">
            <p>58</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Transportation</p>
         </c>
         <c ca="right">
            <p>7 (54%)</p>
         </c>
         <c ca="right">
            <p>13</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Transcription</p>
         </c>
         <c ca="right">
            <p>2 (100%)</p>
         </c>
         <c ca="right">
            <p>2</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Create</p>
         </c>
         <c ca="right">
            <p>37 (73%)</p>
         </c>
         <c ca="right">
            <p>51</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Vague</p>
         </c>
         <c ca="right">
            <p>31 (65%)</p>
         </c>
         <c ca="right">
            <p>48</p>
         </c>
      </r>
   </tblbdy></tbl>
         <p>PathBinder combines the evidence provided by various attributes of a sentence by multiplying odds for each attribute to calculate the overall probability that the sentence describes the putative interaction (e.g.Manning et al 2008 sections 11.1, 11.3 <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>; Davis 1990, pp. 128-130 <abbrgrp><abbr bid="B35">35</abbr></abbrgrp>). The formula used (Dickerson et al 2005 section 2.3.3 <abbrgrp><abbr bid="B36">36</abbr><abbr bid="B37">37</abbr></abbrgrp>) is <it>O</it>(<it>h</it>|<it>f</it><sub>1</sub>,..., <it>f</it><sub><it>n</it></sub>) = <it>O</it>(<it>h|f</it><sub>1</sub>)<it>O</it>(<it>h|f</it><sub>2</sub>)...<it>O</it>(<it>h|f</it><sub><it>n</it></sub>)/<it>O</it>(<it>h</it>)<sup><it>n</it>-1 </sup>which expresses the odds of hypothesis <it>h </it>(in this case that a given passage describes an interaction between given biomolecules) given <it>n </it>items of evidence in terms of a default odds <it>O</it>(<it>h</it>) modeling the entire corpus, and <it>O</it>(<it>h|f</it><sub><it>k</it></sub>), <it>k </it>= 1,..., <it>n</it>, which are the odds of the hypothesis given evidence item (in this case, sentence feature or attribute) <it>k</it>. Odds convert to probability by <it>p </it>= odds/(1+odds), so that for example odds of flipping heads instead of tails is H:T = 1:1 = 1, so <it>p </it>= 1/(1+1) = 0.5 as expected.</p>
         <p>Calculating probabilities sentence by sentence permits ranking sentences based on those probability scores. However, when the goal is to obtain the overall probability of an interaction, we must also combine the evidence provided by multiple sentences containing the same biomolecule co-occurrence. This is explained next.</p>
         <sec>
            <st>
               <p>Combining evidence from multiple passages</p>
            </st>
            <p>A sentence can be given a likelihood of describing an interaction based on its containing a co-occurrence, whether in a phrase, or in the sentence but across phrases. Multiple sentences containing the same co-occurrence often exist in MEDLINE, so to extract interactions from MEDLINE we would like to combine the multiple sources of evidence constituted by the multiple sentences. This can be done probabilistically as follows. Let <it>p </it>be the probability that a sentence describes an interaction. Then <it>q </it>= 1-<it>p </it>is the probability that it does not. Given <it>n </it>such independent sentences, and assuming for a moment that probability <it>p </it>is the same for all sentences, then <it>q</it><sup><it>n </it></sup>would be the probability that none of them describe an interaction, thus 1-<it>q</it><sup><it>n </it></sup>the probability that at least one does. Since <it>q </it>= 1-<it>p</it>, the formula for the probability of an interaction between a pair of biomolecules being described within <it>n </it>relevant sentences is 1-(1-<it>p</it>)<sup><it>n</it></sup>.</p>
            <p>In the more typical case of <it>n </it>sentences each with its own value <it>p</it><sub><it>i</it></sub>, <it>i </it>= 1,..., <it>n </it>for the probability that it describes an interaction, the formula generalizes to:</p>
            <p>
               <display-formula id="M1">
                  <graphic file="1471-2105-10-S11-S18-i1.gif"/>
               </display-formula>
            </p>
            <p>assuming the sentences provide independent evidence, an assumption commonly made and found to lead to useful results though in general incorrect.</p>
            <p>It is reasonable to ask if the value of <it>n </it>should be constrained. Some new interactions may be mentioned only in the most recent publications, limiting the number of publications describing these interactions. Thus, particularly for a recent discovery, the fact that only a few sentences exist containing two given biomolecules might not suggest lack of interaction. Therefore, we also assessed two variant methods for estimating the probability of an interaction between two biomolecules. These are as follows:</p>
            <p indent="1">&#8226; <it>Best 5</it>: use the <it>average </it>of the scores of the top 5 sentences, those having the highest probability of describing an interaction between the two biomolecules: <it>p</it>(interaction) = (<it>p</it><sub>1</sub>+<it>p</it><sub>2</sub>+<it>p</it><sub>3</sub>+<it>p</it><sub>4</sub>+<it>p</it><sub>5</sub>)/5.</p>
            <p indent="1">&#8226; <it>Best 2</it>: use the <it>average </it>of the scores of the top 2 sentences: <it>p</it>(interaction) = (<it>p</it><sub>1</sub>+<it>p</it><sub>2</sub>)/2.</p>
            <p>Formula (1) we will call the <it>All </it>method. For the <it>Best 2 </it>and <it>Best 5 </it>methods, if a biomolecule pair co-occurs in fewer than 2 or 5 sentences, 0 was used for the missing probabilities to reach 2 or 5 terms in their formulas.</p>
            <p>With these 3 evidence combination methods, given a list of biomolecule pairs we can process MEDLINE to extract biomolecular interactions and construct an interaction network. The biomolecules are the vertices in this network, and if two biomolecules are found to interact, there is an edge between their vertices. We obtained the biomolecule name list from an existing database about genome-wide plant mRNA, protein, and metabolite data, MetNetDB (<url>http://metnet.vrac.iastate.edu/MetNet_db.htm</url>, <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>). This database focuses especially on <it>Arabidopsis </it>and soy. We created an interaction network from this database to demonstrate our system.</p>
            <p>Any two biomolecules in the database can be checked to see if they interact. For each such pair any sentences where the biomolecule pair co-occurs can be collected and analyzed to estimate the probability that the corpus describes them as interacting. However, checking all pairs is computationally inefficient because there are about 2*10<sup>6 </sup>biomolecule records in the database, hence about 4*10<sup>12 </sup>pairs. Instead, we scanned sentences in MEDLINE one by one, identified biomolecule pairs in the sentences, recorded the probability score that each sentence gives to its pairs and finally generate the network using the <it>All, Best 5 </it>and <it>Best 2 </it>combination methods on the sentences for each pair. The overall structure of the system is shown in Figure <figr fid="F1">1</figr>.</p>
            <fig id="F1"><title><p>Figure 1</p></title><caption><p>PathBinder system structure</p></caption><text>
   <p><b>PathBinder system structure</b>.</p>
</text><graphic file="1471-2105-10-S11-S18-1"/></fig>
            <p>There are two main parts.</p>
            <p indent="1">1. <it>Interaction Extractor</it>.</p>
            <p indent="2">a. The system examines each sentence in MEDLINE for keywords (biomolecules, IITs, &amp; cellular locations) stored in MetNetDB, tags them and stores the tagged sentences into the PathBinder system database, PathBinderDB.</p>
            <p indent="2">b. When scanning each sentence, the system determines the interaction likelihood for the biomolecule co-occurrence of interest inside the sentence and combines the scores of multiple sentences containing the pair using <it>All</it>, <it>Best 5</it>, and <it>Best 2</it>. The database has two tables for biomolecules, one for their appearances in MEDLINE records and one for entity names recognized by biologists but which might not appear in MEDLINE records under those names. These tables were imported from the MetNet system (Wurtele et al. 2007 <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>). When combining the scores, we first calculated the score for the actual co-occurring pair, then found the entity names in the database corresponding to the co-occurring terms appearing in the text, and finally calculated the composite score for the pair of entity names based on the set of sentences containing co-occurrences of other terms associated with those entity names.</p>
            <p indent="1">2. <it>User Gateway</it>. PathBinder is the user portal to PathBinderDB. PathBinder serves as a query gateway to interaction descriptions stored in PathBinderDB. Users can provide two biomolecules to PathBinder, which will access PathBinderDB and return all sentences in which the two biomolecules appear. It calculates a probability score for each returned sentence, ranks sentences based on their scores, and then shows them to the user. On the other hand, if a user provides just one biomolecule, PathBinder returns a list of other biomolecules potentially interacting with it.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results and testing</p>
         </st>
         <sec>
            <st>
               <p>Evaluating sentences as interaction descriptions</p>
            </st>
            <p>We began with the test corpus of 320 sentences described earlier, for which we computed 320 probability estimates for the likelihood that they described an interaction between a given biomolecule pair. We also manually judged whether each sentence actually does describe an interaction between the queried biomolecule pair, recording 1 if so, or 0 if not, in order to facilitate doing a linear regression to fit the 320 computed likelihoods to the 320 corresponding manual data. If the probability that a sentence describes an interaction is computed accurately, then for a set of sentences with the same computed probability of describing the interaction (e.g., 0.75), that probability is also the expected fraction of those sentences manually found to actually describe the interaction. For example, given a set of sentences each computed to describe an interaction with probability <it>p </it>= 0.75, the statistically expected fraction of them to, in fact, describe an interaction would also be 0.75 (75%), if the computed probability was accurate. Therefore, we can test the accuracy of the computed probabilities by checking how close the linear regression result is to the line <it>y </it>= <it>x </it>(or for axes labeled as in Figure <figr fid="F2">2</figr>, <it>p</it><sub>manual </sub>= <it>p</it><sub>computed</sub>). We consider the actual regression result next.</p>
            <fig id="F2"><title><p>Figure 2</p></title><caption><p>Linear regression results: computation vs. manual analysis (theoretical ideal: <it>p</it><sub>manual </sub>= <it>p</it><sub>computed</sub>)</p></caption><text>
   <p><b>Linear regression results: computation vs. manual analysis (theoretical ideal: <it>p</it><sub>manual </sub>= <it>p</it><sub>computed</sub>)</b>. Note that the 320 manually determined data points all have probability values of 0 or 1 (either they describe an interaction or not), so many of them overlap in the graph.</p>
</text><graphic file="1471-2105-10-S11-S18-2"/></fig>
            <p>The regression line shown in Figure <figr fid="F2">2</figr> is not precisely <it>p</it><sub>manual </sub>= <it>p</it><sub>computed</sub>, but is fairly close:</p>
            <p>
               <display-formula id="M2">
                  <graphic file="1471-2105-10-S11-S18-i2.gif"/>
               </display-formula>
            </p>
            <p>To make our computed probabilities more accurately reflect manually determined reality (i.e., give a regression line of <it>y </it>= <it>x</it>), we can adjust them by defining a <it>p</it><sub>adjusted</sub>:</p>
            <p>
               <display-formula id="M3">
                  <graphic file="1471-2105-10-S11-S18-i3.gif"/>
               </display-formula>
            </p>
            <p>It is no accident that Eqs. (2) and (3) are so similar: showing <it>p</it><sub>adjusted </sub>on the <it>x </it>axis will then give a regression line of <it>y </it>= <it>x </it>or, in the present case, <it>p</it><sub>manual </sub>= <it>p</it><sub>adjusted</sub>, as desired.</p>
            <p>We applied (3) in PathBinder, so that for each sentence <it>s</it>, a computed probability score <it>p</it><sub>computed</sub>(<it>s</it>), is calculated and then adjusted to give a probability score <it>p</it><sub>adjusted</sub>(<it>s</it>) for the probability that it describes an interaction between two given biomolecules.</p>
            <p>The discrepancy between <it>p</it><sub>computed </sub>and <it>p</it><sub>manual </sub>has two possible causes. First, it can simply be a statistical artifact of noisy data. Second, the computational model underlying <it>p</it><sub>computed </sub>might represent reality imperfectly, as models in general often do, and as probabilistic models in particular often do due to implicit independence assumptions that only approximately hold.</p>
            <p>To help determine the cause here, and thus test the validity of the <it>p</it><sub><it>adjusted </it></sub>calculation, we collected a test set of 600 sentences. Of these, 123 contained the 10 biomolecule pairs from among the 10 we used to create the training corpus, but were not already in the 320 sentence experimental set. To get the remaining 477, we collected sentences with <it>p</it><sub><it>adjusted </it></sub>values of 0, 0.1 &#177; 0.01, 0.2 &#177; 0.02, 0.3 &#177; 0.03, 0.4 &#177; 0.04, 0.5 &#177; 0.05, 0.6 &#177; 0.06, 0.7 &#177; 0.07 and 0.739 &#177; 0.07 (the <it>p</it><sub><it>adjusted </it></sub>computation gives results up to about 0.739). About 50 sentences for each of those values were collected from search results using the new pairs: <it>ethanol &amp; acetaldehyde, acetyl-CoA &amp; NADH, dynamin &amp; GTP, adenylate cyclase &amp; ATP</it>, and <it>ATP &amp; creatine</it>.</p>
            <p>For each of the 600 test sentences, whether it really described the interaction was judged manually and recorded as 0 (no) or 1 (yes). Then we did a linear regression on the test set (Figure <figr fid="F3">3</figr>) as was done earlier in Figure <figr fid="F2">2</figr>.</p>
            <fig id="F3"><title><p>Figure 3</p></title><caption><p>The linear regression results for the test set of 600 sentences</p></caption><text>
   <p><b>The linear regression results for the test set of 600 sentences</b>. Note that the 600 manually determined data points often overlap because (i) they all have a height of either 0 or 1, as they were all manually determined to describe an interaction (1), or not (0), and (ii) most of them have horizontal axis values very close to 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, or 0.7.</p>
</text><graphic file="1471-2105-10-S11-S18-3"/></fig>
            <p>The regression line we get is</p>
            <p>
               <display-formula id="M4">
                  <graphic file="1471-2105-10-S11-S18-i4.gif"/>
               </display-formula>
            </p>
            <p>which is very close to the ideal of <it>y </it>= <it>x</it>. Thus PathBinder's calculation of <it>p</it><sub>adjusted </sub>is justified by test set B.</p>
         </sec>
         <sec>
            <st>
               <p>Combining evidence across sentences to create an interaction network</p>
            </st>
            <p>Equation (3) is used to evaluate the likelihood that <it>each </it>sentence describes an interaction. As mentioned earlier, we combine evidence from <it>multiple </it>sentences to evaluate the likelihood that a pair of biomolecules interacts using Equation (1) or the <it>All </it>method, and the <it>Best 2 </it>and <it>Best 5 </it>methods. The result is an interaction network of thousands of biomolecules and the interaction relationships among them. The key information retrieval measures of precision and recall were used to compare <it>All</it>, <it>Best 5</it>, and <it>Best 2</it>. Some key results are shown in Figure <figr fid="F4">4</figr>.</p>
            <fig id="F4"><title><p>Figure 4</p></title><caption><p>Recalls and precisions of the three methods for combining evidence from multiple sentences</p></caption><text>
   <p><b>Recalls and precisions of the three methods for combining evidence from multiple sentences</b>.</p>
</text><graphic file="1471-2105-10-S11-S18-4"/></fig>
            <p>To determine the precisions in Figure <figr fid="F4">4</figr>, we randomly sampled a set of 400 pairs of biomolecules co-occurring in MEDLINE from the previously generated interaction network. The sentences for each pair were each evaluated by the three methods (<it>All</it>, <it>Best 5 </it>&amp;<it> Best 2</it>), and the resulting computationally estimated probabilities of interaction were recorded for each pair. The 400 pairs were also manually analyzed to see whether they do in fact interact. One hundred eight of them did interact. The overall precision was thus 108/400 = 0.27 for this random set. More importantly, we calculated the precisions analogously for 7 subsets of the 400 pairs meeting 7 different thresholds for interaction probability. This was done separately for <it>All</it>, <it>Best 5</it>, and <it>Best 2 </it>(making 7*3 = 21 subsets). Thus each subset was associated with a threshold, a calculation method, a precision, and a recall which was the fraction of the 108 interacting pairs meeting the threshold using the calculation method. The overall recall for the whole set is necessarily 1.</p>
            <p>Some aspects of Figure <figr fid="F4">4</figr> are worth considering further. For the <it>All </it>method, the leftmost data point refers to co-occurrences with a calculated interaction probability of 1. Such a high value happens when there are a lot of sentences providing evidence. Combining that evidence using equation (1) leads to score values that are effectively 1 (for example, co-occurrences of "bilirubin" and "cytochrome P450" and their synonyms was computed to have a score of 1&#8211;10<sup>-11</sup>). We counted any score over 1&#8211;10<sup>-6 </sup>as 1. This was therefore the most selective threshold for the <it>All </it>method and it occurred for 342,492 biomolecule pairs (for MEDLINE as of October 2008).</p>
            <p>Unlike the <it>All </it>method, the <it>Best 5 </it>and <it>Best 2 </it>methods only look at average scores of sentences, so calculated probability scores tend to be lower for these methods than for the <it>All </it>method. Thus <it>Best 5 </it>and <it>Best 2 </it>permit score thresholds met by fewer than 342,492 pairs.</p>
            <p>The curves in Fig. <figr fid="F4">4</figr> are not always monotonic. For example, the first part of the <it>Best 5 </it>curve is not monotonic. The leftmost point on that curve, (0.16, 0.61) is based on the 28 pairs meeting or exceeding a threshold score value of 0.58, computed by the <it>Best 5 </it>method. This was the most selective threshold used to generate the curve. Yet the 62 pairs that met a lower threshold of 0.53 actually had a higher precision, giving point (0.35, 0.63) in Figure <figr fid="F4">4</figr>. One possible reason is noise from the limited data. Another possibility is that <it>Best 5 </it>actually does produce this effect for some reason.</p>
            <p>Recall and precision are often combined to get a single, composite measure of information retrieval quality called the effectiveness, or <it>F</it>-measure, of an information retrieval method: <it>F </it>= 2(recall*precision)/(recall+precision). Figure <figr fid="F5">5</figr> shows the effectiveness for the three methods as a function of the size of the subset meeting a given threshold, with size expressed as a percentage of the full 400-member set.</p>
            <fig id="F5"><title><p>Figure 5</p></title><caption><p>Effectiveness (<it>F</it>-measure) comparison of the three methods</p></caption><text>
   <p><b>Effectiveness (<it>F</it>-measure) comparison of the three methods</b>.</p>
</text><graphic file="1471-2105-10-S11-S18-5"/></fig>
            <p>For the <it>F </it>measure, the <it>Best 2 </it>method gave the highest peak value, for a threshold met by 137 pairs. For the full result interaction network, there are 1,646,337 pairs that meet that threshold.</p>
         </sec>
         <sec>
            <st>
               <p>Use in PathBinder</p>
            </st>
            <p>Our technique has been applied in the PathBinder System, which provides a query gateway to users. If a user provides a biomolecule, PathBinder can find other biomolecules potentially interacting with it. Users can choose a biomolecule pair as a query for sentences describing interactions, as illustrated in Figure <figr fid="F6">6</figr>. Users can also specify more query conditions, like cellular locations (e.g., nucleus, mitochondrion, etc.), categories of IITs appearing with the co-occurring biomolecule names (e.g. association, modification, etc.), specific IITs appearing with a co-occurrence (e.g. bind, increase, etc.) and Linnaean taxonomic categories. All these data are obtained when processing MEDLINE and were pre-recorded in the database. Once Pathbinder gets a query, it will search for all sentences satisfying the query and display them in a new window, as in Figure <figr fid="F7">7</figr>. It can order the result sentences by PMID or (as in Fig. <figr fid="F7">7</figr>) by their estimated probability of describing an interaction between the biomolecules. Users can click the PMID to read the PubMed record containing the sentence directly on the PubMed Web site.</p>
            <fig id="F6"><title><p>Figure 6</p></title><caption><p>PathBinder main screen</p></caption><text>
   <p><b>PathBinder main screen</b>.</p>
</text><graphic file="1471-2105-10-S11-S18-6"/></fig>
            <fig id="F7"><title><p>Figure 7</p></title><caption><p>PathBinder search results</p></caption><text>
   <p><b>PathBinder search results</b>.</p>
</text><graphic file="1471-2105-10-S11-S18-7"/></fig>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>As explained earlier, we calculate a rather precise probability estimate that a sentence describes an interaction between a given biomolecule pair. However, this precision can be misleading. A typical problem is that an IIT describes the interaction of one biomolecule in the given pair with another biomolecule not in the pair, but the non-syntactic approach of PathBinder mistakenly concludes the interaction may be between the biomolecules of interest. For example, consider the sentence</p>
         <p>
            <it>Sodium dichloroacetate increased </it>
            <b>
               <it>glucose </it>
            </b>
            <it><ul>oxidation</ul> and </it>
            <b>
               <it>pyruvate </it>
            </b>
            <it>oxidation in hearts from fed normal or alloxan-diabetic rats perfused with glucose and insulin. </it>
            <abbrgrp>
               <abbr bid="B38">38</abbr>
            </abbrgrp>
         </p>
         <p>The term "oxidation" is between the biomolecules "glucose" and "pyruvate" but it does not describe an interaction between them. PathBinder, however, gives a high score to this sentence anyway. Analyzing the syntactic structure of the sentence, as with full parsing or link grammar <abbrgrp><abbr bid="B39">39</abbr></abbrgrp> would help solve this problem, but is computationally more expensive.</p>
         <p>Another typical problem is that some IITs are not recognized. An unusual IIT might not be stored in our database and so would not be recognized. For example, consider the following sentence.</p>
         <p>
            <b>
               <it>GTP</it>
            </b>
            <it>-<ul>dependent</ul> twisting of </it>
            <b>
               <it>dynamin</it>
            </b>
            <it> implicates constriction and tension in membrane fission. </it>
            <abbrgrp>
               <abbr bid="B40">40</abbr>
            </abbrgrp>
         </p>
         <p>If we try to find an interaction between GTP and dynamin, there is no obvious IIT describing their interaction. But the word "dependent" describes a relation between "GTP" and "twisting of dynamin," so that there is indeed an interaction described. However, neither "dependent" nor "twist" are currently used by the system as IITs and so this sentence gets too low a score.</p>
         <p>Another problem occurs with biomolecules that are very common in MEDLINE. The chance that two of them co-occur in one sentence can be elevated even if they do not interact just because they are so common overall. Most sentences that they co-occur in might not get a high estimated probability of describing an interaction, but if even a small fraction of them do, the estimated probability of interaction can still be high. An example is "ATP" and "starch."</p>
         <p>A different problem in network construction is posed by biomolecules that look like common words in English. For example, since the word 'no' and the abbreviation of nitrous oxide have the same spelling, and the token "no" appears very often in MEDLINE, a na&#239;ve analysis will mistakenly conclude that nitrous oxide has interactions with thousands of biomolecules. In addition, some non-biomolecule terms tend to creep into lexicons of biomolecules, like "resistance" in our case. Such terms tend to then become members of invalid "interactions." In fact, if we eliminate the effects of words like "no" and "resistance," the precision of our results increases significantly, as shown in Figure <figr fid="F8">8</figr>. The effectiveness was in turn improved by the improved precision, as shown in Figure <figr fid="F9">9</figr> (the recall stays the same in this test because no new interacting pairs appear).</p>
         <fig id="F8"><title><p>Figure 8</p></title><caption><p>Updated recalls and precisions of the three methods for combining evidence from multiple sentences</p></caption><text>
   <p><b>Updated recalls and precisions of the three methods for combining evidence from multiple sentences</b>. Precisions are markedly improved when problematic "biomolecule names" are manually removed from consideration (compare this with Figure 4).</p>
</text><graphic file="1471-2105-10-S11-S18-8"/></fig>
         <fig id="F9"><title><p>Figure 9</p></title><caption><p>Updated effectiveness comparison of the three methods</p></caption><text>
   <p><b>Updated effectiveness comparison of the three methods</b>. Problematic "biomolecule names" are manually removed from consideration and effectivenesses increased compared to Figure 5.</p>
</text><graphic file="1471-2105-10-S11-S18-9"/></fig>
         <p>Our precision results are higher than for some other interaction extraction applications. Our highest precision of 95% is among the best results for extracting interactions so far. NLP methods in principle should be capable of obtaining close to 100% precision and recall. Avoiding NLP, however, our system saves considerable time. Our results could be improved while retaining the computational efficiency of shallow methods by investigating and using empirics for more text features. Even when full NLP becomes available at some future time, easily computed text empirics will still have potential value as an ancillary evidence source that could improve and speed up NLP-based analyses.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>We created and developed algorithms to extract sentences describing interactions between biomolecules based on <it>text empirics</it>, that is, observed characteristics of textual passages. Using this approach we designed a software system that provides a service to users by extracting interaction descriptions from MEDLINE. The extracted sentences can be ranked by their estimated probability of describing an interaction between the two biomolecules. We compared the probability estimates to manually generated ("gold standard") data to test their accuracy. Results were close, as shown by Eq. (2), and nearly identical when estimates were linearly adjusted and then tested against a new test set. From MEDLINE, we extracted and created an interaction network which contains more than 300,000 probable interactions. The approach was demonstrated in a system architecture designed for human searchers. However the underlying text empirics results we offer here could be used by other researchers and system designers as well.</p>
      </sec>
      <sec>
         <st>
            <p>List of abbreviations used</p>
         </st>
         <p>(NLP): Natural language processing; (ML): Machine Learning; (IITs): Interaction-indicating terms.</p>
      </sec>
      <sec>
         <st>
            <p>Competing interests</p>
         </st>
         <p>The authors declare that they have no competing interests.</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>LZ and DB analyzed the corpus and designed the algorithm. LZ, JD and TC developed the software. DB and ESW determined the goals, system architecture and usability design.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>We are grateful for support for this work from the NSF Arabidopsis 2010 project under grant DBI-0520267.</p>
            <p>This article has been published as part of <it>BMC Bioinformatics </it>Volume 10 Supplement 11, 2009: Proceedings of the Sixth Annual MCBIOS Conference. Transformational Bioinformatics: Delivering Value from Genomes. The full contents of the supplement are available online at <url>http://www.biomedcentral.com/1471-2105/10?issue=S11</url>.</p>
         </sec>
      </ack>
      <refgrp><bibl id="B1"><title><p>Automated extraction of information on protein-protein interactions from the biological literature</p></title><aug><au><snm>Ono</snm><fnm>T</fnm></au><au><snm>Hishigaki</snm><fnm>H</fnm></au><au><snm>Tanigami</snm><fnm>A</fnm></au><au><snm>Takagi</snm><fnm>T</fnm></au></aug><source>Bioinformatics</source><pubdate>2001</pubdate><volume>17</volume><fpage>155</fpage><lpage>161</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/17.2.155</pubid><pubid idtype="pmpid" link="fulltext">11238071</pubid></pubidlist></xrefbib></bibl><bibl id="B2"><title><p>Expansion of the BioCyc collection of pathway/genome databases to 160 genomes</p></title><aug><au><snm>Karp</snm><fnm>PD</fnm></au><au><snm>Ouzounis</snm><fnm>CA</fnm></au><au><snm>Moore-Kochlacs</snm><fnm>C</fnm></au><au><snm>Goldovsky</snm><fnm>L</fnm></au><au><snm>Kaipa</snm><fnm>P</fnm></au><au><snm>Ahr&#233;n</snm><fnm>D</fnm></au><au><snm>Tsoka</snm><fnm>S</fnm></au><au><snm>Darzentas</snm><fnm>N</fnm></au><au><snm>Kunin</snm><fnm>V</fnm></au><au><snm>L&#243;pez-Bigas</snm><fnm>N</fnm></au></aug><source>Nucleic Acids Research</source><pubdate>2005</pubdate><volume>33</volume><issue>19</issue><fpage>6083</fpage><lpage>6089</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/gki892</pubid><pubid idtype="pmcid">1266070</pubid><pubid idtype="pmpid" link="fulltext">16246909</pubid></pubidlist></xrefbib></bibl><bibl id="B3"><title><p>MetNet: systems biology software for Arabidopsis</p></title><aug><au><snm>Wurtele</snm><fnm>ES</fnm></au><au><snm>Li</snm><fnm>L</fnm></au><au><snm>Berleant</snm><fnm>D</fnm></au><au><snm>Cook</snm><fnm>D</fnm></au><au><snm>Dickerson</snm><fnm>JA</fnm></au><au><snm>Ding</snm><fnm>J</fnm></au><au><snm>Hofmann</snm><fnm>H</fnm></au><au><snm>Lawrence</snm><fnm>M</fnm></au><au><snm>Lee</snm><fnm>EK</fnm></au><au><snm>Li</snm><fnm>J</fnm></au><au><snm>Mentzen</snm><fnm>W</fnm></au><au><snm>Miller</snm><fnm>L</fnm></au><au><snm>Nikolau</snm><fnm>BJ</fnm></au><au><snm>Ransom</snm><fnm>N</fnm></au><au><snm>Wang</snm><fnm>Y</fnm></au></aug><source>Concepts in Plant Metabolomics</source><publisher>Springer</publisher><pubdate>2007</pubdate><fpage>145</fpage><lpage>158</lpage></bibl><bibl id="B4"><title><p>The MIPS mammalian protein-protein interaction database</p></title><aug><au><snm>Pagel</snm><fnm>P</fnm></au><au><snm>Kovac</snm><fnm>S</fnm></au><au><snm>Oesterheld</snm><fnm>M</fnm></au><au><snm>Brauner</snm><fnm>B</fnm></au><au><snm>Dunger-Kaltenbach</snm><fnm>I</fnm></au><au><snm>Frishman</snm><fnm>G</fnm></au><au><snm>Montrone</snm><fnm>C</fnm></au><au><snm>Mark</snm><fnm>P</fnm></au><au><snm>St&#252;mpflen</snm><fnm>V</fnm></au><au><snm>Mewes</snm><fnm>H-W</fnm></au><au><snm>Ruepp</snm><fnm>A</fnm></au><au><snm>Frishman</snm><fnm>D</fnm></au></aug><source>Bioinformatics</source><pubdate>2005</pubdate><volume>21</volume><fpage>832</fpage><lpage>834</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/bti115</pubid><pubid idtype="pmpid" link="fulltext">15531608</pubid></pubidlist></xrefbib></bibl><bibl id="B5"><title><p>From genomics to chemical genomics: new developments in KEGG</p></title><aug><au><snm>Kanehisa</snm><fnm>M</fnm></au><au><snm>Goto</snm><fnm>S</fnm></au><au><snm>Hattori</snm><fnm>M</fnm></au><au><snm>Aoki-Kinoshita</snm><fnm>KF</fnm></au><au><snm>Itoh</snm><fnm>M</fnm></au><au><snm>Kawashima</snm><fnm>S</fnm></au><au><snm>Katayama</snm><fnm>T</fnm></au><au><snm>Araki</snm><fnm>M</fnm></au><au><snm>Hirakawa</snm><fnm>M</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>2006</pubdate><volume>34</volume><issue>Database issue</issue><fpage>D354</fpage><lpage>D357</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/gkj102</pubid><pubid idtype="pmcid">1347464</pubid><pubid idtype="pmpid" link="fulltext">16381885</pubid></pubidlist></xrefbib></bibl><bibl id="B6"><title><p>The Database of Interacting Proteins: 2004 update</p></title><aug><au><snm>Salwinski</snm><fnm>L</fnm></au><au><snm>Miller</snm><fnm>CS</fnm></au><au><snm>Smith</snm><fnm>AJ</fnm></au><au><snm>Pettit</snm><fnm>FK</fnm></au><au><snm>Bowie</snm><fnm>JU</fnm></au><au><snm>Eisenberg</snm><fnm>D</fnm></au></aug><source>Nucleic Acids Research</source><pubdate>2004</pubdate><volume>32</volume><fpage>D449</fpage><lpage>D451</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/gkh086</pubid><pubid idtype="pmcid">308820</pubid><pubid idtype="pmpid" link="fulltext">14681454</pubid></pubidlist></xrefbib></bibl><bibl id="B7"><title><p>BIND: the Biomolecular Interaction Network Database</p></title><aug><au><snm>Bader</snm><fnm>GD</fnm></au><au><snm>Betel</snm><fnm>D</fnm></au><au><snm>Hogue</snm><fnm>CW</fnm></au></aug><source>Nucleic Acids Research</source><pubdate>2003</pubdate><volume>31</volume><issue>1</issue><fpage>248</fpage><lpage>250</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/gkg056</pubid><pubid idtype="pmcid">165503</pubid><pubid idtype="pmpid" link="fulltext">12519993</pubid></pubidlist></xrefbib></bibl><bibl id="B8"><title><p>Mining MEDLINE: abstracts, sentences, or phrases?</p></title><aug><au><snm>Ding</snm><fnm>J</fnm></au><au><snm>Berleant</snm><fnm>D</fnm></au><au><snm>Nettleton</snm><fnm>D</fnm></au><au><snm>Wurtele</snm><fnm>E</fnm></au></aug><source>Pacific Symposium on Biocomputing</source><pubdate>2002</pubdate><fpage>326</fpage><lpage>337</lpage><xrefbib><pubid idtype="pmpid">11928487</pubid></xrefbib></bibl><bibl id="B9"><title><p>PathwayFinder: paving the way towards automatic pathway extraction</p></title><aug><au><snm>Yao</snm><fnm>D</fnm></au><au><snm>Wang</snm><fnm>J</fnm></au><au><snm>Lu</snm><fnm>Y</fnm></au><au><snm>Noble</snm><fnm>N</fnm></au><au><snm>Sun</snm><fnm>H</fnm></au><au><snm>Zhu</snm><fnm>X</fnm></au><au><snm>Payan</snm><fnm>DG</fnm></au><au><snm>Li</snm><fnm>M</fnm></au><au><snm>Qu</snm><fnm>K</fnm></au></aug><source>Proceedings of the second conference on Asia-Pacific bioinformatics</source><pubdate>2004</pubdate><volume>29</volume><fpage>53</fpage><lpage>62</lpage></bibl><bibl id="B10"><title><p>Wnt pathway curation using automated natural language processing: combining statistical methods with partial and full parse for knowledge extraction</p></title><aug><au><snm>Santos</snm><fnm>C</fnm></au><au><snm>Eggle</snm><fnm>D</fnm></au><au><snm>States</snm><fnm>DJ</fnm></au></aug><source>Bioinformatics</source><pubdate>2005</pubdate><volume>21</volume><fpage>1653</fpage><lpage>1658</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/bti165</pubid><pubid idtype="pmpid" link="fulltext">15564295</pubid></pubidlist></xrefbib></bibl><bibl id="B11"><title><p>Text mining of full-text journal articles combined with gene expression analysis reveals a relationship between sphingosine-1-phosphate and invasiveness of a glioblastoma cell line</p></title><aug><au><snm>Natarajan</snm><fnm>J</fnm></au><au><snm>Berrar</snm><fnm>D</fnm></au><au><snm>Dubitzky</snm><fnm>W</fnm></au><au><snm>Hack</snm><fnm>C</fnm></au><au><snm>Zhang</snm><fnm>Y</fnm></au><au><snm>DeSesa</snm><fnm>C</fnm></au><au><snm>Van Brocklyn</snm><fnm>JR</fnm></au><au><snm>Bremer</snm><fnm>EG</fnm></au></aug><source>BMC Bioinformatics</source><pubdate>2006</pubdate><volume>7</volume><fpage>373</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-7-373</pubid><pubid idtype="pmcid">1557675</pubid><pubid idtype="pmpid" link="fulltext">16901352</pubid></pubidlist></xrefbib></bibl><bibl id="B12"><title><p>RelEx &#8211; Relation extraction using dependency parse trees</p></title><aug><au><snm>Fundel</snm><fnm>K</fnm></au><au><snm>K&#252;ffner</snm><fnm>R</fnm></au><au><snm>Zimmer</snm><fnm>R</fnm></au></aug><source>Bioinformatics</source><pubdate>2007</pubdate><volume>23</volume><issue>3</issue><fpage>365</fpage><lpage>371</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btl616</pubid><pubid idtype="pmpid" link="fulltext">17142812</pubid></pubidlist></xrefbib></bibl><bibl id="B13"><title><p>Mining of relations between proteins over biomedical scientific literature using a deep-linguistic approach</p></title><aug><au><snm>Rinaldi</snm><fnm>F</fnm></au><au><snm>Schneider</snm><fnm>G</fnm></au><au><snm>Kaljurand</snm><fnm>K</fnm></au><au><snm>Hess</snm><fnm>M</fnm></au><au><snm>Andronis</snm><fnm>C</fnm></au><au><snm>Konstandi</snm><fnm>O</fnm></au><au><snm>Persidis</snm><fnm>A</fnm></au></aug><source>Artificial Intelligence in Medicine</source><pubdate>2007</pubdate><volume>39</volume><fpage>127</fpage><lpage>136</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.artmed.2006.08.005</pubid><pubid idtype="pmpid" link="fulltext">17052900</pubid></pubidlist></xrefbib></bibl><bibl id="B14"><title><p>Evaluating contributions of natural language parsers to protein-protein interaction extraction</p></title><aug><au><snm>Miyao</snm><fnm>Y</fnm></au><au><snm>Sagae</snm><fnm>K</fnm></au><au><snm>S&#230;tre</snm><fnm>R</fnm></au><au><snm>Matsuzaki</snm><fnm>T</fnm></au><au><snm>Tsujii</snm><fnm>T</fnm></au></aug><source>Bioinformatics</source><pubdate>2009</pubdate><volume>25</volume><issue>3</issue><fpage>394</fpage><lpage>400</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btn631</pubid><pubid idtype="pmcid">2639072</pubid><pubid idtype="pmpid" link="fulltext">19073593</pubid></pubidlist></xrefbib></bibl><bibl id="B15"><title><p>Large-scale directional relationship extraction and resolution</p></title><aug><au><snm>Giles</snm><fnm>CB</fnm></au><au><snm>Wren</snm><fnm>JD</fnm></au></aug><source>BMC Bioinformatics</source><pubdate>2008</pubdate><volume>9</volume><issue>Suppl 9</issue><fpage>S11</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-9-S9-S11</pubid><pubid idtype="pmcid">2537562</pubid><pubid idtype="pmpid" link="fulltext">18793456</pubid></pubidlist></xrefbib></bibl><bibl id="B16"><title><p>Event extraction from biomedical papers using a full parser in biocomputing</p></title><aug><au><snm>Yakushiji</snm><fnm>A</fnm></au><au><snm>Tateisi</snm><fnm>Y</fnm></au><au><snm>Miyao</snm><fnm>Y</fnm></au><au><snm>Tsujii</snm><fnm>Y</fnm></au></aug><source>Pac Symp Biocomput</source><pubdate>2001</pubdate><fpage>408</fpage><lpage>419</lpage><xrefbib><pubid idtype="pmpid">11262959</pubid></xrefbib></bibl><bibl id="B17"><title><p>GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles</p></title><aug><au><snm>Friedman</snm><fnm>C</fnm></au><au><snm>Kra</snm><fnm>P</fnm></au><au><snm>Yu</snm><fnm>H</fnm></au><au><snm>Krauthammer</snm><fnm>M</fnm></au><au><snm>Rzhetsky</snm><fnm>A</fnm></au></aug><source>Bioinformatics</source><pubdate>2001</pubdate><volume>17</volume><issue>Suppl 1</issue><fpage>S74</fpage><lpage>82</lpage><xrefbib><pubid idtype="pmpid" link="fulltext">11472995</pubid></xrefbib></bibl><bibl id="B18"><title><p>GIS: a biomedical text-mining system for gene information discovery</p></title><aug><au><snm>Chiang</snm><fnm>J</fnm></au><au><snm>Yu</snm><fnm>H</fnm></au><au><snm>Hsu</snm><fnm>H</fnm></au></aug><source>Bioinformatics</source><pubdate>2004</pubdate><volume>20</volume><fpage>120</fpage><lpage>121</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btg369</pubid><pubid idtype="pmpid" link="fulltext">14693818</pubid></pubidlist></xrefbib></bibl><bibl id="B19"><title><p>Applying GIFT, a Gene Interactions Finder in Text, to fly literature</p></title><aug><au><snm>Domedel-Puig</snm><fnm>N</fnm></au><au><snm>Wernisch</snm><fnm>L</fnm></au></aug><source>Bioinformatics</source><pubdate>2005</pubdate><volume>21</volume><fpage>3582</fpage><lpage>3583</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/bti578</pubid><pubid idtype="pmpid" link="fulltext">16014369</pubid></pubidlist></xrefbib></bibl><bibl id="B20"><title><p>Content-rich biological network constructed by mining PubMed abstracts</p></title><aug><au><snm>Chen</snm><fnm>H</fnm></au><au><snm>Sharp</snm><fnm>BM</fnm></au></aug><source>BMC Bioinformatics</source><pubdate>2004</pubdate><volume>5</volume><fpage>147</fpage><url>Http://www.biomedcentral.com/1471-2105/5/147</url><note>The Chilibot system is on-line at [<url>http://www.chilibot.net/</url>]</note><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-5-147</pubid><pubid idtype="pmcid">528731</pubid><pubid idtype="pmpid" link="fulltext">15473905</pubid></pubidlist></xrefbib></bibl><bibl id="B21"><title><p>Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full text</p></title><aug><au><snm>Garten</snm><fnm>Y</fnm></au><au><snm>Altman</snm><fnm>RB</fnm></au></aug><source>BMC Bioinformatics</source><pubdate>2009</pubdate><volume>10</volume><issue>Suppl 2</issue><fpage>S6</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-10-S2-S6</pubid><pubid idtype="pmcid">2646239</pubid><pubid idtype="pmpid" link="fulltext">19208194</pubid></pubidlist></xrefbib></bibl><bibl id="B22"><title><p>Discovering patterns to extract protein-protein interactions from full texts</p></title><aug><au><snm>Huang</snm><fnm>M</fnm></au><au><snm>Zhu</snm><fnm>X</fnm></au><au><snm>Hao</snm><fnm>Y</fnm></au><au><snm>Payan</snm><fnm>DG</fnm></au><au><snm>Qu</snm><fnm>K</fnm></au><au><snm>Li</snm><fnm>M</fnm></au></aug><source>Bioinformatics</source><pubdate>2004</pubdate><volume>20</volume><fpage>3604</fpage><lpage>3612</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/bth451</pubid><pubid idtype="pmpid" link="fulltext">15284092</pubid></pubidlist></xrefbib></bibl><bibl id="B23"><title><p>Extracting protein function information from MEDLINE using a full-sentence parser</p></title><aug><au><snm>Daraselia</snm><fnm>N</fnm></au><au><snm>Yuryev</snm><fnm>A</fnm></au><au><snm>Egorov</snm><fnm>S</fnm></au><au><snm>Novichkova</snm><fnm>S</fnm></au><au><snm>Nikitin</snm><fnm>A</fnm></au><au><snm>Mazo</snm><fnm>I</fnm></au></aug><source>Bioinformatics</source><pubdate>2004</pubdate><volume>20</volume><fpage>604</fpage><lpage>611</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btg452</pubid><pubid idtype="pmpid" link="fulltext">15033866</pubid></pubidlist></xrefbib></bibl><bibl id="B24"><title><p>Automatic extraction of gene/protein biological functions from biomedical text</p></title><aug><au><snm>Koike</snm><fnm>A</fnm></au><au><snm>Niwa</snm><fnm>Y</fnm></au><au><snm>Takagi</snm><fnm>T</fnm></au></aug><source>Bioinformatics</source><pubdate>2005</pubdate><volume>21</volume><fpage>1227</fpage><lpage>1236</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/bti084</pubid><pubid idtype="pmpid" link="fulltext">15509601</pubid></pubidlist></xrefbib></bibl><bibl id="B25"><title><p>Mining literature for protein-protein interactions</p></title><aug><au><snm>Marcotte</snm><fnm>EM</fnm></au><au><snm>Xenarios</snm><fnm>I</fnm></au><au><snm>Eisenberg</snm><fnm>D</fnm></au></aug><source>Bioinformatics</source><pubdate>2001</pubdate><volume>17</volume><fpage>359</fpage><lpage>63</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/17.4.359</pubid><pubid idtype="pmpid" link="fulltext">11301305</pubid></pubidlist></xrefbib></bibl><bibl id="B26"><title><p>Dragon Plant Biology Explorer. A text-mining tool for integrating associations between genetic and biochemical entities with genome annotation and biochemical terms lists</p></title><aug><au><snm>Bajic</snm><fnm>VB</fnm></au><au><snm>Veronika</snm><fnm>M</fnm></au><au><snm>Veladandi</snm><fnm>PS</fnm></au><au><snm>Meka</snm><fnm>A</fnm></au><au><snm>Heng</snm><fnm>MW</fnm></au><au><snm>Rajaraman</snm><fnm>K</fnm></au><au><snm>Pan</snm><fnm>H</fnm></au><au><snm>Swarup</snm><fnm>S</fnm></au></aug><source>Plant Physiol</source><pubdate>2005</pubdate><volume>138</volume><issue>4</issue><fpage>1914</fpage><lpage>25</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1104/pp.105.060863</pubid><pubid idtype="pmcid">1183383</pubid><pubid idtype="pmpid" link="fulltext">16172098</pubid></pubidlist></xrefbib></bibl><bibl id="B27"><title><p>Computer-assisted generation of a protein-interaction database for nuclear receptors</p></title><aug><au><snm>Albert</snm><fnm>S</fnm></au><au><snm>Gaudan</snm><fnm>S</fnm></au><au><snm>Knigge</snm><fnm>H</fnm></au><au><snm>Raetsch</snm><fnm>A</fnm></au><au><snm>Delgado</snm><fnm>A</fnm></au><au><snm>Huhse</snm><fnm>B</fnm></au><au><snm>Kirsch</snm><fnm>H</fnm></au><au><snm>Albers</snm><fnm>M</fnm></au><au><snm>Rebholz-Schuhmann</snm><fnm>D</fnm></au><au><snm>Koegl</snm><fnm>M</fnm></au></aug><source>Molecular Endocrinology</source><pubdate>2003</pubdate><volume>17</volume><issue>8</issue><fpage>1555</fpage><lpage>1567</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1210/me.2002-0424</pubid><pubid idtype="pmpid" link="fulltext">12738764</pubid></pubidlist></xrefbib></bibl><bibl id="B28"><title><p>PDQ Wizard: automated prioritization and characterization of gene and protein lists using biomedical literature</p></title><aug><au><snm>Grimes</snm><fnm>GR</fnm></au><au><snm>Wen</snm><fnm>TQ</fnm></au><au><snm>Mewissen</snm><fnm>M</fnm></au><au><snm>Baxter</snm><fnm>RM</fnm></au><au><snm>Moodie</snm><fnm>S</fnm></au><au><snm>Beattie</snm><fnm>JS</fnm></au><au><snm>Ghazal</snm><fnm>P</fnm></au></aug><source>Bioinformatics</source><pubdate>2006</pubdate><volume>22</volume><issue>16</issue><fpage>2055</fpage><lpage>2057</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btl342</pubid><pubid idtype="pmpid" link="fulltext">16809392</pubid></pubidlist></xrefbib></bibl><bibl id="B29"><title><p>Concept-based annotation of enzyme classes</p></title><aug><au><snm>Hofmann</snm><fnm>O</fnm></au><au><snm>Schomburg</snm><fnm>D</fnm></au></aug><source>Bioinformatics</source><pubdate>2005</pubdate><volume>21</volume><fpage>2059</fpage><lpage>2066</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/bti284</pubid><pubid idtype="pmpid" link="fulltext">15661799</pubid></pubidlist></xrefbib></bibl><bibl id="B30"><title><p>A gene network for navigating the literature</p></title><aug><au><snm>Hoffmann</snm><fnm>R</fnm></au><au><snm>Valencia</snm><fnm>A</fnm></au></aug><source>Nature Genetics</source><pubdate>2004</pubdate><volume>36</volume><fpage>664</fpage><note>The iHOP system is on-line at [<url>http://www.ihop-net.org/</url>]</note><xrefbib><pubidlist><pubid idtype="doi">10.1038/ng0704-664</pubid><pubid idtype="pmpid" link="fulltext">15226743</pubid></pubidlist></xrefbib></bibl><bibl id="B31"><title><p>Shared relationship analysis: ranking set cohesion and commonalities within a literature-derived relationship network</p></title><aug><au><snm>Wren</snm><fnm>JD</fnm></au><au><snm>Garner</snm><fnm>HR</fnm></au></aug><source>Bioinformatics</source><pubdate>2004</pubdate><volume>20</volume><fpage>191</fpage><lpage>198</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btg390</pubid><pubid idtype="pmpid" link="fulltext">14734310</pubid></pubidlist></xrefbib></bibl><bibl id="B32"><title><p>Extracting protein-protein interactions from MEDLINE using the Hidden Vector State model</p></title><aug><au><snm>Zhou</snm><fnm>D</fnm></au><au><snm>He</snm><fnm>Y</fnm></au><au><snm>Kwoh</snm><fnm>CK</fnm></au></aug><source>Int J Bioinform Res Appl</source><pubdate>2008</pubdate><volume>4</volume><fpage>64</fpage><lpage>80</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1504/IJBRA.2008.017164</pubid><pubid idtype="pmpid" link="fulltext">18283029</pubid></pubidlist></xrefbib></bibl><bibl id="B33"><title><p>Definition</p></title><aug><au><cnm>Empirical</cnm></au></aug><publisher>Dictionary.com Unabridged, based on the Random House Dictionary, Random House, Inc</publisher><url>Http://dictionary.reference.com/browse/empirical</url><note>(downloaded 5/27/09).</note></bibl><bibl id="B34"><aug><au><snm>Manning</snm><fnm>CD</fnm></au><au><snm>Raghavan</snm><fnm>R</fnm></au><au><snm>Sch&#252;tze</snm><fnm>H</fnm></au></aug><source>Introduction to Information Retrieval</source><publisher>Cambridge University Press</publisher><pubdate>2008</pubdate></bibl><bibl id="B35"><aug><au><snm>Davis</snm><fnm>E</fnm></au></aug><source>Representations of Commonsense Knowledge</source><publisher>Morgan Kaufmann</publisher><pubdate>1990</pubdate><url>http://www.cs.nyu.edu/faculty/davise/ai/independentEvidence.pdf</url></bibl><bibl id="B36"><title><p>Creating, modeling, and visualizing metabolic networks</p></title><aug><au><snm>Dickerson</snm><fnm>JA</fnm></au><au><snm>Berleant</snm><fnm>D</fnm></au><au><snm>Du</snm><fnm>P</fnm></au><au><snm>Ding</snm><fnm>J</fnm></au><au><snm>Foster</snm><fnm>CM</fnm></au><au><snm>Li</snm><fnm>L</fnm></au><au><snm>Wurtele</snm><fnm>ES</fnm></au></aug><source>Medical Informatics: Knowledge Management and Data Mining in Biomedicine</source><publisher>Springer</publisher><editor>Chen H, Fuller SS, Friedman C, Hersh W</editor><pubdate>2005</pubdate><volume>chapter 17</volume><fpage>491</fpage><lpage>518</lpage></bibl><bibl id="B37"><title><p>Combining evidence: the Na&#239;ve Bayes model vs. semi-na&#239;ve evidence combination, Technical Report SARD04-11</p></title><aug><au><snm>Berleant</snm><fnm>D</fnm></au></aug><pubdate>2004 </pubdate><url>http://ifsc.ualr.edu/jdberleant/papers/seminaivemodel.pdf</url><xrefbib><pubid idtype="pmpid" link="fulltext">15509599</pubid></xrefbib></bibl><bibl id="B38"><title><p>Effects of dichloroacetate on the metabolism of glucose, pyruvate, acetate, 3-hydroxybutyrate and palmitate in rat diaphragm and heart muscle in vitro and on extraction of glucose, lactate, pyruvate and free fatty acids by dog heart in vivo</p></title><aug><au><snm>McAllister</snm><fnm>A</fnm></au><au><snm>Allison</snm><fnm>SP</fnm></au><au><snm>Randle</snm><fnm>PJ</fnm></au></aug><source>Biochem J</source><pubdate>1973</pubdate><volume>134</volume><issue>4</issue><fpage>1067</fpage><lpage>1081</lpage><xrefbib><pubidlist><pubid idtype="pmcid">1177916</pubid><pubid idtype="pmpid">4762752</pubid></pubidlist></xrefbib></bibl><bibl id="B39"><title><p>Extracting biochemical interactions from MEDLINE using a link grammar parser</p></title><aug><au><snm>Ding</snm><fnm>J</fnm></au><au><snm>Berleant</snm><fnm>D</fnm></au><au><snm>Xu</snm><fnm>J</fnm></au><au><snm>Fulmer</snm><fnm>AW</fnm></au></aug><source>Proceedings of the Fifteenth IEEE Conference on Tools with Artificial Intelligenc (ICTAI 2003), Nov. 3&#8211;5, Sacramento, </source><fpage>467</fpage><lpage>471</lpage><url>http://ifsc.ualr.edu/jdberleant/papers/LGPmanuscript8-8-03a.pdf</url></bibl><bibl id="B40"><title><p>GTP-dependent twisting of dynamin implicates constriction and tension in membrane fission</p></title><aug><au><snm>Roux</snm><fnm>A</fnm></au><au><snm>Uyhazi</snm><fnm>K</fnm></au><au><snm>Frost</snm><fnm>A</fnm></au><au><snm>De Camilli</snm><fnm>P</fnm></au></aug><source>Nature</source><pubdate>2006</pubdate><volume>441</volume><fpage>528</fpage><lpage>531</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nature04718</pubid><pubid idtype="pmpid" link="fulltext">16648839</pubid></pubidlist></xrefbib></bibl></refgrp>
   </bm>
</art>
