<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-8-24</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>Benchmarking natural-language parsers for biological applications using dependency graphs</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Clegg</snm>
               <mi>B</mi>
               <fnm>Andrew</fnm>
               <insr iid="I1"/>
               <email>a.clegg@mail.cryst.bbk.ac.uk</email>
            </au>
            <au id="A2">
               <snm>Shepherd</snm>
               <mi>J</mi>
               <fnm>Adrian</fnm>
               <insr iid="I1"/>
               <email>a.shepherd@mail.cryst.bbk.ac.uk</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>School of Crystallography, Birkbeck, University of London, Malet Street, London WC1E 7HX, UK</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2007</pubdate>
         <volume>8</volume>
         <issue>1</issue>
         <fpage>24</fpage>
         <url>http://www.biomedcentral.com/1471-2105/8/24</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17254351</pubid>
               <pubid idtype="doi">10.1186/1471-2105-8-24</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>28</day>
               <month>9</month>
               <year>2006</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>25</day>
               <month>1</month>
               <year>2007</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>25</day>
               <month>1</month>
               <year>2007</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2007</year>
         <collab>Clegg and Shepherd; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Interest is growing in the application of syntactic parsers to natural language processing problems in biology, but assessing their performance is difficult because differences in linguistic convention can falsely appear to be errors. We present a method for evaluating their accuracy using an intermediate representation based on dependency graphs, in which the semantic relationships important in most information extraction tasks are closer to the surface. We also demonstrate how this method can be easily tailored to various application-driven criteria.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>Using the GENIA corpus as a gold standard, we tested four open-source parsers which have been used in bioinformatics projects. We first present overall performance measures, and test the two leading tools, the Charniak-Lease and Bikel parsers, on subtasks tailored to reflect the requirements of a system for extracting gene expression relationships. These two tools clearly outperform the other parsers in the evaluation, and achieve accuracy levels comparable to or exceeding native dependency parsers on similar tasks in previous biological evaluations.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>Evaluating using dependency graphs allows parsers to be tested easily on criteria chosen according to the semantics of particular biological applications, drawing attention to important mistakes and soaking up many insignificant differences that would otherwise be reported as errors. Generating high-accuracy dependency graphs from the output of phrase-structure parsers also provides access to the more detailed syntax trees that are used in several natural-language processing techniques.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>In the last few years, natural language processing (NLP) has become a rapidly-expanding field within bioinformatics, as the literature keeps growing exponentially <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> beyond the ability of human researchers to keep track of, at least without computer assistance. NLP methods have been used successfully to extract various classes of data from biological texts, including protein-protein interactions <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>, protein function assignments <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>, regulatory networks <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> and gene-disease relationships <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>.</p>
         <p>Although much headway has been made using text processing methods based on linear pattern matching (e.g. regular expressions), the diversity and complexity of natural language has caused many researchers to integrate more sophisticated parsing methods into their biological NLP pipelines <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp>. This enables NLP systems to take into account the grammatical content of each sentence, including deeply nested structures, and dependencies between widely separated words or phrases that are hard to capture with superficial patterns.</p>
         <p>General-purpose full-sentence parsers fall into two broad categories depending on the formalisms they use to model language and the corresponding outputs they produce. Constituent parsers (or treebank parsers) recursively break the input text down into clauses and phrases, and produce a tree structure where the root represents the sentence as a whole and the leaves represent words (see Figure <figr fid="F1">1</figr>). Dependency parsers model language as a set of relationships between words, and do not make widespread use of concepts like 'phrase' or 'clause'. Instead they produce a graph for each sentence, where each node represents a word, and each arc a grammatical dependency such as that which holds between a verb and its subject (see Figure <figr fid="F2">2</figr>).</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>A constituent (phrase structure) tree</p>
            </caption>
            <text>
               <p><b>A constituent (phrase structure) tree</b>. The phrase structure of the sentence "Two homologues of the rhombotin gene have now been isolated" from the GENIA treebank. The definitions of the linguistic labels used in this and all other diagrams are given in the List of Abbreviations section.</p>
            </text>
            <graphic file="1471-2105-8-24-1"/>
         </fig>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>A dependency graph</p>
            </caption>
            <text>
               <p><b>A dependency graph</b>. The dependency graph of the sentence in Figure 1.</p>
            </text>
            <graphic file="1471-2105-8-24-2"/>
         </fig>
         <p>While constituent parsers are closer to the theoretical models of language employed in mainstream linguistics, dependency parsers are popular in applied NLP circles because the grammatical relationships that they specify are not entirely unlike the semantic relationships encoding logical predicates to which an NLP developer would like to be able to reduce a sentence. However, there is no such thing as a standard grammar for dependency parsers. Each parser uses a different set of dependency types and a different set of attachment rules, meaning that there is often disagreement between dependency parsers regarding graph topology and arc labels <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>. This means that evaluating dependency parsers, and comparing the results of one to another, can be somewhat fraught with complexity.</p>
         <p>Due to the impact on computational linguistics of the Penn Treebank (PTB) <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>, a vast collection of hand-annotated constituent trees for many thousands of sentences drawn mostly from news reports, there is on the other hand a <it>de facto </it>standard for constituent parsers to follow. This means that there are several high-performance parsers available, trained on the PTB, which produce a pre-defined set of clause, phrase and word category (part-of-speech or POS) labels. There are also standardised evaluation measures by which these parsers are benchmarked against a set-aside portion of the original treebank. The most frequently published scores for parser performance use precision and recall measures based on the presence or absence of constituents in each parser's output, compared to the gold standard. These are sometimes referred to as GEIG or PARSEVAL measures, and although their limitiations are well known <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr></abbrgrp> &#8211; for example, they have problems distinguishing between genuine errors which would affect the output of NLP applications, and minor differences of convention (see Figures <figr fid="F3">3</figr> and <figr fid="F4">4</figr>) which would not &#8211; they are still in wide use.</p>
         <fig id="F3">
            <title>
               <p>Figure 3</p>
            </title>
            <caption>
               <p>Adverbial attachment conventions</p>
            </caption>
            <text>
               <p><b>Adverbial attachment conventions</b>. In (a), the lexicalised version of the Stanford parser attaches the adverb (RB) "constitutively" via an adverbial phrase (ADVP) to its parent verb phrase (VP). In (b), the unlexicalised Stanford parser skips this step and attaches the adverb directly to the verb phrase. These two representations are semantically equivalent.</p>
            </text>
            <graphic file="1471-2105-8-24-3"/>
         </fig>
         <fig id="F4">
            <title>
               <p>Figure 4</p>
            </title>
            <caption>
               <p>Co-ordinating conjunction conventions</p>
            </caption>
            <text>
               <p><b>Co-ordinating conjunction conventions</b>. Two alternative ways of joining two nouns with a conjunction ("and") &#8211; the GENIA corpus uses convention (a), while all of the parsers tested use (b). The additional level of noun phrase (NP) constituents however makes no difference to the meaning.</p>
            </text>
            <graphic file="1471-2105-8-24-4"/>
         </fig>
         <p>The impact of the PTB is also such that both the major linguistic annotation projects for molecular biology corpora <abbrgrp><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr></abbrgrp> employ largely PTB-like conventions, although the amount of annotated biological text is currently at least an order of magnitude less than that which is available in the general-English domain. Although the quantities available are insufficient for retraining parsers, evaluation of the performance of parsers for bioinformatics applications is possible given a meaningful evaluation technique.</p>
         <p>Although a dependency graph for a sentence will not, typically, contain as much information as a constituent tree for the same sentence, it is possible to transform the tree structure into a dependency graph by employing a set of deterministic mapping functions <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. The mapping procedure often results in the elimination of redundant information found in the tree structure, and thus tends to level out many of the insignificant differences in convention between alternative constituent parses (see Figures <figr fid="F5">5</figr> and <figr fid="F6">6</figr>).</p>
         <fig id="F5">
            <title>
               <p>Figure 5</p>
            </title>
            <caption>
               <p>Adverbial attachment using dependencies</p>
            </caption>
            <text>
               <p><b>Adverbial attachment using dependencies</b>. Either of the representations in Figure 3 result in this graph.</p>
            </text>
            <graphic file="1471-2105-8-24-5"/>
         </fig>
         <fig id="F6">
            <title>
               <p>Figure 6</p>
            </title>
            <caption>
               <p>Co-ordinating conjunctions using dependencies</p>
            </caption>
            <text>
               <p><b>Co-ordinating conjunctions using dependencies</b>. Either of the representations in Figure 4 result in this graph fragment.</p>
            </text>
            <graphic file="1471-2105-8-24-6"/>
         </fig>
         <p>This process therefore provides a convenient way to evaluate constituent parsers on those aspects of their output that most affect meaning, as well as forming a useful intermediate representation between phrase structure and logical predicates. Furthermore, given such a framework, it becomes easy to define application-specific evaluation criteria reflecting the requirements that will be placed upon a parser in a biological NLP scenario. Using this approach, we have evaluated several leading open-source parsers on general syntactic accuracy, as well as their ability to extract dependencies important to correct interpretation of a corpus of texts relating to biomolecular interactions in humans. The parsers are scored on their ability to correctly generate the grammatical dependencies in each sentence, by comparing the corresponding dependency graphs from their output and from the constituent structure of the original treebank. The results are presented below.</p>
      </sec>
      <sec>
         <st>
            <p>Results and Discussion</p>
         </st>
         <p>The software packages used in our evaluation are the Bikel parser <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>, the Collins parser <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>, the Stanford parser <abbrgrp><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr></abbrgrp> and the Charniak parser <abbrgrp><abbr bid="B19">19</abbr></abbrgrp> &#8211; including a modified version known herein as the Charniak-Lease parser <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. All of these are widely used by the computational linguistics community, and have been employed to parse molecular biology data (see Related Work section), despite having been developed and trained on sentences from the Penn Treebank. While it may be the case that, over the coming years, enough consistently-annotated biological treebank data becomes available to make retraining parsers on biological text a feasible proposition, this is by no means true yet. Furthermore, when choosing which parser to retrain with such data as and when it becomes available, one would wish to pick one which had already demonstrated good cross-domain portability, since the biomedical domain in fact encompasses multiple subdomains with distinct sublanguages <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>.</p>
         <p>We tested at least two versions of each parser as it is by no means certain <it>a priori </it>that the best-performing version on the PTB will likewise perform best on biological text. Our gold standard corpus was 1757 sentences from the GENIA treebank <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>, which were mapped from their original tree structures to dependency graphs by the same deterministic algorithm from the Stanford toolkit that we used to convert the output of each parser <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. See the Methods section for more details of our parsing pipeline.</p>
         <sec>
            <st>
               <p>Overall parse accuracy</p>
            </st>
            <p>For each parser, we calculated two scores, constituent effectiveness (F<sub><it>const</it></sub>) and dependency effectiveness (F<sub><it>dep</it></sub>)against the original constituent trees in the treebank, and their dependency graph equivalents, respectively (see Tables <tblr tid="T1">1</tblr> and <tblr tid="T2">2</tblr>). These scores are measures of tree or graph similarity between the parser output and the gold standard corpus, penalising false negatives and false positives &#8211; see the Methods section for the formulae used to calculate them. When comparing the parsers' output in terms of dependency graphs rather than raw trees &#8211; that is, using F<sub><it>dep </it></sub>rather than F<sub><it>const </it></sub>&#8211; there is a much less gradual spread, with the three front-runners being clearly separated from the rest.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>F<sub><it>const </it></sub>score, all sentences</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="left">
                        <p>Parser</p>
                     </c>
                     <c ca="center">
                        <p>F<sub><it>const</it></sub></p>
                     </c>
                  </r>
                  <r>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Charniak-Lease</p>
                     </c>
                     <c ca="center">
                        <p>80.2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Bikel (0.9.8)</p>
                     </c>
                     <c ca="center">
                        <p>79.4</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Bikel (0.9.9c)</p>
                     </c>
                     <c ca="center">
                        <p>79.4</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Charniak (Aug 05)</p>
                     </c>
                     <c ca="center">
                        <p>78.1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Collins (model 2)</p>
                     </c>
                     <c ca="center">
                        <p>77.8</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Collins (model 3)</p>
                     </c>
                     <c ca="center">
                        <p>77.2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Collins (model 1)</p>
                     </c>
                     <c ca="center">
                        <p>76.4</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Stanford (unlexicalised)</p>
                     </c>
                     <c ca="center">
                        <p>72.3</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Stanford (lexicalised)</p>
                     </c>
                     <c ca="center">
                        <p>71.1</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Parser effectiveness based on comparison of constituent trees to the GENIA treebank, summed over entire corpus.</p>
               </tblfn>
            </tbl>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>F<sub><it>dep </it></sub>score, all sentences</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="left">
                        <p>Parser</p>
                     </c>
                     <c ca="center">
                        <p>F<sub><it>dep</it></sub></p>
                     </c>
                  </r>
                  <r>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Charniak-Lease</p>
                     </c>
                     <c ca="center">
                        <p>77.0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Bikel (0.9.8)</p>
                     </c>
                     <c ca="center">
                        <p>77.0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Bikel (0.9.9c)</p>
                     </c>
                     <c ca="center">
                        <p>77.0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Stanford (lexicalised)</p>
                     </c>
                     <c ca="center">
                        <p>70.5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Charniak (Aug 05)</p>
                     </c>
                     <c ca="center">
                        <p>68.5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Stanford (unlexicalised)</p>
                     </c>
                     <c ca="center">
                        <p>68.5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Collins (model 2)</p>
                     </c>
                     <c ca="center">
                        <p>68.0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Collins (model 1)</p>
                     </c>
                     <c ca="center">
                        <p>68.0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Collins (model 3)</p>
                     </c>
                     <c ca="center">
                        <p>67.0</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Parser effectiveness based on comparison of dependency graphs to the graphs generated from the GENIA treebank, summed over the entire corpus.</p>
               </tblfn>
            </tbl>
            <p>Note that the F<sub><it>dep </it></sub>scores given in Table <tblr tid="T2">2</tblr> use the strictest criterion for a match between a dependency in the parse and the corresponding dependency in the gold standard. A match is only recorded if an arc with the same start node, end node and label (dependency type) exists. This is important as the type of a dependency can be crucial for correct interpretation, discriminating for example between the subject and direct object of a verb. However, many assessments of dependency parsers use a weaker matching criterion which disregards the dependency type, and thus only takes into account the topology of the graph and not the arc labels. For comparison purposes, the mean scores using this weaker untyped criterion are given in Table <tblr tid="T3">3</tblr> (see also the Related Work section). Note that the rank order of the parsers is the same when using the less strict matching criterion, apart from some slippage by the lexicalised version of the Stanford parser, suggesting that this parser's scores on the strict test are boosted by comparatively good dependency type identification. All scores in this paper use the strict matching criterion unless otherwise specified.</p>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>F<sub><it>dep </it></sub>score, all sentences, loose matching criterion</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="left">
                        <p>Parser</p>
                     </c>
                     <c ca="center">
                        <p>F<sub><it>dep</it></sub></p>
                     </c>
                  </r>
                  <r>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Charniak-Lease</p>
                     </c>
                     <c ca="center">
                        <p>81.0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Bikel (0.9.8)</p>
                     </c>
                     <c ca="center">
                        <p>81.0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Bikel (0.9.9c)</p>
                     </c>
                     <c ca="center">
                        <p>81.0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Charniak (Aug 05)</p>
                     </c>
                     <c ca="center">
                        <p>78.0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Stanford (unlexicalised)</p>
                     </c>
                     <c ca="center">
                        <p>74.5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Collins (model 2)</p>
                     </c>
                     <c ca="center">
                        <p>72.5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Collins (model 1)</p>
                     </c>
                     <c ca="center">
                        <p>72.5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Stanford (lexicalised)</p>
                     </c>
                     <c ca="center">
                        <p>72.5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Collins (model 3)</p>
                     </c>
                     <c ca="center">
                        <p>71.5</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Parser effectiveness based on comparison of dependency graphs to the graphs generated from the GENIA treebank, summed over the entire corpus, disregarding dependency types.</p>
               </tblfn>
            </tbl>
            <p>The overall effectiveness scores for some of the parsers are distorted, however, by the fact that they encountered sentences which could not be parsed at all (Table <tblr tid="T4">4</tblr>). It is useful to separate out the effects on the mean scores of complete parse failures as opposed to individual errors in successfully-parsed sentences. The F<sub><it>dep </it></sub>scores in Table <tblr tid="T5">5</tblr> show the mean effectiveness for each parser averaged <it>only </it>over those sentences which resulted in a successful parse. The Bikel parser version 0.9.9c claims in its release notes that the parser has a new robustness feature meaning that it "should <it>always </it>produce <it>some </it>kind of a parse for every input sentence" (original emphasis) <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>, but this does not appear to be true for biological texts. However, it is an improvement over version 0.9.9 (not featured in this investigation) which we found to suffer from 440 failures (25% of the corpus) on the GENIA treebank <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. The parse failures for all of the parsers tended to occur in longer, more complex sentences.</p>
            <tbl id="T4">
               <title>
                  <p>Table 4</p>
               </title>
               <caption>
                  <p>Parse failures</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="left">
                        <p>Parser</p>
                     </c>
                     <c ca="center">
                        <p>Failures</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Charniak-Lease</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Charniak (Aug 05)</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Stanford (unlexicalised)</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Stanford (lexicalised)</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Bikel (0.9.8)</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Bikel (0.9.9c)</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Collins (model 1)</p>
                     </c>
                     <c ca="center">
                        <p>12</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Collins (model 2)</p>
                     </c>
                     <c ca="center">
                        <p>25</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Collins (model 3)</p>
                     </c>
                     <c ca="center">
                        <p>40</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Number of sentences which completely failed to parse, out of a total of 1757 in the whole corpus.</p>
               </tblfn>
            </tbl>
            <tbl id="T5">
               <title>
                  <p>Table 5</p>
               </title>
               <caption>
                  <p>F<sub><it>dep </it></sub>score, successfully parsed sentences only</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="left">
                        <p>Parser</p>
                     </c>
                     <c ca="center">
                        <p>F<sub><it>dep</it></sub></p>
                     </c>
                  </r>
                  <r>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Charniak-Lease</p>
                     </c>
                     <c ca="center">
                        <p>77.0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Bikel (0.9.8)</p>
                     </c>
                     <c ca="center">
                        <p>77.0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Bikel (0.9.9c)</p>
                     </c>
                     <c ca="center">
                        <p>77.0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Collins (model 3)</p>
                     </c>
                     <c ca="center">
                        <p>71.0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Collins (model 2)</p>
                     </c>
                     <c ca="center">
                        <p>70.5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Stanford (lexicalised)</p>
                     </c>
                     <c ca="center">
                        <p>70.5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Collins (model 1)</p>
                     </c>
                     <c ca="center">
                        <p>69.5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Charniak (Aug 05)</p>
                     </c>
                     <c ca="center">
                        <p>68.5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Stanford (unlexicalised)</p>
                     </c>
                     <c ca="center">
                        <p>68.5</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Parser effectiveness based on comparison of dependency graphs to the graphs generated from the GENIA treebank, summed over the sentences for which each parser returned a successful parse.</p>
               </tblfn>
            </tbl>
            <p>The highest-scoring parsers overall, the Charniak-Lease parser and the Bikel parser, achieve very similar scores. Therefore, we decided to subject these two parsers to a series of tests designed to determine where the strengths and weaknesses of each lay when assessed on tasks important to biological language processing applications. We used the older version of the Bikel parser (0.9.8) as it suffered only one failure, as opposed to two by version 0.9.9c.</p>
         </sec>
         <sec>
            <st>
               <p>Prepositional phrase attachment</p>
            </st>
            <p>One problem that is frequently cited as hard for parsers is the correct attachment of prepositional phrases &#8211; modifiers attached to nouns or verbs that convey additional information regarding time, duration, location, manner, cause and so on. It is important to correctly attach such modifiers as errors can alter the meaning of a sentence considerably. For example, consider the phrase "Induction of NF-KB during monocyte differentiation by HIV type 1 infection." Is it the induction (correct) or the differentiation (incorrect) which is caused by the infection? Furthermore, the targets of many biological interactions are expressed in prepositional phrases, e.g. "<it>X </it>binds <b>to <it>Y</it></b>" &#8211; the bold section is a prepositional phrase. However this problem is non-trivial because correct attachment relies on the use of background knowledge (for humans), or an approximation of background knowledge based on frequencies of particular words in particular positions in the training corpus (for parsers). These frequencies are often sparse, and for previously unseen words (e.g. many of the technical terms in biology) they will be missing altogether.</p>
            <p>To assess the potential impact of this phenomenon, we tested the two best parsers on their ability to correctly generate dependencies between prepositions and both the head words of the phrases they modify and the head words of the modifying phrases, by calculating F<sub><it>dep </it></sub>scores over just these arcs. (We did not penalise the Bikel parser for missing dependencies in the one sentence it failed to parse at all, in any of these tasks.) For example, in the phrase "inducing NF-KB expression in the nuclei," the modifying phrase of the preposition "in" is "the nuclei" &#8211; "nuclei" being the head of this phrase &#8211; and the modified word is "inducing". The results are given in Table <tblr tid="T6">6</tblr>. Surprisingly, both parsers scored slightly higher on the harder portion of this task (attaching prepositions to the appropriate modified words) than they did across all dependency types, where both achieved an F<sub><it>dep </it></sub>of 77.0 as shown in Table <tblr tid="T5">5</tblr>. On the easier portion of this task (attaching prepositions to the appropriate modifying words), both scored considerably higher. This ran contrary to our expectations, and indicates that the conventional 'folk wisdom' that prepositional phrase attachment is a particularly hard task is not necessarily true within the constrained environment of biological texts.</p>
            <tbl id="T6">
               <title>
                  <p>Table 6</p>
               </title>
               <caption>
                  <p>F<sub><it>dep </it></sub>for prepositional phrase attachment</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c ca="left">
                        <p>Parser</p>
                     </c>
                     <c ca="center">
                        <p>Modified words F<sub><it>dep</it></sub></p>
                     </c>
                     <c ca="center">
                        <p>Modifying words F<sub><it>dep</it></sub></p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Charniak-Lease</p>
                     </c>
                     <c ca="center">
                        <p>79.5</p>
                     </c>
                     <c ca="center">
                        <p>89.5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Bikel (0.9.8)</p>
                     </c>
                     <c ca="center">
                        <p>78.0</p>
                     </c>
                     <c ca="center">
                        <p>91.0</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Parser effectiveness for the task of attaching prepositions correctly to the words they modify and to the words which are doing the modifying.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Reconstructing co-ordinating conjunctions</p>
            </st>
            <p>Another syntactic phenomenon that is problematic for similar reasons is co-ordinating conjunction &#8211; the joining on an equal footing of two equivalent grammatical units (e.g. two noun phrases) by a conjunction such as 'and' or 'or'. Since the scope of the conjunction relies on extra-linguistic knowledge or assumptions, there are often several equally grammatical but semantically quite different readings available. An example of this is given in Figures <figr fid="F7">7</figr> and <figr fid="F8">8</figr>. The correct reading (Figure <figr fid="F7">7</figr>) refers to the cloning of GATA-1 genes from mice and humans &#8211; "mouse" and "human" are both attached directly to "genes". An alternative, grammatical, yet incorrect reading is shown in Figure <figr fid="F8">8</figr>, where "human" is attached to "genes", but "mouse" is attached directly to "cloned", implying that some human genes and a whole mouse were cloned.</p>
            <fig id="F7">
               <title>
                  <p>Figure 7</p>
               </title>
               <caption>
                  <p>Co-ordination ambiguity I</p>
               </caption>
               <text>
                  <p><b>Co-ordination ambiguity I</b>. The correct dependency graph for the sentence "We have cloned the mouse and human GATA-1 genes."</p>
               </text>
               <graphic file="1471-2105-8-24-7"/>
            </fig>
            <fig id="F8">
               <title>
                  <p>Figure 8</p>
               </title>
               <caption>
                  <p>Co-ordination ambiguity II</p>
               </caption>
               <text>
                  <p><b>Co-ordination ambiguity II</b>. An incorrect graph for the sentence in Figure 7, implying that some genes and a mouse have been cloned.</p>
               </text>
               <graphic file="1471-2105-8-24-8"/>
            </fig>
            <p>To measure the ability of the parsers to make the right choices in these situations, we recalculated the F<sub><it>dep </it></sub>score over only those subgraphs (in the parse or the gold standard) whose root words are at either end of a conjunction dependency. For example, if we were comparing the incorrect parse in Figure <figr fid="F8">8</figr> to the sentence in Figure <figr fid="F7">7</figr>, our gold standard would consist of all the dependencies from Figure <figr fid="F7">7</figr> that go to or from the words "mouse" and "human", as these are connected by the conjunction AND. Our test set would consist of all the dependencies in Figure <figr fid="F8">8</figr> that connect to any of the words "the", "mouse", "human", "GATA-1" and "genes", as the conjunction joins the words "mouse" and "genes" upon which the words "the", "human" and "GATA-1" depend. True and false positive counts, and thus precision, recall and F<sub><it>dep </it></sub>(see Methods section) can then be calculated over just these dependencies. It would not be sufficient to compare the conjunction dependency alone between the two graphs as this would not measure the extent of this initial error's consequences. In some circumstances, such as nested co-ordinations involving complex multiword phrases &#8211; e.g. "the octamer site and the Y, X1 and X2 boxes" &#8211; these consequences can be particularly far-reaching. Both parsers' scores on this task (Table <tblr tid="T7">7</tblr>) were slightly lower than their averages of 77.0 across all dependency types, but not spectacularly lower.</p>
            <tbl id="T7">
               <title>
                  <p>Table 7</p>
               </title>
               <caption>
                  <p>F<sub><it>dep </it></sub>for co-ordinating conjunctions</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="left">
                        <p>Parser</p>
                     </c>
                     <c ca="center">
                        <p>F<sub><it>dep</it></sub></p>
                     </c>
                  </r>
                  <r>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Bikel (0.9.8)</p>
                     </c>
                     <c ca="center">
                        <p>75.5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Charniak-Lease</p>
                     </c>
                     <c ca="center">
                        <p>75.0</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Parser effectiveness for the task of reconstructing phrases joined by conjunctions such as 'and' and 'or'.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Detecting negation</p>
            </st>
            <p>Reliably distinguishing between positive and negative assertions and determining the scope of negation markers are perennial difficulties in NLP, and have been well studied in the medical informatics context <abbrgrp><abbr bid="B23">23</abbr><abbr bid="B24">24</abbr></abbrgrp>. It is not uncommon in information extraction projects to skip sentences containing negation words <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>, but 'not' appears in 10% of the sentences in our test corpus, and this figure does not count all the other ways of negating a statement in English. Thus a case should be made for attempting to tackle the problem in a more methodical way. In order to gain some initial insight into whether dependency parses might be of use here, we calculated the F<sub><it>dep </it></sub>score for all dependency arcs beginning or ending at any of these words: 'not', 'n't', 'no', 'none', 'negative', 'without', 'absence', 'cannot', 'fail', 'failure', 'never', 'without', 'unlikely', 'exclude', 'disprove', 'insignificant'. The results (Table <tblr tid="T8">8</tblr>) are encouraging and the use of dependency graphs in resolving negations warrants further investigation. The difference between these two parsers is much clearer in this task than in any of the others, and demonstrates that the Charniak-Lease parser may be particularly suited to tackling this problem, as it scores higher than its all-dependencies average while the Bikel parser scores considerably lower.</p>
            <tbl id="T8">
               <title>
                  <p>Table 8</p>
               </title>
               <caption>
                  <p>F<sub><it>dep </it></sub>for negation words</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="left">
                        <p>Parser</p>
                     </c>
                     <c ca="center">
                        <p>F<sub><it>dep</it></sub></p>
                     </c>
                  </r>
                  <r>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Charniak-Lease</p>
                     </c>
                     <c ca="center">
                        <p>80.5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Bikel (0.9.8)</p>
                     </c>
                     <c ca="center">
                        <p>70.5</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Parser effectiveness for the task of attaching negation words correctly.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Verb argument assignment</p>
            </st>
            <p>Although there are uncountably many ways to express most logical predicates in natural language, molecular biology texts and abstracts in particular are generally rather constrained and essentially designed for the efficient reporting of sequences of facts, observations and inferences. As a result, much of the important semantic content in this genre is encoded in the form of declarative statements, where a main verb expresses a single predicate more or less exactly, and its syntactic arguments (the subject, direct object and any indirect or prepositional objects) correspond to the entities over which the predicate holds. This being the case, it is important that the arguments of content-bearing verbs are assigned correctly. Failing to recover the subject or object of a verb will render it less useful &#8211; not completely useless, however, since we may like to know e.g. that "<it>X </it>inhibits B cell Ig secretion" even if we do not yet know what <it>X </it>is. Furthermore, most biologically-important predicates are very much directional, meaning that a confusion between subject and object at the level of syntax will lead to a disastrous reversal of the roles of agent and target at the level of semantics. Put more simply, "<it>X </it>phosphorylates <it>Y</it>" and "<it>Y </it>phosphorylates <it>X</it>" are very different statements.</p>
            <p>In order to detect any latent parsing problems that might hinder this process, we chose one of the most common biological predicate verbs in the corpus ('induce' in any of its forms) and divided the dependency types that can hold between it and its (non-prepositional) arguments into two sets: those which one would expect to find linking it to its agent, and those which one would expect to find linking it to its target. For example, in the statement "Cortivazol significantly induced GR mRNA," 'Cortivazol' is the agent and 'GR mRNA' is the target. We then calculated an F<sub><it>dep </it></sub>score for each parser over these dependencies only, counting as a match those which connect the correct two nodes and which are from the correct set, even if the exact dependency type is different. For example, if the gold standard contained a NOMINAL_SUBJECT dependency between two nodes, and the parse contained a CLAUSAL_SUBJECT dependency between the same two nodes, this would count as a match since both are in the agent dependencies set.</p>
            <p>The resulting F<sub><it>dep </it></sub>scores are given in Table <tblr tid="T9">9</tblr>, together with a breakdown of false negatives (recall errors): the numbers of mismatches (substitutions for dependencies from the other set), non-matches (substitutions for dependencies from neither set), and completely missing dependencies. The scores for both parsers are very high, with the Charniak-Lease parser only mis-categorising one out of 145 instances of arguments for 'induce' (putting it in the wrong category) and proposing only three other erroneous arguments for this verb in the whole corpus. These results bode well for the semantic accuracy of information extraction systems based on these principles.</p>
            <tbl id="T9">
               <title>
                  <p>Table 9</p>
               </title>
               <caption>
                  <p>Verb argument assignment for 'induce'</p>
               </caption>
               <tblbdy cols="7">
                  <r>
                     <c ca="left">
                        <p>Parser</p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>F</it>
                           <sub>
                              <it>dep</it>
                           </sub>
                        </p>
                     </c>
                     <c ca="center">
                        <p>False posititives</p>
                     </c>
                     <c ca="center">
                        <p>False negatives</p>
                     </c>
                     <c ca="center">
                        <p>Mismatches</p>
                     </c>
                     <c ca="center">
                        <p>Nonmatches</p>
                     </c>
                     <c ca="center">
                        <p>Missing</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Charniak-Lease</p>
                     </c>
                     <c ca="left">
                        <p>98.0</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Bikel (0.9.8)</p>
                     </c>
                     <c ca="left">
                        <p>97.0</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Parser effectiveness at assigning the arguments of the verb 'induce' into the correct category (agent or target). For each recall error (false negative), we counted the number of mismatches (substitutions for dependencies from the other set), non-matches (substitutions for dependencies from neither set), and completely missing dependencies.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Error analysis</p>
            </st>
            <p>Our previous experiences with parser evaluation have indicated the importance of correct POS tagging for accurate parsing; this is demonstrated by the difference in performance between the Charniak-Lease parser, and the other &#8211; newer &#8211; version of the Charniak parser which does not have the benefit of biomedical-domain POS tagging. To measure the consequences of POS errors, we counted the number of false negatives (recall errors) in the outputs of our two leading parsers where either one or two of the words which should have been joined by the missing dependency were incorrectly tagged. (Remember that, since the strict matching criterion is being applied here, a recall error means that a dependency <it>of a specific type </it>is missing; it will often be the case that another dependency of a different type has been substituted.)</p>
            <p>Also, in a very small minority of cases, it is possible for nodes to be present in a dependency graph from the gold standard, but actually missing from the same graph in a parser's output, or <it>vice versa</it>. This comes about since punctuation symbols are not always retained as nodes in the graph in the same way that words are. If a word is mistakenly treated as a discardable punctuation symbol, it will be omitted from the dependency graph. This can come about as a result of a POS tagging error, an error in the Stanford algorithm or a mismatch between the conventions used by a parser or the gold standard and those used by the Stanford algorithm's developers. Conversely, if a punctuation symbol is treated as a word for the same reasons, it may be present as a node in its own right in the resulting graph even if it would otherwise have been suppressed. Therefore, we also counted the number of missing dependencies in each parser's output where one or both of the nodes that the dependency should have connected were also missing. The results of both of these tests are given in Table <tblr tid="T10">10</tblr>. The results &#8211; one in five missing dependencies being associated with at least one POS error for the Charniak-Lease parser, and almost one in three for the Bikel parser &#8211; should provide all the more motivation for the development and refinement of biological POS tagging software.</p>
            <tbl id="T10">
               <title>
                  <p>Table 10</p>
               </title>
               <caption>
                  <p>Reasons for recall errors</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c ca="left">
                        <p>Parser</p>
                     </c>
                     <c ca="center">
                        <p>1 bad tag</p>
                     </c>
                     <c ca="center">
                        <p>2 bad tags</p>
                     </c>
                     <c ca="center">
                        <p>1 missing</p>
                     </c>
                     <c ca="center">
                        <p>2 missing</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Charniak-Lease</p>
                     </c>
                     <c ca="center">
                        <p>20.6%</p>
                     </c>
                     <c ca="center">
                        <p>2.0%</p>
                     </c>
                     <c ca="center">
                        <p>0.4%</p>
                     </c>
                     <c ca="center">
                        <p>0.0%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Bikel (0.9.8)</p>
                     </c>
                     <c ca="center">
                        <p>28.6%</p>
                     </c>
                     <c ca="center">
                        <p>3.4%</p>
                     </c>
                     <c ca="center">
                        <p>0.4%</p>
                     </c>
                     <c ca="center">
                        <p>0.0%</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>For each dependency from the gold standard that was not generated by each parser, we determined whether one or both of the words it should have joined were badly POS-tagged, or entirely missing from the dependency graph of the parser's output.</p>
               </tblfn>
            </tbl>
            <p>In addition, we counted the missing dependencies for each parser by type, in order to get an idea of which types were the most problematic. The results (Table <tblr tid="T11">11</tblr>) are rather interesting. The same five types (out of roughly 50) account for the majority of errors in both cases, although there is some difference in the relative proportions. One in five missing dependencies are of the generic DEPENDENT type, which the Stanford algorithm produces when it cannot match a syntactic construction in a phrase structure tree to a more specific type of dependency. The presence of large numbers of DEPENDENT arcs in the graphs of the gold standard corpus indicates that the GENIA annotators are using syntactic constructions that are unfamiliar to the Stanford algorithm. On closer inspection, we discovered that one fifth of the DEPENDENT arcs missed by each parser had been substituted for more specific dependencies joining the same words; it is impossible for us to judge by comparison to GENIA whether the types of these dependencies are truly correct or not.</p>
            <tbl id="T11">
               <title>
                  <p>Table 11</p>
               </title>
               <caption>
                  <p>Recall errors by type (top five types only)</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c cspan="2" ca="left">
                        <p>Bikel (0.9.8)</p>
                     </c>
                     <c cspan="2" ca="left">
                        <p>Charniak-Lease</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>DEPENDENT</p>
                     </c>
                     <c ca="center">
                        <p>20.8%</p>
                     </c>
                     <c ca="left">
                        <p>DEPENDENT</p>
                     </c>
                     <c ca="center">
                        <p>20.4%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PREPOSITIONAL_MODIFIER</p>
                     </c>
                     <c ca="center">
                        <p>12.4%</p>
                     </c>
                     <c ca="left">
                        <p>NOUN_COMPOUND_MODIFIER</p>
                     </c>
                     <c ca="center">
                        <p>11.7%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PUNCTUATION</p>
                     </c>
                     <c ca="center">
                        <p>11.6%</p>
                     </c>
                     <c ca="left">
                        <p>PREPOSITIONAL_MODIFIER</p>
                     </c>
                     <c ca="center">
                        <p>11.6%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>ADJECTIVAL_MODIFIER</p>
                     </c>
                     <c ca="center">
                        <p>8.2%</p>
                     </c>
                     <c ca="left">
                        <p>PUNCTUATION</p>
                     </c>
                     <c ca="center">
                        <p>10.5%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>NOUN_COMPOUND_MODIFIER</p>
                     </c>
                     <c ca="center">
                        <p>8.0%</p>
                     </c>
                     <c ca="left">
                        <p>ADJECTIVAL_MODIFIER</p>
                     </c>
                     <c ca="center">
                        <p>7.0%</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>For each parser, these are the five most common dependency types that were not correctly generated, with the proportion of all recall errors they account for.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Computational efficiency</p>
            </st>
            <p>Full syntactic parsing is a computationally demanding process, and although it is trivial to parallelise by parsing separate sentences on separate CPUs, processing speed is nevertheless an important consideration. We measured the parsing time of the 1757-sentence corpus using the GNU time utility, calculating the total processor time for each parser as the sum of the user and system times for the process. The Charniak-Lease parser took 1 h:18 m:36 s while the Bikel parser took much longer at 7 h:21 m:08 s. These times do not include pre- or post-processing scripts, or the time required to generate the dependency graphs, although these are minor compared to the actual parsing process. All processes were running on one processor of a 3 GHz SMP Linux PC.</p>
            <p>The difference between these two results is startling. The Bikel parser is written in Java and the Charniak parser in C++, but this in itself does not explain the difference. Analysis of the time command's output indicated that the Bikel parser had vastly greater memory requirements, and while the Charniak-Lease parser ran without needing to swap any of its data out to the hard disk, the Bikel parser made very frequent use of the swapfile. The newer version of the Bikel parser, while not quite as robust, made a time saving of over 50% compared to its predecessor, which indicates that comptutational speedups are possible and practical with Bikel's architecture. The other parsers in the evaluation varied hugely, ranging from slightly under an hour (for the model 1 Collins parser) to nearly 10 hours (for the lexicalised Stanford parser).</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>We have presented a method for evaluating treebank parsers based on dependency graphs that is particularly suitable for analysing their capabilities with respect to semantically-important tasks crucial to biological information extraction systems. Applying this method to various versions of four popular, open-source parsers that have been deployed in the bioinformatics domain has produced some interesting and occasionally surprising results relevant to previous and future NLP projects in this domain.</p>
         <p>In terms of overall parse accuracy, the Charniak-Lease parser &#8211; a version of the venerable Charniak parser enhanced with access to a biomedical vocabulary for POS-tagging purposes &#8211; and version 0.9.8 of the Bikel parser achieved joint highest results. Both parsers relied on good POS tagging to achieve their scores, with large proportions of the dependency recall errors being attributable to POS errors. An interesting comparison can be drawn here between the Charniak-Lease parser, for which just over 20% of the missing dependencies connect to at least one incorrectly-tagged word, and the original Charniak parser, which uses a POS-tagging component trained on newspaper English, and for which almost 60% of the recall errors relate to at least one incorrectly-tagged word.</p>
         <p>Both parsers performed well on tasks simulating the semantic requirements of a real-world NLP project based on dependency graph analysis, and achieved mostly similar scores. The reconstruction of co-ordinating conjunctions (e.g. 'and'/'or' constructs) was slightly more difficult than average for each parser, and the correct attachment of negation words (e.g. 'not' or 'without') proved problematic for the Bikel parser, although the Charniak-Lease parser was more successful on this task. Both parsers identified the arguments of the verb 'induce' almost perfectly when we relaxed the matching criterion to allow substitutions between agent-argument dependencies (e.g. NOMINAL_SUBJECT and CLAUSAL_SUBJECT) and between target-argument dependencies (e.g. DIRECT_OBJECT and INDIRECT_OBJECT).</p>
         <sec>
            <st>
               <p>Practical considerations</p>
            </st>
            <p>There are two additional criteria upon which one might choose a parser for an information extraction project, all other things being equal: robustness and computational efficiency. On the former criterion, the Charniak-Lease parser is slightly more desirable, as it did not fail to parse any of the sentences in the corpus, whereas version 0.9.8 of the Bikel parser failed on one sentence. This seems to reflect an architectural difference between the two parsers; the version of the Charniak parser tested here did not suffer any failures either, and neither did two previous versions that we tested in earlier experiments, whereas the later version of the Bikel parser tested here failed twice (and was itself a bugfix release for a version that failed a staggering 440 times on our corpus). In terms of efficiency, the Charniak parser family is the clear winner, with the Charniak-Lease parser taking a fraction of the time of the Bikel parser to produce slightly better results.</p>
         </sec>
         <sec>
            <st>
               <p>Advantages of dependency graphs</p>
            </st>
            <p>Given that none of the parsers in this evaluation use dependency grammars natively, one might ask two questions. Firstly, what are the practical advantages of translating the output of treebank-style constituent parsers into dependency graphs? And secondly, how do the graphs thus generated compare to the raw output of dependency parsers on biological texts? We will address the latter question below in the Related Work section. In answer to the former question, the benefits are manifold and apply to both the evaluation process and the engineering of NLP applications.</p>
            <p>We hope that the semantic evaluation tasks presented in this paper demonstrate the ease by which application-specific benchmarks can be designed and applied with reference to dependency graphs. Granted, one could conceive of similar phrase-structure tree-based algorithms to test the positioning of, say, negation words with respect to the words they modify, but these would require the comparison of two subtrees and would therefore require much more coding and processing than their dependency equivalents. Indeed, since several subtrees can result in the same grammatical relation (e.g. Figures <figr fid="F3">3</figr> and <figr fid="F4">4</figr>), one would have to manually account for a degree of allowable variation.</p>
            <p>Furthermore, some application-specific tests &#8211; such as the analysis of arguments for the verb 'induce' in our investigation &#8211; would be impossible using raw constituent trees. This kind of information is not explicitly represented in constituent trees, but rather is implicit (albeit buried rather deeply) in the phrase structures and the rules of English, and to test such relations from trees alone requires the design and implementation of mapping rules that would essentially result in dependency structures anyway.</p>
            <p>That said, there is more information in a constituent tree than in its dependency equivalent, and there are many algorithms that make use of the richness of trees in order to tackle such problems as pronoun resolution <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>, labelling phrases with semantic roles such as CAUSE, EXPERIENCER, RESULT or INSTRUMENT <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>, automatic document summarisation <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>, unsupervised lexicon acquisition <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>, and the assignment of functional category tags like TEMPORAL, MANNER, LOCATION or PURPOSE to phrases <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. All of these features may be of use in a fully-featured NLP system, so it is desirable to retain the original phrase-structure representation of each sentence as well as the final dependency graph. Therefore, a parsing pipeline that produces both a constituent tree and a dependency graph has an advantage over one that produces only one of these.</p>
         </sec>
         <sec>
            <st>
               <p>Related work</p>
            </st>
            <p>The inspiration for this paper came from the observation that constituent parsers are beginning to appear in bioinformatics papers on a wide variety of topics, but without any analysis of how well they perform as isolated components in broader projects. For example, the Bikel parser has been used to produce rough treebanks for human correction in a biological treebanking initiative <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. Subtrees from the Collins parser have been used as features in a protein interaction extractor <abbrgrp><abbr bid="B2">2</abbr></abbrgrp> and in a classifier for semantic relations between biomedical phrases <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>. The Charniak parser has been employed to assist in the re-ranking of search results in a search engine for genomics documents <abbrgrp><abbr bid="B32">32</abbr></abbrgrp> and in the acquisition of causal chains from texts about protein interactions <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>.</p>
            <p>The Stanford parser has been used to provide syntactic clues for identifying key clinical terms in the medical domain <abbrgrp><abbr bid="B34">34</abbr></abbrgrp> and gene and protein names in the biological domain <abbrgrp><abbr bid="B35">35</abbr></abbrgrp>, although we disagree with the latter paper that unlexicalised parsers &#8211; those that represent words simply by their POS tags &#8211; are more suited to the biological domain than lexicalised parsers equipped with a general-English lexicon. While the relative positions of the lexicalised and unlexicalised versions of the Stanford parser in our study depend on which evaluation measure is used, both versions were consistently out-performed by the Bikel and Charniak-Lease parsers, both of whose parsing engines are lexicalised with a general-English vocabulary.</p>
            <p>A thorough analysis of the effectiveness of these parsers in this domain is vital to identifying the source of errors, to developing workarounds for these errors, and indeed to selecting the right parser to begin with. The work reported here builds on a previous paper on the same subject <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> but the dependency-based approach circumvents many of the limitations of constituent-based evaluation that were identified in the course of that investigation. However, there have been a few papers that deal with the benchmarking of parsers of various kinds on biological or biomedical tasks. Lease and Charniak <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>, in introducing the modified version of the Charniak parser that performed so well here, present some comparative scores for various versions of the parser on both the GENIA treebank and the Penn Treebank, but they use constituent-based precision, recall and F-measure (F<sub><it>const</it></sub>) and therefore implicitly suffer from the inability of such measures to distinguish between differences of meaning and convention (as discussed above in the Background section).</p>
            <p>Grover <it>et al</it>. <abbrgrp><abbr bid="B36">36</abbr></abbrgrp> present several experiments on parsing MEDLINE abstracts with three hand-crafted grammars. First they demonstrate that although the low-coverage but high-accuracy ANLT parser <abbrgrp><abbr bid="B37">37</abbr></abbrgrp> can return a successful parse on only 39.5% of the sentences in their 79-sentence test set, 77.2% of those sentences (30.5% overall) were parsed perfectly. This strategy seems somewhat dubious for real-world applications, however, since a parse with a handful of minor errors is surely more desirable in practice than no parse at all. The ANLT parser also returns a set of logical predicates representing the sentence; whether this is more or less useful for application development than a dependency graph remains to be seen. They then present some experiments on using the Cass <abbrgrp><abbr bid="B38">38</abbr></abbrgrp> and TSG <abbrgrp><abbr bid="B39">39</abbr></abbrgrp> parsers to correctly interpret compound nouns which encode predicate relationships, differentiating for example between 'treatment response' = response TO treatment, and 'aerosol administration' = administration BY aerosol. Their results for this unique investigation are interesting and encouraging, but it is unfortunate that they do not apply the ANLT parser to the compound noun task, and conversely, they do not provide general measures of coverage and accuracy for the Cass and TSG parsers.</p>
            <p>Other papers have been published on the behaviour of native dependency parsers on biomedical text. The paper by Pyysalo <it>et al</it>. <abbrgrp><abbr bid="B8">8</abbr></abbrgrp> is perhaps the closest to our own work. They compare the free Link Grammar parser <abbrgrp><abbr bid="B40">40</abbr></abbrgrp> to a commercial parser, the Connexor Machinese Syntax parser <abbrgrp><abbr bid="B41">41</abbr></abbrgrp>, both of which have been used in bioinformatics <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B42">42</abbr></abbrgrp>. The parsers use different dependency grammars, so the authors prepared a 300-sentence protein-protein interaction corpus with a dual annotation scheme that accommodated the major differences between the two parsers' dependency types. They also disregarded dependency types, as well as directions, as the Link parser's 'links' are not explicitly directional, resulting in an even looser matching criterion than the loose criterion mentioned in our Results section.</p>
            <p>The Link parser can return multiple parses in ranked order of likelihood, and taking only the first parse for each sentence, it achieved a recall of 72.9%, and parsed 7.0% of sentences perfectly, although the same group shows elsewhere <abbrgrp><abbr bid="B43">43</abbr></abbrgrp> that this figure may be raised slightly by using an independently-trained re-ranker. The Connexor parser returns a single parse for each sentence; it scored 80.0% for recall and also achieved 7.0% perfect parses. For comparison, our best parser (Charniak-Lease) achieved an overall recall of 81.0% and parsed an impressive 23.1% of sentences perfectly, even given a slightly stricter dependency matching criterion. The authors also scored the parsers on their ability to return perfect interaction subgraphs &#8211; minimal subgraphs joining two protein names and the word or phrase stating their interaction &#8211; although we disagree that a <it>perfect </it>interaction subgraph is necessarily a pre-requisite for successful retrieval of an actual interaction. (Neither is it sufficient, since a negation word might be outside the interaction subgraph yet still able to completely reverse its meaning.)</p>
            <p>Schneider <it>et al</it>. <abbrgrp><abbr bid="B44">44</abbr></abbrgrp> present results comparable to ours for the Pro3Gres parser <abbrgrp><abbr bid="B45">45</abbr></abbrgrp> on performing several specific syntactic tasks over a small subset of GENIA. Their general approach is very similar to ours, but they do not provide performance indicators over all dependency types, and they chunk multi-word terms into single elements before parsing. They report F<sub><it>dep </it></sub>scores of 88.5 and 92.0 for identifying the subjects and objects of verbs respectively, although it is not clear whether or not these relation types are defined as broadly as the categories we used above in the study of the verb 'induce', where the Charniak-Lease parser scored 98.0 and the Bikel parser scored 97.0, averaged across both agent and target relations. They also report F<sub><it>dep </it></sub>scores of 83.5 and 83.0 for prepositional modification of nouns and verbs respectively, which are slightly better than our best parsers' scores on this task; their system contains a module specifically written to correct ambiguous prepositional phrase attachments. (Note that the F<sub><it>dep </it></sub>scores reported here are calculated from the individual precision and recall scores given in the original Schneider <it>et al</it>. paper.)</p>
            <p>One factor common to the Pyysalo <it>et al</it>. paper and the Schneider <it>et al</it>. paper is the small size of the evaluation datasets (300 and 100 sentences respectively) since both required the manual preparation of a dependency corpus tailored to the parsers under inspection. Another advantage of producing dependency parses from constituent parses is that we can make use of the larger and rapidly-growing body of treebank-annotated biological text. Since this project was begun, the GENIA treebank has grown from 200 to 500 MEDLINE abstracts, and the BioIE project <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> has released 642 abstracts annotated in a similar format. The Stanford algorithm provides a <it>de facto </it>standard for comparing a variety of constituent parsers and treebanks at the dependency level; if the dependency parser community were to adopt the same set of grammatical relations as standard, then native dependency parsers could be compared to constituent parsers and to biological treebanks fairly and transparently.</p>
            <p>The use of dependency graph analysis as an evaluation tool is not a new idea, having been discussed by the NLP community for several years, but to the best of our knowledge the application of such methods to specific problem domains like bioinformatics is a recent development. An early proposal along these lines <abbrgrp><abbr bid="B46">46</abbr></abbrgrp> also acknowledged that inconsequential differences exist between different dependency representations of the same text, and included some suggested ways to exclude these phenomena, although without a comprehensive treatment. While such differences do exist, we believe that dependency graphs are much less prone to this problem than constituent trees. The same paper also discussed the mapping of constituent trees to dependency graphs via phrasal heads; the Stanford toolkit relies on a more sophisticated version of this process. Its author later used this approach to evaluate his own MINIPAR dependency parser <abbrgrp><abbr bid="B47">47</abbr></abbrgrp>. Later, the EAGLE and SPARKLE projects used hierarchically-classified grammatical relations, which are comparable to the Stanford toolkit's dependency types, to evaluate parsers in several languages <abbrgrp><abbr bid="B48">48</abbr><abbr bid="B49">49</abbr><abbr bid="B50">50</abbr></abbrgrp>. Similar scoring measures have been proposed for partial parsers <abbrgrp><abbr bid="B51">51</abbr><abbr bid="B52">52</abbr></abbrgrp> &#8211; those parsers which only return complete syntactic analyses of parts of each sentence. However, despite the well-known issues with constituent-based methods and the wealth of research on alternatives such as these, constituent precision and recall (along with supplementary information like number of crossing brackets per sentence) remain the <it>de facto </it>standard for reporting parser accuracy.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Preparing the corpus</p>
            </st>
            <p>We took the initial release of the GENIA treebank, which contains 200 abstracts from MEDLINE matching the query terms <it>human, blood cell </it>and <it>transcription factor</it>, corrected several minor errors, and removed a small number of truncated sentences. This left us with 45406 tokens (words and punctuation symbols) in 1757 sentences, from which we stripped all annotations.</p>
            <p>Before parsing, the words in the corpus needed to be assigned POS tags. We did not use the gold standard POS tags as this would not reflect the typical use case for a parser, where the text is completely unseen. The Charniak and Charniak-Lease parsers perform POS-tagging internally, but the difference between them is that while the original Charniak parser learns to POS-tag as part of the parsing engine's training process &#8211; and therefore is distributed with a general-English POS-tagging vocabulary learnt from the Penn Treebank &#8211; the Charniak-Lease parser has a decoupled POS-tagging module which can be trained separately, and is provided pre-trained on a different part of the GENIA corpus from that which is included in the GENIA treebank. (Note that it still uses lexical statistics learnt from the Penn Treebank for the actual syntactic parsing step as there is not yet sufficient syntactically-annotated biological text for retraining the parsing engine.) The other parsers in the experiment expect pre-tagged text, for which we used the MedPost tagger <abbrgrp><abbr bid="B53">53</abbr></abbrgrp> which is trained on text from a variety of MEDLINE abstracts.</p>
         </sec>
         <sec>
            <st>
               <p>Parsing the corpus</p>
            </st>
            <p>All the parsers were invoked with default compile-time and command-line options, with the exception that all resource limits were set to their most generous levels to allow for particularly long/complex sentences. Some post-processing was required to normalise punctuation symbols and deal with other formatting issues, and to insert 'dummy' trees with no nesting each time one of the parsers completely failed to process a sentence. Prior to scoring, some additional operations were carried out on both the gold standard treebank and the parser output files. PRT labels were replaced with ADVP, and NAC and NX labels were replaced with NP, as these constituent types are not used in GENIA. Any constituents with a single daughter of the same type were removed, as were all constituents that did not cover any words in the sentence, and TOP nodes (S1 nodes in the case of the Charniak parser) which are meaningless top-level container constituents inserted by the parsers at the root of every sentence as a processing convenience.</p>
            <p>The Penn Treebank defines a set of grammatical function suffixes on constituents, such as -LGS for logical subset, -LOC for location and -TMP for temporal modifier, that allow certain aspects of meaning to be represented more specifically than a purely syntactic annotation allows. GENIA uses a subset of these suffixes, the Stanford parser can generate a different subset, and the dependency graph generation algorithm can use another subset to provide additional clues for identifying the correct dependency to hold between two words. However, since these subsets do not match, and the other parsers in the evaluation do not produce any function suffixes at all, we completely discarded them in order to maintain a level playing field. There is a tool which adds these suffixes probabilistically to raw trees <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>, but it was designed for the Charniak parser and is very sensitive to small differences in output between different parsers; its performance on biological text is untested so far and this would make an interesting experiment.</p>
         </sec>
         <sec>
            <st>
               <p>Generating the dependency graphs</p>
            </st>
            <p>We will not discuss in detail the system for mapping from phrase structure trees to dependency graphs as it is described thoroughly in <abbrgrp><abbr bid="B9">9</abbr></abbrgrp> and in the documentation for the Stanford NLP tools <abbrgrp><abbr bid="B54">54</abbr></abbrgrp>. Briefly, it defines a taxonomy of directed, labelled grammatical relations, from the most general default type, DEPENDENT, to highly specific types such as NOMINAL_PASSIVE_SUBJECT or PHRASAL_VERB_PARTICLE. Each type has a list of allowable source constituents, target constituents and local tree structures that may hold between source and target; these definitions can include both structural constraints and lexical constraints (e.g. lists of valid words within the constituents). The algorithm attempts to match the patterns against the supplied tree structure of a sentence, from most specific to most general, and when a match is found, a dependency arc is added to the output graph from the head word of the source constituent to the head word of the target constituent. (A head word of a constituent is the word that is central to that constituent's meaning, upon which all the other words within it ultimately depend; e.g. the head of a verb phrase is the verb itself, and the head of a noun phrase is the rightmost noun.)</p>
            <p>The algorithm also provides the facility to 'collapse' graphs into a slightly simplified form, replacing certain words such as prepositions or possessives with dependencies, and optionally adding extra dependencies that make the semantics of each sentence slightly more explicit (at the expense of making the sentence's graph potentially cyclic rather than guaranteed acyclic). When scoring the parsers' overall performance, we used the collapsed versions of the dependency graphs with all additional dependencies added in, as this is the kind of graph one would find most useful in an information extraction project. The specific subtasks for the Charniak-Lease and Bikel parsers however used the unmodified graphs as these allowed a more fine-grained analysis of behaviour.</p>
         </sec>
         <sec>
            <st>
               <p>Scoring the parsers</p>
            </st>
            <p>The effectiveness scores F<sub><it>const </it></sub>and F<sub><it>dep </it></sub>are constituent tree and dependency graph similarity measures, respectively. They are the harmonic mean of the Precision (P) and Recall (R) values achieved by each parser, and are thus designed to penalise parsers who favour one at the expense of the other:</p>
            <p>
               <m:math name="1471-2105-8-24-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mrow>
                        <m:mi>F</m:mi>
                        <m:mo>=</m:mo>
                        <m:mfrac>
                           <m:mrow>
                              <m:mn>2</m:mn>
                              <m:mo>&#215;</m:mo>
                              <m:mi>P</m:mi>
                              <m:mo>&#215;</m:mo>
                              <m:mi>R</m:mi>
                           </m:mrow>
                           <m:mrow>
                              <m:mi>P</m:mi>
                              <m:mo>+</m:mo>
                              <m:mi>R</m:mi>
                           </m:mrow>
                        </m:mfrac>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGgbGrcqGH9aqpdaWcaaqaaiabikdaYiabgEna0kabdcfaqjabgEna0kabdkfasbqaaiabdcfaqjabgUcaRiabdkfasbaaaaa@3985@</m:annotation>
                  </m:semantics>
               </m:math>
            </p>
            <p>Precision is the proportion of constituents or dependents in the parsed corpus that are actually present in the gold standard:</p>
            <p>
               <m:math name="1471-2105-8-24-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mrow>
                        <m:mi>P</m:mi>
                        <m:mo>=</m:mo>
                        <m:mfrac>
                           <m:mrow>
                              <m:mo>#</m:mo>
                              <m:mtext>true&#160;positives</m:mtext>
                           </m:mrow>
                           <m:mrow>
                              <m:mo>#</m:mo>
                              <m:mtext>true&#160;positives</m:mtext>
                              <m:mo>+</m:mo>
                              <m:mo>#</m:mo>
                              <m:mtext>false&#160;positives</m:mtext>
                           </m:mrow>
                        </m:mfrac>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaucqGH9aqpdaWcaaqaaiabcocaJiabbsha0jabbkhaYjabbwha1jabbwgaLjabbccaGiabbchaWjabb+gaVjabbohaZjabbMgaPjabbsha0jabbMgaPjabbAha2jabbwgaLjabbohaZbqaaiabcocaJiabbsha0jabbkhaYjabbwha1jabbwgaLjabbccaGiabbchaWjabb+gaVjabbohaZjabbMgaPjabbsha0jabbMgaPjabbAha2jabbwgaLjabbohaZjabgUcaRiabcocaJiabbAgaMjabbggaHjabbYgaSjabbohaZjabbwgaLjabbccaGiabbchaWjabb+gaVjabbohaZjabbMgaPjabbsha0jabbMgaPjabbAha2jabbwgaLjabbohaZbaaaaa@6C1E@</m:annotation>
                  </m:semantics>
               </m:math>
            </p>
            <p>Recall is the proportion of constituents or dependents in the gold standard corpus that are correctly proposed by the parser:</p>
            <p>
               <m:math name="1471-2105-8-24-i3" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mrow>
                        <m:mi>R</m:mi>
                        <m:mo>=</m:mo>
                        <m:mfrac>
                           <m:mrow>
                              <m:mo>#</m:mo>
                              <m:mtext>true&#160;positives</m:mtext>
                           </m:mrow>
                           <m:mrow>
                              <m:mo>#</m:mo>
                              <m:mtext>true&#160;positives</m:mtext>
                              <m:mo>+</m:mo>
                              <m:mo>#</m:mo>
                              <m:mtext>false&#160;negatives</m:mtext>
                           </m:mrow>
                        </m:mfrac>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGsbGucqGH9aqpdaWcaaqaaiabcocaJiabbsha0jabbkhaYjabbwha1jabbwgaLjabbccaGiabbchaWjabb+gaVjabbohaZjabbMgaPjabbsha0jabbMgaPjabbAha2jabbwgaLjabbohaZbqaaiabcocaJiabbsha0jabbkhaYjabbwha1jabbwgaLjabbccaGiabbchaWjabb+gaVjabbohaZjabbMgaPjabbsha0jabbMgaPjabbAha2jabbwgaLjabbohaZjabgUcaRiabcocaJiabbAgaMjabbggaHjabbYgaSjabbohaZjabbwgaLjabbccaGiabb6gaUjabbwgaLjabbEgaNjabbggaHjabbsha0jabbMgaPjabbAha2jabbwgaLjabbohaZbaaaaa@6BE2@</m:annotation>
                  </m:semantics>
               </m:math>
            </p>
            <p>When calculating <it>F</it><sub><it>const</it></sub>, a constituent is treated as a true positive only if its label (constituent type) and span (the portion of the sentence covered by the constituent, not counting punctuation) are correct. When calculating F<sub><it>dep</it></sub>, a dependency arc is treated as a true positive only if its label (dependency type), start node and end node are correct (unless the loose matching criterion is specified, in which case the label is disregarded).</p>
            <p>For brevity, individual precision and recall scores have not been reported in this study. In constituent terms, and considering successfully parsed sentences only, all parsers scored slightly higher on precision than they did on recall, indicating that they were producing somewhat sparser trees than the GENIA annotators. In dependency terms, on the other hand, all parsers scored almost exactly the same for precision and recall on successfully parsed sentences. This suggests that omitted dependencies were usually replaced with a single erroneous arc.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>List of abbreviations</p>
         </st>
         <p><b>F </b>Parser effectiveness (F-measure)</p>
         <p><b>F</b><sub><it>const </it></sub>Effectiveness based on constituents</p>
         <p><b>F</b><sub><it>dep </it></sub>Effectiveness based on dependencies</p>
         <p><b>NLP </b>Natural language processing</p>
         <p><b>P </b>Precision</p>
         <p><b>POS </b>Part of speech</p>
         <p><b>PTB </b>Penn Treebank</p>
         <p><b>R </b>Recall</p>
         <p>The following list covers the linguistic abbreviations used in phrase-structure tree diagrams in this paper only. See <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> for explanations of their names and a comprehensive list.</p>
         <p><b>ADVP </b>Adverbial phrase</p>
         <p><b>CC </b>Coordinating conjunction</p>
         <p><b>CD </b>Cardinal number</p>
         <p><b>DT </b>Determiner</p>
         <p><b>IN </b>Preposition or subordinating conjunction</p>
         <p><b>NN </b>Noun, singular or mass</p>
         <p><b>NNS </b>Noun, plural</p>
         <p><b>NP </b>Noun phrase</p>
         <p><b>PP </b>Prepositional phrase</p>
         <p><b>RB </b>Adverb</p>
         <p><b>S </b>Simple declarative clause</p>
         <p><b>VBD </b>Verb, past tense</p>
         <p><b>VBN </b>Verb, past participle</p>
         <p><b>VBP </b>Verb, non-3rd person singular present</p>
         <p><b>VP </b>Verb phrase</p>
         <p><b>WDT </b><it>Wh</it>-determiner (e.g. "which", "that", "whatever")</p>
         <p><b>WHNP </b><it>Wh-noun </it>phrase (noun is replaced by "which", "who" etc.)</p>
         <p>The following list covers the linguistic abbreviations used in dependency graph diagrams in this paper only. See <abbrgrp><abbr bid="B9">9</abbr></abbrgrp> and <abbrgrp><abbr bid="B54">54</abbr></abbrgrp> for comprehensive lists.</p>
         <p><b>ADVMOD </b>Adverbial modifier</p>
         <p><b>AND </b>Conjunction 'and'</p>
         <p><b>AUX </b>Auxiliary</p>
         <p><b>AUXPASS </b>Passive auxiliary</p>
         <p><b>BY </b>Preposition 'by'</p>
         <p><b>DEP </b>Dependent</p>
         <p><b>DET </b>Determiner</p>
         <p><b>DOBJ </b>Direct object</p>
         <p><b>DURING </b>Preposition 'during'</p>
         <p><b>NN </b>Noun compound modifier</p>
         <p><b>NSUBJ </b>Nominal subject</p>
         <p><b>NSUBJPASS </b>Passive nominal subject</p>
         <p><b>NUM </b>Numeric modifier</p>
         <p><b>OF </b>Preposition 'of'</p>
         <p><b>IN </b>Preposition 'in'</p>
         <p><b>RCMOD </b>Relative clause modifier</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>ABC designed and wrote the Perl scripts and Java classes used in the experiment, analysed the results and drafted the manuscript. AJS participated in the experimental design and data analysis, and co-edited the manuscript. Both authors were involved in planning the study, and both read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>This work was funded by the Biotechnology and Biological Sciences Research Council and AstraZeneca PLC.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Natural language processing and systems biology</p>
            </title>
            <aug>
               <au>
                  <snm>Cohen</snm>
                  <fnm>KB</fnm>
               </au>
               <au>
                  <snm>Hunter</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Artificial intelligence methods and tools for systems biology</source>
            <publisher>Dordrecht: Kluwer</publisher>
            <editor>Dubitzky W, Azuaje F</editor>
            <pubdate>2004</pubdate>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Protein-Protein Interaction: A Supervised Learning Approach</p>
            </title>
            <aug>
               <au>
                  <snm>Xiao</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Su</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Zhou</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Tan</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Proceedings of the First International Symposium on Semantic Mining in Biomedicine</source>
            <publisher>Hinxton, UK</publisher>
            <pubdate>2005</pubdate>
         </bibl>
         <bibl id="B3">
            <title>
               <p>An evaluation of GO annotation retrieval for BioCreAtIvE and GOA</p>
            </title>
            <aug>
               <au>
                  <snm>Camon</snm>
                  <fnm>EB</fnm>
               </au>
               <au>
                  <snm>Barrell</snm>
                  <fnm>DG</fnm>
               </au>
               <au>
                  <snm>Dimmer</snm>
                  <fnm>EC</fnm>
               </au>
               <au>
                  <snm>Lee</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Magrane</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Maslen</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Binns</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Apweiler</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <issue>Suppl. 1</issue>
            <fpage>(S17)</fpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15960829</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Applying GIFT, a Gene Interactions Finder in Text, to fly literature</p>
            </title>
            <aug>
               <au>
                  <snm>Domedel-Puig</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Wernisch</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <issue>17</issue>
            <fpage>3582</fpage>
            <lpage>3583</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bti578</pubid>
                  <pubid idtype="pmpid" link="fulltext">16014369</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Semantic Relations Asserting the Etiology of Genetic Diseases</p>
            </title>
            <aug>
               <au>
                  <snm>Rindflesch</snm>
                  <fnm>TC</fnm>
               </au>
               <au>
                  <snm>Bisharah</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Dimitar</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Aronson</snm>
                  <fnm>AR</fnm>
               </au>
               <au>
                  <snm>H</snm>
                  <fnm>K</fnm>
               </au>
            </aug>
            <source>Proceedings of the American Medical Informatics Association Annual Symposium, Hanley and Belfus, Inc</source>
            <pubdate>2003</pubdate>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Extraction of protein interaction information from unstructured text using a context-free grammar</p>
            </title>
            <aug>
               <au>
                  <snm>Temkin</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Gilder</snm>
                  <fnm>MR</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <issue>16</issue>
            <fpage>2046</fpage>
            <lpage>2053</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btg279</pubid>
                  <pubid idtype="pmpid" link="fulltext">14594709</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>IntEx: A Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text</p>
            </title>
            <aug>
               <au>
                  <snm>Ahmed</snm>
                  <fnm>ST</fnm>
               </au>
               <au>
                  <snm>Chidambaram</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Davulcu</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Baral</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Proceedings of the ACL-ISMB Workshop on Linking Biological Literature. Ontologies and Databases: Mining Biological Semantics</source>
            <publisher>Detroit: Association for Computational Linguistics</publisher>
            <pubdate>2005</pubdate>
            <fpage>54</fpage>
            <lpage>61</lpage>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Evaluation of two dependency parsers on biomedical corpus targeted at protein-protein interactions</p>
            </title>
            <aug>
               <au>
                  <snm>Pyysalo</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Ginter</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Pahikkala</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Boberg</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>J&#228;rvinen</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Salakoski</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>International Journal of Medical Informatics</source>
            <pubdate>2006</pubdate>
            <volume>75</volume>
            <issue>6</issue>
            <fpage>430</fpage>
            <lpage>442</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.ijmedinf.2005.06.009</pubid>
                  <pubid idtype="pmpid" link="fulltext">16099201</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Generating Typed Dependency Parses from Phrase Structure Parses</p>
            </title>
            <aug>
               <au>
                  <snm>de Marneffe</snm>
                  <fnm>MC</fnm>
               </au>
               <au>
                  <snm>MacCartney</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Manning</snm>
                  <fnm>CD</fnm>
               </au>
            </aug>
            <source>Proceedings of 5th International Conference on Language Resources and Evaluation (LREC2006)</source>
            <publisher>Genoa, Italy</publisher>
            <pubdate>2006</pubdate>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Building a Large Annotated Corpus of English: The Penn Treebank</p>
            </title>
            <aug>
               <au>
                  <snm>Marcus</snm>
                  <fnm>MP</fnm>
               </au>
               <au>
                  <snm>Santorini</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Marcinkiewicz</snm>
                  <fnm>MA</fnm>
               </au>
            </aug>
            <source>Computational Linguistics</source>
            <pubdate>1994</pubdate>
            <volume>19</volume>
            <issue>2</issue>
            <fpage>313</fpage>
            <lpage>330</lpage>
         </bibl>
         <bibl id="B11">
            <title>
               <p>A test of the leaf-ancestor metric for parse accuracy</p>
            </title>
            <aug>
               <au>
                  <snm>Sampson</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Babarczy</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Proceedings of the Beyond PARSEVAL workshop of the third LREC conference</source>
            <publisher>Las Palmas, Canary Islands</publisher>
            <pubdate>2002</pubdate>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Evaluating and integrating treebank parsers on a biomedical corpus</p>
            </title>
            <aug>
               <au>
                  <snm>Clegg</snm>
                  <fnm>AB</fnm>
               </au>
               <au>
                  <snm>Shepherd</snm>
                  <fnm>AJ</fnm>
               </au>
            </aug>
            <source>Association for Computational Linguistics Workshop on Software CDROM</source>
            <editor>Jansche M, Ann Arbor, MI</editor>
            <pubdate>2005</pubdate>
         </bibl>
         <bibl id="B13">
            <title>
               <p>GENIA Treebank Beta Version</p>
            </title>
            <url>http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/topics/Corpus/GTB.html</url>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Mining the Bibliome</p>
            </title>
            <url>http://bioie.ldc.upenn.edu/</url>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Design of a Multi-lingual, Parallel-processing Statistical Parsing Engine</p>
            </title>
            <aug>
               <au>
                  <snm>Bikel</snm>
                  <fnm>DM</fnm>
               </au>
            </aug>
            <source>Proceedings of the Human Language Technology Conference 2002 (HLT2002)</source>
            <publisher>San Diego</publisher>
            <pubdate>2002</pubdate>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Head-Driven Statistical Models for Natural Language Parsing</p>
            </title>
            <aug>
               <au>
                  <snm>Collins</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Phd</source>
            <publisher>University of Pennsylvania</publisher>
            <pubdate>1999</pubdate>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Fast Exact Inference with a Factored Model for Natural Language Parsing</p>
            </title>
            <aug>
               <au>
                  <snm>Klein</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Manning</snm>
                  <fnm>CD</fnm>
               </au>
            </aug>
            <source>Advances in Neural Information Processing Systems</source>
            <pubdate>2002</pubdate>
            <fpage>3</fpage>
            <lpage>10</lpage>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Accurate Unlexicalized Parsing</p>
            </title>
            <aug>
               <au>
                  <snm>Klein</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Manning</snm>
                  <fnm>CD</fnm>
               </au>
            </aug>
            <source>Proceedings of the 41st Meeting of the Association for Computational Linguistics (ACL'03). Main Volume</source>
            <publisher>Sapporo, Japan: ACL</publisher>
            <pubdate>2003</pubdate>
         </bibl>
         <bibl id="B19">
            <title>
               <p>A Maximum-Entropy-Inspired Parser</p>
            </title>
            <aug>
               <au>
                  <snm>Charniak</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Tech rep Brown University</source>
            <pubdate>1999</pubdate>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Parsing Biomedical Literature</p>
            </title>
            <aug>
               <au>
                  <snm>Lease</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Charniak</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Proceedings of the Second International Joint Conference on Natural Language Processing (IJCNLP'05)</source>
            <publisher>Jeju Island, Korea</publisher>
            <editor>Dale R, Wong KF, Su J, Kwong OY</editor>
            <pubdate>2005</pubdate>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Two biomedical sublanguages: a description based on the theories of Zellig Harris</p>
            </title>
            <aug>
               <au>
                  <snm>Friedman</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Kra</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Rzhetsky</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Journal of Biomedical Informatics</source>
            <pubdate>2002</pubdate>
            <volume>35</volume>
            <issue>4</issue>
            <fpage>222</fpage>
            <lpage>235</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S1532-0464(03)00012-1</pubid>
                  <pubid idtype="pmpid">12755517</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Bikel Parser</p>
            </title>
            <url>http://www.cis.upenn.edu/~dbikel/software.html</url>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Learning to Detect Negation with 'Not' in Medical Texts</p>
            </title>
            <aug>
               <au>
                  <snm>Goldin</snm>
                  <fnm>IM</fnm>
               </au>
               <au>
                  <snm>Chapman</snm>
                  <fnm>WW</fnm>
               </au>
            </aug>
            <source>ACM SIGIR '03 Workshop on Text Analysis and Search for Bioinformatics: Participant Notebook</source>
            <publisher>Toronto, Canada: Association for Computing Machinery</publisher>
            <pubdate>2003</pubdate>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Use of General-purpose Negation Detection to Augment Concept Indexing of Medical Documents: A Quantitative Study Using the UMLS</p>
            </title>
            <aug>
               <au>
                  <snm>Mutalik</snm>
                  <fnm>PG</fnm>
               </au>
               <au>
                  <snm>Deshpande</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Nadkarni</snm>
                  <fnm>PM</fnm>
               </au>
            </aug>
            <source>Journal of the American Medical Informatics Association</source>
            <pubdate>2001</pubdate>
            <volume>8</volume>
            <issue>6</issue>
            <fpage>598</fpage>
            <lpage>609</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">130070</pubid>
                  <pubid idtype="pmpid" link="fulltext">11687566</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>A Statistical Approach to Anaphora Resolution</p>
            </title>
            <aug>
               <au>
                  <snm>Ge</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Hale</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Charniak</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Proceedings of the Sixth Workshop on Very Large Corpora, Hong Kong</source>
            <pubdate>1998</pubdate>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Automatic labeling of semantic roles</p>
            </title>
            <aug>
               <au>
                  <snm>Gildea</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Jurafsky</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Computational Linguistics</source>
            <pubdate>2002</pubdate>
            <volume>28</volume>
            <issue>3</issue>
            <fpage>245</fpage>
            <lpage>288</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1162/089120102760275983</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Statistics-based summarization &#8211; Step one: Sentence compression</p>
            </title>
            <aug>
               <au>
                  <snm>Knight</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Marcu</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Proceedings of the 17th National Conference on Artificial Intelligence (AAAI)</source>
            <publisher>Austin, Texas</publisher>
            <pubdate>2000</pubdate>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Automatic Verb Classification Based on Statistical Distributions of Argument Structure</p>
            </title>
            <aug>
               <au>
                  <snm>Merlo</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Stevenson</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Computational Linguistics</source>
            <pubdate>2001</pubdate>
            <volume>27</volume>
            <issue>3</issue>
            <fpage>373</fpage>
            <lpage>408</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1162/089120101317066122</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Assigning function tags to parsed text</p>
            </title>
            <aug>
               <au>
                  <snm>Blaheta</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Charniak</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Proceedings of the 1st Annual Meeting of the North American Chapter of the Association for Computational Linguistics</source>
            <pubdate>2000</pubdate>
            <fpage>234</fpage>
            <lpage>240</lpage>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Parallel Entity and Treebank Annotation</p>
            </title>
            <aug>
               <au>
                  <snm>Bies</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Kulick</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Mandel</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Proceedings of the Workshop on Frontiers in Corpus Annotations II: Pie in the Sky</source>
            <publisher>Ann Arbor, Michigan: Association for Computational Linguistics</publisher>
            <pubdate>2005</pubdate>
            <fpage>21</fpage>
            <lpage>28</lpage>
         </bibl>
         <bibl id="B31">
            <title>
               <p>Classifying Semantic Relations in Bioscience Texts</p>
            </title>
            <aug>
               <au>
                  <snm>Rosario</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Hearst</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL'04), Main Volume</source>
            <publisher>Barcelona, Spain</publisher>
            <pubdate>2004</pubdate>
            <fpage>430</fpage>
            <lpage>437</lpage>
         </bibl>
         <bibl id="B32">
            <title>
               <p>Synonym-based Query Expansion and Boosting-based Re-ranking: A Two-phase Approach for Genomic Information Retrieval</p>
            </title>
            <aug>
               <au>
                  <snm>Shi</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Gu</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Popowich</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Sarkar</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Proceedings of the Fourteenth Text REtrieval Conference (TREC 2005)</source>
            <publisher>Gaithersburg, Maryland</publisher>
            <pubdate>2005</pubdate>
         </bibl>
         <bibl id="B33">
            <title>
               <p>Acquisition of Causal Knowledge from Text: Applications to Bioinformatics</p>
            </title>
            <aug>
               <au>
                  <snm>Sanchez</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Poesio</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>First International Symposium on Semantic Mining in Biomedicine (SMBM) poster session</source>
            <publisher>Hinxton, UK</publisher>
            <pubdate>2005</pubdate>
         </bibl>
         <bibl id="B34">
            <title>
               <p>Improved Identification of Noun Phrases in Clinical Radiology Reports Using a High-Performance Statistical Natural Language Parser Augmented with the UMLS Specialist Lexicon</p>
            </title>
            <aug>
               <au>
                  <snm>Huang</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Lowe</snm>
                  <fnm>HJ</fnm>
               </au>
               <au>
                  <snm>Klein</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Cucina</snm>
                  <fnm>RJ</fnm>
               </au>
            </aug>
            <source>Journal of the American Medical Informatics Association</source>
            <pubdate>2005</pubdate>
            <volume>12</volume>
            <issue>3</issue>
            <fpage>275</fpage>
            <lpage>285</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1090458</pubid>
                  <pubid idtype="pmpid" link="fulltext">15684131</pubid>
                  <pubid idtype="doi">10.1197/jamia.M1695</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <title>
               <p>Exploiting concepts for biomedical entity recognition: From syntax to the Web</p>
            </title>
            <aug>
               <au>
                  <snm>Finkel</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Dingare</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Nguyen</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Nissim</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Manning</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Sinclair</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA)</source>
            <publisher>Geneva, Switzerland</publisher>
            <editor>Collier N, Ruch P, Nazarenko A</editor>
            <pubdate>2004</pubdate>
            <fpage>88</fpage>
            <lpage>91</lpage>
         </bibl>
         <bibl id="B36">
            <title>
               <p>A comparison of parsing technologies for the biomedical domain</p>
            </title>
            <aug>
               <au>
                  <snm>Grover</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Lapata</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Lascarides</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Natural Language Engineering</source>
            <pubdate>2005</pubdate>
            <volume>11</volume>
            <fpage>27</fpage>
            <lpage>65</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1017/S1351324904003547</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B37">
            <title>
               <p>The Alvey Natural Language Tools grammar (4th release)</p>
            </title>
            <aug>
               <au>
                  <snm>Grover</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Carroll</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Briscoe</snm>
                  <fnm>EJ</fnm>
               </au>
            </aug>
            <source>Technical report 284, Cambridge University</source>
            <pubdate>1993</pubdate>
            <url>http://www.cl.cam.ac.uk/users/ejb/anlt-gram.pdf</url>
         </bibl>
         <bibl id="B38">
            <title>
               <p>Partial Parsing via Finite-State Cascades</p>
            </title>
            <aug>
               <au>
                  <snm>Abney</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Workshop on Robust Parsing. 8th European Summer School in Logic. Language and Information (ESSLI)</source>
            <editor>Carroll J, Prague</editor>
            <pubdate>1996</pubdate>
            <fpage>8</fpage>
            <lpage>15</lpage>
         </bibl>
         <bibl id="B39">
            <title>
               <p>Robust Accurate Statistical Annotation of General Text</p>
            </title>
            <aug>
               <au>
                  <snm>Briscoe</snm>
                  <fnm>EJ</fnm>
               </au>
               <au>
                  <snm>Carroll</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002)</source>
            <publisher>Las Palmas, Canary Islands</publisher>
            <pubdate>2002</pubdate>
            <fpage>1499</fpage>
            <lpage>1504</lpage>
         </bibl>
         <bibl id="B40">
            <title>
               <p>Parsing English with a Link Grammar</p>
            </title>
            <aug>
               <au>
                  <snm>Sleator</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Temperley</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Proceedings of the Third International Workshop on Parsing Technologies</source>
            <publisher>Tilburg, Netherlands</publisher>
            <pubdate>1993</pubdate>
         </bibl>
         <bibl id="B41">
            <title>
               <p>Connexor Oy</p>
            </title>
            <url>http://www.connexor.com/</url>
         </bibl>
         <bibl id="B42">
            <title>
               <p>Protein names and how to find them</p>
            </title>
            <aug>
               <au>
                  <snm>Franz&#233;n</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Eriksson</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Olsson</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Lid&#233;n</snm>
                  <fnm>LAP</fnm>
               </au>
               <au>
                  <snm>Coster</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>International Journal of Medical Informatics</source>
            <pubdate>2002</pubdate>
            <volume>67</volume>
            <issue>l&#8211;3</issue>
            <fpage>49</fpage>
            <lpage>61</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S1386-5056(02)00052-7</pubid>
                  <pubid idtype="pmpid" link="fulltext">12460631</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B43">
            <title>
               <p>Regularized Least-Squares for Parse Ranking</p>
            </title>
            <aug>
               <au>
                  <snm>Tsivtsivadze</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Pahikkala</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Pyysalo</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Boberg</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Myll&#228;ri</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Salakoski</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Proceedings of the 6th International Symposium on Intelligent Data Analysis (IDA 2005)</source>
            <publisher>Madrid, Spain</publisher>
            <pubdate>2005</pubdate>
            <fpage>464</fpage>
            <lpage>474</lpage>
         </bibl>
         <bibl id="B44">
            <title>
               <p>Steps towards a GENIA Dependency Treebank</p>
            </title>
            <aug>
               <au>
                  <snm>Schneider</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Rinaldi</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Kaljurand</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Hess</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Proceedings of the Third Workshop on Treebanks and Linguistic Theories (TLT 2004)</source>
            <publisher>T&#252;bingen, Germany</publisher>
            <pubdate>2004</pubdate>
            <fpage>137</fpage>
            <lpage>148</lpage>
         </bibl>
         <bibl id="B45">
            <title>
               <p>A robust and deep-linguistic theory applied to large-scale parsing</p>
            </title>
            <aug>
               <au>
                  <snm>Schneider</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Dowdall</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Rinaldi</snm>
                  <fnm>F</fnm>
               </au>
            </aug>
            <source>Coling 2004 Workshop on Robust Methods in the Analysis of Natural Language Data (ROMAND 2004)</source>
            <publisher>Geneva</publisher>
            <pubdate>2004</pubdate>
         </bibl>
         <bibl id="B46">
            <title>
               <p>A dependency-based method for evaluating broad-coverage parsers</p>
            </title>
            <aug>
               <au>
                  <snm>Lin</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Proceedings of the International Joint Conference on Artificial Intelligence (IJACI-95)</source>
            <publisher>Montreal, Quebec</publisher>
            <pubdate>1995</pubdate>
         </bibl>
         <bibl id="B47">
            <title>
               <p>Dependency-Based Evaluation Of Minipar</p>
            </title>
            <aug>
               <au>
                  <snm>Lin</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Building and using Parsed Corpora</source>
            <publisher>Dordrecht: Kluwer</publisher>
            <editor>Abeill&#233; A</editor>
            <pubdate>2003</pubdate>
         </bibl>
         <bibl id="B48">
            <title>
               <p>Parser evaluation: a survey and new proposal</p>
            </title>
            <aug>
               <au>
                  <snm>Carroll</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Briscoe</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Sanfilippo</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Proceedings of the first international conference on language resources and evaluation (LREC)</source>
            <publisher>Granada, Spain</publisher>
            <pubdate>1998</pubdate>
         </bibl>
         <bibl id="B49">
            <title>
               <p>Corpus annotation for parser evaluation</p>
            </title>
            <aug>
               <au>
                  <snm>Carroll</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Minnen</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Briscoe</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Proceedings of the EACL workshop on linguistically interpreted corpora (LINC)</source>
            <publisher>Bergen, Norway</publisher>
            <pubdate>1999</pubdate>
         </bibl>
         <bibl id="B50">
            <title>
               <p>Relational evaluation schemes</p>
            </title>
            <aug>
               <au>
                  <snm>Briscoe</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Carroll</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Graham</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Copestake</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Proceedings of the Beyond PARSEVAL workshop of the third LREC conference</source>
            <publisher>Las Palmas, Canary Islands</publisher>
            <pubdate>2002</pubdate>
         </bibl>
         <bibl id="B51">
            <title>
               <p>An approach to robust partial parsing and evaluation metrics</p>
            </title>
            <aug>
               <au>
                  <snm>Srinivas</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Doran</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Hockey</snm>
                  <fnm>BA</fnm>
               </au>
               <au>
                  <snm>Joshi</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Proceedings of the Eight European Summer School In Logic. Language and Information</source>
            <publisher>Prague, Czech Republic</publisher>
            <pubdate>1996</pubdate>
         </bibl>
         <bibl id="B52">
            <title>
               <p>Towards a dependency-oriented evaluation for partial parsing</p>
            </title>
            <aug>
               <au>
                  <snm>K&#252;bler</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Telljohann</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Proceedings of the Beyond PARSEVAL workshop of the third LREC conference</source>
            <publisher>Las Palmas, Canary Islands</publisher>
            <pubdate>2002</pubdate>
         </bibl>
         <bibl id="B53">
            <title>
               <p>MedPost: a part-of-speech tagger for bioMedical text</p>
            </title>
            <aug>
               <au>
                  <snm>Smith</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Rindflesch</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Wilbur</snm>
                  <fnm>WJ</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>20</volume>
            <issue>14</issue>
            <fpage>2320</fpage>
            <lpage>2321</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bth227</pubid>
                  <pubid idtype="pmpid" link="fulltext">15073016</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B54">
            <title>
               <p>Stanford NLP tools</p>
            </title>
            <url>http://nlp.stanford.edu/software/index.shtml</url>
         </bibl>
      </refgrp>
   </bm>
</art>

