<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-9-108</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>MScanner: a classifier for retrieving Medline citations</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Poulter</snm>
               <mi>L</mi>
               <fnm>Graham</fnm>
               <insr iid="I1"/>
               <email>graham.poulter@gmail.com</email>
            </au>
            <au id="A2">
               <snm>Rubin</snm>
               <mi>L</mi>
               <fnm>Daniel</fnm>
               <insr iid="I2"/>
               <email>dlrubin@stanford.edu</email>
            </au>
            <au id="A3">
               <snm>Altman</snm>
               <mi>B</mi>
               <fnm>Russ</fnm>
               <insr iid="I3"/>
               <email>russ.altman@stanford.edu</email>
            </au>
            <au id="A4">
               <snm>Seoighe</snm>
               <fnm>Cathal</fnm>
               <insr iid="I1"/>
               <email>cathal.seoighe@uct.ac.za</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>UCT NBN Node, Department of Molecular and Cell Biology, University of Cape Town, Cape Town, South Africa</p>
            </ins>
            <ins id="I2">
               <p>Stanford Medical Informatics, Stanford University, San Francisco, USA</p>
            </ins>
            <ins id="I3">
               <p>Department of Bioengineering and Department of Genetics, Stanford University, San Francisco, USA</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2008</pubdate>
         <volume>9</volume>
         <issue>1</issue>
         <fpage>108</fpage>
         <url>http://www.biomedcentral.com/1471-2105/9/108</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">18284683</pubid>
               <pubid idtype="doi">10.1186/1471-2105-9-108</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>07</day>
               <month>9</month>
               <year>2007</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>19</day>
               <month>2</month>
               <year>2008</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>19</day>
               <month>2</month>
               <year>2008</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2008</year>
         <collab>Poulter et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Keyword searching through PubMed and other systems is the standard means of retrieving information from Medline. However, ad-hoc retrieval systems do not meet all of the needs of databases that curate information from literature, or of text miners developing a corpus on a topic that has many terms indicative of relevance. Several databases have developed supervised learning methods that operate on a filtered subset of Medline, to classify Medline records so that fewer articles have to be manually reviewed for relevance. A few studies have considered generalisation of Medline classification to operate on the entire Medline database in a non-domain-specific manner, but existing applications lack speed, available implementations, or a means to measure performance in new domains.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>MScanner is an implementation of a Bayesian classifier that provides a simple web interface for submitting a corpus of relevant training examples in the form of PubMed IDs and returning results ranked by decreasing probability of relevance. For maximum speed it uses the Medical Subject Headings (MeSH) and journal of publication as a concise document representation, and takes roughly 90 seconds to return results against the 16 million records in Medline. The web interface provides interactive exploration of the results, and cross validated performance evaluation on the relevant input against a random subset of Medline. We describe the classifier implementation, cross validate it on three domain-specific topics, and compare its performance to that of an expert PubMed query for a complex topic. In cross validation on the three sample topics against 100,000 random articles, the classifier achieved excellent separation of relevant and irrelevant article score distributions, ROC areas between 0.97 and 0.99, and averaged precision between 0.69 and 0.92.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>MScanner is an effective non-domain-specific classifier that operates on the entire Medline database, and is suited to retrieving topics for which many features may indicate relevance. Its web interface simplifies the task of classifying Medline citations, compared to building a pre-filter and classifier specific to the topic. The data sets and open source code used to obtain the results in this paper are available on-line and as supplementary material, and the web interface may be accessed at <url>http://mscanner.stanford.edu</url>.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <sec>
            <st>
               <p>Ad-hoc information retrieval</p>
            </st>
            <p>Information retrieval on the biomedical literature indexed by Medline <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> is most often carried out using ad-hoc retrieval. The PubMed <abbrgrp><abbr bid="B2">2</abbr></abbrgrp> boolean search engine is the most widely used Medline retrieval system. Other interfaces to searching Medline include relevance ranking systems such as Relemed <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> and systems such as EBIMed <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> that perform information extraction and clustering on results. Certain web search engines such as Google Scholar <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> also index much of the same literature as Medline. Alternatives to ordinary queries include the related articles feature of PubMed <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, which returns the Medline records most similar to a given record of interest, and the eTBlast <abbrgrp><abbr bid="B7">7</abbr></abbrgrp> search engine which ranks Medline abstracts by their similarity to a given paragraph of text.</p>
         </sec>
         <sec>
            <st>
               <p>Supervised learning for database curation</p>
            </st>
            <p>Ad-hoc retrieval in general has proven inefficient for the task of identifying articles relevant to databases that require manual curation of entries from biomedical literature, such as the Pharmacogenetics Knowledgebase (PharmGKB) <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>, and for constructing corpora for automated text mining systems such as Textpresso <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr></abbrgrp>. It is difficult to design an expert boolean query (the knowledge engineering approach to document classification <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>) that recalls most of the relevant documents without retrieving many irrelevant documents at the same time, when there are many document features that potentially indicate relevance.</p>
            <p>The case of many relevant features is, however, effectively handled using supervised learning, in which a text classifier is inductively trained from labelled examples <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp>. Several databases have therefore used supervised learning to filter Medline for relevant documents <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>, a recent example being the Immune Epitope Database (IEDB) <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. IEDB researchers first used a sensitive PubMed query several pages in length to obtain a Medline subset of 20,910 records. The components of the query had previously been used by IEDB curators, whose manual relevance judgements formed a "gold standard" training corpus of 5,712 relevant and 15,198 irrelevant documents. Different classifier algorithms and document representations were evaluated under cross validation, and their performance compared using the area under the Receiver Operating Characteristic (ROC) curve <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. The best of the trained classifiers is to be applied to future results of the sensitive query to reduce the number of documents that have to be manually reviewed.</p>
            <p>Supervised learning has also been used to identify Medline records relevant to the Biomolecular Interaction Network Database <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>, the ACP Journal Club for evidence based medicine <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>, the Textpresso resource <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>, and the Database of Interacting Proteins (DIP) <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>. Classification may also be performed on full-text articles as in the TREC 2005 Genomics Track <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>, and Cohen <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> provides a general-purpose classifier for the task. Most classifiers have been developed for filtering sets of a few thousand Medline records, but it is possible to classify larger subsets of Medline and even the whole Medline database. A small number of methods have been developed for larger data sets, including an ad-hoc scoring method that has been tested on a stem cell subset of Medline <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>, the PharmGKB curation filter <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>, and the PubFinder <abbrgrp><abbr bid="B24">24</abbr></abbrgrp> web application derived from the DIP curation filter <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>. However, tasks submitted to the PubFinder site in mid-2006 are still processing and the maintainers are unreachable. In some cases, text mining for relationships between named entities is used instead of supervised learning to judge relevance &#8211; for example in the more recent curation filter developed for the DIP <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. The most closely related articles <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> to individual articles in a collection have also been used to update a bibliography <abbrgrp><abbr bid="B26">26</abbr></abbrgrp> or a database <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Comparison of information retrieval approaches</p>
            </st>
            <p>Approaches to retrieving relevant Medline records for database curation have included ad-hoc retrieval (boolean retrieval in particular), related article search, and supervised learning. Pure boolean retrieval systems like PubMed return (without ranking) all documents that satisfy the logical conditions specified in the query. The vector space models used by web search engines rank documents by similarity to the query, and probabilistic retrieval models rank documents by decreasing probability of relevance to the topics in the query <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. Related article search retrieves documents by their similarity to a query document, which can be accomplished by using the document as a query string in a ranking ad-hoc retrieval system tuned for long queries <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B29">29</abbr></abbrgrp>. Overlap in citation lists has also been used as a benchmark for relatedness <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. The method used in PubMed related articles <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> directly evaluates the similarity between a pair of documents over all topics (corresponding to vocabulary terms) using a probabilistic model. Supervised learning trains a document classifier from labelled examples, framing the problem of Medline retrieval as a problem of classifying documents into the categories of "relevant" and "irrelevant". Classifiers may either produce ranked outputs or make hard judgements like a boolean query <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. Statistical classifiers, such as the Na&#239;ve Bayes classifier used here, use the same Probability Ranking Principle as probabilistic ad-hoc retrieval systems <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. Ranked classifier results may loosely be considered to contain documents closely related to the relevant examples as a whole.</p>
         </sec>
         <sec>
            <st>
               <p>Overview of MScanner</p>
            </st>
            <p>We have developed MScanner, a classifier of Medline records that uses supervised learning to identify relevant records in a non-domain-specific manner. The user provides only relevant citations as training examples, with the rest of Medline approximating the irrelevant examples for training purposes. Most classifiers are developed for particular databases, a limitation that we address by demonstrating effectiveness in multiple domains and providing facilities to evaluate the classifier on new inputs. We make it easier to use text classification by providing a web interface and operating on all of Medline instead of a Medline subset. To attain the high speeds necessary for online use, we used an optimised implementation of a Na&#239;ve Bayes classifier, and a compact document representation derived from two feature spaces in the Medline record metadata, namely the Medical Subject Headings (MeSH) and the journal of publication (ISSN). The choice of the MeSH feature space is informed by a previous study <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>, in which classification using MeSH features performed well on PharmGKB citations. We describe the use of the classifier, present example cross validation results, and evaluate the classifier on a gold standard data set derived from an expert PubMed query.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>Web interface workflow</p>
            </st>
            <p>The web interface, shown in Figure <figr fid="F1">1</figr>, takes as input a list of PubMed IDs representing the relevant training examples. In the case of a database curated from published literature, the PubMed IDs can be extracted from line-of-evidence annotations in the database itself. An existing domain-specific text mining corpus or bibliography may also serve as input. The classifier is then trained, in order to calculate support scores for each distinct term in the feature space (see Methods and Table <tblr tid="T1">1</tblr>). It uses the input corpus to estimate term frequencies in relevant articles, and the remainder of Medline to estimate term frequencies in irrelevant articles. The remainder of Medline provides a reasonable approximation of the term frequencies in irrelevant articles, provided the frequency of relevant articles in Medline is low. The trained classifier then ranks each article in Medline by score (log of the odds of relevance) and returns articles scoring greater than 0, subject to an upper limit on the number of results.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Web interface</p>
               </caption>
               <text>
                  <p><b>Web interface</b>. Submission form for Medline retrieval and cross validation. Relevant training examples are provided as a list of PubMed IDs.</p>
               </text>
               <graphic file="1471-2105-9-108-1"/>
            </fig>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Feature scores for PG07.</p>
               </caption>
               <tblbdy cols="8">
                  <r>
                     <c ca="right">
                        <p>Score</p>
                     </c>
                     <c ca="right">
                        <p>
                           <it>R</it>
                        </p>
                     </c>
                     <c ca="right">
                        <p>
                           <inline-formula>
                              <m:math name="1471-2105-9-108-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
                                 <m:semantics>
                                    <m:mover accent="true">
                                       <m:mi>R</m:mi>
                                       <m:mo>&#175;</m:mo>
                                    </m:mover>
                                    <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGafmOuaiLbaebaaaa@2D18@</m:annotation>
                                 </m:semantics>
                              </m:math>
                           </inline-formula>
                        </p>
                     </c>
                     <c ca="right">
                        <p><it>p</it>(<it>F</it><sub><it>i </it></sub>= 1|<it>R</it>)</p>
                     </c>
                     <c ca="right">
                        <p><it>p</it>(<it>F</it><sub><it>i </it></sub>= 1|<inline-formula><m:math name="1471-2105-9-108-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>R</m:mi><m:mo>&#175;</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGafmOuaiLbaebaaaa@2D18@</m:annotation></m:semantics></m:math></inline-formula>)</p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>z</it>
                           <sub>
                              <it>i</it>
                           </sub>
                        </p>
                     </c>
                     <c ca="left">
                        <p>Type</p>
                     </c>
                     <c ca="left">
                        <p>Term String</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="8">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>8.27</p>
                     </c>
                     <c ca="right">
                        <p>56</p>
                     </c>
                     <c ca="right">
                        <p>140</p>
                     </c>
                     <c ca="right">
                        <p>3.5E-2</p>
                     </c>
                     <c ca="right">
                        <p>9.0E-6</p>
                     </c>
                     <c ca="center">
                        <p>1.3E-5</p>
                     </c>
                     <c ca="left">
                        <p>issn</p>
                     </c>
                     <c ca="left">
                        <p>1744&#8211;6872 (Pharmacogenet. Genomics)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>7.36</p>
                     </c>
                     <c ca="right">
                        <p>137</p>
                     </c>
                     <c ca="right">
                        <p>855</p>
                     </c>
                     <c ca="right">
                        <p>8.6E-2</p>
                     </c>
                     <c ca="right">
                        <p>5.5E-5</p>
                     </c>
                     <c ca="center">
                        <p>6.4E-5</p>
                     </c>
                     <c ca="left">
                        <p>issn</p>
                     </c>
                     <c ca="left">
                        <p>0960-314X (Pharmacogenetics)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>7.24</p>
                     </c>
                     <c ca="right">
                        <p>41</p>
                     </c>
                     <c ca="right">
                        <p>287</p>
                     </c>
                     <c ca="right">
                        <p>2.6E-2</p>
                     </c>
                     <c ca="right">
                        <p>1.8E-5</p>
                     </c>
                     <c ca="center">
                        <p>2.1E-5</p>
                     </c>
                     <c ca="left">
                        <p>issn</p>
                     </c>
                     <c ca="left">
                        <p>1470-269X (Pharmacogenomics J.)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>6.85</p>
                     </c>
                     <c ca="right">
                        <p>6</p>
                     </c>
                     <c ca="right">
                        <p>62</p>
                     </c>
                     <c ca="right">
                        <p>3.8E-3</p>
                     </c>
                     <c ca="right">
                        <p>4.0E-6</p>
                     </c>
                     <c ca="center">
                        <p>4.4E-6</p>
                     </c>
                     <c ca="left">
                        <p>mesh</p>
                     </c>
                     <c ca="left">
                        <p>Organic Anion Transport Polypeptide C</p>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>5.95</p>
                     </c>
                     <c ca="right">
                        <p>20</p>
                     </c>
                     <c ca="right">
                        <p>509</p>
                     </c>
                     <c ca="right">
                        <p>1.3E-2</p>
                     </c>
                     <c ca="right">
                        <p>3.3E-5</p>
                     </c>
                     <c ca="center">
                        <p>3.4E-5</p>
                     </c>
                     <c ca="left">
                        <p>issn</p>
                     </c>
                     <c ca="left">
                        <p>1462&#8211;2416 (Pharmacogenomics)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>5.88</p>
                     </c>
                     <c ca="right">
                        <p>31</p>
                     </c>
                     <c ca="right">
                        <p>847</p>
                     </c>
                     <c ca="right">
                        <p>1.9E-2</p>
                     </c>
                     <c ca="right">
                        <p>5.4E-5</p>
                     </c>
                     <c ca="center">
                        <p>5.6E-5</p>
                     </c>
                     <c ca="left">
                        <p>mesh</p>
                     </c>
                     <c ca="left">
                        <p>Steroid 16-alpha-Hydroxylase</p>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>5.84</p>
                     </c>
                     <c ca="right">
                        <p>70</p>
                     </c>
                     <c ca="right">
                        <p>1986</p>
                     </c>
                     <c ca="right">
                        <p>4.4E-2</p>
                     </c>
                     <c ca="right">
                        <p>1.3E-4</p>
                     </c>
                     <c ca="center">
                        <p>1.3E-4</p>
                     </c>
                     <c ca="left">
                        <p>mesh</p>
                     </c>
                     <c ca="left">
                        <p>Cytochrome P-450 CYP2D6</p>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>5.84</p>
                     </c>
                     <c ca="right">
                        <p>2</p>
                     </c>
                     <c ca="right">
                        <p>57</p>
                     </c>
                     <c ca="right">
                        <p>1.3E-3</p>
                     </c>
                     <c ca="right">
                        <p>3.7E-6</p>
                     </c>
                     <c ca="center">
                        <p>3.8E-6</p>
                     </c>
                     <c ca="left">
                        <p>mesh</p>
                     </c>
                     <c ca="left">
                        <p>Glucuronic Acids</p>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>5.79</p>
                     </c>
                     <c ca="right">
                        <p>13</p>
                     </c>
                     <c ca="right">
                        <p>390</p>
                     </c>
                     <c ca="right">
                        <p>8.2E-3</p>
                     </c>
                     <c ca="right">
                        <p>2.5E-5</p>
                     </c>
                     <c ca="center">
                        <p>2.6E-5</p>
                     </c>
                     <c ca="left">
                        <p>mesh</p>
                     </c>
                     <c ca="left">
                        <p>Mephenytoin</p>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>5.78</p>
                     </c>
                     <c ca="right">
                        <p>114</p>
                     </c>
                     <c ca="right">
                        <p>3434</p>
                     </c>
                     <c ca="right">
                        <p>7.1E-2</p>
                     </c>
                     <c ca="right">
                        <p>2.2E-4</p>
                     </c>
                     <c ca="center">
                        <p>2.3E-4</p>
                     </c>
                     <c ca="left">
                        <p>mesh</p>
                     </c>
                     <c ca="left">
                        <p>Pharmacogenetics</p>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>5.69</p>
                     </c>
                     <c ca="right">
                        <p>1</p>
                     </c>
                     <c ca="right">
                        <p>33</p>
                     </c>
                     <c ca="right">
                        <p>6.3E-4</p>
                     </c>
                     <c ca="right">
                        <p>2.1E-6</p>
                     </c>
                     <c ca="center">
                        <p>2.2E-6</p>
                     </c>
                     <c ca="left">
                        <p>mesh</p>
                     </c>
                     <c ca="left">
                        <p>Methenyltetrahydrofolate Cyclohydrolase</p>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>5.54</p>
                     </c>
                     <c ca="right">
                        <p>7</p>
                     </c>
                     <c ca="right">
                        <p>268</p>
                     </c>
                     <c ca="right">
                        <p>4.4E-3</p>
                     </c>
                     <c ca="right">
                        <p>1.7E-5</p>
                     </c>
                     <c ca="center">
                        <p>1.8E-5</p>
                     </c>
                     <c ca="left">
                        <p>mesh</p>
                     </c>
                     <c ca="left">
                        <p>Xeroderma Pigmentosum Group D Protein</p>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>5.53</p>
                     </c>
                     <c ca="right">
                        <p>2</p>
                     </c>
                     <c ca="right">
                        <p>78</p>
                     </c>
                     <c ca="right">
                        <p>1.3E-3</p>
                     </c>
                     <c ca="right">
                        <p>5.0E-6</p>
                     </c>
                     <c ca="center">
                        <p>5.1E-6</p>
                     </c>
                     <c ca="left">
                        <p>mesh</p>
                     </c>
                     <c ca="left">
                        <p>Methylthioinosine</p>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>5.42</p>
                     </c>
                     <c ca="right">
                        <p>5</p>
                     </c>
                     <c ca="right">
                        <p>216</p>
                     </c>
                     <c ca="right">
                        <p>3.1E-3</p>
                     </c>
                     <c ca="right">
                        <p>1.4E-5</p>
                     </c>
                     <c ca="center">
                        <p>1.4E-5</p>
                     </c>
                     <c ca="left">
                        <p>mesh</p>
                     </c>
                     <c ca="left">
                        <p>Organic Anion Transporters, Sodium-Independent</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>The support scores for feature occurrence, for retrieval using PG07. "<it>R</it>" denotes |<it>R</it> &#8745; Fi = 1| and "<inline-formula><m:math name="1471-2105-9-108-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>R</m:mi><m:mo>&#175;</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGafmOuaiLbaebaaaa@2D18@</m:annotation></m:semantics></m:math></inline-formula>" denotes |<inline-formula><m:math name="1471-2105-9-108-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>R</m:mi><m:mo>&#175;</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGafmOuaiLbaebaaaa@2D18@</m:annotation></m:semantics></m:math></inline-formula> &#8745; Fi = 1|, which are the number of example and rest-of-Medline articles with each feature. p(Fi = 1|<it>R</it>) and p(Fi = 1|<inline-formula><m:math name="1471-2105-9-108-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>R</m:mi><m:mo>&#175;</m:mo></m:mover><m:annotation encoding="MathType-MTEF">)
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGafmOuaiLbaebaaaa@2D18@</m:annotation></m:semantics></m:math></inline-formula> are the posterior probabilities for feature occurrence, using zi as the prior count.</p>
               </tblfn>
            </tbl>
            <p>The results pages, an example of which is shown in Figure <figr fid="F2">2</figr>, contain full abstracts and links to PubMed, and feature a JavaScript application for instantaneous filtering and sorting by different fields. The pages also have a facility for manually marking relevant abstracts to open in PubMed or save to disk. The complete output directory can be downloaded as a zip file. Additionally, the front output page lists the MeSH terms with the greatest Term Frequency/Inverse Document Frequency <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>, which provides potentially useful information about the nature of the input set and suggests useful keywords to use with a search engine. In total, the whole-Medline classification takes 60&#8211;90 seconds to return results on a Sun Fire 280R, which is comparable to web services such as NCBI BLAST. The core step of calculating classifier scores for the over 16 million articles in Medline has been optimised down to 32 seconds.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Results page</p>
               </caption>
               <text>
                  <p><b>Results page</b>. The first page of results when trained on the PG07 corpus. The page contains JavaScript for sorting and searching within results, saving manual selections to disk and opening selected results in PubMed.</p>
               </text>
               <graphic file="1471-2105-9-108-2"/>
            </fig>
            <p>The submission form allows some of the classifier parameters to be adjusted. These include setting an upper limit on the number of results, or restricting Medline to records completed after a particular date (useful when monitoring for new results). More specialised options include the estimated fraction of relevant articles in Medline (prevalence), and the minimum score to classify an article as relevant. Higher estimated prevalence produces more results by raising the prior probability of relevance (see Methods), while higher prediction thresholds return fewer results, for greater overall precision at the cost of recall.</p>
         </sec>
         <sec>
            <st>
               <p>Cross validation protocol</p>
            </st>
            <p>The web interface provides a 10-fold cross validation function. The input examples are used as the relevant corpus, and up to 100,000 PubMed IDs are selected at random from the remainder of Medline to approximate an irrelevant corpus. In each round of cross validation, 90% of the data is used to estimate term frequencies, and the trained classifier is used to calculate article scores for the remaining 10%. Graphs derived from the cross validated scores include article score distributions, the ROC curve <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> and the curve of precision as a function of recall. Metrics include area under ROC and average precision <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>.</p>
            <p>Below, we applied cross validation to training examples from three topics (detailed in Methods) and one control corpus, to illustrate different use cases. The PG07 corpus consists of 1,663 pharmacogenetics articles, for the use case of curating a domain-specific database. The AIDSBio corpus consists of 10,727 articles about AIDS and bioethics, for the case of approximating a complex query or extending a text mining corpus. The Radiology corpus consists of 67 articles focusing on splenic imaging, for the case of extending a personal bibliography. The Control corpus consists of 10,000 randomly selected citations, and exists to demonstrate worst-case performance when the input has the same term distribution as Medline. We derived the irrelevant corpus for each topic from a single corpus, Medline100K, of 100,000 random Medline records. For each topic, we create the irrelevant corpus by taking Medline100K and subtracting any overlap with the relevant training examples. This differs from the web interface, which generates an independent irrelevant corpus every time it is used. A summary of the cross validation statistics for the sample topics is presented in Table <tblr tid="T2">2</tblr>.</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Cross validation statistics.</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c ca="left">
                        <p>
                           <b>Statistic</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>PG07</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>Radiology</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>AIDSBio</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>Control</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b># Relevant</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>1663</p>
                     </c>
                     <c ca="right">
                        <p>67</p>
                     </c>
                     <c ca="right">
                        <p>10727</p>
                     </c>
                     <c ca="right">
                        <p>10000</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b># Irrelevant</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>99986</p>
                     </c>
                     <c ca="right">
                        <p>100000</p>
                     </c>
                     <c ca="right">
                        <p>99927</p>
                     </c>
                     <c ca="right">
                        <p>99955</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Prevalence</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>0.01636</p>
                     </c>
                     <c ca="right">
                        <p>0.00067</p>
                     </c>
                     <c ca="right">
                        <p>0.09702</p>
                     </c>
                     <c ca="right">
                        <p>0.09095</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>ROC Area</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>0.9754</p>
                     </c>
                     <c ca="right">
                        <p>0.9923</p>
                     </c>
                     <c ca="right">
                        <p>0.9913</p>
                     </c>
                     <c ca="right">
                        <p>0.4975</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>ROC Std Error</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>0.0020</p>
                     </c>
                     <c ca="right">
                        <p>0.0047</p>
                     </c>
                     <c ca="right">
                        <p>0.0004</p>
                     </c>
                     <c ca="right">
                        <p>0.0030</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Averaged Precision</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>0.693</p>
                     </c>
                     <c ca="right">
                        <p>0.711</p>
                     </c>
                     <c ca="right">
                        <p>0.924</p>
                     </c>
                     <c ca="right">
                        <p>0.090</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Break-Even</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>0.652</p>
                     </c>
                     <c ca="right">
                        <p>0.642</p>
                     </c>
                     <c ca="right">
                        <p>0.884</p>
                     </c>
                     <c ca="right">
                        <p>0.089</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Summary of the cross validation training sets and performance metrics. Prevalence is the fraction of the data that is relevant, and break-even is point where cross validation precision equals recall.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Distributions of article scores</p>
            </st>
            <p>The article score distributions for relevant and irrelevant documents for each topic are shown in Figure <figr fid="F3">3</figr>. We have marked with a vertical line the score threshold that would result in equal precision and recall. The areas above the threshold represent the true and false positive rates, while areas below the threshold represent true and false negative rates <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. The low prevalence of relevant documents in Medline for a given topic of interest places stringent requirements on acceptable false positive rates when the classifier is applied to all of Medline. For example, a score threshold capturing 90% of relevant articles and 1% of irrelevant articles yields only 8% precision if relevant articles occur at a rate of one in a thousand. For our sample topics, the article score distributions for AIDSBio and Radiology were better separated from their irrelevant corpus than for PG07. As expected, the distribution of the Control corpus overlapped entirely with the irrelevant articles, indicating no ability to distinguish the control articles from Medline.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Article score distributions</p>
               </caption>
               <text>
                  <p><b>Article score distributions</b>. For each topic, a pair of article score distributions are shown for the relevant articles (red curve) and the irrelevant articles (blue curve). The vertical lines mark the score threshold that has precision equal to recall in each case. Irrelevant articles were derived from Medline100K in each case.</p>
               </text>
               <graphic file="1471-2105-9-108-3"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Receiver Operating Characteristic</p>
            </st>
            <p>The ROC curve <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> for each topic is shown in Figure <figr fid="F4">4</figr>. We summarise the ROC using the area under curve (AUC) statistic, representing the probability that a randomly selected relevant article will be ranked above a randomly selected irrelevant article. We calculated the standard error of the AUC using the tabular method of Hanley <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>. Worst-case performance was obtained for the Control corpus, as expected, with equal true and false positive rates and 0.5 falling within the standard error of the AUC. In the theoretical best case, all relevant articles would be retrieved before any false positives occur (top left corner of the graph). The AUC for PG07 in Table <tblr tid="T2">2</tblr> (0.9754 &#177; 0.0020) was significantly lower than the AUC for AIDSBio (0.9913 &#177; 0.0004) and Radiology (0.9923 &#177; 0.0047), which did not differ significantly. The lower AUC for PG07, and the poorer separation of its score distribution from Medline background, may be because pharmacogenetics articles discuss the interaction of a drug and a gene (requiring the use of relationship extraction <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>), which may not always be represented in the MeSH feature space.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Receiver Operating Characteristic</p>
               </caption>
               <text>
                  <p><b>Receiver Operating Characteristic</b>. ROC curve cross validation of each sample topic against 100,000 irrelevant articles. Because Medline retrieval requires low false positive rates, we have shown the ROC curve up to 1% false positives on the right.</p>
               </text>
               <graphic file="1471-2105-9-108-4"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Precision under cross validation</p>
            </st>
            <p>We evaluated cross validation precision at different levels of recall in Figure <figr fid="F5">5</figr>, where the precision represents the fraction of articles above the prediction threshold that were relevant. To summarise the curve we evaluated precision at each point where a relevant article occurred and averaged over the relevant articles <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. The averaged precisions for AIDSBio, Radiology and PG07 in Table <tblr tid="T2">2</tblr> were 0.92, 0.71 and 0.69 respectively. As an overall summary, the Mean of Averaged Precisions (MAP) <abbrgrp><abbr bid="B30">30</abbr></abbrgrp> across the three topics was 0.77. In Additional File <supplr sid="S1">1</supplr> we provide 11-point interpolated precision curves for these topics and for the IEDB tasks below, to facilitate future comparisons to our results. As expected for the Control corpus, precision at all thresholds was roughly equal to the 9% prevalence of relevant articles in the data. AIDSBio and Radiology had comparable ROC areas, but the averaged precision for Radiology was much lower than for AIDSBio. This is because prevalence (prior probability of relevance) is much lower for Radiology than AIDSBio: 0.067% vs 9.7% in Table <tblr tid="T2">2</tblr>. For a given recall and false positive rate, precision depends non-linearly on the ratio of relevant to irrelevant documents, while ROC is independent of that ratio <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>.</p>
            <suppl id="S1">
               <title>
                  <p>Additional file 1</p>
               </title>
               <text>
                  <p><b>11-point precision-recall curves</b>. 11pointcurves.pdf is a PDF file containing a table of 11-point interpolated precision curves for all experiments in the paper. The interpolated precision at a specified recall is the highest precision found for any value of recall greater than or equal to the specified recall.</p>
               </text>
               <file name="1471-2105-9-108-S1.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Precision as a function of recall</p>
               </caption>
               <text>
                  <p><b>Precision as a function of recall</b>. Cross validation precision as a function of recall for each sample topic. The points where precision equals recall occur at the intersection with the dotted diagonal line.</p>
               </text>
               <graphic file="1471-2105-9-108-5"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Performance in a retrieval situation</p>
            </st>
            <p>To evaluate classification performance in a retrieval situation we compared the performance of MScanner to the performance of an expert PubMed query that was used to identify articles for the Immune Epitope Database (IEDB). We made use of the 20,910 results of a sensitive expert query that had been manually split into 5,712 relevant and 15,198 irrelevant articles for the purpose of training the IEDB classifier <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. MeSH terms were available for 20,812 of the articles, of which 5,680 were relevant and 15,132 irrelevant. The final data set is provided in Additional File <supplr sid="S2">2</supplr>. To create training and testing corpora, we first restricted Medline to the 783,028 records completed in 2004, a year within the date ranges of all components of the IEDB query. For relevant training examples we used the 3,488 relevant IEDB results from before 2004, and we approximated irrelevant training examples using the whole of 2004 Medline. We then used the trained classifier to rank the articles in 2004 Medline.</p>
            <suppl id="S2">
               <title>
                  <p>Additional file 2</p>
               </title>
               <text>
                  <p><b>Corpora used in the IEDB comparison</b>. iedb.zip is a ZIP archive containing text files, where each line contains the PubMed ID and completion date of a Medline record. iedb-all-relevant.txt and iedb-all-irrelevant.txt are the relevant and irrelevant cross validation corpora used in the IEDB cross validation. iedb-pre2004-relevant.txt are the relevant training examples for the retrieval comparison. iedb-2004-relevant.txt and iedb-2004-irrelevant.txt are the manually evaluated IEDB query results from 2004 Medline. PubMed IDs for 2004 Medline may be obtained using the PubMed query 2004 [DateCompleted] AND medline [sb].</p>
               </text>
               <file name="1471-2105-9-108-S2.zip">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <p>We compared precision and recall as a function of rank for MScanner and the IEDB boolean query in Figure <figr fid="F6">6</figr>, for the task of retrieving IEDB-relevant citations from 2004 Medline. The IEDB query had 3,544 results in 2004 Medline, of which 1,089 had been judged relevant and 2,465 irrelevant, for 30.6% precision and 100% recall (since the data set was defined by the query). Since the IEDB query results were unranked, we assumed constant precision for plotting its curves. Up until about 900 results, MScanner recall and precision are above those of the IEDB query. At 3,544 results, MScanner's relative recall was 57% and its precision was 17.4%. Precisions after retrieving <it>N </it>results are as follows: P10 = 50%, P50 = 44%, P100 = 49%, P200 = 44% and P500 = 37%.</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>Comparison to the IEDB expert query</p>
               </caption>
               <text>
                  <p><b>Comparison to the IEDB expert query</b>. Precision and recall as a function of rank, comparing MScanner and the IEDB query at the task of retrieving IEDB-relevant articles from 2004 Medline. MScanner was trained on the pre-2004 relevant results of the IEDB query.</p>
               </text>
               <graphic file="1471-2105-9-108-6"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Performance/speed trade-off</p>
            </st>
            <p>We also compared MScanner to the IEDB classifier on its cross validation data, to evaluate the trade-off between performance and speed. The IEDB uses a Na&#239;ve Bayes classifier with word features derived from a concatenation of abstract, authors, title, journal and MeSH, followed by an information gain feature selection step and extraction of domain-specific features (peptides and MHC alleles). Using cross-validation to calculate scores for the collection of 20,910 documents, the IEDB classifier obtained an area under ROC curve of 0.855, with a classification speed (after training) of 1,000 articles per 30 seconds. MScanner, using whole MeSH terms and ISSN features, obtained an area under ROC of 0.782 &#177; 0.003, with a classification speed of approximately 15 million articles per 30 seconds. However, the prior we used for frequency of term occurrence (see Methods) is designed for training data where the prevalence of relevant examples is low. The prevalence of 0.27 in the IEDB data is much higher than the prevalences in Table <tblr tid="T2">2</tblr>, and using the Laplace prior here would improve the ROC area to 0.825 &#177; 0.003 but degrade performance in cross validation against Medline100K. The remaining difference in ROC between MScanner and the IEDB classifier reflects information from the abstract and domain-specific features not captured by the MeSH feature space. All ROC AUC values on the IEDB data are much lower than in the sample cross validation topics. This is because it is more difficult to distinguish between relevant and irrelevant articles among the closely related articles resulting from an expert query, than to distinguish relevant articles from the rest of Medline.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <sec>
            <st>
               <p>Uses of supervised learning for Medline retrieval</p>
            </st>
            <p>Supervised learning has already been applied to the problem of database curation and the development of text mining resources. However, using a web service like MScanner to perform supervised learning is a simple operation compared to constructing a boolean filter, gold standard training set, and custom-built classifier. MScanner may supplement existing workflows that use a pre-filter query by detecting relevant articles inadvertently excluded by the filter. Another possibility is using MScanner in place of a filter query when one is unavailable, and confirming relevance by passing on the results to a stronger classifier or an information extraction method such as that used by the Database of Interacting Proteins <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. Supervised learning can also be used in other scenarios where relevant training examples are readily available and the presence of many relevant features hinders ad-hoc retrieval. For example, individual researchers could leverage the documents in a personal bibliography to identify additional articles relevant to their research interests.</p>
         </sec>
         <sec>
            <st>
               <p>Performance evaluation</p>
            </st>
            <p>MScanner's performance varies by topic, depending on the degree to which features are enriched or depleted in relevant articles compared to Medline. The relative performance on different corpora also depends on the evaluation metric used. For example, ROC performance on PG07 shows lower overall ability to distinguish pharmacogenetics articles from Medline, but the right hand sub-plot of Figure <figr fid="F4">4</figr> shows higher recall at low false positive rates on PG07 than AIDSBio. Besides the topic itself, the size of the training set can also influence performance. For the complex topics curated by databases, many relevant examples may be needed to obtain good coverage of terms indicating relevance. Narrower topics, such as the Radiology corpus, require fewer training examples to obtain good estimates of the frequencies of important terms. Too few training examples, however, will result in poor estimates of term frequencies (over-estimates, or failure of important terms to be represented), degrading performance. The use of a random set of Medline articles as the set of irrelevant articles in training (Medline100K in the use cases we presented) can also influence performance in cross validation. It can inflate the false positive rate to some extent because it contains relevant articles that are not part of the relevant training set.</p>
            <p>The score distributions for the Control corpus (Figure <figr fid="F3">3</figr>) were somewhat anomalous, with multiple narrow modes. This is due to the larger irrelevant corpus derived from Medline100K containing low-frequency features not present in the Control corpus. Each iteration of training therefore yielded many rare features with scores around -8 to -10. The four narrow peaks correspond to the chance presence of 0, 1, 2 or 3 of those features, which were influential because other features scored close to zero. In non-random corpora (AIDSBio, PG07 and Radiology), the other non-zero features dominate to produce broader unimodal distributions. Removing features unique to Medline100K reduced the Control distribution to the expected single narrow peak between -5 and +5.</p>
         </sec>
         <sec>
            <st>
               <p>Document representations</p>
            </st>
            <p>We represented Medline records as binary feature vectors derived from MeSH terms and journal ISSNs. These are separate feature spaces: a MeSH term and ISSN consisting of the same string would be not be considered the same feature. Medline provides each MeSH term in a record as a descriptor in association with zero or more qualifiers, as in "Nevirapine/administration &amp; dosage". To reduce the dimensionality of the feature space we treat the descriptor and qualifier as separate features. We detected 24,069 distinct MeSH features in use, and 17,191 ISSN features, for an average of 13.5 features per record. The 2007 MeSH vocabulary comprises 24,357 descriptors and 83 qualifiers. Of the journals, about 5,000 are monitored by PubMed and the rest are represented by a only few records each. An advantage of the MeSH and ISSN feature spaces is that they allow a compact document representation using 16-bit features, which increases classification speed. MeSH is also a controlled vocabulary, and so does not have word sense ambiguities like free text. However the vocabulary does not cover all concepts, and covers some areas of biology and medicine (such as medical terminology) more densely than others. Also, not every article has all relevant MeSH terms assigned, and there is a tendency for certain terms to be assigned to articles that just discuss the topic, such as articles "about dental research" rather than dental research articles themselves <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>.</p>
            <p>Performance can be improved by adding an additional space of binary features derived from the title and abstract of the document. Not relying solely on MeSH features would also enable classification of Medline records that have not been assigned MeSH descriptors yet. The additional features would, however, reduce classification speed due to larger document representations, introduce redundancy with the MeSH feature space, and require a feature selection step. The IEDB classifier <abbrgrp><abbr bid="B15">15</abbr></abbrgrp> avoids redundancy by concatenating the abstract with the MeSH terms and using a single feature space of text words. Binary features should model short abstracts relatively well, although performance on longer texts is known to benefit from considering multiple occurrences of terms <abbrgrp><abbr bid="B35">35</abbr><abbr bid="B36">36</abbr></abbrgrp>.</p>
            <p>MeSH annotations and journal ISSNs are domain-specific resources in the biomedical literature. The articles cited by a given article (although not provided in Medline) are another domain-specific resource that may prove useful in retrieval tasks, in addition to their uses in navigating the citation network. For example, the overlap in citation lists has been used as a benchmark for article relatedness <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. In supervised learning, it may be possible to incorporate the number of co-citations between a document and relevant articles, or to use the citing of an article as a binary feature.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>MScanner inductively learns topics of interest from example citations, with the aim of retrieving a large number of topical citations more effectively than with boolean queries. It represents an advance on previous tools for Medline classification by performing well across a range of topics and input sizes, by making available implementation source code, and by operating on all of Medline fast enough to use over a web interface. As a non-domain-specific classifier, it has a facility for performing cross validation to obtain ROC and precision statistics on new inputs. MScanner should be useful as a filter for database curation where a sensitive filter query and customised classifier are not already available, and in general for constructing large bibliographies, text mining corpora and other domain-specific Medline subsets.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Bayesian classification</p>
            </st>
            <p>MScanner uses a Na&#239;ve Bayes classifier, which places documents in the class with the greatest posterior probability, and is derived by assuming that feature occurrences are conditionally independent with respect to the class variable. In the multivariate Bernoulli document model <abbrgrp><abbr bid="B35">35</abbr></abbrgrp>, each document is represented as a binary vector, <it>f </it>= (<it>f</it><sub>1</sub>, <it>f</it><sub>2</sub>,...,<it>f</it><sub><it>k</it></sub>), with 1 or 0 specifying the presence or absence of each feature. The score of the article is the logarithm of the posterior probability ratio for the article being relevant versus irrelevant, which reduces to a sum of feature support scores and a prior score:</p>
            <p>
               <display-formula>
                  <m:math name="1471-2105-9-108-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mtable columnalign="left">
                              <m:mtr columnalign="left">
                                 <m:mtd columnalign="left">
                                    <m:mrow>
                                       <m:mi>S</m:mi>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>f</m:mi>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:mtd>
                                 <m:mtd columnalign="left">
                                    <m:mo>=</m:mo>
                                 </m:mtd>
                                 <m:mtd columnalign="left">
                                    <m:mrow>
                                       <m:mi>log</m:mi>
                                       <m:mo>&#8289;</m:mo>
                                       <m:mfrac>
                                          <m:mrow>
                                             <m:mi>p</m:mi>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>F</m:mi>
                                             <m:mo>=</m:mo>
                                             <m:mi>f</m:mi>
                                             <m:mo>|</m:mo>
                                             <m:mi>R</m:mi>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mi>p</m:mi>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>R</m:mi>
                                             <m:mo stretchy="false">)</m:mo>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mi>p</m:mi>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>F</m:mi>
                                             <m:mo>=</m:mo>
                                             <m:mi>f</m:mi>
                                             <m:mo>|</m:mo>
                                             <m:mover accent="true">
                                                <m:mi>R</m:mi>
                                                <m:mo>&#175;</m:mo>
                                             </m:mover>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mi>p</m:mi>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mover accent="true">
                                                <m:mi>R</m:mi>
                                                <m:mo>&#175;</m:mo>
                                             </m:mover>
                                             <m:mo stretchy="false">)</m:mo>
                                          </m:mrow>
                                       </m:mfrac>
                                    </m:mrow>
                                 </m:mtd>
                              </m:mtr>
                              <m:mtr columnalign="left">
                                 <m:mtd columnalign="left">
                                    <m:mrow/>
                                 </m:mtd>
                                 <m:mtd columnalign="left">
                                    <m:mo>=</m:mo>
                                 </m:mtd>
                                 <m:mtd columnalign="left">
                                    <m:mrow>
                                       <m:mi>log</m:mi>
                                       <m:mo>&#8289;</m:mo>
                                       <m:mstyle displaystyle="true">
                                          <m:munderover>
                                             <m:mo>&#8719;</m:mo>
                                             <m:mrow>
                                                <m:mi>i</m:mi>
                                                <m:mo>=</m:mo>
                                                <m:mn>1</m:mn>
                                             </m:mrow>
                                             <m:mi>k</m:mi>
                                          </m:munderover>
                                          <m:mrow>
                                             <m:mfrac>
                                                <m:mrow>
                                                   <m:mi>p</m:mi>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:msub>
                                                      <m:mi>F</m:mi>
                                                      <m:mi>i</m:mi>
                                                   </m:msub>
                                                   <m:mo>=</m:mo>
                                                   <m:msub>
                                                      <m:mi>f</m:mi>
                                                      <m:mi>i</m:mi>
                                                   </m:msub>
                                                   <m:mo>|</m:mo>
                                                   <m:mi>R</m:mi>
                                                   <m:mo stretchy="false">)</m:mo>
                                                </m:mrow>
                                                <m:mrow>
                                                   <m:mi>p</m:mi>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:msub>
                                                      <m:mi>F</m:mi>
                                                      <m:mi>i</m:mi>
                                                   </m:msub>
                                                   <m:mo>=</m:mo>
                                                   <m:msub>
                                                      <m:mi>f</m:mi>
                                                      <m:mi>i</m:mi>
                                                   </m:msub>
                                                   <m:mo>|</m:mo>
                                                   <m:mover accent="true">
                                                      <m:mi>R</m:mi>
                                                      <m:mo>&#175;</m:mo>
                                                   </m:mover>
                                                   <m:mo stretchy="false">)</m:mo>
                                                </m:mrow>
                                             </m:mfrac>
                                             <m:mo>+</m:mo>
                                             <m:mi>log</m:mi>
                                             <m:mo>&#8289;</m:mo>
                                             <m:mfrac>
                                                <m:mrow>
                                                   <m:mi>p</m:mi>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:mi>R</m:mi>
                                                   <m:mo stretchy="false">)</m:mo>
                                                </m:mrow>
                                                <m:mrow>
                                                   <m:mi>p</m:mi>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:mover accent="true">
                                                      <m:mi>R</m:mi>
                                                      <m:mo>&#175;</m:mo>
                                                   </m:mover>
                                                   <m:mo stretchy="false">)</m:mo>
                                                </m:mrow>
                                             </m:mfrac>
                                          </m:mrow>
                                       </m:mstyle>
                                    </m:mrow>
                                 </m:mtd>
                              </m:mtr>
                              <m:mtr columnalign="left">
                                 <m:mtd columnalign="left">
                                    <m:mrow/>
                                 </m:mtd>
                                 <m:mtd columnalign="left">
                                    <m:mo>=</m:mo>
                                 </m:mtd>
                                 <m:mtd columnalign="left">
                                    <m:mrow>
                                       <m:mstyle displaystyle="true">
                                          <m:munderover>
                                             <m:mo>&#8721;</m:mo>
                                             <m:mrow>
                                                <m:mi>i</m:mi>
                                                <m:mo>=</m:mo>
                                                <m:mn>1</m:mn>
                                             </m:mrow>
                                             <m:mi>k</m:mi>
                                          </m:munderover>
                                          <m:mrow>
                                             <m:mi>Y</m:mi>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:msub>
                                                <m:mi>F</m:mi>
                                                <m:mi>i</m:mi>
                                             </m:msub>
                                             <m:mo>=</m:mo>
                                             <m:msub>
                                                <m:mi>f</m:mi>
                                                <m:mi>i</m:mi>
                                             </m:msub>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mo>+</m:mo>
                                             <m:mi>log</m:mi>
                                             <m:mo>&#8289;</m:mo>
                                             <m:mfrac>
                                                <m:mrow>
                                                   <m:mi>p</m:mi>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:mi>R</m:mi>
                                                   <m:mo stretchy="false">)</m:mo>
                                                </m:mrow>
                                                <m:mrow>
                                                   <m:mn>1</m:mn>
                                                   <m:mo>&#8722;</m:mo>
                                                   <m:mi>p</m:mi>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:mi>R</m:mi>
                                                   <m:mo stretchy="false">)</m:mo>
                                                </m:mrow>
                                             </m:mfrac>
                                          </m:mrow>
                                       </m:mstyle>
                                    </m:mrow>
                                 </m:mtd>
                              </m:mtr>
                           </m:mtable>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaqbaeaabmWaaaqaaiabdofatjabcIcaOiabdAgaMjabcMcaPaqaaiabg2da9aqaaiGbcYgaSjabc+gaVjabcEgaNLqbaoaalaaabaGaemiCaaNaeiikaGIaemOrayKaeyypa0JaemOzayMaeiiFaWNaemOuaiLaeiykaKIaemiCaaNaeiikaGIaemOuaiLaeiykaKcabaGaemiCaaNaeiikaGIaemOrayKaeyypa0JaemOzayMaeiiFaWNafmOuaiLbaebacqGGPaqkcqWGWbaCcqGGOaakcuWGsbGugaqeaiabcMcaPaaaaOqaaaqaaiabg2da9aqaaiGbcYgaSjabc+gaVjabcEgaNnaarahabaqcfa4aaSaaaeaacqWGWbaCcqGGOaakcqWGgbGrdaWgaaqaaiabdMgaPbqabaGaeyypa0JaemOzay2aaSbaaeaacqWGPbqAaeqaaiabcYha8jabdkfasjabcMcaPaqaaiabdchaWjabcIcaOiabdAeagnaaBaaabaGaemyAaKgabeaacqGH9aqpcqWGMbGzdaWgaaqaaiabdMgaPbqabaGaeiiFaWNafmOuaiLbaebacqGGPaqkaaGccqGHRaWkcyGGSbaBcqGGVbWBcqGGNbWzjuaGdaWcaaqaaiabdchaWjabcIcaOiabdkfasjabcMcaPaqaaiabdchaWjabcIcaOiqbdkfaszaaraGaeiykaKcaaaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaem4AaSganiabg+GivdaakeaaaeaacqGH9aqpaeaadaaeWbqaaiabdMfazjabcIcaOiabdAeagnaaBaaaleaacqWGPbqAaeqaaOGaeyypa0JaemOzay2aaSbaaSqaaiabdMgaPbqabaGccqGGPaqkcqGHRaWkcyGGSbaBcqGGVbWBcqGGNbWzjuaGdaWcaaqaaiabdchaWjabcIcaOiabdkfasjabcMcaPaqaaiabigdaXiabgkHiTiabdchaWjabcIcaOiabdkfasjabcMcaPaaaaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabdUgaRbqdcqGHris5aaaaaaa@A660@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>The feature support scores <abbrgrp><abbr bid="B37">37</abbr></abbrgrp> are:</p>
            <p>
               <display-formula>
                  <m:math name="1471-2105-9-108-i3" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>Y</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:msub>
                              <m:mi>F</m:mi>
                              <m:mi>i</m:mi>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:msub>
                              <m:mi>f</m:mi>
                              <m:mi>i</m:mi>
                           </m:msub>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mi>log</m:mi>
                           <m:mo>&#8289;</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mi>p</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:msub>
                                    <m:mi>F</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                                 <m:mo>=</m:mo>
                                 <m:msub>
                                    <m:mi>f</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                                 <m:mo>|</m:mo>
                                 <m:mi>R</m:mi>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>p</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:msub>
                                    <m:mi>F</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                                 <m:mo>=</m:mo>
                                 <m:msub>
                                    <m:mi>f</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                                 <m:mo>|</m:mo>
                                 <m:mover accent="true">
                                    <m:mi>R</m:mi>
                                    <m:mo>&#175;</m:mo>
                                 </m:mover>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                           </m:mfrac>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemywaKLaeiikaGIaemOray0aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpcqWGMbGzdaWgaaWcbaGaemyAaKgabeaakiabcMcaPiabg2da9iGbcYgaSjabc+gaVjabcEgaNLqbaoaalaaabaGaemiCaaNaeiikaGIaemOray0aaSbaaeaacqWGPbqAaeqaaiabg2da9iabdAgaMnaaBaaabaGaemyAaKgabeaacqGG8baFcqWGsbGucqGGPaqkaeaacqWGWbaCcqGGOaakcqWGgbGrdaWgaaqaaiabdMgaPbqabaGaeyypa0JaemOzay2aaSbaaeaacqWGPbqAaeqaaiabcYha8jqbdkfaszaaraGaeiykaKcaaaaa@53DA@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>The greatest support scores for occurring features are shown in Table <tblr tid="T1">1</tblr>, when the classifier has been trained to perform PG07 retrieval. For computational efficiency, the non-occurrence support scores, <it>Y</it>(<it>F</it><sub><it>i </it></sub>= 0), are simplified to a base score (of an article with no features) and a small adjustment for each feature that occurs.</p>
            <p>We estimate the prior probability of relevance <it>P</it>(<it>R</it>) using the number of training examples divided by the number of articles in Medline, and the classifier predicts relevance for articles with <it>S</it>(<it>f</it>) &#8805; 0. The prior and minimum score for predicting relevance may also be set on the web interface.</p>
         </sec>
         <sec>
            <st>
               <p>Estimation of feature frequencies</p>
            </st>
            <p>We use posterior estimates for <it>p</it>(<it>F</it><sub><it>i </it></sub>= <it>f</it><sub><it>i</it></sub>|<it>R</it>) and <it>p</it>(<it>F</it><sub><it>i </it></sub>= <it>f</it><sub><it>i</it></sub>|<inline-formula><m:math name="1471-2105-9-108-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>R</m:mi><m:mo>&#175;</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGafmOuaiLbaebaaaa@2D18@</m:annotation></m:semantics></m:math></inline-formula>). We choose the prior probability <it>z</it><sub><it>i </it></sub>of observing the feature to be the fraction of articles in all of Medline in which the feature occurs. The weight of the prior is equivalent to one article worth of evidence, resulting in probabilities of the following form:</p>
            <p>
               <display-formula>
                  <m:math name="1471-2105-9-108-i4" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>p</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:msub>
                              <m:mi>F</m:mi>
                              <m:mi>i</m:mi>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:mn>1</m:mn>
                           <m:mo>|</m:mo>
                           <m:mi>R</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mi>p</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>R</m:mi>
                                 <m:mo>&#8745;</m:mo>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:msub>
                                    <m:mi>F</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                                 <m:mo>=</m:mo>
                                 <m:mn>1</m:mn>
                                 <m:mo stretchy="false">)</m:mo>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>p</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>R</m:mi>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                           </m:mfrac>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mo>|</m:mo>
                                 <m:mi>R</m:mi>
                                 <m:mo>&#8745;</m:mo>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:msub>
                                    <m:mi>F</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                                 <m:mo>=</m:mo>
                                 <m:mn>1</m:mn>
                                 <m:mo stretchy="false">)</m:mo>
                                 <m:mo>|</m:mo>
                                 <m:mo>+</m:mo>
                                 <m:msub>
                                    <m:mi>z</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                              </m:mrow>
                              <m:mrow>
                                 <m:mo>|</m:mo>
                                 <m:mi>R</m:mi>
                                 <m:mo>|</m:mo>
                                 <m:mo>+</m:mo>
                                 <m:mn>1</m:mn>
                              </m:mrow>
                           </m:mfrac>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemiCaaNaeiikaGIaemOray0aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpcqaIXaqmcqGG8baFcqWGsbGucqGGPaqkcqGH9aqpjuaGdaWcaaqaaiabdchaWjabcIcaOiabdkfasjabgMIihlabcIcaOiabdAeagnaaBaaabaGaemyAaKgabeaacqGH9aqpcqaIXaqmcqGGPaqkcqGGPaqkaeaacqWGWbaCcqGGOaakcqWGsbGucqGGPaqkaaGccqGH9aqpjuaGdaWcaaqaaiabcYha8jabdkfasjabgMIihlabcIcaOiabdAeagnaaBaaabaGaemyAaKgabeaacqGH9aqpcqaIXaqmcqGGPaqkcqGG8baFcqGHRaWkcqWG6bGEdaWgaaqaaiabdMgaPbqabaaabaGaeiiFaWNaemOuaiLaeiiFaWNaey4kaSIaeGymaedaaaaa@601C@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>And similarly for <it>p</it>(<it>F</it><sub><it>i </it></sub>= 1|<inline-formula><m:math name="1471-2105-9-108-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>R</m:mi><m:mo>&#175;</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGafmOuaiLbaebaaaa@2D18@</m:annotation></m:semantics></m:math></inline-formula>). Probabilities for non-occurrence of features are of the form <it>p</it>(<it>F</it><sub><it>i </it></sub>= 0|<it>R</it>) = 1 - <it>p</it>(<it>F</it><sub><it>i </it></sub>= 1|<it>R</it>). Bayesian classifiers normally use a Laplace prior <abbrgrp><abbr bid="B35">35</abbr></abbrgrp>, which specifies one prior success and one prior failure for each feature. However, the Laplace prior performs poorly here because of class skew in the training data: when irrelevant articles greatly outnumber relevant ones it over-estimates <it>P</it>(<it>F</it><sub><it>i </it></sub>= 1|<it>R</it>) relative to <it>P</it>(<it>F</it><sub><it>i </it></sub>= 1|<inline-formula><m:math name="1471-2105-9-108-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>R</m:mi><m:mo>&#175;</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGafmOuaiLbaebaaaa@2D18@</m:annotation></m:semantics></m:math></inline-formula>), in particular for terms not observed in any relevant examples.</p>
         </sec>
         <sec>
            <st>
               <p>Data structures enabling fast classification</p>
            </st>
            <p>MScanner's classification speed is due to the use of a Bayesian classifier, a compact feature space, and a customised implementation. Training in retrieval tasks is made much faster by keeping track of the total number of occurrences of each term in Medline. The MeSH and ISSN feature spaces fit in 16-bit feature IDs, and each Medline record has an average of 13.5 features. Including some overhead, this allows the features of all 16 million articles in Medline to be stored in a binary stream of around 600 MB. A C program takes 32 seconds to parse this file and calculate article scores for all of Medline, returning those above the specified threshold. The rest of the program is written in Python <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>, using the Numpy library for vector operations. Source code is provided in Additional File <supplr sid="S3">3</supplr>.</p>
            <suppl id="S3">
               <title>
                  <p>Additional file 3</p>
               </title>
               <text>
                  <p><b>Source code for MScanner</b>. mscanner-20071123.zip is a ZIP archive containing the Python 2.5 source code for MScanner, licensed under the GNU General Public License. It also contains API documentation in HTML format. Updated versions will be made available at <url>http://mscanner.stanford.edu</url>.</p>
               </text>
               <file name="1471-2105-9-108-S3.zip">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <p>For storing complete Medline records, we used a 22 GB Berkeley DB indexed by PubMed ID. It was generated by parsing the Medline Baseline <abbrgrp><abbr bid="B39">39</abbr></abbrgrp> distribution, which consists of 70 GB XML compressed to 7 GB and split into files of 30,000 records each. During parsing, a count of the number of occurrences of each feature in Medline is maintained, ready to be used for training the classifier. To look up feature vectors in cross validation, we use a 1.3 GB Berkeley DB instead of the binary stream.</p>
         </sec>
         <sec>
            <st>
               <p>Construction of PG07, AIDSBio, Radiology and Medline100K</p>
            </st>
            <p>The PG07, AIDSBio and Radiology corpora provided in Additional File <supplr sid="S4">4</supplr> are from different domains and are of different sizes, to illustrate the different use cases mentioned in the results. The PG07 corpus comprises literature annotations taken from the PharmGKB <abbrgrp><abbr bid="B8">8</abbr></abbrgrp> on 5 February 2007. The AIDSBio corpus is the intersection of the PubMed AIDS <abbrgrp><abbr bid="B40">40</abbr></abbrgrp> and Bioethics <abbrgrp><abbr bid="B41">41</abbr></abbrgrp> subsets on 19 October 2006. The Radiology corpus is a bibliography of 67 radiology articles focusing on the spleen, obtained from a co-worker of DR's. The corpora exclude records that do not have status "MEDLINE", and thus lack MeSH terms. The Medline100K corpus consists of 100,000 randomly selected Medline records, with completion dates up to 21 January 2007, which is also the upper date for the Control corpus of 10,000 random citations. The size of Medline100K was chosen to provide a good approximation of the Medline background, while containing few unknown relevant articles.</p>
            <suppl id="S4">
               <title>
                  <p>Additional file 4</p>
               </title>
               <text>
                  <p><b>Sample cross validation corpora</b>. corpora.zip is a ZIP archive containing text files for the PG07, AIDSBio, Radiology, Control and Medline100K sample corpora. Each line contains the PubMed ID and completion date of a Medline record.</p>
               </text>
               <file name="1471-2105-9-108-S4.zip">
                  <p>Click here for file</p>
               </file>
            </suppl>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Availability and requirements</p>
         </st>
         <p>&#8226; <b>Project Name: </b>MScanner</p>
         <p>&#8226; <b>Home Page: </b><url>http://mscanner.stanford.edu</url></p>
         <p>&#8226; <b>Operating Systems: </b>Platform independent</p>
         <p>&#8226; <b>Programming Languages: </b>Python, JavaScript, C</p>
         <p>&#8226; <b>Minimum Requirements: </b>Internet Explorer 7, Mozilla Firefox 2, Opera 9, or Safari 3</p>
         <p>&#8226; <b>License: </b>GNU General Public License</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>GP and CS in collaboration with DR and RA conceived of the goals for MScanner, including a web interface and refining the classifier formulation. GP programmed the MScanner software and web interface, developed and carried out experiments to analyse MScanner's performance with feedback from CS, DR and RA, and wrote the manuscript drafts. CS supervised the research and reviewed all drafts of the manuscript. All authors read and approved the final draft of the paper.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>This work is supported by the University of Cape Town (UCT), the South African National Research Foundation (NRF), the National Bioinformatics Network (NBN), and the Stanford-South Africa Bio-Medical Informatics Programme (SSABMI), which is funded through US National Institutes of Health Fogarty International Center Grant D43 TW06993, and PharmGKB associates by grant NIH U01GM61374. Thank you to Tina Zhou for setting up the server space for MScanner, and Prof. Vladimir Bajic for a helpful discussion.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Fact Sheet: MEDLINE</p>
            </title>
            <url>http://www.nlm.nih.gov/pubs/factsheets/medline.html</url>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Fact Sheet: PubMed<sup>&#174;</sup>: MEDLINE<sup>&#174; </sup>R Retrieval on the World Wide Web</p>
            </title>
            <url>http://www.nlm.nih.gov/pubs/factsheets/pubmed.html</url>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Relemed: sentence-level search engine with relevance score for the MEDLINE database of biomedical articles</p>
            </title>
            <aug>
               <au>
                  <snm>Siadaty</snm>
                  <fnm>MS</fnm>
               </au>
               <au>
                  <snm>Shu</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Knaus</snm>
                  <fnm>WA</fnm>
               </au>
            </aug>
            <source>BMC Med Inform Decis Mak</source>
            <pubdate>2007</pubdate>
            <volume>7</volume>
            <fpage>1</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1780044</pubid>
                  <pubid idtype="pmpid" link="fulltext">17214888</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>EBIMed--text crunching to gather facts for proteins from Medline</p>
            </title>
            <aug>
               <au>
                  <snm>Rebholz-Schuhmann</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Kirsch</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Arregui</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Gaudan</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Riethoven</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Stoehr</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>23</volume>
            <issue>2</issue>
            <fpage>e237</fpage>
            <lpage>e244</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">17237098</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Google Scholar</p>
            </title>
            <url>http://scholar.google.com</url>
         </bibl>
         <bibl id="B6">
            <title>
               <p>PubMed related articles: a probabilistic topic-based model for content similarity</p>
            </title>
            <aug>
               <au>
                  <snm>Lin</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Wilbur</snm>
                  <fnm>WJ</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>8</volume>
            <fpage>423</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2212667</pubid>
                  <pubid idtype="pmpid" link="fulltext">17971238</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Text similarity: an alternative way to search MEDLINE</p>
            </title>
            <aug>
               <au>
                  <snm>Lewis</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Ossowski</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Hicks</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Errami</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Garner</snm>
                  <fnm>HR</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>22</volume>
            <issue>18</issue>
            <fpage>2298</fpage>
            <lpage>2304</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">16926219</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>PharmGKB: the Pharmacogenetics Knowledge Base</p>
            </title>
            <aug>
               <au>
                  <snm>Hewett</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Oliver</snm>
                  <fnm>DE</fnm>
               </au>
               <au>
                  <snm>Rubin</snm>
                  <fnm>DL</fnm>
               </au>
               <au>
                  <snm>Easton</snm>
                  <fnm>KL</fnm>
               </au>
               <au>
                  <snm>Stuart</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Altman</snm>
                  <fnm>RB</fnm>
               </au>
               <au>
                  <snm>Klein</snm>
                  <fnm>TE</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <fpage>163</fpage>
            <lpage>165</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">99138</pubid>
                  <pubid idtype="pmpid" link="fulltext">11752281</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Automatic document classification of biological literature</p>
            </title>
            <aug>
               <au>
                  <snm>Chen</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>M&#252;ller</snm>
                  <fnm>HM</fnm>
               </au>
               <au>
                  <snm>Sternberg</snm>
                  <fnm>PW</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>370</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1559726</pubid>
                  <pubid idtype="pmpid" link="fulltext">16893465</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Textpresso: an ontology-based information retrieval and extraction system for biological literature</p>
            </title>
            <aug>
               <au>
                  <snm>M&#252;ller</snm>
                  <fnm>HM</fnm>
               </au>
               <au>
                  <snm>Kenny</snm>
                  <fnm>EE</fnm>
               </au>
               <au>
                  <snm>Sternberg</snm>
                  <fnm>PW</fnm>
               </au>
            </aug>
            <source>PLoS Biol</source>
            <pubdate>2004</pubdate>
            <volume>2</volume>
            <issue>11</issue>
            <fpage>e309</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">517822</pubid>
                  <pubid idtype="pmpid" link="fulltext">15383839</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>A Tutorial on Automated Text Categorisation</p>
            </title>
            <aug>
               <au>
                  <snm>Sebastiani</snm>
                  <fnm>F</fnm>
               </au>
            </aug>
            <source>Proceedings of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence</source>
            <editor>Amandi A, Zunino R, Buenos Aires AR</editor>
            <pubdate>1999</pubdate>
            <fpage>7</fpage>
            <lpage>35</lpage>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Machine learning in automated text categorization</p>
            </title>
            <aug>
               <au>
                  <snm>Sebastiani</snm>
                  <fnm>F</fnm>
               </au>
            </aug>
            <source>ACM Comput Surv</source>
            <pubdate>2002</pubdate>
            <volume>34</volume>
            <fpage>1</fpage>
            <lpage>47</lpage>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Text Categorization with Support Vector Machines: Learning with Many Relevant Features</p>
            </title>
            <aug>
               <au>
                  <snm>Joachims</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>ECML '98: Proceedings of the 10th European Conference on Machine Learning</source>
            <publisher>London, UK: Springer-Verlag</publisher>
            <pubdate>1998</pubdate>
            <fpage>137</fpage>
            <lpage>142</lpage>
         </bibl>
         <bibl id="B14">
            <title>
               <p>A survey of current work in biomedical text mining</p>
            </title>
            <aug>
               <au>
                  <snm>Cohen</snm>
                  <fnm>AM</fnm>
               </au>
               <au>
                  <snm>Hersh</snm>
                  <fnm>WR</fnm>
               </au>
            </aug>
            <source>Brief Bioinform</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <fpage>57</fpage>
            <lpage>71</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15826357</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Automating document classification for the Immune Epitope Database</p>
            </title>
            <aug>
               <au>
                  <snm>Wang</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Morgan</snm>
                  <fnm>AA</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Q</fnm>
               </au>
               <au>
                  <snm>Sette</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Peters</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>8</volume>
            <fpage>269</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1965490</pubid>
                  <pubid idtype="pmpid" link="fulltext">17655769</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>The use of receiver operating characteristic curves in biomedical informatics</p>
            </title>
            <aug>
               <au>
                  <snm>Lasko</snm>
                  <fnm>TA</fnm>
               </au>
               <au>
                  <snm>Bhagwat</snm>
                  <fnm>JG</fnm>
               </au>
               <au>
                  <snm>Zou</snm>
                  <fnm>KH</fnm>
               </au>
               <au>
                  <snm>Ohno-Machado</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>J Biomed Inform</source>
            <pubdate>2005</pubdate>
            <volume>38</volume>
            <issue>5</issue>
            <fpage>404</fpage>
            <lpage>415</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">16198999</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>PreBIND and Textomy-mining the biomedical literature for protein-protein interactions using a support vector machine</p>
            </title>
            <aug>
               <au>
                  <snm>Donaldson</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Martin</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>de Bruijn</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Wolting</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Lay</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Tuekam</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Baskin</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Bader</snm>
                  <fnm>GD</fnm>
               </au>
               <au>
                  <snm>Michalickova</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Pawson</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Hogue</snm>
                  <fnm>CWV</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>4</volume>
            <fpage>11</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">153503</pubid>
                  <pubid idtype="pmpid" link="fulltext">12689350</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Text categorization models for high-quality article retrieval in internal medicine</p>
            </title>
            <aug>
               <au>
                  <snm>Aphinyanaphongs</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Tsamardinos</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Statnikov</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Hardin</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Aliferis</snm>
                  <fnm>CF</fnm>
               </au>
            </aug>
            <source>J Am Med Inform Assoc</source>
            <pubdate>2005</pubdate>
            <volume>12</volume>
            <issue>2</issue>
            <fpage>207</fpage>
            <lpage>216</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">551552</pubid>
                  <pubid idtype="pmpid" link="fulltext">15561789</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Mining literature for protein-protein interactions</p>
            </title>
            <aug>
               <au>
                  <snm>Marcotte</snm>
                  <fnm>EM</fnm>
               </au>
               <au>
                  <snm>Xenarios</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Eisenberg</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2001</pubdate>
            <volume>17</volume>
            <issue>4</issue>
            <fpage>359</fpage>
            <lpage>363</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11301305</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>TREC 2005 Genomics Track Overview</p>
            </title>
            <aug>
               <au>
                  <snm>Hersh</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Cohen</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Yang</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Bhupatiraju</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Roberts</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Hearst</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>The Fourteenth Text REtrieval Conference (TREC 2005)</source>
            <pubdate>2005</pubdate>
         </bibl>
         <bibl id="B21">
            <title>
               <p>An effective general purpose approach for automated biomedical document classification</p>
            </title>
            <aug>
               <au>
                  <snm>Cohen</snm>
                  <fnm>AM</fnm>
               </au>
            </aug>
            <source>AMIA Annu Symp Proc</source>
            <pubdate>2006</pubdate>
            <fpage>161</fpage>
            <lpage>165</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1839342</pubid>
                  <pubid idtype="pmpid">17238323</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Ranking the whole MEDLINE database according to a large training set using text indexing</p>
            </title>
            <aug>
               <au>
                  <snm>Suomela</snm>
                  <fnm>BP</fnm>
               </au>
               <au>
                  <snm>Andrade</snm>
                  <fnm>MA</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <fpage>75</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1274266</pubid>
                  <pubid idtype="pmpid" link="fulltext">15790421</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>A statistical approach to scanning the biomedical literature for pharmacogenetics knowledge</p>
            </title>
            <aug>
               <au>
                  <snm>Rubin</snm>
                  <fnm>DL</fnm>
               </au>
               <au>
                  <snm>Thorn</snm>
                  <fnm>CF</fnm>
               </au>
               <au>
                  <snm>Klein</snm>
                  <fnm>TE</fnm>
               </au>
               <au>
                  <snm>Altman</snm>
                  <fnm>RB</fnm>
               </au>
            </aug>
            <source>J Am Med Inform Assoc</source>
            <pubdate>2005</pubdate>
            <volume>12</volume>
            <issue>2</issue>
            <fpage>121</fpage>
            <lpage>129</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">551544</pubid>
                  <pubid idtype="pmpid" link="fulltext">15561790</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>PubFinder: a tool for improving retrieval rate of relevant PubMed abstracts</p>
            </title>
            <aug>
               <au>
                  <snm>Goetz</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>von der Lieth</snm>
                  <fnm>CW</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2005</pubdate>
            <issue>33 Web Server</issue>
            <fpage>W774</fpage>
            <lpage>W778</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1160190</pubid>
                  <pubid idtype="pmpid" link="fulltext">15980583</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Finding the evidence for protein-protein interactions from PubMed abstracts</p>
            </title>
            <aug>
               <au>
                  <snm>Jang</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Lim</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Lim</snm>
                  <fnm>JH</fnm>
               </au>
               <au>
                  <snm>Park</snm>
                  <fnm>SJ</fnm>
               </au>
               <au>
                  <snm>Lee</snm>
                  <fnm>KC</fnm>
               </au>
               <au>
                  <snm>Park</snm>
                  <fnm>SH</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>22</volume>
            <issue>14</issue>
            <fpage>e220</fpage>
            <lpage>e226</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">16873475</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Updating a bibliography using the related articles function within PubMed</p>
            </title>
            <aug>
               <au>
                  <snm>Liu</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Altman</snm>
                  <fnm>RB</fnm>
               </au>
            </aug>
            <source>Proc AMIA Symp</source>
            <pubdate>1998</pubdate>
            <fpage>750</fpage>
            <lpage>754</lpage>
            <xrefbib>
               <pubid idtype="pmpid">9929319</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>A protocol for the update of references to scientific literature in biological databases</p>
            </title>
            <aug>
               <au>
                  <snm>Perez-Iratxeta</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Astola</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Ciccarelli</snm>
                  <fnm>FD</fnm>
               </au>
               <au>
                  <snm>Sha</snm>
                  <fnm>PK</fnm>
               </au>
               <au>
                  <snm>Bork</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Andrade</snm>
                  <fnm>MA</fnm>
               </au>
            </aug>
            <source>Appl Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>2</volume>
            <issue>3</issue>
            <fpage>189</fpage>
            <lpage>191</lpage>
            <xrefbib>
               <pubid idtype="pmpid">15130808</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Probabilistic models in information retrieval</p>
            </title>
            <aug>
               <au>
                  <snm>Fuhr</snm>
                  <fnm>N</fnm>
               </au>
            </aug>
            <source>Comput J</source>
            <pubdate>1992</pubdate>
            <volume>35</volume>
            <issue>3</issue>
            <fpage>243</fpage>
            <lpage>255</lpage>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Using argumentation to retrieve articles with similar citations: an inquiry into improving related articles search in the MEDLINE digital library</p>
            </title>
            <aug>
               <au>
                  <snm>Tbahriti</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Chichester</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Lisacek</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Ruch</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Int J Med Inform</source>
            <pubdate>2005</pubdate>
            <volume>75</volume>
            <issue>6</issue>
            <fpage>488</fpage>
            <lpage>495</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">16165395</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>A tutorial on information retrieval: basic terms and concepts</p>
            </title>
            <aug>
               <au>
                  <snm>Zhou</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Smalheiser</snm>
                  <fnm>NR</fnm>
               </au>
               <au>
                  <snm>Yu</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>J Biomed Discov Collab</source>
            <pubdate>2006</pubdate>
            <volume>1</volume>
            <fpage>2</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1459215</pubid>
                  <pubid idtype="pmpid" link="fulltext">16722601</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B31">
            <title>
               <p>The meaning and use of the area under a receiver operating characteristic (ROC) curve</p>
            </title>
            <aug>
               <au>
                  <snm>Hanley</snm>
                  <fnm>JA</fnm>
               </au>
               <au>
                  <snm>McNeil</snm>
                  <fnm>BJ</fnm>
               </au>
            </aug>
            <source>Radiology</source>
            <pubdate>1982</pubdate>
            <volume>143</volume>
            <fpage>29</fpage>
            <lpage>36</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">7063747</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>Extracting and characterizing gene-drug relationships from the literature</p>
            </title>
            <aug>
               <au>
                  <snm>Chang</snm>
                  <fnm>JT</fnm>
               </au>
               <au>
                  <snm>Altman</snm>
                  <fnm>RB</fnm>
               </au>
            </aug>
            <source>Pharmacogenetics</source>
            <pubdate>2004</pubdate>
            <volume>14</volume>
            <issue>9</issue>
            <fpage>577</fpage>
            <lpage>586</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15475731</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B33">
            <title>
               <p>The relationship between Precision-Recall and ROC curves</p>
            </title>
            <aug>
               <au>
                  <snm>Davis</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Goadrich</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>ICML 2006: Proceedings of the 23rd International Conference on Machine learning</source>
            <publisher>New York, NY, USA: ACM Press</publisher>
            <pubdate>2006</pubdate>
            <fpage>233</fpage>
            <lpage>240</lpage>
         </bibl>
         <bibl id="B34">
            <title>
               <p>Retrieval and classification of dental research articles</p>
            </title>
            <aug>
               <au>
                  <snm>Bartling</snm>
                  <fnm>WC</fnm>
               </au>
               <au>
                  <snm>Schleyer</snm>
                  <fnm>TK</fnm>
               </au>
               <au>
                  <snm>Visweswaran</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Adv Dent Res</source>
            <pubdate>2003</pubdate>
            <volume>17</volume>
            <fpage>115</fpage>
            <lpage>120</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15126221</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <title>
               <p>A comparison of event models for Naive Bayes text classification</p>
            </title>
            <aug>
               <au>
                  <snm>McCallum</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Nigam</snm>
                  <fnm>K</fnm>
               </au>
            </aug>
            <source>Tech. rep., Just Research</source>
            <pubdate>1998</pubdate>
         </bibl>
         <bibl id="B36">
            <title>
               <p>Poisson naive Bayes for text classification with feature weighting</p>
            </title>
            <aug>
               <au>
                  <snm>Kim</snm>
                  <fnm>SB</fnm>
               </au>
               <au>
                  <snm>Seo</snm>
                  <fnm>HC</fnm>
               </au>
               <au>
                  <snm>Rim</snm>
                  <fnm>HC</fnm>
               </au>
            </aug>
            <source>IRAL 2003: Proceedings of the Sixth International Workshop on Information Retrieval with Asian Languages</source>
            <publisher>Morristown, NJ, USA: Association for Computational Linguistics</publisher>
            <pubdate>2003</pubdate>
            <fpage>33</fpage>
            <lpage>40</lpage>
         </bibl>
         <bibl id="B37">
            <aug>
               <au>
                  <snm>Ewens</snm>
                  <fnm>WJ</fnm>
               </au>
               <au>
                  <snm>Grant</snm>
                  <fnm>GR</fnm>
               </au>
            </aug>
            <source>Statistical Methods in Bioinformatics: An Introduction</source>
            <publisher>Springer</publisher>
            <edition>2</edition>
            <pubdate>2005</pubdate>
         </bibl>
         <bibl id="B38">
            <aug>
               <au>
                  <snm>van Rossum</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Drake</snm>
                  <fnm>FL</fnm>
               </au>
            </aug>
            <source>Python Reference Manual. Virginia, USA</source>
            <pubdate>2001</pubdate>
            <url>http://www.python.org</url>
         </bibl>
         <bibl id="B39">
            <title>
               <p>2007 MEDLINE<sup>&#174; </sup>R/PubMed<sup>&#174; </sup>R Baseline Distribution</p>
            </title>
            <url>http://www.nlm.nih.gov/bsd/licensee/2007_stats/baseline_doc.html</url>
         </bibl>
         <bibl id="B40">
            <title>
               <p>National Library of Medicine AIDS Subset Strategy</p>
            </title>
            <url>http://www.nlm.nih.gov/bsd/pubmed_subsets/aids_strategy.html</url>
         </bibl>
         <bibl id="B41">
            <title>
               <p>National Library of Medicine Bioethics Subset Strategy</p>
            </title>
            <url>http://www.nlm.nih.gov/bsd/pubmed_subsets/bioethics_strategy.html</url>
         </bibl>
      </refgrp>
   </bm>
</art>

