<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-8-423</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>PubMed related articles: a probabilistic topic-based model for content similarity</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Lin</snm>
               <fnm>Jimmy</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>jimmylin@umd.edu</email>
            </au>
            <au id="A2">
               <snm>Wilbur</snm>
               <fnm>W John</fnm>
               <insr iid="I2"/>
               <email>wilbur@ncbi.nlm.nih.gov</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>College of Information Studies, University of Maryland, College Park, Maryland, USA</p>
            </ins>
            <ins id="I2">
               <p>National Center for Biotechnology Information, National Library of Medicine, Bethesda, Maryland, USA</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2007</pubdate>
         <volume>8</volume>
         <issue>1</issue>
         <fpage>423</fpage>
         <url>http://www.biomedcentral.com/1471-2105/8/423</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17971238</pubid>
               <pubid idtype="doi">10.1186/1471-2105-8-423</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>25</day>
               <month>7</month>
               <year>2007</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>30</day>
               <month>10</month>
               <year>2007</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>30</day>
               <month>10</month>
               <year>2007</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2007</year>
         <collab>Lin and Wilbur; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>We present a probabilistic topic-based model for content similarity called <it>pmra </it>that underlies the related article search feature in PubMed. Whether or not a document is about a particular topic is computed from term frequencies, modeled as Poisson distributions. Unlike previous probabilistic retrieval models, we do not attempt to estimate relevance&#8211;but rather our focus is "relatedness", the probability that a user would want to examine a particular document given known interest in another. We also describe a novel technique for estimating parameters that does not require human relevance judgments; instead, the process is based on the existence of MeSH <sup>&#174; </sup>in MEDLINE <sup>&#174;</sup>.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>The <it>pmra </it>retrieval model was compared against <it>bm25</it>, a competitive probabilistic model that shares theoretical similarities. Experiments using the test collection from the TREC 2005 genomics track shows a small but statistically significant improvement of <it>pmra </it>over <it>bm25 </it>in terms of precision.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>Our experiments suggest that the <it>pmra </it>model provides an effective ranking algorithm for related article search.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>This article describes the retrieval model behind the related article search functionality in PubMed <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. Whenever the user examines a MEDLINE citation in detail, a panel to the right of the abstract text is automatically populated with titles of articles that may also be of interest (see Figure <figr fid="F1">1</figr>). We describe <it>pmra</it>, the topic-based content similarity model that underlies this feature.</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>A typical view in the PubMed search interface showing an abstract in detail</p>
            </caption>
            <text>
               <p>A typical view in the PubMed search interface showing an abstract in detail. The "Related Links" panel on the right is populated with titles of articles that may be of interest.</p>
            </text>
            <graphic file="1471-2105-8-423-1"/>
         </fig>
         <p>There is evidence to suggest that related article search is a useful feature. Based on PubMed query logs gathered during a one-week period in June 2007, we observed approximately 35 million page views across 8 million browser sessions. Of those sessions, 63% consisted of a single page view&#8211;representing bots and direct access into MEDLINE (e.g., from an embedded link or another search engine). Of all sessions in our data set, approximately 2 million include at least one PubMed search query and at least one view of an abstract&#8211;this figure roughly quantifies actual searches. About 19% of these involve at least one click on a related article. In other words, roughly a fifth of all non-trivial user sessions contain at least one invocation of related article search. In terms of overall frequency, approximately five percent of all page views in these non-trivial sessions were generated from clicks on related article links. More details can be found in <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>.</p>
         <p>We evaluate the <it>pmra </it>retrieval model with the test collection from the TREC 2005 genomics track. A test collection is a standard laboratory tool for evaluating retrieval systems, and it consists of three major components:</p>
         <p>&#8226; a corpus&#8211;a collection of documents on which retrieval is performed,</p>
         <p>&#8226; a set of information needs&#8211;written statements describing the desired information, which translate into queries to the system, and</p>
         <p>&#8226; relevance judgments&#8211;records specifying the documents that should be retrieved in response to each information need (typically, these are gathered from human assessors in large-scale evaluations <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>).</p>
         <p>The use of test collections to assess the performance of retrieval algorithms is a well-established methodology in the information retrieval (IR) literature, dating back to the Cranfield experiments in the 60's <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. These tools enable rapid, reproducible experiments in a controlled setting without requiring users.</p>
         <p>The <it>pmra </it>model is compared against <it>bm25 </it><abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>, a competitive probabilistic model that shares theoretical similarities with <it>pmra</it>. On test data from the TREC 2005 genomics track, we observe a small but statistically significant improvement in terms of precision.</p>
         <p>Before proceeding, a clarification on terminology: although MEDLINE records contain only abstract text and associated bibliographic information, PubMed provides access to the full text articles (if available). Thus, it is not inaccurate to speak of searching for articles, even though the search itself is only performed on information in MEDLINE. Throughout this work, we use "document" and "article" interchangeably.</p>
         <sec>
            <st>
               <p>1.1 Formal Model</p>
            </st>
            <p>We formalize the related document search problem as follows: given a document that the user has indicated interest in, the system task is to retrieve other documents that the user may also want to examine. Since this activity generally occurs in the context of broader information-seeking behaviors, relevance can serve as one indicator of interest, i.e., retrieve other relevant documents. However, we think of the problem in broader terms: other documents may be interesting because they discuss similar topics, share the same citations, provide general background, lead to interesting hypotheses, etc.</p>
            <p>To constrain this problem, we assume in our theoretical model that documents of interest are similar in terms of the topics or concepts that they are <it>about</it>; in the case of MEDLINE citations, we limit ourselves to the article title and abstract (the deployed algorithm in PubMed also takes advantage of MeSH terms, which we do not discuss here). Following typical assumptions in information retrieval <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>, we wish to rank documents (MEDLINE citations, in our case) based on the probability that the user will want to see them. Thus, our <it>pmra </it>retrieval model focuses on estimating <it>P</it>(<it>c</it>|<it>d</it>), the probability that the user will find document <it>c </it>interesting given expressed interest in document <it>d</it>.</p>
            <p>Let us begin by decomposing documents into mutually-exclusive and exhaustive "topics" (denoted by the set {<it>s</it><sub>1</sub>...<it>s</it><sub><it>N</it></sub>}). Assuming that the relatedness of documents is mediated through topics, we get the following:</p>
            <p>
               <display-formula id="M1">
                  <m:math name="1471-2105-8-423-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>P</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>c</m:mi>
                           <m:mo>|</m:mo>
                           <m:mi>d</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mstyle displaystyle="true">
                              <m:munderover>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mrow>
                                    <m:mi>j</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                                 <m:mi>N</m:mi>
                              </m:munderover>
                              <m:mrow>
                                 <m:mi>P</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>c</m:mi>
                                 <m:mo>|</m:mo>
                                 <m:msub>
                                    <m:mi>s</m:mi>
                                    <m:mi>j</m:mi>
                                 </m:msub>
                                 <m:mo stretchy="false">)</m:mo>
                                 <m:mi>P</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:msub>
                                    <m:mi>s</m:mi>
                                    <m:mi>j</m:mi>
                                 </m:msub>
                                 <m:mo>|</m:mo>
                                 <m:mi>d</m:mi>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                           </m:mstyle>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaucqGGOaakcqWGJbWycqGG8baFcqWGKbazcqGGPaqkcqGH9aqpdaaeWbqaaiabdcfaqjabcIcaOiabdogaJjabcYha8jabdohaZnaaBaaaleaacqWGQbGAaeqaaOGaeiykaKIaemiuaaLaeiikaGIaem4Cam3aaSbaaSqaaiabdQgaQbqabaGccqGG8baFcqWGKbazcqGGPaqkaSqaaiabdQgaQjabg2da9iabigdaXaqaaiabd6eaobqdcqGHris5aaaa@4CC1@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>Expanding <it>P</it>(<it>s</it><sub><it>j</it></sub>|<it>d</it>) by Bayes' Theorem, we get:</p>
            <p>
               <display-formula id="M2">
                  <m:math name="1471-2105-8-423-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>P</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>c</m:mi>
                           <m:mo>|</m:mo>
                           <m:mi>d</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mstyle displaystyle="true">
                                    <m:msubsup>
                                       <m:mo>&#8721;</m:mo>
                                       <m:mrow>
                                          <m:mi>j</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                       <m:mi>N</m:mi>
                                    </m:msubsup>
                                    <m:mrow>
                                       <m:mi>P</m:mi>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>c</m:mi>
                                       <m:mo>|</m:mo>
                                       <m:msub>
                                          <m:mi>s</m:mi>
                                          <m:mi>j</m:mi>
                                       </m:msub>
                                       <m:mo stretchy="false">)</m:mo>
                                       <m:mi>P</m:mi>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>d</m:mi>
                                       <m:mo>|</m:mo>
                                       <m:msub>
                                          <m:mi>s</m:mi>
                                          <m:mi>j</m:mi>
                                       </m:msub>
                                       <m:mo stretchy="false">)</m:mo>
                                       <m:mi>P</m:mi>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:msub>
                                          <m:mi>s</m:mi>
                                          <m:mi>j</m:mi>
                                       </m:msub>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:mstyle>
                              </m:mrow>
                              <m:mrow>
                                 <m:mstyle displaystyle="true">
                                    <m:msubsup>
                                       <m:mo>&#8721;</m:mo>
                                       <m:mrow>
                                          <m:mi>j</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                       <m:mi>N</m:mi>
                                    </m:msubsup>
                                    <m:mrow>
                                       <m:mi>P</m:mi>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>d</m:mi>
                                       <m:mo>|</m:mo>
                                       <m:msub>
                                          <m:mi>s</m:mi>
                                          <m:mi>j</m:mi>
                                       </m:msub>
                                       <m:mo stretchy="false">)</m:mo>
                                       <m:mi>P</m:mi>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:msub>
                                          <m:mi>s</m:mi>
                                          <m:mi>j</m:mi>
                                       </m:msub>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:mstyle>
                              </m:mrow>
                           </m:mfrac>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaucqGGOaakcqWGJbWycqGG8baFcqWGKbazcqGGPaqkcqGH9aqpdaWcaaqaamaaqadabaGaemiuaaLaeiikaGIaem4yamMaeiiFaWNaem4Cam3aaSbaaSqaaiabdQgaQbqabaGccqGGPaqkcqWGqbaucqGGOaakcqWGKbazcqGG8baFcqWGZbWCdaWgaaWcbaGaemOAaOgabeaakiabcMcaPiabdcfaqjabcIcaOiabdohaZnaaBaaaleaacqWGQbGAaeqaaOGaeiykaKcaleaacqWGQbGAcqGH9aqpcqaIXaqmaeaacqWGobGta0GaeyyeIuoaaOqaamaaqadabaGaemiuaaLaeiikaGIaemizaqMaeiiFaWNaem4Cam3aaSbaaSqaaiabdQgaQbqabaGccqGGPaqkcqWGqbaucqGGOaakcqWGZbWCdaWgaaWcbaGaemOAaOgabeaakiabcMcaPaWcbaGaemOAaOMaeyypa0JaeGymaedabaGaemOta4eaniabggHiLdaaaaaa@677D@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>Since we are only concerned about the ranking of documents, the denominator can be safely ignored since it is independent of <it>c</it>. Thus, we arrive at the following criteria for ranking documents:</p>
            <p>
               <display-formula id="M3">
                  <m:math name="1471-2105-8-423-i3" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>P</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>c</m:mi>
                           <m:mo>|</m:mo>
                           <m:mi>d</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>&#8733;</m:mo>
                           <m:mstyle displaystyle="true">
                              <m:munderover>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mrow>
                                    <m:mi>j</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                                 <m:mi>N</m:mi>
                              </m:munderover>
                              <m:mrow>
                                 <m:mi>P</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>c</m:mi>
                                 <m:mo>|</m:mo>
                                 <m:msub>
                                    <m:mi>s</m:mi>
                                    <m:mi>j</m:mi>
                                 </m:msub>
                                 <m:mo stretchy="false">)</m:mo>
                                 <m:mi>P</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>d</m:mi>
                                 <m:mo>|</m:mo>
                                 <m:msub>
                                    <m:mi>s</m:mi>
                                    <m:mi>j</m:mi>
                                 </m:msub>
                                 <m:mo stretchy="false">)</m:mo>
                                 <m:mi>P</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:msub>
                                    <m:mi>s</m:mi>
                                    <m:mi>j</m:mi>
                                 </m:msub>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                           </m:mstyle>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaucqGGOaakcqWGJbWycqGG8baFcqWGKbazcqGGPaqkcqGHDisTdaaeWbqaaiabdcfaqjabcIcaOiabdogaJjabcYha8jabdohaZnaaBaaaleaacqWGQbGAaeqaaOGaeiykaKIaemiuaaLaeiikaGIaemizaqMaeiiFaWNaem4Cam3aaSbaaSqaaiabdQgaQbqabaGccqGGPaqkcqWGqbaucqGGOaakcqWGZbWCdaWgaaWcbaGaemOAaOgabeaakiabcMcaPaWcbaGaemOAaOMaeyypa0JaeGymaedabaGaemOta4eaniabggHiLdaaaa@5318@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>Rephrased in prose, <it>P</it>(<it>c</it>|<it>s</it><sub><it>j</it></sub>) is the probability that a user would want to see <it>c </it>given an interest in topic <it>s</it><sub><it>j</it></sub>, and similarly for <it>P</it>(<it>d</it>|<it>s</it><sub><it>j</it></sub>). Thus, the degree to which two documents are related can be computed by the product of these two probabilities and the prior probability on the topic <it>P</it>(<it>s</it><sub><it>j</it></sub>), summed across all topics.</p>
            <p>Thus far, we have not addressed the important question of what a topic actually is. For computational tractability, we make the simplifying assumption that each term in a document represents a topic (that is, each term conveys an idea or concept). Thus, the "aboutness" of a document (i.e., what topics the document discusses) is conveyed through the terms in the document. As with most retrieval models, we assume single-word terms, as opposed to potentially complex multi-word concepts. This satisfies our requirement that the set of topics be exhaustive and mutually-exclusive.</p>
            <p>From this starting point, we leverage previous work in probabilistic retrieval models based on Poisson distributions (e.g., <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>). A Poisson distribution characterizes the probability of a specific number of events occurring in a fixed period of time if these events occur with a known average rate. The underlying assumption is a generative model of document content: let us suppose that an author uses a particular term with constant probability, and that documents are generated as a sequence of terms. A Poisson distribution specifies the probability that we would observe the term <it>n </it>times in a document. Obviously, this does not accurately reflect how content is actually produced&#8211;nevertheless, this simple model has served as the starting point for many effective retrieval algorithms.</p>
            <p>This content model also assumes that each term occurrence is independent. Although in reality term occurrences are <it>not </it>independent&#8211;for example, observing the term "breast" in a document makes the term "cancer" more likely to also be observed&#8211;such a simplification makes the problem computationally tractable. This is commonly known as the term-independence assumption and dates back to the earliest days of information retrieval research <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. See <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> for recent work that attempts to introduce term dependencies into retrieval algorithms.</p>
            <p>Building on this, we invoke the concept of <it>eliteness</it>, which is closely associated with probabilistic IR models <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. A given document <it>d </it>can be <it>about </it>a particular topic <it>s</it><sub><it>i </it></sub>or not. Following standard definitions, in the first case we say that the term <it>t</it><sub><it>i </it></sub>(representing the topic <it>s</it><sub><it>i</it></sub>) is <it>elite </it>for document <it>d </it>(and not elite in the second case).</p>
            <p>Let us further assume, as others have before, that elite terms and non-elite terms are used with different frequencies. That is, if the author intends to convey topic <it>s</it><sub><it>i </it></sub>in a document, the author will use term <it>t</it><sub><it>i </it></sub>with a certain probability (elite case); if the document is not about <it>s</it><sub><it>i</it></sub>, the author will use term <it>t</it><sub><it>i </it></sub>with a different (presumably smaller) probability. We can characterize the observed frequency of a term by a Poisson distribution, defined by a single parameter (the mean), which in our model is different for the elite and non-elite cases.</p>
            <p>Thus, we wish to compute <it>P</it>(<it>E</it>|<it>k</it>)&#8211;the probability that a document is <it>about </it>a topic, given that we observed its corresponding term <it>k </it>times in the document. By Bayes' rule:</p>
            <p>
               <display-formula id="M4">
                  <m:math name="1471-2105-8-423-i4" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>P</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>E</m:mi>
                           <m:mo>|</m:mo>
                           <m:mi>k</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mi>P</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>k</m:mi>
                                 <m:mo>|</m:mo>
                                 <m:mi>E</m:mi>
                                 <m:mo stretchy="false">)</m:mo>
                                 <m:mi>P</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>E</m:mi>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>P</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>k</m:mi>
                                 <m:mo>|</m:mo>
                                 <m:mi>E</m:mi>
                                 <m:mo stretchy="false">)</m:mo>
                                 <m:mi>P</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>E</m:mi>
                                 <m:mo stretchy="false">)</m:mo>
                                 <m:mo>+</m:mo>
                                 <m:mi>P</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>k</m:mi>
                                 <m:mo>|</m:mo>
                                 <m:mover accent="true">
                                    <m:mi>E</m:mi>
                                    <m:mo>&#175;</m:mo>
                                 </m:mover>
                                 <m:mo stretchy="false">)</m:mo>
                                 <m:mi>P</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mover accent="true">
                                    <m:mi>E</m:mi>
                                    <m:mo>&#175;</m:mo>
                                 </m:mover>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                           </m:mfrac>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaucqGGOaakcqWGfbqrcqGG8baFcqWGRbWAcqGGPaqkcqGH9aqpdaWcaaqaaiabdcfaqjabcIcaOiabdUgaRjabcYha8jabdweafjabcMcaPiabdcfaqjabcIcaOiabdweafjabcMcaPaqaaiabdcfaqjabcIcaOiabdUgaRjabcYha8jabdweafjabcMcaPiabdcfaqjabcIcaOiabdweafjabcMcaPiabgUcaRiabdcfaqjabcIcaOiabdUgaRjabcYha8jqbdweafzaaraGaeiykaKIaemiuaaLaeiikaGIafmyrauKbaebacqGGPaqkaaaaaa@55D2@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>
               <display-formula id="M5">
                  <m:math name="1471-2105-8-423-i5" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mo>=</m:mo>
                           <m:msup>
                              <m:mrow>
                                 <m:mrow>
                                    <m:mo>(</m:mo>
                                    <m:mrow>
                                       <m:mn>1</m:mn>
                                       <m:mo>+</m:mo>
                                       <m:mfrac>
                                          <m:mrow>
                                             <m:mi>P</m:mi>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>k</m:mi>
                                             <m:mo>|</m:mo>
                                             <m:mover accent="true">
                                                <m:mi>E</m:mi>
                                                <m:mo>&#175;</m:mo>
                                             </m:mover>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mi>P</m:mi>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mover accent="true">
                                                <m:mi>E</m:mi>
                                                <m:mo>&#175;</m:mo>
                                             </m:mover>
                                             <m:mo stretchy="false">)</m:mo>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mi>P</m:mi>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>k</m:mi>
                                             <m:mo>|</m:mo>
                                             <m:mi>R</m:mi>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mi>P</m:mi>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>E</m:mi>
                                             <m:mo stretchy="false">)</m:mo>
                                          </m:mrow>
                                       </m:mfrac>
                                    </m:mrow>
                                    <m:mo>)</m:mo>
                                 </m:mrow>
                              </m:mrow>
                              <m:mrow>
                                 <m:mo>&#8722;</m:mo>
                                 <m:mn>1</m:mn>
                              </m:mrow>
                           </m:msup>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqGH9aqpdaqadaqaaiabigdaXiabgUcaRmaalaaabaGaemiuaaLaeiikaGIaem4AaSMaeiiFaWNafmyrauKbaebacqGGPaqkcqWGqbaucqGGOaakcuWGfbqrgaqeaiabcMcaPaqaaiabdcfaqjabcIcaOiabdUgaRjabcYha8jabdkfasjabcMcaPiabdcfaqjabcIcaOiabdweafjabcMcaPaaaaiaawIcacaGLPaaadaahaaWcbeqaaiabgkHiTiabigdaXaaaaaa@48E7@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>Next, we must compute the two probabilities <it>P</it>(<it>k</it>|<it>E</it>) and <it>P</it>(<it>k</it>|<inline-formula><m:math name="1471-2105-8-423-i6" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>E</m:mi><m:mo>&#175;</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGfbqrgaqeaaaa@2DD7@</m:annotation></m:semantics></m:math></inline-formula>). As discussed above, we model the two as Poisson distributions. For the elite case, the distribution is defined by the parameter <it>&#955;</it>, for the non-elite case, the parameter <it>&#956;</it>:</p>
            <p>
               <display-formula id="M6">
                  <m:math name="1471-2105-8-423-i7" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>P</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>k</m:mi>
                           <m:mo>|</m:mo>
                           <m:mi>E</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:msup>
                                    <m:mi>&#955;</m:mi>
                                    <m:mi>k</m:mi>
                                 </m:msup>
                                 <m:msup>
                                    <m:mi>e</m:mi>
                                    <m:mrow>
                                       <m:mo>&#8722;</m:mo>
                                       <m:mi>&#955;</m:mi>
                                    </m:mrow>
                                 </m:msup>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>k</m:mi>
                                 <m:mo>!</m:mo>
                              </m:mrow>
                           </m:mfrac>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaucqGGOaakcqWGRbWAcqGG8baFcqWGfbqrcqGGPaqkcqGH9aqpdaWcaaqaaGGaciab=T7aSnaaCaaaleqabaGaem4AaSgaaOGaemyzau2aaWbaaSqabeaacqGHsislcqWF7oaBaaaakeaacqWGRbWAcqGGHaqiaaaaaa@3E2F@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>
               <display-formula id="M7">
                  <m:math name="1471-2105-8-423-i8" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>P</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>k</m:mi>
                           <m:mo>|</m:mo>
                           <m:mover accent="true">
                              <m:mi>E</m:mi>
                              <m:mo>&#175;</m:mo>
                           </m:mover>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:msup>
                                    <m:mi>&#956;</m:mi>
                                    <m:mi>k</m:mi>
                                 </m:msup>
                                 <m:msup>
                                    <m:mi>e</m:mi>
                                    <m:mrow>
                                       <m:mo>&#8722;</m:mo>
                                       <m:mi>&#956;</m:mi>
                                    </m:mrow>
                                 </m:msup>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>k</m:mi>
                                 <m:mo>!</m:mo>
                              </m:mrow>
                           </m:mfrac>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaucqGGOaakcqWGRbWAcqGG8baFcuWGfbqrgaqeaiabcMcaPiabg2da9maalaaabaacciGae8hVd02aaWbaaSqabeaacqWGRbWAaaGccqWGLbqzdaahaaWcbeqaaiabgkHiTiab=X7aTbaaaOqaaiabdUgaRjabcgcaHaaaaaa@3E4B@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>After further algebraic manipulation, we get the expression in Equation 8. Since there are differences in length between documents in the same collection, we account for this by introducing <it>l</it>, the length of the document in words. Previous research has shown that document length normalization plays an important role in retrieval performance (e.g., <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>), since longer documents are likely to have more query terms <it>a priori</it>. Finally, we define the parameter <it>&#951; </it>= <it>P</it>(<inline-formula><m:math name="1471-2105-8-423-i6" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>E</m:mi><m:mo>&#175;</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGfbqrgaqeaaaa@2DD7@</m:annotation></m:semantics></m:math></inline-formula>)/<it>P</it>(<it>E</it>).</p>
            <p>
               <display-formula id="M8">
                  <m:math name="1471-2105-8-423-i9" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>P</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>E</m:mi>
                           <m:mo>|</m:mo>
                           <m:mi>k</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:msup>
                              <m:mrow>
                                 <m:mrow>
                                    <m:mo>(</m:mo>
                                    <m:mrow>
                                       <m:mn>1</m:mn>
                                       <m:mo>+</m:mo>
                                       <m:mi>&#951;</m:mi>
                                       <m:msup>
                                          <m:mrow>
                                             <m:mrow>
                                                <m:mo>(</m:mo>
                                                <m:mrow>
                                                   <m:mfrac>
                                                      <m:mi>&#956;</m:mi>
                                                      <m:mi>&#955;</m:mi>
                                                   </m:mfrac>
                                                </m:mrow>
                                                <m:mo>)</m:mo>
                                             </m:mrow>
                                          </m:mrow>
                                          <m:mi>k</m:mi>
                                       </m:msup>
                                       <m:msup>
                                          <m:mi>e</m:mi>
                                          <m:mrow>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>&#956;</m:mi>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mi>&#955;</m:mi>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mi>l</m:mi>
                                          </m:mrow>
                                       </m:msup>
                                    </m:mrow>
                                    <m:mo>)</m:mo>
                                 </m:mrow>
                              </m:mrow>
                              <m:mrow>
                                 <m:mo>&#8722;</m:mo>
                                 <m:mn>1</m:mn>
                              </m:mrow>
                           </m:msup>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaucqGGOaakcqWGfbqrcqGG8baFcqWGRbWAcqGGPaqkcqGH9aqpdaqadaqaaiabigdaXiabgUcaRGGaciab=D7aOnaabmaabaWaaSaaaeaacqWF8oqBaeaacqWF7oaBaaaacaGLOaGaayzkaaWaaWbaaSqabeaacqWGRbWAaaGccqWGLbqzdaahaaWcbeqaaiabgkHiTiabcIcaOiab=X7aTjabgkHiTiab=T7aSjabcMcaPiabdYgaSbaaaOGaayjkaiaawMcaamaaCaaaleqabaGaeyOeI0IaeGymaedaaaaa@4BFD@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>How does Equation 8 relate to our retrieval model? Recall from Equation 3 that we need to compute <it>P</it>(<it>c</it>|<it>s</it><sub><it>j</it></sub>) and <it>P </it>(<it>d</it>|<it>s</it><sub><it>j</it></sub>)&#8211;the probability that a user would want to see a particular document given interest in a specific topic. Let us employ <it>P</it>(<it>E</it>|<it>k</it>) for exactly this purpose: we assume that users want to see the elite set of documents for a particular topic, which is computed by observing the frequency of the term that represents the topic. Finally, we approximate <it>P</it>(<it>s</it><sub><it>i</it></sub>) with <it>idf</it>, that is, the inverse document frequency of <it>t</it><sub><it>i</it></sub>. Putting everything together, we derive the following term weighting and document ranking function:</p>
            <p>
               <display-formula id="M9">
                  <m:math name="1471-2105-8-423-i10" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msub>
                              <m:mi>w</m:mi>
                              <m:mi>t</m:mi>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:msup>
                              <m:mrow>
                                 <m:mrow>
                                    <m:mo>(</m:mo>
                                    <m:mrow>
                                       <m:mn>1</m:mn>
                                       <m:mo>+</m:mo>
                                       <m:mi>&#951;</m:mi>
                                       <m:msup>
                                          <m:mrow>
                                             <m:mrow>
                                                <m:mo>(</m:mo>
                                                <m:mrow>
                                                   <m:mfrac>
                                                      <m:mi>&#956;</m:mi>
                                                      <m:mi>&#955;</m:mi>
                                                   </m:mfrac>
                                                </m:mrow>
                                                <m:mo>)</m:mo>
                                             </m:mrow>
                                          </m:mrow>
                                          <m:mi>k</m:mi>
                                       </m:msup>
                                       <m:msup>
                                          <m:mi>e</m:mi>
                                          <m:mrow>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>&#956;</m:mi>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mi>&#955;</m:mi>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mi>l</m:mi>
                                          </m:mrow>
                                       </m:msup>
                                    </m:mrow>
                                    <m:mo>)</m:mo>
                                 </m:mrow>
                              </m:mrow>
                              <m:mrow>
                                 <m:mo>&#8722;</m:mo>
                                 <m:mn>1</m:mn>
                              </m:mrow>
                           </m:msup>
                           <m:msqrt>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mi>d</m:mi>
                                 <m:msub>
                                    <m:mi>f</m:mi>
                                    <m:mi>t</m:mi>
                                 </m:msub>
                              </m:mrow>
                           </m:msqrt>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWG3bWDdaWgaaWcbaGaemiDaqhabeaakiabg2da9maabmaabaGaeGymaeJaey4kaSccciGae83TdG2aaeWaaeaadaWcaaqaaiab=X7aTbqaaiab=T7aSbaaaiaawIcacaGLPaaadaahaaWcbeqaaiabdUgaRbaakiabdwgaLnaaCaaaleqabaGaeyOeI0IaeiikaGIae8hVd0MaeyOeI0Iae83UdWMaeiykaKIaemiBaWgaaaGccaGLOaGaayzkaaWaaWbaaSqabeaacqGHsislcqaIXaqmaaGcdaGcaaqaaiabdMgaPjabdsgaKjabdAgaMnaaBaaaleaacqWG0baDaeqaaaqabaaaaa@4E06@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>
               <display-formula id="M10">
                  <m:math name="1471-2105-8-423-i11" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mtext>Sim</m:mtext>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>c</m:mi>
                           <m:mo>,</m:mo>
                           <m:mi>d</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mstyle displaystyle="true">
                              <m:munderover>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mrow>
                                    <m:mi>t</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                                 <m:mi>N</m:mi>
                              </m:munderover>
                              <m:mrow>
                                 <m:msub>
                                    <m:mi>w</m:mi>
                                    <m:mrow>
                                       <m:mi>t</m:mi>
                                       <m:mo>,</m:mo>
                                       <m:mi>c</m:mi>
                                    </m:mrow>
                                 </m:msub>
                                 <m:mo>&#8901;</m:mo>
                                 <m:msub>
                                    <m:mi>w</m:mi>
                                    <m:mrow>
                                       <m:mi>t</m:mi>
                                       <m:mo>,</m:mo>
                                       <m:mi>d</m:mi>
                                    </m:mrow>
                                 </m:msub>
                              </m:mrow>
                           </m:mstyle>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqqGtbWucqqGPbqAcqqGTbqBcqGGOaakcqWGJbWycqGGSaalcqWGKbazcqGGPaqkcqGH9aqpdaaeWbqaaiabdEha3naaBaaaleaacqWG0baDcqGGSaalcqWGJbWyaeqaaOGaeyyXICTaem4DaC3aaSbaaSqaaiabdsha0jabcYcaSiabdsgaKbqabaaabaGaemiDaqNaeyypa0JaeGymaedabaGaemOta4eaniabggHiLdaaaa@4A6A@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>A term's weight with respect to a particular document (<it>w</it><sub><it>t</it></sub>) can be computed using Equation 9, derived from the estimation of eliteness in our probabilistic topic similarity model. Similarity between two documents is computed by an inner product of term weights, and documents are sorted by their similarity to the current document <it>d </it>in the final output. We note that this derivation shares similarities with existing probabilistic retrieval models, which we discuss in Section 3.</p>
         </sec>
         <sec>
            <st>
               <p>1.2 Parameter Estimation</p>
            </st>
            <p>The optimization of parameters is one key to good retrieval performance. In many cases, test collections with relevance judgments are required to tune parameters in terms of metrics such as mean average precision (the standard single-point measure for quantifying system performance in the IR literature). However, test collections are expensive to build and not available for many retrieval applications. To address this issue, we have developed a novel process for estimating <it>pmra </it>parameters that does not require relevance judgments.</p>
            <p>The <it>pmra </it>model has three parameters: <it>&#955;</it>, <it>&#956;</it>, and <it>&#951; </it>. The first two define the means of the elite and non-elite Poisson distributions, respectively, and the third is <it>P</it>(<inline-formula><m:math name="1471-2105-8-423-i6" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>E</m:mi><m:mo>&#175;</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGfbqrgaqeaaaa@2DD7@</m:annotation></m:semantics></m:math></inline-formula>)/<it>P</it>(<it>E</it>). To make our model computationally tractable, we make one additional simplifying assumption: that half the term occurrences in the document are elite and the other half are not. This corresponds to assuming a uniform probability distribution in absence of any other information&#8211;a similar principle underlies maximum entropy models commonly used in natural language processing <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. This leads to the following:</p>
            <p>
               <display-formula id="M11">
                  <m:math name="1471-2105-8-423-i12" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>&#951;</m:mi>
                           <m:mrow>
                              <m:mo>(</m:mo>
                              <m:mrow>
                                 <m:mfrac>
                                    <m:mi>&#956;</m:mi>
                                    <m:mi>&#955;</m:mi>
                                 </m:mfrac>
                              </m:mrow>
                              <m:mo>)</m:mo>
                           </m:mrow>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mi>P</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mover accent="true">
                                    <m:mi>E</m:mi>
                                    <m:mo>&#175;</m:mo>
                                 </m:mover>
                                 <m:mo stretchy="false">)</m:mo>
                                 <m:mi>&#956;</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>P</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>E</m:mi>
                                 <m:mo stretchy="false">)</m:mo>
                                 <m:mi>&#955;</m:mi>
                              </m:mrow>
                           </m:mfrac>
                           <m:mo>=</m:mo>
                           <m:mn>1</m:mn>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWF3oaAdaqadaqaamaalaaabaGae8hVd0gabaGae83UdWgaaaGaayjkaiaawMcaaiabg2da9maalaaabaGaemiuaaLaeiikaGIafmyrauKbaebacqGGPaqkcqWF8oqBaeaacqWGqbaucqGGOaakcqWGfbqrcqGGPaqkcqWF7oaBaaGaeyypa0JaeGymaedaaa@41B8@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>Experimental results presented in Sections 2.2 and 2.3 suggest that this assumption works reasonably well. More importantly, it reduces the number of parameters in <it>pmra </it>from three to two, and yields a slightly simpler weighting function:</p>
            <p>
               <display-formula id="M12">
                  <m:math name="1471-2105-8-423-i13" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msub>
                              <m:mi>w</m:mi>
                              <m:mi>t</m:mi>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:msup>
                              <m:mrow>
                                 <m:mrow>
                                    <m:mo>(</m:mo>
                                    <m:mrow>
                                       <m:mn>1</m:mn>
                                       <m:mo>+</m:mo>
                                       <m:msup>
                                          <m:mrow>
                                             <m:mrow>
                                                <m:mo>(</m:mo>
                                                <m:mrow>
                                                   <m:mfrac>
                                                      <m:mi>&#956;</m:mi>
                                                      <m:mi>&#955;</m:mi>
                                                   </m:mfrac>
                                                </m:mrow>
                                                <m:mo>)</m:mo>
                                             </m:mrow>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mi>k</m:mi>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mn>1</m:mn>
                                          </m:mrow>
                                       </m:msup>
                                       <m:msup>
                                          <m:mi>e</m:mi>
                                          <m:mrow>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>&#956;</m:mi>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mi>&#955;</m:mi>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mi>l</m:mi>
                                          </m:mrow>
                                       </m:msup>
                                    </m:mrow>
                                    <m:mo>)</m:mo>
                                 </m:mrow>
                              </m:mrow>
                              <m:mrow>
                                 <m:mo>&#8722;</m:mo>
                                 <m:mn>1</m:mn>
                              </m:mrow>
                           </m:msup>
                           <m:msqrt>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mi>d</m:mi>
                                 <m:msub>
                                    <m:mi>f</m:mi>
                                    <m:mi>t</m:mi>
                                 </m:msub>
                              </m:mrow>
                           </m:msqrt>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWG3bWDdaWgaaWcbaGaemiDaqhabeaakiabg2da9maabmaabaGaeGymaeJaey4kaSYaaeWaaeaadaWcaaqaaGGaciab=X7aTbqaaiab=T7aSbaaaiaawIcacaGLPaaadaahaaWcbeqaaiabdUgaRjabgkHiTiabigdaXaaakiabdwgaLnaaCaaaleqabaGaeyOeI0IaeiikaGIae8hVd0MaeyOeI0Iae83UdWMaeiykaKIaemiBaWgaaaGccaGLOaGaayzkaaWaaWbaaSqabeaacqGHsislcqaIXaqmaaGcdaGcaaqaaiabdMgaPjabdsgaKjabdAgaMnaaBaaaleaacqWG0baDaeqaaaqabaaaaa@4E3C@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>Nevertheless, we must still determine the parameters <it>&#955; </it>and <it>&#956; </it>(Poisson parameters for the elite and non-elite distributions). If a document collection were annotated with actual topics, then these values could be estimated directly. Fortunately, for MEDLINE we have exactly this metadata&#8211;in the form of MeSH terms associated with each record. MeSH terms are useful for parameter estimation in our model precisely because they represent topics present in the articles. Thus, we can assume that if <it>H</it><sub><it>n </it></sub>is assigned to document <it>d</it>, the terms in the MeSH descriptor are elite. For example, if the MeSH descriptor "headache" [C10.597.617.470] were assigned to a citation, than the term "headache" must be elite in that abstract. We can record the frequency of the term and estimate <it>&#955; </it>from such observations. Similarly, we can treat as the non-elite case terms in a document that do not appear in any MeSH descriptors, and from this we can derive <it>&#956;</it>. There is, however, one additional consideration: from what set of citations should these parameters be estimated? A few possibilities include: the entire corpus, a random sample, or a biased sample (e.g., results of a search). In this work, we experiment with variants of the third approach.</p>
            <p>As a final note, while it is theoretically possible to estimate the parameter <it>&#951; </it>based on MeSH descriptors using a similar procedure, this assumes that the coverage of MeSH terms is complete, i.e., that they completely enumerate all topics present in the abstract. Since the assignment of MeSH is performed by humans, we suspect that recall is less than perfect&#8211;therefore, we do not explore this idea further.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>2 Results</p>
         </st>
         <sec>
            <st>
               <p>2.1 Experimental Design</p>
            </st>
            <p>We evaluated our <it>pmra </it>retrieval model against <it>bm25</it>&#8211;a comparison that is appropriate given their shared theoretical ancestry (see Section 3.2). Despite the popularity and performance of language modeling techniques for information retrieval (see <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> for an overview), <it>bm25 </it>remains a competitive baseline.</p>
            <p>Our experiments were conducted using the test collection from the TREC 2005 genomics track <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>, which used a ten-year subset of MEDLINE. The test collection contains fifty information needs and relevance judgments for each, which take the form of lists of PMIDs (unique identifiers for MEDLINE citations) that were previously determined to be relevant by human assessors. See Section 5.1 for more details.</p>
            <p>The evaluation was designed to mimic the operational deployment of related article search in PubMed as much as possible. In total, there are 4584 known relevant documents in the test collection from the TREC 2005 genomics track. Each abstract served as a test "query", and we evaluated the top five results under different experimental conditions (the same number that the current PubMed interface shows). Precision, a standard metric for quantifying retrieval performance, is defined as:</p>
            <p>
               <display-formula id="M13">
                  <m:math name="1471-2105-8-423-i14" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mtext>Precision</m:mtext>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mo>#</m:mo>
                                 <m:mtext>of&#160;relevant&#160;documents</m:mtext>
                              </m:mrow>
                              <m:mrow>
                                 <m:mo>#</m:mo>
                                 <m:mtext>of&#160;retrieved&#160;documents</m:mtext>
                              </m:mrow>
                           </m:mfrac>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqqGqbaucqqGYbGCcqqGLbqzcqqGJbWycqqGPbqAcqqGZbWCcqqGPbqAcqqGVbWBcqqGUbGBcqGH9aqpdaWcaaqaaiabcocaJiabb+gaVjabbAgaMjabbccaGiabbkhaYjabbwgaLjabbYgaSjabbwgaLjabbAha2jabbggaHjabb6gaUjabbsha0jabbccaGiabbsgaKjabb+gaVjabbogaJjabbwha1jabb2gaTjabbwgaLjabb6gaUjabbsha0jabbohaZbqaaiabcocaJiabb+gaVjabbAgaMjabbccaGiabbkhaYjabbwgaLjabbsha0jabbkhaYjabbMgaPjabbwgaLjabbAha2jabbwgaLjabbsgaKjabbccaGiabbsgaKjabb+gaVjabbogaJjabbwha1jabb2gaTjabbwgaLjabb6gaUjabbsha0jabbohaZbaaaaa@7414@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>More specifically, we measured precision at a cutoff of five retrieved documents, commonly written as P5 for short. Since our test collection contains a list of relevant PMIDs for each information need (i.e., the relevance judgments), this computation was straightforward.</p>
            <p>We performed two types of experiments:</p>
            <p>&#8226; a number of runs that exhaustively explored the parameter space to determine optimal values, and</p>
            <p>&#8226; additional runs of <it>pmra </it>using parameters that were estimated in different ways.</p>
            <p>The <it>pmra </it>experiments used the ranking algorithm described in the previous section. For <it>bm25</it>, we used the complete text of the abstract verbatim as the "query" and treated the resulting output as the ranked list of related documents. Finally, as a computational expedient, we ran retrieval experiments as a reranking task using the top 100 documents retrieved by <it>bm25 </it>with default parameter settings (<it>k</it><sub>1 </sub>= 1.2, <it>b </it>= 0.75), as implemented in the open source Lemur Toolkit for language modeling and information retrieval <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. Due to the large number of queries involved in our exhaustive exploration of the parameter space and the length of each query (the entire abstract text), this setup made the problem much more tractable given the computational resources we had access to (half a dozen commodity PCs). Since we were only evaluating the top five hits, we believe that this procedure is unlikely to yield different results from a retrieval run against the complete corpus. An experiment to validate this assumption is presented in Section 5.2.</p>
            <p>The following procedures were adopted for our exhaustive runs: For <it>bm25</it>, we tried all possible parameter combinations, with <it>k</it><sub>1 </sub>ranging from 0.5 to 3.0 in 0.1 increments and <it>b </it>from 0.6 to 1.0 in 0.05 increments. This range was selected based on the default settings of <it>k</it><sub>1 </sub>= 1.2, <it>b </it>= 0.75 widely reported in the literature. Our exploration of the <it>pmra </it>parameter space started with arbitrary values of <it>&#955; </it>and <it>&#956;</it>. Assuming that the performance surface was convex and smooth, we tried different values until its shape became apparent. This was accomplished by first fixing a <it>&#955; </it>value and varying <it>&#956; </it>values in increments of 0.001; this process was repeated for different <it>&#955; </it>values in 0.001 increments.</p>
            <p>In the second set of experiments, <it>&#955; </it>and <it>&#956; </it>for <it>pmra </it>were estimated using the procedure described in Section 1.2, on different sets of citations. We also performed cross-validation as necessary to further verify our experimental results.</p>
         </sec>
         <sec>
            <st>
               <p>2.2 Optimal Parameters</p>
            </st>
            <p>The results of our exhaustive parameter tuning experiments for <it>bm25 </it>are shown in Figure <figr fid="F2">2</figr>, which plots precision at five across a wide range of parameter values. We note that except for low values of <it>k</it><sub>1 </sub>and <it>b</it>, P5 performance is relatively insensitive to parameter settings (more on this below). Results for the <it>pmra </it>parameter tuning experiments are shown in Figure <figr fid="F3">3</figr>&#8211;regions in the parameter space that yield high precision lie along a prominent "ridge" that cuts diagonally from smaller to larger values of <it>&#955; </it>and <it>&#956; </it>(more on this in Section 3.3).</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>P5 for the <it>bm25 </it>model given different settings of the parameters <it>k</it><sub>1 </sub>and <it>b</it></p>
               </caption>
               <text>
                  <p>P5 for the <it>bm25 </it>model given different settings of the parameters <it>k</it><sub>1 </sub>and <it>b</it>. This plot was generated by exhaustively trying all <it>k</it><sub>1 </sub>values 0.5 to 3.0 (in 0.1 increments) and <it>b </it>values 0.6 to 1.0 (in 0.05 increments). Notice that except for low values of <it>k</it><sub>1 </sub>and <it>b</it>, P5 performance is relatively insensitive to parameter settings.</p>
               </text>
               <graphic file="1471-2105-8-423-2"/>
            </fig>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>P5 for the <it>pmra </it>model given different settings of the parameters <it>&#955; </it>(Poisson parameter for the elite distribution) and <it>&#956; </it>(Poisson parameter for the non-elite distribution)</p>
               </caption>
               <text>
                  <p>P5 for the <it>pmra </it>model given different settings of the parameters <it>&#955; </it>(Poisson parameter for the elite distribution) and <it>&#956; </it>(Poisson parameter for the non-elite distribution). Notice that the parameter settings resulting in high P5 values lie along a "ridge" in the parameter space.</p>
               </text>
               <graphic file="1471-2105-8-423-3"/>
            </fig>
            <p>The highest P5 performance for <it>bm25 </it>is achieved with <it>k</it><sub>1 </sub>= 1.9 and <it>b </it>= 1.0; by the same metric, the optimal setting for <it>pmra </it>is <it>&#955; </it>= 0.022 and <it>&#956; </it>= 0.013. Table <tblr tid="T1">1</tblr> shows precision at five values numerically for optimal <it>bm25 </it>and optimal <it>pmra</it>, which we refer to as <it>bm25</it>* and <it>pmra</it>* for convenience. For comparison, the performance of <it>bm25 </it>with default parameter values <it>k</it><sub>1 </sub>= 1.2, <it>b </it>= 0.75 (denoted as <it>bm25</it><sup>b</sup>) is also shown. We applied the Wilcoxon signed-rank test to determine if the differences in the evaluation metrics are statistically significant. Throughout this paper, significance at the 1% level is indicated by **; significance at the 5% level is indicated by *. Differences that are not statistically significant are marked with the symbol &#176;. Results show a small, but statistically significant improvement of <it>pmra </it>over <it>bm25 </it>(both default and optimized), but no significant difference between optimized and default <it>bm25</it>. Due to the large number of test abstracts, we are able to discriminate small differences in performance between the models (recall that each of the 4584 relevant documents from the test collection was used as a test abstract).</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Overall comparison between the <it>bm25 </it>and <it>pmra </it>models.</p>
               </caption>
               <tblbdy cols="6">
                  <r>
                     <c ca="left">
                        <p>
                           <b>Run</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Model</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Description</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>P5</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>vs. bm25</it>
                           <sup>b</sup>
                        </p>
                     </c>
                     <c ca="left">
                        <p><it>bm25</it>*</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>bm25</it>
                           <sup>b</sup>
                        </p>
                     </c>
                     <c ca="left">
                        <p><it>bm25 </it>(<it>k</it><sub>1 </sub>= 1.2, <it>b </it>= 0.75)</p>
                     </c>
                     <c ca="left">
                        <p><it>bm25</it>, default parameters</p>
                     </c>
                     <c ca="left">
                        <p>0.381</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>-0.5%&#176;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p><it>bm25</it>*</p>
                     </c>
                     <c ca="left">
                        <p><it>bm25 </it>(<it>k</it><sub>1 </sub>= 1.9, <it>b </it>= 1.00)</p>
                     </c>
                     <c ca="left">
                        <p><it>bm25</it>, optimal parameters</p>
                     </c>
                     <c ca="left">
                        <p>0.383</p>
                     </c>
                     <c ca="left">
                        <p>+0.5%&#176;</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p><it>pmra</it>*</p>
                     </c>
                     <c ca="left">
                        <p><it>pmra </it>(<it>&#955; </it>= 0.022, <it>&#956; </it>= 0.013)</p>
                     </c>
                     <c ca="left">
                        <p><it>pmra</it>, optimal parameters</p>
                     </c>
                     <c ca="left">
                        <p>0.399</p>
                     </c>
                     <c ca="left">
                        <p>+4.7% **</p>
                     </c>
                     <c ca="left">
                        <p>+4.2% **</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Table shows model parameters, P5 values over the entire test collection, and relative differences.</p>
               </tblfn>
            </tbl>
            <p>Information needs from the TREC 2005 genomics track were grouped into five templates, each with ten different instantiations; see Section 5.1 for more details. Precision at five values broken down by template are shown in Table <tblr tid="T2">2</tblr>. Relative differences are shown in Table <tblr tid="T3">3</tblr>, along with the results of Wilcoxon signed-rank tests. We find that in general, differences between default and optimized <it>bm25 </it>are not statistically significant, except for template #3. Optimized <it>pmra </it>outperforms optimized <it>bm25 </it>on four out of five templates, three of which are statistically significant.</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Comparison between the <it>bm25 </it>and <it>pmra </it>models, broken down by template.</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c ca="left">
                        <p>
                           <b>Template</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>bm25</it>
                           <sup>b</sup>
                        </p>
                     </c>
                     <c ca="center">
                        <p><it>bm25</it>*</p>
                     </c>
                     <c ca="center">
                        <p><it>pmra</it>*</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>#1: methods or protocols</p>
                     </c>
                     <c ca="center">
                        <p>0.211</p>
                     </c>
                     <c ca="center">
                        <p>0.210</p>
                     </c>
                     <c ca="center">
                        <p>0.253</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>#2: role of gene in disease</p>
                     </c>
                     <c ca="center">
                        <p>0.484</p>
                     </c>
                     <c ca="center">
                        <p>0.487</p>
                     </c>
                     <c ca="center">
                        <p>0.499</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>#3: role of gene in biological process</p>
                     </c>
                     <c ca="center">
                        <p>0.351</p>
                     </c>
                     <c ca="center">
                        <p>0.365</p>
                     </c>
                     <c ca="center">
                        <p>0.349</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>#4: gene interactions in organ/disease</p>
                     </c>
                     <c ca="center">
                        <p>0.297</p>
                     </c>
                     <c ca="center">
                        <p>0.281</p>
                     </c>
                     <c ca="center">
                        <p>0.303</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>#5: mutation of gene and its impact</p>
                     </c>
                     <c ca="center">
                        <p>0.440</p>
                     </c>
                     <c ca="center">
                        <p>0.438</p>
                     </c>
                     <c ca="center">
                        <p>0.462</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Table shows P5 values for <it>bm25</it><sup>b </sup>(default parameters), <it>bm25</it>* (optimized parameters), and <it>pmra</it>* (optimized parameters).</p>
               </tblfn>
            </tbl>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Relative differences between the <it>bm25 </it>and <it>pmra </it>models.</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c ca="left">
                        <p>
                           <b>Template</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p><it>bm25</it>* <it>vs. bm25</it><sup>b</sup></p>
                     </c>
                     <c ca="left">
                        <p><it>pmra</it>* <it>vs. bm25</it><sup>b</sup></p>
                     </c>
                     <c ca="left">
                        <p><it>pmra</it>* <it>vs. bm25</it>*</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>#1: methods or protocols</p>
                     </c>
                     <c ca="left">
                        <p>-0.5%&#176;</p>
                     </c>
                     <c ca="left">
                        <p>+20.0% **</p>
                     </c>
                     <c ca="left">
                        <p>+20.5% **</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>#2: role of gene in disease</p>
                     </c>
                     <c ca="left">
                        <p>+0.6%&#176;</p>
                     </c>
                     <c ca="left">
                        <p>+3.1% *</p>
                     </c>
                     <c ca="left">
                        <p>+2.5% *</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>#3: role of gene in biological process</p>
                     </c>
                     <c ca="left">
                        <p>+4.0% **</p>
                     </c>
                     <c ca="left">
                        <p>-0.6%&#176;</p>
                     </c>
                     <c ca="left">
                        <p>-4.4% **</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>#4: gene interactions in organ/disease</p>
                     </c>
                     <c ca="left">
                        <p>-5.4%&#176;</p>
                     </c>
                     <c ca="left">
                        <p>+2.0%&#176;</p>
                     </c>
                     <c ca="left">
                        <p>+7.8%&#176;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>#5: mutation of gene and its impact</p>
                     </c>
                     <c ca="left">
                        <p>-0.5%&#176;</p>
                     </c>
                     <c ca="left">
                        <p>+5.0% *</p>
                     </c>
                     <c ca="left">
                        <p>+5.5% **</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Three conditions are compared: <it>bm25</it><sup>b </sup>(default parameters), <it>bm25</it>* (optimized parameters), and <it>pmra</it>* (optimized parameters). Each column represents a <it>x vs. y </it>comparison, where the figures indicate the relative improvements of <it>x </it>over <it>y</it>. In general, we see that the differences between optimized and default <it>bm25 </it>are not statistically significant, whereas the differences between <it>pmra </it>and <it>bm25 </it>are.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>2.3 Estimated Parameters</p>
            </st>
            <p>We also attempted to automatically estimate parameters for the <it>pmra </it>model using the method described in Section 1.2. However, that method is underspecified with respect to the set of MEDLINE citations over which it is applied. We experimented with the following possibilities:</p>
            <p>&#8226; The complete set of documents examined by human assessors in the TREC 2005 genomics track (see <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> for a description of how these documents were gathered).</p>
            <p>&#8226; The top 100 hits for each of the 4584 PMIDs that comprise our test abstracts, using <it>bm25 </it>with default parameters.</p>
            <p>&#8226; The top 100 hits for each of the 50 template queries that comprise the TREC 2005 genomics track, retrieved using Indri's default ranking algorithm based on language models. Indri is a component in the open source Lemur Toolkit.</p>
            <p>&#8226; Same as previous, except with top 1000 hits.</p>
            <p>The estimated parameters given each citation set is shown in Table <tblr tid="T4">4</tblr>, along with the size of each set and the precision achieved. In the first condition, the estimated parameters differ from the optimal ones, but the resulting P5 figure is statistically indistinguishable. For the three other citation sets, the estimated parameters were very close to the optimal parameters. Once again, the differences are not statistically significant. These results suggest that our parameter estimation method is robust and effective. Furthermore, it also appears to be insensitive with respect to the size and composition of the citation set.</p>
            <tbl id="T4">
               <title>
                  <p>Table 4</p>
               </title>
               <caption>
                  <p>Values of <it>pmra </it>parameters (<it>&#955;</it>, <it>&#956;</it>) estimated using different sets of MEDLINE citations.</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c ca="left">
                        <p>
                           <b>Set Used</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Size</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>&#955;</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>&#956;</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>P5</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>All assessed documents from TREC 2005 genomics track</p>
                     </c>
                     <c ca="left">
                        <p>39874</p>
                     </c>
                     <c ca="left">
                        <p>0.032</p>
                     </c>
                     <c ca="left">
                        <p>0.022</p>
                     </c>
                     <c ca="left">
                        <p>0.397&#176;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Top 100 hits for every relevant citation, <it>bm25</it></p>
                     </c>
                     <c ca="left">
                        <p>453402</p>
                     </c>
                     <c ca="left">
                        <p>0.023</p>
                     </c>
                     <c ca="left">
                        <p>0.013</p>
                     </c>
                     <c ca="left">
                        <p>0.398&#176;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Top 100 hits for every template query, Indri</p>
                     </c>
                     <c ca="left">
                        <p>4991</p>
                     </c>
                     <c ca="left">
                        <p>0.022</p>
                     </c>
                     <c ca="left">
                        <p>0.012</p>
                     </c>
                     <c ca="left">
                        <p>0.397&#176;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Top 1000 hits for every template query, Indri</p>
                     </c>
                     <c ca="left">
                        <p>49907</p>
                     </c>
                     <c ca="left">
                        <p>0.024</p>
                     </c>
                     <c ca="left">
                        <p>0.013</p>
                     </c>
                     <c ca="left">
                        <p>0.397&#176;</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Optimal parameters</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>0.022</p>
                     </c>
                     <c ca="left">
                        <p>0.013</p>
                     </c>
                     <c ca="left">
                        <p>0.399</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>We see that estimated values are close to optimal parameter values in many cases, and that differences in P5 performance are not statistically significant.</p>
               </tblfn>
            </tbl>
            <p>Finally, to further verify these results and to ensure that we were not estimating parameters from the same set used to measure precision, cross-validation experiments were performed on the second condition. The 4584 test abstracts were divided into five folds, stratified across the templates so that each template was represented in each fold. We conducted five separate experiments, using four of the folds for parameter estimation and the final fold for evaluation. The results were exactly the same&#8211;P5 figures were statistically indistinguishable from the optimal values.</p>
            <p>In summary, we have empirically demonstrated the effectiveness of our <it>pmra </it>retrieval model and shown a small but statistically significant improvement in precision at five documents over the <it>bm25 </it>baseline. Furthermore, our novel parameter estimation method was found to be effective when applied to a wide range of citation sets varying in both composition and size. Notably, the tuning of parameters did not require relevance judgments, the component in a test collection that is the most expensive and time-consuming to gather.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>3 Discussion</p>
         </st>
         <sec>
            <st>
               <p>3.1 Significance of Results</p>
            </st>
            <p>Although we measured statistically significant differences in P5 between <it>pmra </it>and <it>bm25</it>, are the improvements meaningful in a real sense? The difference between baseline <it>bm25 </it>and optimal <it>pmra </it>(achievable by our parameter estimation process) is 4.7%. In terms of the PubMed interface, for each abstract, one would expect 2.0 <it>vs. </it>1.9 interesting articles in the related links display. We argue that although small, this is nevertheless a meaningful improvement.</p>
            <p>PubMed is one of the Internet's most-visited gateways to MEDLINE&#8211;small differences, multiplied by thousands of users and many more interactions add up to substantial quantities. In addition, our metrics are measuring performance differences <it>per interaction</it>, since a list of related articles is retrieved for every citation that the user examines. In the course of a search session, a user may examine many citations, especially when conducting in-depth research on a particular subject. Thus, the effects of small performance improvements accumulate.</p>
            <p>One might also argue that this accumulation of benefits is not linear. Consider the case of repeatedly browsing related articles&#8211;the user views a citation, examines related articles, selects an interesting one, and repeats (cf. the simulation studies in <abbrgrp><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr></abbrgrp>). In that case, the expected number of interesting links per interaction can be viewed as a branching factor if one wanted to quantify the total number of interesting articles that are accessible in this manner. In about 13 interactions, an improvement of 0.1 (i.e., 1.9<sup>13 </sup>vs. 2.0<sup>13</sup>) would result in potential access to twice as many interesting articles.</p>
         </sec>
         <sec>
            <st>
               <p>3.2 Comparison to Other Work</p>
            </st>
            <p>A suitable point of comparison for this work is the Binary Independent Retrieval (BIR) model for probabilistic IR <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>, which underlies <it>bm25</it>. Indeed, <it>bm25 </it>was chosen as a baseline not only for its performance, but also because it shares certain theoretical similarities with our model. Along with related work dating back several decades <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>, these two models share in their attempts to capture term frequencies with Poisson distributions. However, there are important differences that set our work apart.</p>
            <p>The <it>pmra </it>model was designed for a fundamentally different task&#8211;related document search, not <it>ad hoc </it>retrieval. In the latter, the system's task is to return a ranked list of documents that is relevant to a user's query (what most people think of as "search"). One substantial difference is query length&#8211;in <it>ad hoc </it>retrieval, user queries are typically very short (a few words at the most). As a result, query-length normalization is not a critical problem, and hence has not received much attention. In contrast, since the "query" in related document search is a complete document, more care is required to account for document length differences.</p>
            <p>Another important difference between <it>pmra </it>and <it>bm25 </it>is that there is no notion of relevance in the <it>pmra </it>model, only that of relatedness, mediated via topic similarity. Note, however, that the concept of relevance is still <it>implicitly </it>present in the task definition&#8211;in that the examination of documents may take place in the context of broader information-seeking behaviors. In contrast, the starting point of BIR is a log-odds, i.e., <it>P</it>(<it>R</it>|<it>D</it>)/<it>P</it>(<inline-formula><m:math name="1471-2105-8-423-i15" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>R</m:mi><m:mo>&#175;</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGsbGugaqeaaaa@2DF1@</m:annotation></m:semantics></m:math></inline-formula>|<it>D</it>), which explicitly attempts to estimate the relevance (<it>R</it>) and non-relevance (<inline-formula><m:math name="1471-2105-8-423-i15" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>R</m:mi><m:mo>&#175;</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGsbGugaqeaaaa@2DF1@</m:annotation></m:semantics></m:math></inline-formula>) of a document (<it>D</it>). Relevance is then modeled in terms of eliteness (see below). The starting point of our task definition leads to a different derivation.</p>
            <p>Although both <it>bm25 </it>and <it>pmra </it>attempt to capture term dependencies in terms of Poisson distributions, they do so in different ways. BIR employs a more complex representation, where term frequencies are modeled as mixtures of two different Poisson distributions (elite and non-elite). In total, the complete model has four parameters&#8211;the two Poisson parameters, <it>P</it>(<it>E</it>|<it>R</it>), and <it>P</it>(<it>E</it>|<inline-formula><m:math name="1471-2105-8-423-i15" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>R</m:mi><m:mo>&#175;</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGsbGugaqeaaaa@2DF1@</m:annotation></m:semantics></m:math></inline-formula>). Since eliteness is a hidden variable, there is no way to estimate the parameters directly. Instead, Robertson and Walker devised simple approximations that work well empirically <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>. One side effect of this 2-Poisson approximation is that <it>bm25 </it>parameters are not physically meaningful, unlike <it>&#955; </it>and <it>&#956; </it>in <it>pmra</it>, which correspond to comprehensible quantities. Unlike BIR, our model makes the simplifying assumption that terms are exclusively drawn from either the elite or non-elite distribution. That is, if the document is about a particular topic, then the corresponding term frequency is dictated solely by the elite Poisson distribution; similarly, the non-elite distribution for the non-elite case.</p>
            <p>Finally, the derivation of our model, coupled with the availability of MeSH headings in the biomedical domain, allow us to directly estimate parameters for our system. Most notably, the process does not require a test collection with relevance judgments, making the parameter optimization process far less onerous.</p>
         </sec>
         <sec>
            <st>
               <p>3.3 Parameter Estimation</p>
            </st>
            <p>The estimation of parameters in the <it>pmra </it>model depends on the existence of MeSH terms, which is indeed a fortuitous happenstance in the case of MEDLINE. Does this limit the applicability of our model to other domains in which topic indexing and controlled vocabularies are not available? We note that effective access to biomedical text is suffciently important an application that even a narrowly-tailored solution represents a contribution. Nevertheless, we present evidence to suggest that the <it>pmra </it>model provides a general solution to related document search.</p>
            <p>We see from Figure <figr fid="F3">3</figr> that our model performs well with settings that lie along a ridge in the parameter space. This observation is confirmed in Figure <figr fid="F4">4</figr>&#8211;for each value of <it>&#955; </it>(from 0.015 to 0.035 in increments of 0.001), we plot the optimal value of <it>&#956;</it>. Superimposed on this graph is a linear regression line, which achieves an <it>R</it><sup>2</sup> value of 0.976, a very good fit. This finding suggests that the relationship between <it>&#955; </it>and <it>&#956; </it>is perhaps even more important than their absolute values, since good performance is attainable with a wide range of parameter settings (as long as the relationship between <it>&#955; </it>and <it>&#956; </it>is maintained).</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Optimal <it>&#956; </it>(Poisson parameter for the non-elite distribution) for each <it>&#955; </it>value (Poisson parameter for the elite distribution) in the <it>pmra </it>model</p>
               </caption>
               <text>
                  <p>Optimal <it>&#956; </it>(Poisson parameter for the non-elite distribution) for each <it>&#955; </it>value (Poisson parameter for the elite distribution) in the <it>pmra </it>model. Regression line shows a linear relationship between these two parameters, corresponding to the "ridge" in Figure 3.</p>
               </text>
               <graphic file="1471-2105-8-423-4"/>
            </fig>
            <p>How good is related document search performance along this ridge? The answer is found in Figure <figr fid="F5">5</figr>. On the <it>x</it>-axis we plot values of <it>&#955; </it>; the <it>y</it>-axis shows P5 values for two conditions&#8211;optimal <it>&#956; </it>(for that <it>&#955;</it>), shown as squares, and interpolated <it>&#956; </it>based on the regression line in Figure <figr fid="F4">4</figr>, shown as diamonds. The performance of the globally-optimal setting (<it>&#955; </it>= 0.022, <it>&#956; </it>= 0.013, which yields P5 = 0.399) is shown as the dotted line. We see that across a wide range of parameter settings, P5 performance remains close to the global optimum. The Wilcoxon signed-rank test was applied to compare the performance at each setting with the globally-optimal setting: differences that are statistically significant (<it>p </it>&lt; 0.05) are shown as solid diamonds and squares. Only at both ends of the wide <it>&#955; </it>range do differences become significant.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>P5 at optimal and interpolated values of <it>&#956; </it>for each <it>&#955; </it>in the <it>pmra </it>model</p>
               </caption>
               <text>
                  <p>P5 at optimal and interpolated values of <it>&#956; </it>for each <it>&#955; </it>in the <it>pmra </it>model. Squares represent optimal <it>&#956; </it>at each <it>&#955;</it>, corresponding to the squares in Figure 4. Diamonds represent interpolated <it>&#956; </it>at each <it>&#955;</it>, corresponding to the regression line in Figure 4. P5 of the globally optimal parameter setting is shown as the dotted line. The filled square and diamond represent points at which P5 is significantly lower than the globally optimal setting.</p>
               </text>
               <graphic file="1471-2105-8-423-5"/>
            </fig>
            <p>This finding suggests that the <it>pmra </it>model is relatively insensitive to parameter settings, so long as a particular relationship is maintained between <it>&#955; </it>and <it>&#956;</it>. Thus, it would be reasonable to apply our model to texts for which controlled-vocabulary resources do not exist.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>4 Conclusion</p>
         </st>
         <p>In most search applications, system input is comprised of a short query, which is a textual representation of the user's information need. In contrast, this work focuses on related document search, where given a document, the goal is to find other documents that may be of interest to the user&#8211;in our case, the specific task is to retrieve related MEDLINE abstracts. We present a novel probabilistic topic-based content similarity algorithm for accomplishing this, deployed in the PubMed search engine. Experiments on the TREC 2005 genomics track test collection show a small but statistically significant improvement over <it>bm25</it>, a competitive probabilistic retrieval model. Evidence suggests that the <it>pmra </it>model is able to effectively retrieve related articles, and that its integration into PubMed enriches the user experience.</p>
      </sec>
      <sec>
         <st>
            <p>5 Methods</p>
         </st>
         <sec>
            <st>
               <p>5.1 Test Collection</p>
            </st>
            <p>The test collection used in our experiments was developed from the TREC 2005 genomics track <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. The Text Retrieval Conferences (TRECs) are annual evaluations of information retrieval systems that draw dozens of participants from all over the world each year <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. Numerous "tracks" at TREC focus on different aspects of information retrieval, ranging from spam detection to question answering. The genomics track in 2005 focused on retrieval of MEDLINE abstracts in response to typical information needs of biologists and other biomedical researchers.</p>
            <p>The live MEDLINE database as deployed in PubMed is constantly evolving as new articles are added, making it unsuitable for controlled, reproducible experiments. Therefore, the TREC 2005 genomics track evaluation employed a ten-year subset of MEDLINE (1994&#8211;2003), which totals 4.6 million citations (approximately a third of the size of the entire database at the time it was collected in 2004). Each record is identified by a unique PMID and includes bibliographic information and abstract text (if available).</p>
            <p>One salient feature of the evaluation is its use of generic topic templates (GTTs) to capture users' information needs, instead of the typical free-text title, description, and narrative combinations used in other <it>ad hoc </it>retrieval tasks, e.g., <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. The GTTs consist of semantic types, such as genes and diseases, that are embedded in common genomics-related information needs, as determined from interviews with biologists. In total, five templates were developed, with ten fully-instantiated information needs for each; examples are shown in Table <tblr tid="T5">5</tblr>. The templates impose a level of organization on the information needs, but do not have a substantial impact on system performance since participants for the most part did not exploit the template structure, but instead treated the topics no differently than free-text queries.</p>
            <tbl id="T5">
               <title>
                  <p>Table 5</p>
               </title>
               <caption>
                  <p>Templates and sample instantiations used in the TREC 2005 genomics track evaluation.</p>
               </caption>
               <tblbdy cols="1">
                  <r>
                     <c ca="left">
                        <p>#1 <b>Information describing standard </b>[<b>methods or protocols</b>] <b>for doing some sort of experiment or procedure.</b></p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p><it>methods or protocols: </it>how to "open up" a cell through a process called "electroporation"</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>#2 <b>Information describing the role(s) of a </b>[<b>gene</b>] <b>involved in a </b>[<b>disease</b>].</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p><it>gene: </it>interferon-beta</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p><it>disease: </it>multiple sclerosis</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>#3 <b>Information describing the role of a </b>[<b>gene</b>] <b>in a specific </b>[<b>biological process</b>].</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p><it>gene: </it>nucleoside diphosphate kinase (NM23)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p><it>biological process: </it>tumor progression</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>#4 <b>Information describing interactions between two or more </b>[<b>genes</b>] <b>in the </b>[<b>function of an organ</b>] <b>or in a </b>[<b>disease</b>].</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p><it>genes: </it>CFTR and Sec61</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p><it>function of an organ: </it>degradation of CFTR</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p><it>disease: </it>cystic fibrosis</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>#5 <b>Information describing one or more </b>[<b>mutations</b>] <b>of a given </b>[<b>gene</b>] <b>and its </b>[<b>biological impact or role</b>].</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p><it>gene with mutation: </it>BRCA1 185delAG mutation</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p><it>biological impact: </it>role in ovarian cancer</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>In total, 32 groups submitted 59 runs to the TREC 2005 genomics track, consisting of both automatic runs and those with human intervention. Relevance judgments were provided by an undergraduate student and a Ph.D. researcher in biology. We adapted the judgments for our task by treating each relevant document as a test abstract&#8211;citations relevant to the same information need were said to be related to each other. In other words, we assume that if a user were examining a MEDLINE citation to address a particular information need, other relevant citations would also be of interest.</p>
         </sec>
         <sec>
            <st>
               <p>5.2 Reranking Experiments</p>
            </st>
            <p>Recall from Section 2.1 that for computational expediency, our experiments were performed as reranking runs over results retrieved by <it>bm25 </it>with default paramters. We describe an experiment that examined the potential impact of this setup.</p>
            <p>In theory, both <it>bm25 </it>and <it>pmra </it>establish an ordering over <it>all </it>documents in a corpus with respect to a query. Reranking in the limit yields exactly the same results; thus, the substantive question is whether reranking the top hundred hits would yield the same results as searching over the entire corpus. We can examine this issue by tallying the original rank positions of the top five results after reranking&#8211;that is, if reranking promotes hits that are highly ranked in the original list to begin with, then we can conclude that hits in the lower ranked positions of the original list matter little. On the other hand, if the reranking brings up hits that are very far down in the original ranked list, it might cause us to wonder what other documents from lower-ranked positions are missed.</p>
            <p>We performed exactly this experiment with the optimal <it>pmra </it>run (<it>&#955; </it>= 0.022, <it>&#956; </it>= 0.013). For each test abstract, we tallied the original ranks of the top five results, e.g., hit 1 of <it>pmra </it>was promoted from hit 9 of the original ranked list, etc. We divided the original rank positions into ten bins of equal size and plotted a histogram of the bin frequencies. The results are shown by the bar graph in Figure <figr fid="F6">6</figr>; the line graph shows the corresponding cumulative distribution. We see, for example, that approximately 80% of the top five <it>pmra </it>results came from the top ten results in the original ranked list. That is, 80% of the time the <it>pmra </it>algorithm was merely reshuffing the top ten <it>bm25 </it>results&#8211;this is not unexpected, since <it>bm25 </it>already performs well and there's not much to be done in terms of improving the results in many cases. The cumulative distribution tops 95% at rank 31 and 99% at rank 67&#8211;which means that <it>pmra </it>is promoting hits below these ranks to the top five positions only five and one percent of the time, respectively. Thus, it is unlikely that our reranking setup resulted in different conclusions than if the retrieval had been performed on the entire corpus. This experiment supports the validity of our experimental design.</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>Distribution of original ranks for reranked run: <it>pmra </it>(<it>&#955; </it>= 0.013)</p>
               </caption>
               <text>
                  <p>Distribution of original ranks for reranked run: <it>pmra </it>(<it>&#955; </it>= 0.022, <it>&#956; </it>= 0.013). The bar graph divides the original rank positions into ten bins and tallies the fraction of hits that were brought into the top five by <it>pmra</it>; for example, approximately 80% of the top five <it>pmra </it>results came from the top ten results in the original ranked list. The line graph shows the cumulative distribution.</p>
               </text>
               <graphic file="1471-2105-8-423-6"/>
            </fig>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>WJW developed the original <it>pmra </it>model. JL worked on subsequent refinements, including the parameter estimation method. JL ran the experiments. Both authors read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>For this work, JL was funded in part by the National Library of Medicine, where he was a visiting research scientist during the summer of 2006. WJW is supported by the Intramural Research Program of the NIH, National Library of Medicine. JL would also like to thank Esther and Kiri for their kind support.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Modeling Text Retrieval in Biomedicine</p>
            </title>
            <aug>
               <au>
                  <snm>Wilbur</snm>
                  <fnm>WJ</fnm>
               </au>
            </aug>
            <source>Medical Informatics: Knowledge Management and Data Mining in Biomedicine</source>
            <publisher>New York: Springer</publisher>
            <editor>Chen H, Fuller SS, Friedman C, Hersh W</editor>
            <pubdate>2005</pubdate>
            <fpage>277</fpage>
            <lpage>297</lpage>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Exploring the Effectiveness of Related Article Search in PubMed</p>
            </title>
            <aug>
               <au>
                  <snm>Lin</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>DiCuccio</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Grigoryan</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Wilbur</snm>
                  <fnm>WJ</fnm>
               </au>
            </aug>
            <source>Tech. Rep. LAMP-TR-145/CS-TR-4877/UMIACS-TR-2007-36/HCIL-2007-10</source>
            <publisher>University of Maryland, College Park, Maryland</publisher>
            <pubdate>2007</pubdate>
         </bibl>
         <bibl id="B3">
            <title>
               <p>The TREC Test Collections</p>
            </title>
            <aug>
               <au>
                  <snm>Harman</snm>
                  <fnm>DK</fnm>
               </au>
            </aug>
            <source>TREC: Experiment and Evaluation in Information Retrieval</source>
            <publisher>Cambridge, Massachusetts: MIT Press</publisher>
            <editor>Voorhees EM, Harman DK</editor>
            <pubdate>2005</pubdate>
            <fpage>21</fpage>
            <lpage>52</lpage>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Factors Determining the Performance of Indexing Systems</p>
            </title>
            <aug>
               <au>
                  <snm>Cleverdon</snm>
                  <fnm>CW</fnm>
               </au>
               <au>
                  <snm>Mills</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Keen</snm>
                  <fnm>EM</fnm>
               </au>
            </aug>
            <source>Two volumes, ASLIB Cranfield Research Project, Cranfield, England</source>
            <pubdate>1968</pubdate>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Okapi at TREC-3</p>
            </title>
            <aug>
               <au>
                  <snm>Robertson</snm>
                  <fnm>SE</fnm>
               </au>
               <au>
                  <snm>Walker</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Jones</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Hancock-Beaulieu</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Gatford</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Proceedings of the 3rd Text REtrieval Conference (TREC-3)</source>
            <pubdate>1994</pubdate>
         </bibl>
         <bibl id="B6">
            <title>
               <p>A Probabilistic Model of Information Retrieval: Development and Comparative Experiments (Parts 1 and 2)</p>
            </title>
            <aug>
               <au>
                  <snm>Sparck Jones</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Walker</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Robertson</snm>
                  <fnm>SE</fnm>
               </au>
            </aug>
            <source>Information Processing and Management</source>
            <pubdate>2000</pubdate>
            <volume>36</volume>
            <issue>6</issue>
            <fpage>779</fpage>
            <lpage>840</lpage>
         </bibl>
         <bibl id="B7">
            <title>
               <p>The Probability Ranking Principle in IR</p>
            </title>
            <aug>
               <au>
                  <snm>Robertson</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Journal of Documentation</source>
            <pubdate>1977</pubdate>
            <volume>33</volume>
            <issue>4</issue>
            <fpage>294</fpage>
            <lpage>304</lpage>
         </bibl>
         <bibl id="B8">
            <title>
               <p>A Probabilistic Approach to Automatic Keyword Indexing. Part I: On the Distribution of Specialty Words in a Technical Literature</p>
            </title>
            <aug>
               <au>
                  <snm>Harter</snm>
                  <fnm>SP</fnm>
               </au>
            </aug>
            <source>Journal of the American Society for Information Science</source>
            <pubdate>1975</pubdate>
            <volume>26</volume>
            <issue>4</issue>
            <fpage>197</fpage>
            <lpage>206</lpage>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Modelling Documents with Multiple Poisson Distributions</p>
            </title>
            <aug>
               <au>
                  <snm>Margulis</snm>
                  <fnm>EL</fnm>
               </au>
            </aug>
            <source>Information Processing and Management</source>
            <pubdate>1993</pubdate>
            <volume>29</volume>
            <issue>2</issue>
            <fpage>215</fpage>
            <lpage>227</lpage>
         </bibl>
         <bibl id="B10">
            <title>
               <p>A Vector Space Model for Information Retrieval</p>
            </title>
            <aug>
               <au>
                  <snm>Salton</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Communications of the ACM</source>
            <pubdate>1975</pubdate>
            <volume>18</volume>
            <issue>11</issue>
            <fpage>613</fpage>
            <lpage>620</lpage>
         </bibl>
         <bibl id="B11">
            <title>
               <p>A Markov Random Field Model for Term Dependencies</p>
            </title>
            <aug>
               <au>
                  <snm>Metzler</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Croft</snm>
                  <fnm>WB</fnm>
               </au>
            </aug>
            <source>Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2005)</source>
            <publisher>Salvador, Brazil</publisher>
            <pubdate>2005</pubdate>
            <fpage>472</fpage>
            <lpage>479</lpage>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Pivoted Document Length Normalization</p>
            </title>
            <aug>
               <au>
                  <snm>Singhal</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Buckley</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Mitra</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1996)</source>
            <publisher>Z&#252;rich, Switzerland</publisher>
            <pubdate>1996</pubdate>
            <fpage>21</fpage>
            <lpage>29</lpage>
         </bibl>
         <bibl id="B13">
            <title>
               <p>A Maximum Entropy Approach to Natural Language Processing</p>
            </title>
            <aug>
               <au>
                  <snm>Berger</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Pietra</snm>
                  <fnm>SD</fnm>
               </au>
               <au>
                  <snm>Pietra</snm>
                  <fnm>VD</fnm>
               </au>
            </aug>
            <source>Computational Linguistics</source>
            <pubdate>1996</pubdate>
            <volume>22</volume>
            <fpage>39</fpage>
            <lpage>71</lpage>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Statistical Language Models for Information Retrieval</p>
            </title>
            <aug>
               <au>
                  <snm>Zhai</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Tutorial Presentation at the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006)</source>
            <pubdate>2006</pubdate>
         </bibl>
         <bibl id="B15">
            <title>
               <p>TREC 2005 Genomics Track Overview</p>
            </title>
            <aug>
               <au>
                  <snm>Hersh</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Cohen</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Yang</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Bhupatiraju</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Roberts</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Hearst</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Proceedings of the Fourteenth Text REtrieval Conference (TREC 2005)</source>
            <pubdate>2005</pubdate>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Using the Lemur Toolkit for IR</p>
            </title>
            <aug>
               <au>
                  <snm>Strohman</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Ogilvie</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Tutorial Presentation at the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006)</source>
            <publisher>Seattle, Washington</publisher>
            <pubdate>2006</pubdate>
         </bibl>
         <bibl id="B17">
            <title>
               <p>The Effectiveness of Document Neighboring in Search Enhancement</p>
            </title>
            <aug>
               <au>
                  <snm>Wilbur</snm>
                  <fnm>WJ</fnm>
               </au>
               <au>
                  <snm>Coffee</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Information Processing and Management</source>
            <pubdate>1994</pubdate>
            <volume>30</volume>
            <issue>2</issue>
            <fpage>253</fpage>
            <lpage>266</lpage>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Find-Similar: Similarity Browsing as a Search Tool</p>
            </title>
            <aug>
               <au>
                  <snm>Smucker</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Allan</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006)</source>
            <pubdate>2006</pubdate>
            <fpage>461</fpage>
            <lpage>468</lpage>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval</p>
            </title>
            <aug>
               <au>
                  <snm>Robertson</snm>
                  <fnm>SE</fnm>
               </au>
               <au>
                  <snm>Walker</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1994)</source>
            <publisher>Dublin, Ireland</publisher>
            <pubdate>1994</pubdate>
            <fpage>232</fpage>
            <lpage>241</lpage>
         </bibl>
         <bibl id="B20">
            <aug>
               <au>
                  <snm>Voorhees</snm>
                  <fnm>EM</fnm>
               </au>
               <au>
                  <snm>Harman</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>TREC: Experiments and Evaluation in Information Retrieval</source>
            <publisher>Cambridge, Massachusetts: MIT Press</publisher>
            <pubdate>2005</pubdate>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Overview of the Sixth Text REtrieval Conference (TREC-6)</p>
            </title>
            <aug>
               <au>
                  <snm>Voorhees</snm>
                  <fnm>EM</fnm>
               </au>
               <au>
                  <snm>Harman</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Proceedings of the Ninth Text REtrieval Conference (TREC-6)</source>
            <pubdate>1997</pubdate>
         </bibl>
      </refgrp>
   </bm>
</art>
