<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-7-58</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>Identifying biological concepts from a protein-related corpus with a probabilistic topic model</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Zheng</snm>
               <fnm>Bin</fnm>
               <insr iid="I1"/>
               <email>zheng@musc.edu</email>
            </au>
            <au id="A2">
               <snm>McLean</snm>
               <mi>C</mi>
               <fnm>David</fnm>
               <suf>Jr</suf>
               <insr iid="I1"/>
               <email>mcleandc@musc.edu</email>
            </au>
            <au id="A3" ca="yes">
               <snm>Lu</snm>
               <fnm>Xinghua</fnm>
               <insr iid="I1"/>
               <email>lux@musc.edu</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Department of Biostatistics, Bioinformatics and Epidemiology, Medical University of South Carolina, Charleston, SC 29405, USA</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2006</pubdate>
         <volume>7</volume>
         <issue>1</issue>
         <fpage>58</fpage>
         <url>http://www.biomedcentral.com/1471-2105/7/58</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">16466569</pubid>
               <pubid idtype="doi">10.1186/1471-2105-7-58</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>14</day>
               <month>9</month>
               <year>2005</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>08</day>
               <month>2</month>
               <year>2006</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>08</day>
               <month>2</month>
               <year>2006</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2006</year>
         <collab>Zheng et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Biomedical literature, e.g., MEDLINE, contains a wealth of knowledge regarding functions of proteins. Major recurring biological concepts within such text corpora represent the domains of this body of knowledge. The goal of this research is to identify the major biological topics/concepts from a corpus of protein-related MEDLINE<sup>&#169; </sup>titles and abstracts by applying a probabilistic topic model.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>The latent Dirichlet allocation (LDA) model was applied to the corpus. Based on the Bayesian model selection, 300 major topics were extracted from the corpus. The majority of identified topics/concepts was found to be semantically coherent and most represented biological objects or concepts. The identified topics/concepts were further mapped to the controlled vocabulary of the Gene Ontology (GO) terms based on mutual information.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>The major and recurring biological concepts within a collection of MEDLINE documents can be extracted by the LDA model. The identified topics/concepts provide parsimonious and semantically-enriched representation of the texts in a semantic space with reduced dimensionality and can be used to index text.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>An important task of bioinformatics research is to acquire and represent biomedical knowledge in computable form so that it can be efficiently stored, retrieved, and used for discovery of new knowledge. For example, the Gene Ontology (GO) Consortium <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> and the Gene Ontology Annotation (GOA) project <abbrgrp><abbr bid="B2">2</abbr></abbrgrp> are dedicated to the task of representing biological knowledge with the controlled vocabulary of GO terms. Knowledge of protein functions serves as a cornerstone of modern biomedical knowledge. Much of such knowledge is contained in the form of free text in biomedical literature. A more compressed and accessible representation of this same knowledge is contained in bibliographic databases, e.g., MEDLINE. In addition to current manual annotation efforts, needs for automatic knowledge acquisition and representation exist, and a critical step of this process is to extract biological concepts from free text.</p>
         <p>The task of automatic knowledge acquisition from free text is usually addressed within the frameworks of the natural language processing (NLP), information extraction (IE), and information retrieval (IR) techniques <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr></abbrgrp>, which has been wide applied in bioinformatics setting, as reviewed in <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>. Recent trend in text mining is to acquire deeper semantic information from text, e.g., semantic information has be used to cluster genes <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> and evaluate the functional coherence of a group of genes <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp>. Extracting semantic information from free text requires the capability of effectively dealing with the uncertainties commonly associated with human language. To this end, probabilistic semantic analyses serve as promising approaches for handling such uncertainties and performing semantically enriched text mining.</p>
         <p>In this paper, we report extraction of semantic topics/concepts from a corpus of MEDLINE titles and abstracts using a probabilistic topic model, the LDA model <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp>. The goal was to identify the major and recurring concepts that represent the major knowledge domains of protein functions. Furthermore, extraction of the semantic contents of a document provides a parsimonious and concise representation of that text. Such information can be used for efficient indexing, information retrieval, and protein annotation.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>Representing semantic topics with a probabilistic topic model</p>
            </st>
            <p>In a scientific article, a scientist will refer to multiple real world objects and/or concepts, thus a paper usually consists of multiple topics/subjects, e.g., a paper may discuss a protein located in <it>mitochondria </it>and involved in the cellular process of <it>apoptosis</it>. When discussing objects or concepts, the author will choose certain words to convey the semantic meaning. For instance, when discussing the topic <it>mitochondria</it>, words like 'electron,' 'cytochrome,' and 'ATP' are commonly used, while words like 'apoptosis,' 'programmed,' 'death,' and 'caspase' are commonly used to discuss the concept of <it>apoptosis</it>. Thus a document can be treated as a mixture of words from multiple topics. The LDA model represent such a notion by explicitly encoding multi-topicality of a document with a topic-composition variable and then simulating the "generation" of words by accordingly mixing words from topics, which are represented as multinomial distributions over a vocabulary, i.e., a word-usage pattern. Figure <figr fid="F1">1</figr> shows how a topic can be represented as word-usage pattern in a probabilistic topic model. Given a corpus of text documents, the LDA model is capable of extracting the topics by statistical inference as described in the Methods section.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Representing concepts with word distributions</p>
               </caption>
               <text>
                  <p><b>Representing concepts with word distributions</b>. Two hypothetic topics are depicted. The bar lengths indicate the word usage preference in form of probability.</p>
               </text>
               <graphic file="1471-2105-7-58-1"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Training of LDA model</p>
            </st>
            <p>The LDA model was applied to extract the semantic topics from a corpus of MEDLINE titles and abstracts downloaded from the GOA project website as described in the Methods section. The training of an LDA model requires specification of the number of topics for the models, an issue of interest from both semantic analysis and statistical learning view points. From a semantic analysis point of view, this is equivalent to determining the granularity of abstraction of the concepts that can be used to summarize the semantic contents of the corpus. From the statistical learning point of view, this is equivalent to select among the models with different complexity. A Bayesian model selection framework was employed to determine the "optimal" number of topics based on the posterior probability of a model, <it>p</it>(<it>M </it>| <b>w</b>). To perform the Bayesian model selection, samples of the latent semantic topics, <b>z</b>, were collected for a model with a given number of topics, <it>T</it>, and the approximate the posterior probabilities were calculated according to equation (7) and plotted (Figure <figr fid="F2">2</figr>). The model with 300 topics had the highest approximated marginal likelihood and was thus used for the analyses reported in this paper.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Bayesian model selection</p>
               </caption>
               <text>
                  <p><b>Bayesian model selection</b>. The means of approximated evidence for different models are plotted; standard error bars are within the symbols.</p>
               </text>
               <graphic file="1471-2105-7-58-2"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Evaluating semantic topics</p>
            </st>
            <p>A trained LDA model returns estimated distributions of the following parameters and latent variables: (1) the word-usage distribution, <b>&#966;</b><sub><it>t</it></sub>, for each topic; (2) the latent topic labeling <it>z</it><sub><it>i </it></sub>for each word <it>w</it><sub><it>i</it></sub>; and (3) the topic-composition distribution <b>&#952;</b><sub><it>d </it></sub>for each document. The parameter vector <b>&#966;</b><sub><it>t </it></sub>is a distribution representing a word-usage pattern for the topic <it>t</it>. High probability words of each <b>&#966;</b><sub><it>t </it></sub>can be thought as the words frequently used to discuss the topics. In Table <tblr tid="T1">1</tblr>, the 10 most commonly observed topics and their high probability words of the trained LDA model are listed. The topics are sorted in descending order according to the number of words assigned to them in the corpus. High probability words of these topics constitute clusters of words that coherently convey biological concepts. For example, topic # 51 reflects the concept of <it>ligand-activated receptors</it>, and the topic # 156 is related to <it>serine/threonin kinase activity</it>. Because the LDA model attempts to capture the major topics that can be used to "generate" the data, the concepts extracted by this model should reflect the recurring themes of the corpus. Indeed, when multiple models with 300 topics were trained with different random-number seeds, similar major topics were extracted although the index of the topics differed among the models. Thus, the topics listed in Table <tblr tid="T1">1</tblr> do reflect common biological themes in our corpus.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>The ten most common topics from a trained LDA model</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="center">
                        <p>
                           <b>Topic #</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>Topic words</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>51</p>
                     </c>
                     <c ca="left">
                        <p>receptor coupl ligand agonist subtype pharmacolog antagonist orphan adrenerg desensit</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>156</p>
                     </c>
                     <c ca="left">
                        <p>kinas phosphoryl serin threonin pkc autophosphoryl casein akt catalyt ste20</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>136</p>
                     </c>
                     <c ca="left">
                        <p>cerevisia saccharomyc strain yeast plasmid multicopi lacz floccul auxotroph gal1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>67</p>
                     </c>
                     <c ca="left">
                        <p>Famili member belong multigen subfamily mrg Dalton cabp28k heterogen transmembran</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>154</p>
                     </c>
                     <c ca="left">
                        <p>patient syndrom diseas disord autosom inherit recess ref caus clinic</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>124</p>
                     </c>
                     <c ca="left">
                        <p>cdna librari clone probe screen isol lambda obtain oligonucleotid gtl1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>37</p>
                     </c>
                     <c ca="left">
                        <p>neuron axon migrat motor glial spinal cord neurit dendrite outgrowth</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>229</p>
                     </c>
                     <c ca="left">
                        <p>mutant defect doubl phenotyp fail rescu restor impair pleiotrop unable</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>112</p>
                     </c>
                     <c ca="left">
                        <p>exon intron genom kb flank region span upstream bp start</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>172</p>
                     </c>
                     <c ca="left">
                        <p>nuclear nucleu export cytoplasm nuclei pore ran hnrnp envelop import</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Inferring the semantic content of a text</p>
            </st>
            <p>The instantiated latent variables <b>z</b><sub><it>d </it></sub>indicates the semantic contents of the document. For the text in the training data set, the topic contents for each document were returned as the estimated latent variables <b>z</b><sub><it>d </it></sub>of the trained model. For a newly observed text, the topic contents can be inferred by invoke the sampling algorithm with the estimated parameters as described in the Methods section. Figure <figr fid="F3">3</figr> shows an example of a MEDLINE abstract, in which topic assignment for the words were inferred using a trained LDA model. This abstract discusses a protein referred to as apoptosis inducing factor (AIF), a mitochondrial protein that induces apoptosis. In this figure, the inferred semantic topic for each word (excluding "stop" words) is shown as the superscript numbers next to it. The abstract is associated with the following GO terms: (1) GO:0008630, DNA damage response, signal transduction resulting in induction of apoptosis; (2) GO:0009055, electron carrier activity; (3) GO:0005739, mitochondrion; and (4) GO:0006309, DNA fragmentation during apoptosis. In Figure <figr fid="F3">3</figr>, two major topics, # 73 and # 147, are the dominant topics of the abstract. Topic # 73 is related to the <it>mitochondrion </it>and topic # 147 reflects the concept <it>of apoptosis</it>. Interestingly, several words, which can belong to multiple topics depending on context, were found in the abstract, e.g., "space" and "outer." The LDA model has captured their common occurrence in the context of <it>mitochondrion </it>and correctly assigned these common words to this topic based on the context. With the inferred topics, this abstract can be readily indexed with these two major topics which agree well with the human GO annotations of this abstract. Furthermore, a document can also be indexed as a vector containing the counts of the words in each topic or with the normalized estimated <m:math name="1471-2105-7-58-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mover accent="true"><m:mi>&#952;</m:mi><m:mo>^</m:mo></m:mover><m:mi>d</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWF4oqCgaqcamaaBaaaleaacqWGKbazaeqaaaaa@2FF6@</m:annotation></m:semantics></m:math>, which be treated as a vector in the space spanned by the topics. Such representation effectively projects the document from the high dimensional vocabulary space onto the reduced-dimensionality of topic space. Such information could be used to automatically index the text.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Semantic analysis for a MEDLINE abstract (PMID 9989411)</p>
               </caption>
               <text>
                  <p><b>Semantic analysis for a MEDLINE abstract (PMID 9989411)</b>. The topics associated with the words were inferred by the LDA model and are shown as the superscript number next to the words. The words from the topics # 73 and # 147 are highlighted with blue and red colors, respectively.</p>
               </text>
               <graphic file="1471-2105-7-58-3"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Assessing biological relevance of topics</p>
            </st>
            <p>The LDA model simulates the "generation" of a corpus. By its generative nature, it will incorporate topics needed to capture the common characteristics in the corpus. However, some common features may not be necessarily relevant to biology but merely reflect the linguistic feature of the corpus. To determine the biological relevance of topics, we further inspected the high probability words and assigned a biological relevance score, ranging from 0 (indicating no biological relevance) to 5 (representing strong biological relevance) to each topic. A histogram of the assigned biological relevance scores (Panel A of Figure <figr fid="F4">4</figr>) indicates that most topics/concepts extracted from this corpus were biologically relevant, with only a fraction with biological relevance scores equal to zero, indicating no biological relevance.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Determining the biological relevance of the topics</p>
               </caption>
               <text>
                  <p><b>Determining the biological relevance of the topics</b>. <b><it>Panel A</it></b>. Histogram of human assigned biological relevance scores. A score of 0 indicates no biological relevance, while scores of 1 through 5 indicate increasingly relevant and coherent biological relevance. <b><it>Panel B</it></b>. Relationship between the human assigned biological relevance score and the topic-GO MI.</p>
               </text>
               <graphic file="1471-2105-7-58-4"/>
            </fig>
            <p>Each MEDLINE abstract from the GOA corpus was associated with one or more GO terms, providing an opportunity to study the relationship between the semantic topics extracted by the LDA model and the GO annotations. The correlation between the semantic topic and the GO annotation can be quantified by mutual information (MI) between the latent topic and the annotated GO terms. MI is a symmetric, non-negative quantity that measures the relevance (amount of information) of one variable with respect to another variable, which equals zero if and only if the variables are independent. Since GO terms are designed to represent biological objects/concepts, the topics highly relevant to biological objects/concepts should have high MI with some GO terms, while the topics irrelevant to biology should have low MI values for topic-GO association. Indeed, as shown in Figure <figr fid="F4">4</figr>, the topics rated low relevance have very low MI with any GO terms, while topics with high relevance have the highest topic-GO MI (Panel B). However, there were some topics that were assigned high relevance scores but had low MI with GO terms. This disparity was likely due to the way the MI for a topic-GO association was calculated in this study, which specifies that, if a document was annotated with a GO term <it>g</it>, every word in the document was considered as annotated with that GO term. This method was adopted due to the lack of supervised training data specifying which words in a document were responsible for the GO annotations. MI calculated under this assumption is skewed for the relatively uncommon topics in the corpus. Nonetheless, the MI of topic-GO association serves as a criterion of evaluating the biological relevance of a topic. When a topic had a high MI value for a topic-GO association, it usually reflected a coherent biological concept. Interestingly, a topic with low biological relevance did not mean that it was not a coherent semantic concept. For example, topics # 224 and # 227 (Table <tblr tid="T2">2</tblr>) consisted of common English words that therefore had the lowest MI with any GO term. However, the topics did contain the words that constitute coherent semantic concepts, e.g., topic # 224 contains words related to the concept of <it>being unique.</it></p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Examples of topic-GO associations</p>
               </caption>
               <tblbdy cols="6">
                  <r>
                     <c ca="center">
                        <p>
                           <b>Topic #</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>GO ID</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>MI</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>GO Category</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>GO Term</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Most Frequent Topic Words</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>278</p>
                     </c>
                     <c ca="center">
                        <p>GO:0005730</p>
                     </c>
                     <c ca="center">
                        <p>0.001439</p>
                     </c>
                     <c ca="center">
                        <p>Component</p>
                     </c>
                     <c ca="left">
                        <p>nucleolus</p>
                     </c>
                     <c ca="left">
                        <p>ribosom rrna pre deplet process small nucleolar biogenesi accumul nucleolu</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>267</p>
                     </c>
                     <c ca="center">
                        <p>GO:0005681</p>
                     </c>
                     <c ca="center">
                        <p>0.001193</p>
                     </c>
                     <c ca="center">
                        <p>Component</p>
                     </c>
                     <c ca="left">
                        <p>spliceosome complex</p>
                     </c>
                     <c ca="left">
                        <p>splice altern pre snrnp mrna spliceosom u2 step sap snrna</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>105</p>
                     </c>
                     <c ca="center">
                        <p>GO:0005816</p>
                     </c>
                     <c ca="center">
                        <p>0.00119</p>
                     </c>
                     <c ca="center">
                        <p>Component</p>
                     </c>
                     <c ca="left">
                        <p>spindle pole body</p>
                     </c>
                     <c ca="left">
                        <p>microtubul spindl mitot tubulin kinetochor mitosi centrosom pole centromer bodi</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>236</p>
                     </c>
                     <c ca="center">
                        <p>GO:0006935</p>
                     </c>
                     <c ca="center">
                        <p>0.00186</p>
                     </c>
                     <c ca="center">
                        <p>Process</p>
                     </c>
                     <c ca="left">
                        <p>chemotaxis</p>
                     </c>
                     <c ca="left">
                        <p>lymphocyt macrophag chemokin monocyt neutrophil inflammatori leukocyt peripher mcp cd8</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>156</p>
                     </c>
                     <c ca="center">
                        <p>GO:0006468</p>
                     </c>
                     <c ca="center">
                        <p>0.001514</p>
                     </c>
                     <c ca="center">
                        <p>Process</p>
                     </c>
                     <c ca="left">
                        <p>protein amino acid phosphorylation</p>
                     </c>
                     <c ca="left">
                        <p>kinas phosphoryl serin threonin pkc autophosphoryl casein akt catalyt ste20</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>267</p>
                     </c>
                     <c ca="center">
                        <p>GO:0000398</p>
                     </c>
                     <c ca="center">
                        <p>0.001404</p>
                     </c>
                     <c ca="center">
                        <p>Process</p>
                     </c>
                     <c ca="left">
                        <p>nuclear mRNA splicing</p>
                     </c>
                     <c ca="left">
                        <p>splice altern pre snrnp mrna spliceosom u2 step sap snrna</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>156</p>
                     </c>
                     <c ca="center">
                        <p>GO:0004674</p>
                     </c>
                     <c ca="center">
                        <p>0.001148</p>
                     </c>
                     <c ca="center">
                        <p>Function</p>
                     </c>
                     <c ca="left">
                        <p>protein serine/threonine kinase activity</p>
                     </c>
                     <c ca="left">
                        <p>kinas phosphoryl serin threonin pkc autophosphoryl casein akt catalyt ste20</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>267</p>
                     </c>
                     <c ca="center">
                        <p>GO:0008248</p>
                     </c>
                     <c ca="center">
                        <p>0.001463</p>
                     </c>
                     <c ca="center">
                        <p>Function</p>
                     </c>
                     <c ca="left">
                        <p>pre-mRNA splicing factor activity</p>
                     </c>
                     <c ca="left">
                        <p>splice altern pre snrnp mrna spliceosom u2 step sap snrna</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>236</p>
                     </c>
                     <c ca="center">
                        <p>GO:0008009</p>
                     </c>
                     <c ca="center">
                        <p>0.001093</p>
                     </c>
                     <c ca="center">
                        <p>Function</p>
                     </c>
                     <c ca="left">
                        <p>chemokine activity</p>
                     </c>
                     <c ca="left">
                        <p>lymphocyt macrophag chemokin monocyt neutrophil inflammatori leukocyt peripher mcp cd8</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>224</p>
                     </c>
                     <c ca="center">
                        <p>GO:0015671</p>
                     </c>
                     <c ca="center">
                        <p>5.05E-06</p>
                     </c>
                     <c ca="center">
                        <p>Process</p>
                     </c>
                     <c ca="left">
                        <p>oxygen transport</p>
                     </c>
                     <c ca="left">
                        <p>ha uniqu characterist featur extens character typic possess unusu exhibit</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>227</p>
                     </c>
                     <c ca="center">
                        <p>GO:0015213</p>
                     </c>
                     <c ca="center">
                        <p>5.00E-06</p>
                     </c>
                     <c ca="center">
                        <p>Function</p>
                     </c>
                     <c ca="left">
                        <p>uridine transporter activity</p>
                     </c>
                     <c ca="left">
                        <p>function defin unknown perform wide thei tissu repres consist creat</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Associating topic with GO terms</p>
            </st>
            <p>Studying the correlation between the topics and the GO terms also allowed the mapping of topics to the controlled vocabulary of GO terms, laying a foundation for possible future automatic annotation/indexing of MEDLINE abstracts with the GO terms. While annotating a gene product based on biomedical literature, a human curator needs to extract and summarize the semantic concepts of the literature, find a GO term that is semantically close to the concepts, and assign that GO term to the gene product. To identify the potential matching GO terms for each topic, the MI values for all observed topic-GO associations were calculated. Then, for each topic <it>t</it>, a GO term from each of the three GO categories with the highest MI value was treated as the candidate GO term matching the topic. Table <tblr tid="T2">2</tblr> shows examples of associating the extracted semantic topics with the GO terms. The top 9 rows are the topic-GO associations with high MI values, while the bottom 2 rows are examples of topic-GO associations with low MI. When MI values for topic-GO associations were high, the definitions of the GO terms usually agreed well with the semantic concepts contained in the latent topics. Interestingly, the inference of the topics by the LDA model mimics the process of identifying the biologic concepts from the texts by a human curator; and determining the MI ("the strength") of topic-GO association mimics the process of mapping the biological concepts to the GO terms. Thus, mapping latent topics to GO terms potentially provides a means to automatically annotate a protein with GO terms based on the semantic concepts contained in the associated literatures.</p>
         </sec>
         <sec>
            <st>
               <p>Clustering proteins according to their functional descriptions</p>
            </st>
            <p>In a topic that strongly related to a specific biological object or process, i.e., when MI of topic-GO association was high, the names of the proteins involved in that process frequently appeared on the top of the word list for the topics. For example, topic # 156 in Table <tblr tid="T2">2</tblr> is related to <it>threonine/serine phosphorylation </it>process, and the protein names 'pkc,' 'akt,' and 'ste20' were among the most frequent words of the topic, indicating that the LDA model was capable of clustering gene/protein names according to the concept of protein functions. Interestingly, clustering of these protein names did not require them to co-occur within the same documents. The LDA model was capable of clustering the gene/protein names simply based on their associations with some common key words of the biological concepts. This finding could be used as a tool to cluster genes with similar functions from different organisms based on their associated literatures. This finding also agrees with a previous study by Homayouni et al <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>, in which proteins were represented as points in the vocabulary space based on their associated literature, and they were further projected onto a reduced-dimension semantic space constructed with the LSI techniques. The proteins with similar functions were form clusters within semantic space.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>Most biomedical knowledge is stored as free text in the biomedical literature, and the size of the biomedical literature is increasing rapidly. There is an urgent need for automatically acquiring and representing this body of knowledge in a computable form to facilitate the discovery of new knowledge, which requires the development of computational methods to extract knowledge from the text. The current state of the art of the text mining approaches have applied to biomedical literature and reported in several recent challenge evaluations, such as the KDD, the BioCreative, and the TREC <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B9">9</abbr><abbr bid="B16">16</abbr></abbrgrp>. However, most of these approaches are within the conventional NLP, IE, and IR framework, and the application of probabilistic or non-probabilistic semantic modeling of biomedical literature remains relatively sparse <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr></abbrgrp>.</p>
         <p>In this paper, we report the extraction of a set of semantic topics from a corpus of protein-related MEDLINE titles and abstracts with the LDA model. The key advantages of applying an LDA model to perform statistical semantic analysis includes, but is not necessarily limited to the following: (1) it model is capable of extracting major recurring themes from a corpus of text in a unsupervised manner; (2) the assumption that a document is a mixture of topics naturally simulates real world text and allows modeling of text at finer granularity; and (3) it can effectively resolve many ambiguities commonly association with natural language.</p>
         <sec>
            <st>
               <p>Recurring biological themes reflect knowledge domains</p>
            </st>
            <p>The LDA model identifies topics from a text corpus by capturing the covariance of the words and organizes the words that tend to co-occur into a structure that mimics a topic. The inference algorithm for the model is unsupervised, precluding the need of expensive, manually-annotated data. The generative nature of the LDA model ensures that the extracted topics/concepts reflect the recurring themes within the corpus. We used a well-annotated data set from the Uniprot database <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>, thus the major topics identified from the corpus arguably reflect the major domains of our knowledge of proteins.</p>
            <p>We applied a Bayesian model selection approach to determine the "optimal" number of topics for the purpose of model fitting. The Bayesian model selection favors the simplest model that explains data well <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. With such a preference, many of the 300 topics in our results reflect the general themes of the corpus. However, the model is also capable of capturing strong co-occurrence patterns that are highly specific biological objects/concepts, as demonstrated in Table <tblr tid="T2">2</tblr>. As more training data become available, especially as full electronic texts of the biomedical literature become available, the Bayesian model selection can accommodate more complex models thus simulating the data with finer granularity. One limitation of the LDA model is that it requires a specified number of topics in order to model the data. However, it is a strong assumption to specify that a corpus is generated with a fixed number of topics, which may not be valid in the real world. To address this issue, recent development in the nonparametric approaches, such as the Dirichlet process based methods may be more reasonable to model the data without a specified number of topics, such as in the Dirichlet process related models <abbrgrp><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr></abbrgrp>.</p>
            <p>In the LDA model, a topic is represented as a distribution reflecting the word-usage pattern. One key advantage of the LDA model is that the extracted topics correspond to real world objects or concepts that are readily understandable by people with domain knowledge. In comparison, another extensively studied semantic analysis approach, the latent semantic indexing (LSI) model <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B12">12</abbr><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr><abbr bid="B24">24</abbr></abbrgrp>, cannot recover understandable semantic topics from text. The LSI model also captures the covariance of the words from a collection of text and identifies the major directions of the covariance space. It applies the singular value decomposition (SVD) approach to identify the orthogonal directions of semantic space spanned by the word vectors of the documents and uses major directions to represent the semantic space with a reduced rank. Thus, a document can be represented as a vector in a reduced-rank space spanned by few major directions &#8211; a process of indexing the document with respect to semantic directions. However, restricting the semantic directions to be orthogonal to each other, the LSI identifies the directions that may not correspond to any human-understandable topics, thus remaining "latent."</p>
         </sec>
         <sec>
            <st>
               <p>Semantic analysis and automatic indexing</p>
            </st>
            <p>As shown in Figure <figr fid="F3">3</figr>, the LDA model can be used to extract semantic contents of an abstract, indicating that the model should be useful for automatic document indexing and information retrieval. In comparison to conventional information retrieval by keyword indexing, semantic indexing by LSI has been demonstrated to be more accurate <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> due to the fact that semantic indexing allows retrieval of documents whose semantic contents align well with the semantic meanings of the query terms, without requiring occurrence of the exact query terms in the documents. Although not yet tested on as large a scale as the LSI, the LDA model should have similar indexing power due to the fact that the semantic concepts extracted by the LDA aligns well with human perception.</p>
            <p>We have shown that many of the topics extracted by the LDA model can be mapped to the controlled vocabulary of GO terms, potentially serving as a means of automatically annotating a protein-related corpus. Currently, most GO annotations are manually performed by PhD level biologists at different centers of GO consortium. Although accurate and specific, manual annotation is labor-intensive and cannot be expected to keep up with the pace of growth in the biomedical literature. Automatic annotation of proteins based upon the biomedical literature is a growing and urgent task facing the bioinformatics community that motivated the specific tasks in the recent competitive evaluations <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B9">9</abbr><abbr bid="B16">16</abbr></abbrgrp>. Our results indicate that it is possible to extract salient biological concepts from a large amount of biomedical literature and map the concepts to the controlled vocabulary. Although the mapping between the latent topics from the LDA model to the GO terms may not provide annotations as specific as manual annotations, automatic annotation based on the LDA should provide general and consistent descriptions of a protein</p>
         </sec>
         <sec>
            <st>
               <p>Dealing with ambiguities of natural language</p>
            </st>
            <p>Human natural language is full of ambiguities confounding the results of contemporary NLP, IE, and IR techniques <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr></abbrgrp>. Most noticeably, the phenomena of polysemy and synonym need to be effectively addressed during NLP, IE, and IR. The LDA effectively handles the uncertainties and ambiguities caused by the polysemes and synonyms due to its probabilistic representation of the topics. The distributional representation of concepts allows the synonyms to be group into a common topic, while a polyseme can participate in multiple concepts. Such representation effectively captures the key relationship between the words and semantic concepts: the concept is conveyed by choice of words and sense of a word is dependent on context. The inference algorithm of the LDA model explicitly utilizes such relationships to infer the topic for a word, so that the semantic topics of synonyms and polysemes can be assigned based on the context of text. This capability makes the LDA model a powerful tool to enhance the performance of other NLP, IE and IR techniques for text mining. The result shown in Figure <figr fid="F3">3</figr> serves as a good example of the capability of the LDA model to properly assign words to topics depending upon context. Note that the words "space" and "induce" are general words that fit into different semantic context, and the LDA algorithm correctly associated them with the concepts of <it>mitochondria </it>and <it>apoptosis</it>, respectively, based on the semantic context of the document.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>In summary, we extracted a set of major semantic concepts from a protein-related corpus of text words from MEDLINE titles and abstracts by applying the LDA model. The identified concepts are semantically coherent, and most of them are biologically relevant. The extracted biological topics reflect the major knowledge domains of current knowledge of protein function contained in the corpus. The semantic content of a document can be inferred from a text and used for automatically indexing the text. Future directions will be explored to extend the current approach or to develop new techniques for extracting biological concepts of finer granularity and combining semantic analyses with conventional NLP, IE, and IR techniques to map the topics to the controlled vocabulary.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Data set</p>
            </st>
            <p>The protein annotation data of the Uniprot database (Version 22, October 2004) was downloaded from the GOA project <abbrgrp><abbr bid="B2">2</abbr></abbrgrp> web site of the European Bioinformatics Institute. In this data set, each protein was annotated with one or more GO annotations. Many annotation entries contained references to PubMed identification (PMID) numbers, presumably these annotations resulted from reading the literature indexed by the PMID. All the PMIDs and their associated GO terms were extracted from the Uniprot data set. The extracted data contained 6,565 unique GO accession numbers (GOID) and 25,005 unique PMIDs. The MEDLINE entries indexed by these PMID were downloaded from the National Center for Biotechnology Information (NCBI) using the Entrez E-utility service, and their titles and abstracts were extracted. These MEDLINE text data were preprocessed as follows: (1) common words from a standard English "stop words" list were removed; (2) words were stemmed using Porter's stemmer <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>; (3) words that appeared fewer than 5 times in the corpus were discarded. The processed data set is referred to as GOA corpus and contained the preprocessed MEDLINE text words and associated GO annotations. After preprocessing, the vocabulary of the corpus consisted of 25,143 unique terms.</p>
         </sec>
         <sec>
            <st>
               <p>LDA model</p>
            </st>
            <sec>
               <st>
                  <p>Model specification</p>
               </st>
               <p>The LDA model is a probabilistic topic model <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr><abbr bid="B26">26</abbr></abbrgrp>. It is a hierarchical generative model that simulates the process of writing a text. Let the corpus <it>C </it>= {<it>d</it><sub>1</sub>, <it>d</it><sub>2</sub>, ..., <it>d</it><sub><it>D</it></sub>} be a set of documents, where <it>D </it>denotes the number of documents in the corpus; a document <it>d </it>= (<it>w</it><sub>1</sub>, <it>w</it><sub>2</sub>,..., <it>w</it><sub><it>Nd</it></sub>) consists of a sequence of words; and <it>w </it>be a word that takes a value from the vocabulary {<it>v</it><sub>1</sub>, <it>v</it><sub>2</sub>, ..., <it>v</it><sub><it>V</it></sub>}. Let <it>T </it>be the number of topics of a LDA model and <it>V </it>be the size of the vocabulary of the corpus. The LDA model simulates the generation of a document with following stochastic processes:</p>
               <p>&#8226; For each document, sample a topic proportion vector <it>&#952; </it>= (<it>&#952;</it><sub>1</sub>,<it>&#952;</it><sub>2</sub>,...,<it>&#952;</it><sub><it>T</it></sub>)' from a Dirichlet distribution with parameter <it>&#945;</it>: <it>&#952; </it>~ <it>Dir</it>(<it>&#952; </it>| <it>&#945;</it>). This is equivalent to an author deciding what topics to include in the paper.</p>
               <p>&#8226; For each word in the document, sample a topic <it>z </it>according to multinomial distribution governed by <it>&#952;</it>: <it>z </it>~ <it>Multi</it>(<it>z </it>| <it>&#952;</it>). This can be thought as assigning a word to a topic.</p>
               <p>&#8226; Conditioning on <it>z</it>, sample a word <it>w </it>according multinomial distribution with parameter <it>&#966;</it><sub><it>z </it></sub>: <it>w </it>~ <it>Multi</it>(<it>w </it>| <it>&#966;</it><sub><it>z</it></sub>, <it>z</it>). This corresponds to picking words to represent the concept.</p>
               <p>&#8226; The parameter <it>&#966;</it><sub><it>t </it></sub>with <it>t </it>&#8712; {1,2,...,<it>T</it>}, is a <it>V</it>-dimension vector that defines the multinomial word distribution of a topic. It is distributed as Dirichlet with parameter <it>&#946;</it>: <it>&#966;</it><sub><it>t </it></sub>~ <it>Dir</it>(<it>&#966;</it><sub><it>t </it></sub>| <it>&#946;</it>).</p>
               <p>The probabilistic directed acyclic graphical representation of the LDA model is shown in Figure <figr fid="F5">5</figr> in plate notation <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. In a probabilistic graph, nodes represent random variables and edges represent the probabilistic relationship, i.e., the conditional probability, between the variables. The shaded and un-shaded nodes represent the observed and unobserved variables, respectively. Each rectangular plate represents a replica of the data structure; the number at the bottom right of each plate indicates the number of the replicates. In this graph, each document is associated with a topic composition variable <it>&#952; </it>and total of <it>N</it><sub><it>d </it></sub>replicates of topic variable <it>z </it>and word <it>w</it>. The graph also shows that there are <it>T </it>topic word distributions.</p>
               <fig id="F5">
                  <title>
                     <p>Figure 5</p>
                  </title>
                  <caption>
                     <p>A directed acyclic graphical representation of the LDA model in plate notation</p>
                  </caption>
                  <text>
                     <p>A directed acyclic graphical representation of the LDA model in plate notation.</p>
                  </text>
                  <graphic file="1471-2105-7-58-5"/>
               </fig>
            </sec>
            <sec>
               <st>
                  <p>Statistical learning</p>
               </st>
               <p>Given the observed documents, the learning task is to infer the topic-composition <it>&#952;</it><sub><it>d </it></sub>for each document; the topic variable, <it>z</it><sub><it>i</it></sub>, for each word,<it>w</it><sub><it>i</it></sub>, within the document; and the word distribution <it>&#966;</it><sub><it>t </it></sub>for each topic <it>t</it>. The exact inference of these unobserved variables is intractable. A Markov chain Monte Carlo (MCMC) <abbrgrp><abbr bid="B28">28</abbr></abbrgrp> inference algorithm by Griffiths and Steyvers <abbrgrp><abbr bid="B15">15</abbr></abbrgrp> was adopted to perform approximate inference. Let <b>z </b>denote a vector of the instances of all latent topic variables and <b>w </b>denote a vector of all the observed words of the corpus. The algorithm concentrates on the joint probability <it>p</it>(<b>w</b>, <b>z</b>) and applies Gibbs sampling to instantiate the latent topic variable for each word. Gibbs sampling is a technique to generate samples from a complex posterior distribution <it>p</it>(<b>z </b>| <b>w</b>) by iteratively sampling and updating each component variable <it>z</it><sub><it>i </it></sub>according to the conditional distribution <it>p</it>(<it>z</it><sub><it>i</it></sub>| <b>z</b><sub>-<it>i</it></sub>, <b>w</b>), where <b>z</b><sub>-<it>i </it></sub>denotes the current instantiation of all the latent topic variables except <it>z</it><sub><it>i</it></sub>, and <b>w </b>denotes the vector of all observed words of the corpus. The Gibbs sampling procedure follows these steps: (1) randomly initialize the latent variables <b>z</b>; (2) each element z<sub><it>i </it></sub>of <b>z </b>is iteratively sampled and updated; (3) repeat step (2) until the Markov chain converges to the target posterior distribution <it>p</it>(<b>z </b>| <b>w</b>) ("burn in"); and (4) samples of <b>z </b>are collected from the Markov chain. The conditional distribution <it>p</it>(<it>z</it><sub><it>i </it></sub>| <b>z</b><sub>-<it>i</it></sub>, <b>w</b>) is defined as follows:</p>
               <p>
                  <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-7-58-i2">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>p</m:mi>
                           <m:mrow>
                              <m:mo>(</m:mo>
                              <m:mrow>
                                 <m:msub>
                                    <m:mi>z</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                                 <m:mo>=</m:mo>
                                 <m:mi>j</m:mi>
                                 <m:mrow>
                                    <m:mo>|</m:mo>
                                    <m:mrow>
                                       <m:msub>
                                          <m:mi>z</m:mi>
                                          <m:mrow>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mi>i</m:mi>
                                          </m:mrow>
                                       </m:msub>
                                       <m:mo>,</m:mo>
                                       <m:mi>w</m:mi>
                                    </m:mrow>
                                 </m:mrow>
                              </m:mrow>
                              <m:mo>)</m:mo>
                           </m:mrow>
                           <m:mo>&#8733;</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:msubsup>
                                    <m:mi>n</m:mi>
                                    <m:mrow>
                                       <m:mo>&#8722;</m:mo>
                                       <m:mi>i</m:mi>
                                       <m:mi>j</m:mi>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:msub>
                                          <m:mi>w</m:mi>
                                          <m:mi>i</m:mi>
                                       </m:msub>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:msubsup>
                                 <m:mo>+</m:mo>
                                 <m:mi>&#946;</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:msubsup>
                                    <m:mi>n</m:mi>
                                    <m:mrow>
                                       <m:mo>&#8722;</m:mo>
                                       <m:mi>i</m:mi>
                                       <m:mi>j</m:mi>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mo>.</m:mo>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:msubsup>
                                 <m:mo>+</m:mo>
                                 <m:mi>V</m:mi>
                                 <m:mi>&#946;</m:mi>
                              </m:mrow>
                           </m:mfrac>
                           <m:mo>&#215;</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:msubsup>
                                    <m:mi>n</m:mi>
                                    <m:mrow>
                                       <m:mo>&#8722;</m:mo>
                                       <m:mi>i</m:mi>
                                       <m:mi>j</m:mi>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:msub>
                                          <m:mi>d</m:mi>
                                          <m:mi>i</m:mi>
                                       </m:msub>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:msubsup>
                                 <m:mo>+</m:mo>
                                 <m:mi>&#945;</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:msubsup>
                                    <m:mi>n</m:mi>
                                    <m:mrow>
                                       <m:mo>&#8722;</m:mo>
                                       <m:mi>i</m:mi>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mo>.</m:mo>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:msubsup>
                                 <m:mo>+</m:mo>
                                 <m:mi>T</m:mi>
                                 <m:mi>&#945;</m:mi>
                              </m:mrow>
                           </m:mfrac>
                           <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                           <m:mrow>
                              <m:mo>(</m:mo>
                              <m:mn>1</m:mn>
                              <m:mo>)</m:mo>
                           </m:mrow>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGWbaCdaqadaqaaiabdQha6naaBaaaleaacqWGPbqAaeqaaOGaeyypa0JaemOAaO2aaqqaaeaacqWG6bGEdaWgaaWcbaGaeyOeI0IaemyAaKgabeaakiabcYcaSiabdEha3bGaay5bSdaacaGLOaGaayzkaaGaeyyhIu7aaSaaaeaacqWGUbGBdaqhaaWcbaGaeyOeI0IaemyAaKMaemOAaOgabaGaeiikaGIaem4DaC3aaSbaaWqaaiabdMgaPbqabaWccqGGPaqkaaGccqGHRaWkiiGacqWFYoGyaeaacqWGUbGBdaqhaaWcbaGaeyOeI0IaemyAaKMaemOAaOgabaGaeiikaGIaeiOla4IaeiykaKcaaOGaey4kaSIaemOvayLae8NSdigaaiabgEna0oaalaaabaGaemOBa42aa0baaSqaaiabgkHiTiabdMgaPjabdQgaQbqaaiabcIcaOiabdsgaKnaaBaaameaacqWGPbqAaeqaaSGaeiykaKcaaOGaey4kaSIae8xSdegabaGaemOBa42aa0baaSqaaiabgkHiTiabdMgaPbqaaiabcIcaOiabc6caUiabcMcaPaaakiabgUcaRiabdsfaujab=f7aHbaacaWLjaGaaCzcamaabmaabaGaeGymaedacaGLOaGaayzkaaaaaa@72F4@</m:annotation>
                     </m:semantics>
                  </m:math>
               </p>
               <p>In equation (1), <m:math name="1471-2105-7-58-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>n</m:mi><m:mrow><m:mo>&#8722;</m:mo><m:mi>i</m:mi><m:mi>j</m:mi></m:mrow><m:mrow><m:mo stretchy="false">(</m:mo><m:msub><m:mi>w</m:mi><m:mi>i</m:mi></m:msub><m:mo stretchy="false">)</m:mo></m:mrow></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGUbGBdaqhaaWcbaGaeyOeI0IaemyAaKMaemOAaOgabaGaeiikaGIaem4DaC3aaSbaaWqaaiabdMgaPbqabaWccqGGPaqkaaaaaa@369F@</m:annotation></m:semantics></m:math> denotes the count of the words in the corpus that are indexed by <it>w</it><sub><it>i </it></sub>and assigned to the topic <it>j</it>, excluding the word <it>w</it><sub><it>i</it></sub>; <m:math name="1471-2105-7-58-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>n</m:mi><m:mrow><m:mo>&#8722;</m:mo><m:mi>i</m:mi><m:mi>j</m:mi></m:mrow><m:mrow><m:mo stretchy="false">(</m:mo><m:mo>.</m:mo><m:mo stretchy="false">)</m:mo></m:mrow></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGUbGBdaqhaaWcbaGaeyOeI0IaemyAaKMaemOAaOgabaGaeiikaGIaeiOla4IaeiykaKcaaaaa@3479@</m:annotation></m:semantics></m:math> is the count of all words assigned to the topic <it>j</it>, excluding the word <it>w</it><sub><it>i</it></sub>; <m:math name="1471-2105-7-58-i5" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>n</m:mi><m:mrow><m:mo>&#8722;</m:mo><m:mi>i</m:mi><m:mi>j</m:mi></m:mrow><m:mrow><m:mo stretchy="false">(</m:mo><m:msub><m:mi>d</m:mi><m:mi>i</m:mi></m:msub><m:mo stretchy="false">)</m:mo></m:mrow></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGUbGBdaqhaaWcbaGaeyOeI0IaemyAaKMaemOAaOgabaGaeiikaGIaemizaq2aaSbaaWqaaiabdMgaPbqabaWccqGGPaqkaaaaaa@3679@</m:annotation></m:semantics></m:math> is the count of words assigned to the topic <it>j </it>in document <it>d</it><sub><it>i </it></sub>that contains topic variable <it>z</it><sub><it>i</it></sub>, excluding <it>w</it><sub><it>i</it></sub>; <m:math name="1471-2105-7-58-i6" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>n</m:mi><m:mrow><m:mo>&#8722;</m:mo><m:mi>i</m:mi></m:mrow><m:mrow><m:mo stretchy="false">(</m:mo><m:msub><m:mi>d</m:mi><m:mi>i</m:mi></m:msub><m:mo stretchy="false">)</m:mo></m:mrow></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGUbGBdaqhaaWcbaGaeyOeI0IaemyAaKgabaGaeiikaGIaemizaq2aaSbaaWqaaiabdMgaPbqabaWccqGGPaqkaaaaaa@351C@</m:annotation></m:semantics></m:math> stands for the count of all the words in that document excluding <it>w</it><sub><it>i</it></sub>; and <it>&#945;, &#946;, V </it>and <it>T </it>were defined previously. During training of the LDA model, the values for the corpus level parameters were set as follows: <it>&#945; </it>= 1, <it>&#946; </it>= 0.1.</p>
               <p>Equation (1) has an intuitive explanation for how the inference algorithm determines the topic label <it>z</it><sub><it>i </it></sub>for the observed word <it>w</it><sub><it>i</it></sub>. The first term on the right side indicates the likelihood of observing word <it>w</it><sub><it>i </it></sub>if its topic <it>z</it><sub><it>i</it></sub><it> = j</it>, e.g., the likelihood of observing word "death " if the topic is <it>apoptosis</it>. The second term specifies the likelihood that a word in the document belongs to topic <it>j</it>, based on the context of the document. In plain English, the second term would read: "the word <it>w</it><sub><it>i </it></sub>more likely belongs to topic <it>j </it>if many other words in the document belong to the topics <it>j</it>." For example, the word "death" is more likely to belong to the topic <it>apoptosis</it>, if there are other words in the document, such as "apoptosis," "programmed," and "cell," belonging to the same topic.</p>
               <p>Once the vector of the latent topics <b>z </b>is instantiated by sampling, the parameters governing the posterior distribution of <it>&#952; </it>and <it>&#966; </it>can be estimated analytically as follows:</p>
               <p>
                  <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-7-58-i7">
                     <m:semantics>
                        <m:mrow>
                           <m:msubsup>
                              <m:mover accent="true">
                                 <m:mi>&#952;</m:mi>
                                 <m:mo>^</m:mo>
                              </m:mover>
                              <m:mi>j</m:mi>
                              <m:mrow>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>d</m:mi>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                           </m:msubsup>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:msubsup>
                                    <m:mi>n</m:mi>
                                    <m:mi>j</m:mi>
                                    <m:mrow>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>d</m:mi>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:msubsup>
                                 <m:mo>+</m:mo>
                                 <m:mi>&#945;</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:msubsup>
                                    <m:mi>n</m:mi>
                                    <m:mi>i</m:mi>
                                    <m:mrow>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>d</m:mi>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:msubsup>
                                 <m:mo>+</m:mo>
                                 <m:mi>T</m:mi>
                                 <m:mi>&#945;</m:mi>
                              </m:mrow>
                           </m:mfrac>
                           <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                           <m:mrow>
                              <m:mo>(</m:mo>
                              <m:mn>2</m:mn>
                              <m:mo>)</m:mo>
                           </m:mrow>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWF4oqCgaqcamaaDaaaleaacqWGQbGAaeaacqGGOaakcqWGKbazcqGGPaqkaaGccqGH9aqpdaWcaaqaaiabd6gaUnaaDaaaleaacqWGQbGAaeaacqGGOaakcqWGKbazcqGGPaqkaaGccqGHRaWkcqWFXoqyaeaacqWGUbGBdaqhaaWcbaGaemyAaKgabaGaeiikaGIaemizaqMaeiykaKcaaOGaey4kaSIaemivaqLae8xSdegaaiaaxMaacaWLjaWaaeWaaeaacqaIYaGmaiaawIcacaGLPaaaaaa@4A04@</m:annotation>
                     </m:semantics>
                  </m:math>
               </p>
               <p>
                  <m:math name="1471-2105-7-58-i8" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msubsup>
                              <m:mover accent="true">
                                 <m:mi>&#966;</m:mi>
                                 <m:mo>^</m:mo>
                              </m:mover>
                              <m:mi>j</m:mi>
                              <m:mrow>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>v</m:mi>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                           </m:msubsup>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:msubsup>
                                    <m:mi>n</m:mi>
                                    <m:mi>j</m:mi>
                                    <m:mi>v</m:mi>
                                 </m:msubsup>
                                 <m:mo>+</m:mo>
                                 <m:mi>&#946;</m:mi>
                              </m:mrow>
                              <m:mrow>
                                 <m:msubsup>
                                    <m:mi>n</m:mi>
                                    <m:mi>j</m:mi>
                                    <m:mrow>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mo>.</m:mo>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:msubsup>
                                 <m:mo>+</m:mo>
                                 <m:mi>V</m:mi>
                                 <m:mi>&#946;</m:mi>
                              </m:mrow>
                           </m:mfrac>
                           <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                           <m:mrow>
                              <m:mo>(</m:mo>
                              <m:mn>3</m:mn>
                              <m:mo>)</m:mo>
                           </m:mrow>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFgpGzgaqcamaaDaaaleaacqWGQbGAaeaacqGGOaakcqWG2bGDcqGGPaqkaaGccqGH9aqpdaWcaaqaaiabd6gaUnaaDaaaleaacqWGQbGAaeaacqWG2bGDaaGccqGHRaWkcqWFYoGyaeaacqWGUbGBdaqhaaWcbaGaemOAaOgabaGaeiikaGIaeiOla4IaeiykaKcaaOGaey4kaSIaemOvayLae8NSdigaaiaaxMaacaWLjaWaaeWaaeaacqaIZaWmaiaawIcacaGLPaaaaaa@483C@</m:annotation>
                     </m:semantics>
                  </m:math>
               </p>
               <p>where <m:math name="1471-2105-7-58-i9" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>n</m:mi><m:mi>j</m:mi><m:mi>d</m:mi></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGUbGBdaqhaaWcbaGaemOAaOgabaGaemizaqgaaaaa@30EC@</m:annotation></m:semantics></m:math> is the number of words assigned the topic <it>j </it>in the document <it>d; n.</it><sup>(<it>d</it>) </sup>is the total number of words in the document <it>d</it>; <m:math name="1471-2105-7-58-i10" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>n</m:mi><m:mi>j</m:mi><m:mrow><m:mo stretchy="false">(</m:mo><m:mi>v</m:mi><m:mo stretchy="false">)</m:mo></m:mrow></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGUbGBdaqhaaWcbaGaemOAaOgabaGaeiikaGIaemODayNaeiykaKcaaaaa@32C2@</m:annotation></m:semantics></m:math> stands for number of times a word indexed by <it>v </it>belongs to the topic <it>j</it>; and <m:math name="1471-2105-7-58-i11" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>n</m:mi><m:mi>j</m:mi><m:mrow><m:mo stretchy="false">(</m:mo><m:mo>.</m:mo><m:mo stretchy="false">)</m:mo></m:mrow></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGUbGBdaqhaaWcbaGaemOAaOgabaGaeiikaGIaeiOla4IaeiykaKcaaaaa@3231@</m:annotation></m:semantics></m:math> denotes total number of words assigned to the topic <it>j.</it></p>
            </sec>
            <sec>
               <st>
                  <p>Inference for new data</p>
               </st>
               <p>A trained model can be used to infer the latent topic variables <b>z </b>and estimate <it>&#952;</it><sub><it>d </it></sub>for a newly observed document. This is achieved by sampling <b>z </b>from the posterior distribution with MCMC by invoking Equation (1). During the sampling, the first term of equation (1) is replaced with the previously learned <m:math name="1471-2105-7-58-i12" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>&#966;</m:mi><m:mo>^</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWFgpGzgaqcaaaa@2E7C@</m:annotation></m:semantics></m:math> from equation (3), and only the counts in the second terms are updated.</p>
            </sec>
            <sec>
               <st>
                  <p>Model Selection</p>
               </st>
               <p>One objective of model training is to allow the model to fit the data well while avoiding over fitting. From a statistical learning point of view, this is a model selection problem that can be addressed within a Bayesian model selection framework to select the optimal model <m:math name="1471-2105-7-58-i13" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>M</m:mi><m:mo>^</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGnbqtgaqcaaaa@2DDF@</m:annotation></m:semantics></m:math> that has the highest posterior probability conditioning on the observed data <b>w </b>as follows:</p>
               <p>
                  <m:math name="1471-2105-7-58-i14" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mover accent="true">
                              <m:mi>M</m:mi>
                              <m:mo>^</m:mo>
                           </m:mover>
                           <m:mo>=</m:mo>
                           <m:munder>
                              <m:mrow>
                                 <m:mtext>arg&#160;max</m:mtext>
                              </m:mrow>
                              <m:mi>M</m:mi>
                           </m:munder>
                           <m:mtext>&#160;</m:mtext>
                           <m:mi>p</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>M</m:mi>
                           <m:mo>|</m:mo>
                           <m:mi>w</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>,</m:mo>
                           <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                           <m:mrow>
                              <m:mo>(</m:mo>
                              <m:mn>4</m:mn>
                              <m:mo>)</m:mo>
                           </m:mrow>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGnbqtgaqcaiabg2da9maaxababaGaeeyyaeMaeeOCaiNaee4zaCMaeeiiaaIaeeyBa0MaeeyyaeMaeeiEaGhaleaacqWGnbqtaeqaaOGaeeiiaaIaemiCaaNaeiikaGIaemyta0KaeiiFaWNaem4DaCNaeiykaKIaeiilaWIaaCzcaiaaxMaadaqadaqaaiabisda0aGaayjkaiaawMcaaaaa@45DB@</m:annotation>
                     </m:semantics>
                  </m:math>
               </p>
               <p>
                  <m:math name="1471-2105-7-58-i15" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>p</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>M</m:mi>
                           <m:mo>|</m:mo>
                           <m:mi>w</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mi>p</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>w</m:mi>
                                 <m:mo>|</m:mo>
                                 <m:mi>M</m:mi>
                                 <m:mo stretchy="false">)</m:mo>
                                 <m:mi>p</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>M</m:mi>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>p</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>w</m:mi>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                           </m:mfrac>
                           <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                           <m:mrow>
                              <m:mo>(</m:mo>
                              <m:mn>5</m:mn>
                              <m:mo>)</m:mo>
                           </m:mrow>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGWbaCcqGGOaakcqWGnbqtcqGG8baFcqWG3bWDcqGGPaqkcqGH9aqpdaWcaaqaaiabdchaWjabcIcaOiabdEha3jabcYha8jabd2eanjabcMcaPiabdchaWjabcIcaOiabd2eanjabcMcaPaqaaiabdchaWjabcIcaOiabdEha3jabcMcaPaaacaWLjaGaaCzcamaabmaabaGaeGynaudacaGLOaGaayzkaaaaaa@48C1@</m:annotation>
                     </m:semantics>
                  </m:math>
               </p>
               <p>Assuming an uninformative prior distribution <it>p</it>(<it>M</it>) for the models, the model selection was determined by the evidence (marginal likelihood) <it>p</it>(<b>w </b>| <it>M</it>), which can calculated by integrating out the latent parameters and variables:</p>
               <p>
                  <m:math name="1471-2105-7-58-i16" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>p</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>w</m:mi>
                           <m:mo>|</m:mo>
                           <m:mi>M</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mstyle displaystyle="true">
                              <m:mrow>
                                 <m:msub>
                                    <m:mo>&#8747;</m:mo>
                                    <m:mi>&#981;</m:mi>
                                 </m:msub>
                                 <m:mrow>
                                    <m:mstyle displaystyle="true">
                                       <m:mrow>
                                          <m:msub>
                                             <m:mo>&#8747;</m:mo>
                                             <m:mi>&#952;</m:mi>
                                          </m:msub>
                                          <m:mrow>
                                             <m:mstyle displaystyle="true">
                                                <m:munder>
                                                   <m:mo>&#8721;</m:mo>
                                                   <m:mi>z</m:mi>
                                                </m:munder>
                                                <m:mrow>
                                                   <m:mi>p</m:mi>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:mi>w</m:mi>
                                                   <m:mo>,</m:mo>
                                                   <m:mi>z</m:mi>
                                                   <m:mo>,</m:mo>
                                                   <m:mi>&#966;</m:mi>
                                                   <m:mo>,</m:mo>
                                                   <m:mi>&#952;</m:mi>
                                                   <m:mo>|</m:mo>
                                                   <m:mi>M</m:mi>
                                                   <m:mo stretchy="false">)</m:mo>
                                                   <m:mi>d</m:mi>
                                                   <m:mi>&#952;</m:mi>
                                                   <m:mi>d</m:mi>
                                                   <m:mi>&#966;</m:mi>
                                                   <m:mo>.</m:mo>
                                                   <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                                                   <m:mrow>
                                                      <m:mo>(</m:mo>
                                                      <m:mn>6</m:mn>
                                                      <m:mo>)</m:mo>
                                                   </m:mrow>
                                                </m:mrow>
                                             </m:mstyle>
                                          </m:mrow>
                                       </m:mrow>
                                    </m:mstyle>
                                 </m:mrow>
                              </m:mrow>
                           </m:mstyle>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGWbaCcqGGOaakcqWG3bWDcqGG8baFcqWGnbqtcqGGPaqkcqGH9aqpdaWdraqaamaapebabaWaaabuaeaacqWGWbaCcqGGOaakcqWG3bWDcqGGSaalcqWG6bGEcqGGSaaliiGacqWFgpGzcqWFSaalcqWF4oqCcqGG8baFcqWGnbqtcqGGPaqkcqWGKbazcqWF4oqCcqWGKbazcqWFgpGzcqWFUaGlcaWLjaGaaCzcamaabmaabaGaeGOnaydacaGLOaGaayzkaaaaleaacqWG6bGEaeqaniabggHiLdaaleaacqWF4oqCaeqaniabgUIiYdaaleaacqWFvpGAaeqaniabgUIiYdaaaa@5982@</m:annotation>
                     </m:semantics>
                  </m:math>
               </p>
               <p>The summation and integration in the equation (6) was intractable. Instead, a Monte Carlo approximation for this quantity was employed <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. With the parameters <it>&#945; </it>and <it>&#946; </it>fixed, the difference between the model <it>M</it><sub><it>l </it></sub>and <it>M</it><sub><it>k </it></sub>is the number of the topics <it>T</it><sub><it>l </it></sub>and <it>T</it><sub><it>k</it></sub>. For a model with a given number of topics, <it>T</it>, the evidence <it>p</it>(<b>w </b>| <it>M</it>) was approximated as follows: 40 samples of latent variable vectors, {<b>z</b><sub>1</sub>, <b>z</b><sub>2</sub>, ..., <b>z</b><sub>40</sub>}, were collected from 4 randomly initialized Markov chains according equation (1). Then, the conditional probability <it>p</it>(<b>w </b>| <b>z</b>, <it>M</it>) for each sample <b>z </b>was evaluated by analytically integrating out <it>&#966;</it>:</p>
               <p>
                  <m:math name="1471-2105-7-58-i17" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>p</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>w</m:mi>
                           <m:mo>|</m:mo>
                           <m:mi>z</m:mi>
                           <m:mo>,</m:mo>
                           <m:mi>M</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:msup>
                              <m:mrow>
                                 <m:mrow>
                                    <m:mo>(</m:mo>
                                    <m:mrow>
                                       <m:mfrac>
                                          <m:mrow>
                                             <m:mi>&#915;</m:mi>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>V</m:mi>
                                             <m:mi>&#946;</m:mi>
                                             <m:mo stretchy="false">)</m:mo>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mi>&#915;</m:mi>
                                             <m:msup>
                                                <m:mrow>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:mi>&#946;</m:mi>
                                                   <m:mo stretchy="false">)</m:mo>
                                                </m:mrow>
                                                <m:mi>V</m:mi>
                                             </m:msup>
                                          </m:mrow>
                                       </m:mfrac>
                                    </m:mrow>
                                    <m:mo>)</m:mo>
                                 </m:mrow>
                              </m:mrow>
                              <m:mi>T</m:mi>
                           </m:msup>
                           <m:mstyle displaystyle="true">
                              <m:munderover>
                                 <m:mo>&#8719;</m:mo>
                                 <m:mrow>
                                    <m:mi>j</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                                 <m:mi>T</m:mi>
                              </m:munderover>
                              <m:mrow>
                                 <m:mfrac>
                                    <m:mrow>
                                       <m:mstyle displaystyle="true">
                                          <m:msubsup>
                                             <m:mo>&#8719;</m:mo>
                                             <m:mrow>
                                                <m:mi>i</m:mi>
                                                <m:mo>=</m:mo>
                                                <m:mn>1</m:mn>
                                             </m:mrow>
                                             <m:mi>v</m:mi>
                                          </m:msubsup>
                                          <m:mrow>
                                             <m:mi>&#915;</m:mi>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:msubsup>
                                                <m:mi>n</m:mi>
                                                <m:mi>j</m:mi>
                                                <m:mrow>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:mi>i</m:mi>
                                                   <m:mo stretchy="false">)</m:mo>
                                                </m:mrow>
                                             </m:msubsup>
                                             <m:mo>+</m:mo>
                                             <m:mi>&#946;</m:mi>
                                             <m:mo stretchy="false">)</m:mo>
                                          </m:mrow>
                                       </m:mstyle>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mi>&#915;</m:mi>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:msubsup>
                                          <m:mi>n</m:mi>
                                          <m:mi>j</m:mi>
                                          <m:mrow>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mo>.</m:mo>
                                             <m:mo stretchy="false">)</m:mo>
                                          </m:mrow>
                                       </m:msubsup>
                                       <m:mo>+</m:mo>
                                       <m:mi>V</m:mi>
                                       <m:mi>&#946;</m:mi>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:mfrac>
                              </m:mrow>
                           </m:mstyle>
                           <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                           <m:mrow>
                              <m:mo>(</m:mo>
                              <m:mn>7</m:mn>
                              <m:mo>)</m:mo>
                           </m:mrow>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGWbaCcqGGOaakcqWG3bWDcqGG8baFcqWG6bGEcqGGSaalcqWGnbqtcqGGPaqkcqGH9aqpdaqadaqaamaalaaabaGaeu4KdCKaeiikaGIaemOvayfcciGae8NSdiMaeiykaKcabaGaeu4KdCKaeiikaGIae8NSdiMaeiykaKYaaWbaaSqabeaacqWGwbGvaaaaaaGccaGLOaGaayzkaaWaaWbaaSqabeaacqWGubavaaGcdaqeWbqaamaalaaabaWaaebmaeaacqqHtoWrcqGGOaakcqWGUbGBdaqhaaWcbaGaemOAaOgabaGaeiikaGIaemyAaKMaeiykaKcaaOGaey4kaSIae8NSdiMaeiykaKcaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWG2bGDa0Gaey4dIunaaOqaaiabfo5ahjabcIcaOiabd6gaUnaaDaaaleaacqWGQbGAaeaacqGGOaakcqGGUaGlcqGGPaqkaaGccqGHRaWkcqWGwbGvcqWFYoGycqGGPaqkaaaaleaacqWGQbGAcqGH9aqpcqaIXaqmaeaacqWGubava0Gaey4dIunakiaaxMaacaWLjaWaaeWaaeaacqaI3aWnaiaawIcacaGLPaaaaaa@6FB1@</m:annotation>
                     </m:semantics>
                  </m:math>
               </p>
               <p>The evidence <it>p</it>(<b>w </b>| M) was approximated with the harmonic means of the sample conditional probabilities <it>p</it>(<b>w </b>| <b>z</b>, <it>M</it>) <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. The selection among the models with different <it>T </it>was carried out based on the approximated evidence.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Mutual information</p>
            </st>
            <p>MI is a symmetric, non-negative quantity that measures the amount of information one variable contains with respect to another variable, and it equals zero if and only if the variables are independent. The MI between a latent topic and a GO term was calculated as follows:</p>
            <p>
               <m:math name="1471-2105-7-58-i18" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mrow>
                        <m:mi>I</m:mi>
                        <m:mo stretchy="false">(</m:mo>
                        <m:msub>
                           <m:mi>A</m:mi>
                           <m:mi>g</m:mi>
                        </m:msub>
                        <m:mo>,</m:mo>
                        <m:msub>
                           <m:mi>L</m:mi>
                           <m:mi>t</m:mi>
                        </m:msub>
                        <m:mo stretchy="false">)</m:mo>
                        <m:mo>=</m:mo>
                        <m:mstyle displaystyle="true">
                           <m:munder>
                              <m:mo>&#8721;</m:mo>
                              <m:mrow>
                                 <m:msub>
                                    <m:mi>A</m:mi>
                                    <m:mi>g</m:mi>
                                 </m:msub>
                                 <m:mo>,</m:mo>
                                 <m:msub>
                                    <m:mi>L</m:mi>
                                    <m:mi>t</m:mi>
                                 </m:msub>
                              </m:mrow>
                           </m:munder>
                           <m:mrow>
                              <m:mi>p</m:mi>
                              <m:mo stretchy="false">(</m:mo>
                              <m:msub>
                                 <m:mi>A</m:mi>
                                 <m:mi>g</m:mi>
                              </m:msub>
                              <m:mo>,</m:mo>
                              <m:msub>
                                 <m:mi>L</m:mi>
                                 <m:mi>t</m:mi>
                              </m:msub>
                              <m:mo stretchy="false">)</m:mo>
                              <m:mi>log</m:mi>
                              <m:mo>&#8289;</m:mo>
                              <m:mfrac>
                                 <m:mrow>
                                    <m:mi>p</m:mi>
                                    <m:mo stretchy="false">(</m:mo>
                                    <m:msub>
                                       <m:mi>A</m:mi>
                                       <m:mrow>
                                          <m:mi>g</m:mi>
                                          <m:mo>,</m:mo>
                                       </m:mrow>
                                    </m:msub>
                                    <m:mo>,</m:mo>
                                    <m:msub>
                                       <m:mi>L</m:mi>
                                       <m:mi>t</m:mi>
                                    </m:msub>
                                    <m:mo stretchy="false">)</m:mo>
                                 </m:mrow>
                                 <m:mrow>
                                    <m:mi>p</m:mi>
                                    <m:mo stretchy="false">(</m:mo>
                                    <m:msub>
                                       <m:mi>A</m:mi>
                                       <m:mi>g</m:mi>
                                    </m:msub>
                                    <m:mo stretchy="false">)</m:mo>
                                    <m:mi>p</m:mi>
                                    <m:mo stretchy="false">(</m:mo>
                                    <m:msub>
                                       <m:mi>L</m:mi>
                                       <m:mi>t</m:mi>
                                    </m:msub>
                                    <m:mo stretchy="false">)</m:mo>
                                 </m:mrow>
                              </m:mfrac>
                           </m:mrow>
                        </m:mstyle>
                        <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                        <m:mrow>
                           <m:mo>(</m:mo>
                           <m:mn>8</m:mn>
                           <m:mo>)</m:mo>
                        </m:mrow>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGjbqscqGGOaakcqWGbbqqdaWgaaWcbaGaem4zaCgabeaakiabcYcaSiabdYeamnaaBaaaleaacqWG0baDaeqaaOGaeiykaKIaeyypa0ZaaabuaeaacqWGWbaCcqGGOaakcqWGbbqqdaWgaaWcbaGaem4zaCgabeaakiabcYcaSiabdYeamnaaBaaaleaacqWG0baDaeqaaOGaeiykaKIagiiBaWMaei4Ba8Maei4zaC2aaSaaaeaacqWGWbaCcqGGOaakcqWGbbqqdaWgaaWcbaGaem4zaCMaeiilaWcabeaakiabcYcaSiabdYeamnaaBaaaleaacqWG0baDaeqaaOGaeiykaKcabaGaemiCaaNaeiikaGIaemyqae0aaSbaaSqaaiabdEgaNbqabaGccqGGPaqkcqWGWbaCcqGGOaakcqWGmbatdaWgaaWcbaGaemiDaqhabeaakiabcMcaPaaaaSqaaiabdgeabnaaBaaameaacqWGNbWzaeqaaSGaeiilaWIaemitaW0aaSbaaWqaaiabdsha0bqabaaaleqaniabggHiLdGccaWLjaGaaCzcamaabmaabaGaeGioaGdacaGLOaGaayzkaaaaaa@6655@</m:annotation>
                  </m:semantics>
               </m:math>
            </p>
            <p>where I(<it>A</it><sub><it>g</it></sub><it>, L</it><sub><it>t</it></sub>) is the mutual information between the annotation of a word with GO term <it>g </it>and labeling the word with topic <it>t</it>; <it>A</it><sub><it>g </it></sub>and <it>L</it><sub><it>t </it></sub>are binary variables indicating whether a word is annotated with the GO term <it>g </it>and assigned to the topic <it>t</it>, respectively. The topic labeling of a word was determined according to the inferred latent variable samples <b>z</b>. We specified that each word within a given document was annotated with a GO term <it>g </it>if the document was annotated with the term <it>g</it>. Note that this is a strong assumption, which may skew the MI value for some uncommon topics. The joint and marginal probabilities in equation (8) were estimated empirically by counting the events</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>BZ performed data collection, processing and model training experiments. DCM carried out results evaluation. XL conceived, directed the study and implemented the LDA inference program.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>XL is partially supported by the Medical University of South Carolina cardiovascular COBRE grant from NIH/NCRR (5 P20 RR016434-04) and NIH/NLM 5T15LM007438-03. DCM is supported by the NLM training grant 5T15-LM007438-02. The authors would like to thank Drs. Mark Steyvers, Alan Aronson, Chengxiang Zhai, and anonymous reviewers for their discussions and suggestions.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Gene ontology: tool for the unification of biology. The Gene Ontology Consortium</p>
            </title>
            <aug>
               <au>
                  <snm>Ashburner</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Ball</snm>
                  <fnm>CA</fnm>
               </au>
               <au>
                  <snm>Blake</snm>
                  <fnm>JA</fnm>
               </au>
               <au>
                  <snm>Botstein</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Butler</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Cherry</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Davis</snm>
                  <fnm>AP</fnm>
               </au>
               <au>
                  <snm>Dolinski</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Dwight</snm>
                  <fnm>SS</fnm>
               </au>
               <au>
                  <snm>Eppig</snm>
                  <fnm>JT</fnm>
               </au>
               <au>
                  <snm>Harris</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Hill</snm>
                  <fnm>DP</fnm>
               </au>
               <au>
                  <snm>Issel-Tarver</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Kasarskis</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Lewis</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Matese</snm>
                  <fnm>JC</fnm>
               </au>
               <au>
                  <snm>Richardson</snm>
                  <fnm>JE</fnm>
               </au>
               <au>
                  <snm>Ringwald</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Rubin</snm>
                  <fnm>GM</fnm>
               </au>
               <au>
                  <snm>Sherlock</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Nat Genet</source>
            <pubdate>2000</pubdate>
            <volume>25</volume>
            <issue>1</issue>
            <fpage>25</fpage>
            <lpage>29</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/75556</pubid>
                  <pubid idtype="pmpid" link="fulltext">10802651</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>The Gene Ontology Annotation (GOA) Database &#8211; an integrated resource of GO annotations to the UniProt Knowledgebase</p>
            </title>
            <aug>
               <au>
                  <snm>Camon</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Barrell</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Lee</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Dimmer</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Apweiler</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>In Silico Biol</source>
            <pubdate>2004</pubdate>
            <volume>4</volume>
            <issue>1</issue>
            <fpage>5</fpage>
            <lpage>6</lpage>
            <xrefbib>
               <pubid idtype="pmpid">15089749</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Foundation of statistical natural language processing</p>
            </title>
            <aug>
               <au>
                  <snm>Manning</snm>
                  <fnm>CD</fnm>
               </au>
               <au>
                  <snm>Schutze</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <publisher>Cambridge, MA: MIT Press</publisher>
            <pubdate>1999</pubdate>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Speech and language processing</p>
            </title>
            <aug>
               <au>
                  <snm>Jurafsky</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Martin</snm>
                  <fnm>JH</fnm>
               </au>
            </aug>
            <publisher>Upper Saddle River, NJ: Prentice Hall</publisher>
            <pubdate>2000</pubdate>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Modern Information Retrieval</p>
            </title>
            <aug>
               <au>
                  <snm>Baeza-Yates</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Ribeiro-Neto</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <publisher>Pearson Education Limited and ACM Press</publisher>
            <pubdate>1999</pubdate>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Accomplishments and challenges in literature data mining for biology</p>
            </title>
            <aug>
               <au>
                  <snm>Hirschman</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Park</snm>
                  <fnm>JC</fnm>
               </au>
               <au>
                  <snm>Tsujii</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Wong</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Wu</snm>
                  <fnm>CH</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>18</volume>
            <issue>12</issue>
            <fpage>1553</fpage>
            <lpage>1561</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/18.12.1553</pubid>
                  <pubid idtype="pmpid" link="fulltext">12490438</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>TREC 2004 genomics track overview</p>
            </title>
            <aug>
               <au>
                  <snm>Hersh</snm>
                  <fnm>WR</fnm>
               </au>
               <au>
                  <snm>Bhuptiraju</snm>
                  <fnm>RT</fnm>
               </au>
               <au>
                  <snm>Ross</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Johnson</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Cohen</snm>
                  <fnm>AM</fnm>
               </au>
               <au>
                  <snm>Kreamer</snm>
                  <fnm>DF</fnm>
               </au>
            </aug>
            <source>Text Retrieval Conference (TREC) 2004</source>
            <pubdate>2004</pubdate>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Text-mining and information-retrieval services for molecular biology</p>
            </title>
            <aug>
               <au>
                  <snm>Krallinger</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Valencia</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <issue>7</issue>
            <fpage>224</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1175978</pubid>
                  <pubid idtype="pmpid" link="fulltext">15998455</pubid>
                  <pubid idtype="doi">10.1186/gb-2005-6-7-224</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Overview of BioCreAtlvE: critical assessment of information extraction for biology</p>
            </title>
            <aug>
               <au>
                  <snm>Hirschman</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Yeh</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Blaschke</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Valencia</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <issue>Suppl 1</issue>
            <fpage>S1</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1186/1471-2105-6-S1-S1</pubid>
                  <pubid idtype="pmpid" link="fulltext">15960821</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Gene clustering by latent semantic indexing of MEDLINE abstracts</p>
            </title>
            <aug>
               <au>
                  <snm>Homayouni</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Heinrich</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Wei</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Berry</snm>
                  <fnm>MW</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <issue>1</issue>
            <fpage>104</fpage>
            <lpage>115</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bth464</pubid>
                  <pubid idtype="pmpid" link="fulltext">15308538</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>A literature-based method for assessing the functional coherence of a gene group</p>
            </title>
            <aug>
               <au>
                  <snm>Raychaudhuri</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Altman</snm>
                  <fnm>RB</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <issue>3</issue>
            <fpage>396</fpage>
            <lpage>401</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btg002</pubid>
                  <pubid idtype="pmpid" link="fulltext">12584126</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>A semantic analysis of the annotations of the human genome</p>
            </title>
            <aug>
               <au>
                  <snm>Khatri</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Done</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Rao</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Done</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Draghici</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <issue>16</issue>
            <fpage>3416</fpage>
            <lpage>3421</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bti538</pubid>
                  <pubid idtype="pmpid" link="fulltext">15955782</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Ontological analysis of gene expression data: current tools, limitations, and open problems</p>
            </title>
            <aug>
               <au>
                  <snm>Khatri</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Draghici</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <fpage>3587</fpage>
            <lpage>3595</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bti565</pubid>
                  <pubid idtype="pmpid" link="fulltext">15994189</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Latent Dirichlet Allocation</p>
            </title>
            <aug>
               <au>
                  <snm>Blei</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Ng</snm>
                  <fnm>AY</fnm>
               </au>
               <au>
                  <snm>Jordan</snm>
                  <fnm>MI</fnm>
               </au>
            </aug>
            <source>Journal of Machine Learning Research</source>
            <pubdate>2003</pubdate>
            <volume>3</volume>
            <fpage>993</fpage>
            <lpage>1022</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1162/jmlr.2003.3.4-5.993</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Finding scientific topics</p>
            </title>
            <aug>
               <au>
                  <snm>Griffiths</snm>
                  <fnm>TL</fnm>
               </au>
               <au>
                  <snm>Steyvers</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>2004</pubdate>
            <volume>101</volume>
            <issue>Suppl 1</issue>
            <fpage>5228</fpage>
            <lpage>5235</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">387300</pubid>
                  <pubid idtype="pmpid" link="fulltext">14872004</pubid>
                  <pubid idtype="doi">10.1073/pnas.0307752101</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup</p>
            </title>
            <aug>
               <au>
                  <snm>Yeh</snm>
                  <fnm>AS</fnm>
               </au>
               <au>
                  <snm>Hirschman</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Morgan</snm>
                  <fnm>AA</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <issue>Suppl 1</issue>
            <fpage>i331</fpage>
            <lpage>339</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btg1046</pubid>
                  <pubid idtype="pmpid" link="fulltext">12855478</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>The Universal Protein Resource (UniProt)</p>
            </title>
            <aug>
               <au>
                  <snm>Bairoch</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Apweiler</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Wu</snm>
                  <fnm>CH</fnm>
               </au>
               <au>
                  <snm>Barker</snm>
                  <fnm>WC</fnm>
               </au>
               <au>
                  <snm>Boeckmann</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Ferro</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Gasteiger</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Huang</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Lopez</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Magrane</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Martin</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Natale</snm>
                  <fnm>DA</fnm>
               </au>
               <au>
                  <snm>O'Donovan</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Redaschi</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Yeh</snm>
                  <fnm>LS</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2005</pubdate>
            <volume>33</volume>
            <issue>Database</issue>
            <fpage>D154</fpage>
            <lpage>159</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">540024</pubid>
                  <pubid idtype="pmpid" link="fulltext">15608167</pubid>
                  <pubid idtype="doi">10.1093/nar/gki070</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Information theory, inference and learning algorithms</p>
            </title>
            <aug>
               <au>
                  <snm>MacKay</snm>
                  <fnm>DJC</fnm>
               </au>
            </aug>
            <publisher>Cambridge, UK: Cambridage University Press</publisher>
            <pubdate>2003</pubdate>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Hierarchical Dirichlet Processes</p>
            </title>
            <aug>
               <au>
                  <snm>Teh</snm>
                  <fnm>YW</fnm>
               </au>
               <au>
                  <snm>Jordan</snm>
                  <fnm>MI</fnm>
               </au>
               <au>
                  <snm>Beal</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Blei</snm>
                  <fnm>DM</fnm>
               </au>
            </aug>
            <source>Advances in Neural Information Processing Systems (NIPS) 17: 2005</source>
            <pubdate>2005</pubdate>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Dirichlet enhanced latent semantic analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Yu</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Yu</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Tresp</snm>
                  <fnm>V</fnm>
               </au>
            </aug>
            <source>Workshop on Artificial Intelligence and Statistics AISTAT 2005</source>
            <pubdate>2005</pubdate>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Variational methods for the Dirichlet process</p>
            </title>
            <aug>
               <au>
                  <snm>Blei</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Jordan</snm>
                  <fnm>MI</fnm>
               </au>
            </aug>
            <source>Proceedings of the 21st International Conference on Machine Learning (ICML): 2004</source>
            <pubdate>2004</pubdate>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Indexing by latent semantic analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Deerwester</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Dumais</snm>
                  <fnm>ST</fnm>
               </au>
               <au>
                  <snm>Landauer</snm>
                  <fnm>TK</fnm>
               </au>
               <au>
                  <snm>Furnas</snm>
                  <fnm>GW</fnm>
               </au>
               <au>
                  <snm>Harshman</snm>
                  <fnm>RA</fnm>
               </au>
            </aug>
            <source>J Am Soc Inf Sci</source>
            <pubdate>1990</pubdate>
            <volume>41</volume>
            <fpage>391</fpage>
            <lpage>407</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1002/(SICI)1097-4571(199009)41:6&lt;391::AID-ASI1>3.0.CO;2-9</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>matrices, vector spaces, and information retrieval</p>
            </title>
            <aug>
               <au>
                  <snm>Berry</snm>
                  <fnm>MW</fnm>
               </au>
               <au>
                  <snm>Drmac</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Jessup</snm>
                  <fnm>ER</fnm>
               </au>
            </aug>
            <source>SIAM Review</source>
            <pubdate>1999</pubdate>
            <volume>41</volume>
            <issue>2</issue>
            <fpage>335</fpage>
            <lpage>362</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1137/S0036144598347035</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>A Probabilistic Model for Latent Semantic Indexing</p>
            </title>
            <aug>
               <au>
                  <snm>Ding</snm>
                  <fnm>CHQ</fnm>
               </au>
            </aug>
            <source>J Am Soc Inf Sci Tech</source>
            <pubdate>2005</pubdate>
            <volume>56</volume>
         </bibl>
         <bibl id="B25">
            <title>
               <p>An algorithm for suffix stripping</p>
            </title>
            <aug>
               <au>
                  <snm>Porter</snm>
                  <fnm>MF</fnm>
               </au>
            </aug>
            <source>Program</source>
            <pubdate>1980</pubdate>
            <volume>14</volume>
            <issue>3</issue>
            <fpage>130</fpage>
            <lpage>137</lpage>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Probabilistic Latent Semantic Indexing</p>
            </title>
            <aug>
               <au>
                  <snm>Hofmann</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>the 22nd International Conference on Research and Development in Information Retrieval (SIGIR'99):1999</source>
            <pubdate>1999</pubdate>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Operations for learning with graphical models</p>
            </title>
            <aug>
               <au>
                  <snm>Buntine</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>Journal of Artifical Intelligence Research</source>
            <pubdate>1994</pubdate>
            <volume>3</volume>
            <fpage>993</fpage>
         </bibl>
         <bibl id="B28">
            <title>
               <p>An Introduction to MCMC for Machine Learning</p>
            </title>
            <aug>
               <au>
                  <snm>Andrieu</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Freitas</snm>
                  <fnm>Nd</fnm>
               </au>
               <au>
                  <snm>Doucet</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Jordan</snm>
                  <fnm>MI</fnm>
               </au>
            </aug>
            <source>Machine Learning</source>
            <pubdate>2003</pubdate>
            <volume>50</volume>
            <issue>1&#8211;2</issue>
            <fpage>5</fpage>
            <lpage>43</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1023/A:1020281327116</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Bayes Factors</p>
            </title>
            <aug>
               <au>
                  <snm>Kass</snm>
                  <fnm>RE</fnm>
               </au>
               <au>
                  <snm>Raftery</snm>
                  <fnm>AE</fnm>
               </au>
            </aug>
            <source>J Am Stat Assoc</source>
            <pubdate>1995</pubdate>
            <volume>90</volume>
            <fpage>773</fpage>
            <lpage>795</lpage>
            <xrefbib>
               <pubid idtype="doi">10.2307/2291091</pubid>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
