<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-9-205</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Software</dochead>
      <bibl>
         <title>
            <p>GAPscreener: An automatic tool for screening human genetic association literature in PubMed using the support vector machine technique</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Yu</snm>
               <fnm>Wei</fnm>
               <insr iid="I1"/>
               <email>WYu@cdc.gov</email>
            </au>
            <au id="A2">
               <snm>Clyne</snm>
               <fnm>Melinda</fnm>
               <insr iid="I1"/>
               <email>MClyne@cdc.gov</email>
            </au>
            <au id="A3">
               <snm>Dolan</snm>
               <mi>M</mi>
               <fnm>Siobhan</fnm>
               <insr iid="I2"/>
               <email>siobhanmdolan@yahoo.com</email>
            </au>
            <au id="A4">
               <snm>Yesupriya</snm>
               <fnm>Ajay</fnm>
               <insr iid="I1"/>
               <email>AYesupriya@cdc.gov</email>
            </au>
            <au id="A5">
               <snm>Wulf</snm>
               <fnm>Anja</fnm>
               <insr iid="I1"/>
               <email>AWulf@cdc.gov</email>
            </au>
            <au id="A6">
               <snm>Liu</snm>
               <fnm>Tiebin</fnm>
               <insr iid="I1"/>
               <email>TLiu@cdc.gov</email>
            </au>
            <au id="A7">
               <snm>Khoury</snm>
               <mi>J</mi>
               <fnm>Muin</fnm>
               <insr iid="I1"/>
               <email>MKhoury@cdc.gov</email>
            </au>
            <au id="A8">
               <snm>Gwinn</snm>
               <fnm>Marta</fnm>
               <insr iid="I1"/>
               <email>MGwinn@cdc.gov</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>National Office of Public Health Genomics, Coordinating Center for Health Promotion, Centers for Disease Control and Prevention, Atlanta, GA, USA</p>
            </ins>
            <ins id="I2">
               <p>Albert Einstein College of Medicine/Montefiore Medical Center, Bronx, NY, USA</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2008</pubdate>
         <volume>9</volume>
         <issue>1</issue>
         <fpage>205</fpage>
         <url>http://www.biomedcentral.com/1471-2105/9/205</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">18430222</pubid>
               <pubid idtype="doi">10.1186/1471-2105-9-205</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>07</day>
               <month>12</month>
               <year>2007</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>22</day>
               <month>4</month>
               <year>2008</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>22</day>
               <month>4</month>
               <year>2008</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2008</year>
         <collab>Yu et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Synthesis of data from published human genetic association studies is a critical step in the translation of human genome discoveries into health applications. Although genetic association studies account for a substantial proportion of the abstracts in PubMed, identifying them with standard queries is not always accurate or efficient. Further automating the literature-screening process can reduce the burden of a labor-intensive and time-consuming traditional literature search. The Support Vector Machine (SVM), a well-established machine learning technique, has been successful in classifying text, including biomedical literature. The GAPscreener, a free SVM-based software tool, can be used to assist in screening PubMed abstracts for human genetic association studies.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>The data source for this research was the HuGE Navigator, formerly known as the HuGE Pub Lit database. Weighted SVM feature selection based on a keyword list obtained by the two-way z score method demonstrated the best screening performance, achieving 97.5% recall, 98.3% specificity and 31.9% precision in performance testing. Compared with the traditional screening process based on a complex PubMed query, the SVM tool reduced by about 90% the number of abstracts requiring individual review by the database curator. The tool also ascertained 47 articles that were missed by the traditional literature screening process during the 4-week test period. We examined the literature on genetic associations with preterm birth as an example. Compared with the traditional, manual process, the GAPscreener both reduced effort and improved accuracy.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>GAPscreener is the first free SVM-based application available for screening the human genetic association literature in PubMed with high recall and specificity. The user-friendly graphical user interface makes this a practical, stand-alone application. The software can be downloaded at no charge.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>The peer-reviewed scientific literature is a major source of information for developing research hypotheses and creating new knowledge through synthesis of research findings <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. The information explosion in biomedical science has created a huge challenge for researchers, who want to obtain useful information promptly and efficiently. Human genetic association studies epitomize this challenge because they have proliferated rapidly since completion of the Human Genome Project <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>. Systematic review and meta-analysis have become important approaches for evaluating the robustness of such associations across different study platforms and populations <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. A key factor in the quality of a systematic review is complete capture of the relevant studies <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. Many databases that deposit genetic association information, including citations from PubMed, have been built and curated <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp>. PubMed <abbrgrp><abbr bid="B8">8</abbr></abbrgrp> is the largest publicly accessible biomedical literature database and is the main source for such activities. However, because of its large size and the complex syntax required for query formation, it is fairly difficult to comprehensively and effectively search PubMed for genetic association studies. The necessarily labor-intensive screening and curation process makes the maintenance of such databases extremely challenging.</p>
         <p>Automatic literature classification is becoming increasingly attractive and has already demonstrated some successes in the biomedical literature <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr></abbrgrp>. The support vector machine (SVM) method <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> is a powerful machine learning technique that has been used to solve classification problems <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr></abbrgrp>. An earlier report described a potential application of SVM methods to classify literature on human genome epidemiology <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. In this paper, we report a novel method for feature selection and show that using it to train the SVM model significantly improved its ability to classify reports of human genetic association studies. We implemented the method as a Java-based application named GAPscreener (<b>G</b>enetic <b>A</b>ssociation <b>P</b>ublication screener) that can be freely downloaded <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>.</p>
      </sec>
      <sec>
         <st>
            <p>Implementation</p>
         </st>
         <sec>
            <st>
               <p>SVM Model Generation</p>
            </st>
            <sec>
               <st>
                  <p>Data sources</p>
               </st>
               <p>To generate the training dataset for the SVM experiment, we used 10,000 randomly selected abstracts from articles published between 2001 and 2006 in PubMed as a background dataset. The positive dataset consisted of 10,000 randomly selected gene-disease association articles from the HuGE Navigator <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> (formerly known as the HuGE Pub Lit database <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>), a continuously updated database of studies relevant to human genome epidemiology sponsored by the National Office of Public Health Genomics. Inclusion and exclusion criteria for positive dataset from the HuGE Pub Lit database has been reported <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>.</p>
            </sec>
            <sec>
               <st>
                  <p>PubMed abstract text retrieval</p>
               </st>
               <p>We developed a PubMed text extraction tool using the NCBI E-utility <abbrgrp><abbr bid="B20">20</abbr></abbrgrp> to retrieve text content based on PubMed identification numbers (PMIDs). The text used for processing consisted of the title and the abstract, or the title alone if the abstract was not available. The text data were stored in a data structure for processing.</p>
            </sec>
            <sec>
               <st>
                  <p>Text processing and extraction of keywords</p>
               </st>
               <p>The abstract and title of each article were then processed with the text-processing tool we developed. A stemming technique was used to deal with morphologic word changes, for example, polymorph(isms) and polymorph(ic) were considered the same word. A stop word list was generated for some common English words, such as pronouns and articles, to reduce the number of words extracted.</p>
            </sec>
            <sec>
               <st>
                  <p>Significant keyword generation</p>
               </st>
               <p>We selected keywords by identifying statistically significant differences between the probability of their occurrence in the text (title and abstract) of human genetic association articles, compared with their frequency in all other articles. The sample sizes of both groups were large enough that the distribution of differences in probabilities was approximated by a normal distribution. Thus words with a z score greater than 1.96 or less than &#8211; 1.96 (significance level of &#945; = .05) were chosen as feature keywords.</p>
               <p>The statistical formula <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> used for calculating the z score is given by:</p>
               <p>
                  <display-formula>
                     <m:math name="1471-2105-9-205-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mrow>
                              <m:mtable>
                                 <m:mtr>
                                    <m:mtd>
                                       <m:mrow>
                                          <m:mtext>Z</m:mtext>
                                          <m:mo>=</m:mo>
                                          <m:mfrac>
                                             <m:mrow>
                                                <m:msub>
                                                   <m:mtext>p</m:mtext>
                                                   <m:mn>1</m:mn>
                                                </m:msub>
                                                <m:mo>&#8722;</m:mo>
                                                <m:msub>
                                                   <m:mtext>p</m:mtext>
                                                   <m:mn>2</m:mn>
                                                </m:msub>
                                             </m:mrow>
                                             <m:mrow>
                                                <m:msqrt>
                                                   <m:mrow>
                                                      <m:mtext>pq</m:mtext>
                                                      <m:mo stretchy="false">(</m:mo>
                                                      <m:mfrac>
                                                         <m:mn>1</m:mn>
                                                         <m:mrow>
                                                            <m:msub>
                                                               <m:mtext>n</m:mtext>
                                                               <m:mn>1</m:mn>
                                                            </m:msub>
                                                         </m:mrow>
                                                      </m:mfrac>
                                                      <m:mo>+</m:mo>
                                                      <m:mfrac>
                                                         <m:mn>1</m:mn>
                                                         <m:mrow>
                                                            <m:msub>
                                                               <m:mtext>n</m:mtext>
                                                               <m:mn>2</m:mn>
                                                            </m:msub>
                                                         </m:mrow>
                                                      </m:mfrac>
                                                      <m:mo stretchy="false">)</m:mo>
                                                   </m:mrow>
                                                </m:msqrt>
                                             </m:mrow>
                                          </m:mfrac>
                                          <m:mtext>&#160;if&#160;</m:mtext>
                                          <m:mo stretchy="false">(</m:mo>
                                          <m:msub>
                                             <m:mtext>n</m:mtext>
                                             <m:mn>1</m:mn>
                                          </m:msub>
                                          <m:mtext>pq</m:mtext>
                                          <m:mo>></m:mo>
                                          <m:mn>5</m:mn>
                                          <m:mo>&amp;</m:mo>
                                          <m:mo stretchy="false">(</m:mo>
                                          <m:msub>
                                             <m:mtext>n</m:mtext>
                                             <m:mn>2</m:mn>
                                          </m:msub>
                                          <m:mtext>pq</m:mtext>
                                          <m:mo>></m:mo>
                                          <m:mn>5</m:mn>
                                          <m:mo stretchy="false">)</m:mo>
                                       </m:mrow>
                                    </m:mtd>
                                 </m:mtr>
                                 <m:mtr>
                                    <m:mtd>
                                       <m:mrow>
                                          <m:mtable>
                                             <m:mtr>
                                                <m:mtd>
                                                   <m:mrow>
                                                      <m:mtext>p</m:mtext>
                                                      <m:mo>=</m:mo>
                                                      <m:mfrac>
                                                         <m:mrow>
                                                            <m:msub>
                                                               <m:mtext>n</m:mtext>
                                                               <m:mn>1</m:mn>
                                                            </m:msub>
                                                            <m:msub>
                                                               <m:mtext>p</m:mtext>
                                                               <m:mn>1</m:mn>
                                                            </m:msub>
                                                            <m:mo>+</m:mo>
                                                            <m:msub>
                                                               <m:mtext>n</m:mtext>
                                                               <m:mn>2</m:mn>
                                                            </m:msub>
                                                            <m:msub>
                                                               <m:mtext>p</m:mtext>
                                                               <m:mn>2</m:mn>
                                                            </m:msub>
                                                         </m:mrow>
                                                         <m:mrow>
                                                            <m:msub>
                                                               <m:mtext>n</m:mtext>
                                                               <m:mn>1</m:mn>
                                                            </m:msub>
                                                            <m:mo>+</m:mo>
                                                            <m:msub>
                                                               <m:mtext>n</m:mtext>
                                                               <m:mn>2</m:mn>
                                                            </m:msub>
                                                         </m:mrow>
                                                      </m:mfrac>
                                                   </m:mrow>
                                                </m:mtd>
                                                <m:mtd>
                                                   <m:mrow>
                                                      <m:mtext>q</m:mtext>
                                                      <m:mo>=</m:mo>
                                                      <m:mn>1</m:mn>
                                                      <m:mo>&#8722;</m:mo>
                                                      <m:mtext>p</m:mtext>
                                                   </m:mrow>
                                                </m:mtd>
                                             </m:mtr>
                                          </m:mtable>
                                       </m:mrow>
                                    </m:mtd>
                                 </m:mtr>
                              </m:mtable>
                           </m:mrow>
                           <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabiqaaaqaaiabbQfaAjabg2da9KqbaoaalaaabaGaeeiCaa3aaSbaaeaacqaIXaqmaeqaaiabgkHiTiabbchaWnaaBaaabaGaeGOmaidabeaaaeaadaGcaaqaaiabbchaWjabbghaXjabcIcaOmaalaaabaGaeGymaedabaGaeeOBa42aaSbaaeaacqaIXaqmaeqaaaaacqGHRaWkdaWcaaqaaiabigdaXaqaaiabb6gaUnaaBaaabaGaeGOmaidabeaaaaGaeiykaKcabeaaaaGaeeiiaaIccqqGPbqAcqqGMbGzcqqGGaaicqGGOaakcqqGUbGBdaWgaaWcbaGaeGymaedabeaakiabbchaWjabbghaXjabg6da+iabiwda1iabcAcaMiabcIcaOiabb6gaUnaaBaaaleaacqaIYaGmaeqaaOGaeeiCaaNaeeyCaeNaeyOpa4JaeGynauJaeiykaKcabaqbaeqabeGaaaqaaiabbchaWjabg2da9KqbaoaalaaabaGaeeOBa42aaSbaaeaacqaIXaqmaeqaaiabbchaWnaaBaaabaGaeGymaedabeaacqGHRaWkcqqGUbGBdaWgaaqaaiabikdaYaqabaGaeeiCaa3aaSbaaeaacqaIYaGmaeqaaaqaaiabb6gaUnaaBaaabaGaeGymaedabeaacqGHRaWkcqqGUbGBdaWgaaqaaiabikdaYaqabaaaaaGcbaGaeeyCaeNaeyypa0JaeGymaeJaeyOeI0IaeeiCaahaaaaaaaa@7101@</m:annotation>
                        </m:semantics>
                     </m:math>
                  </display-formula>
               </p>
               <p>where:</p>
               <p>p<sub>1 </sub>= probability of occurrence of word in genetic association abstracts.</p>
               <p>p<sub>2 </sub>= probability of occurrence of word in non-genetic association abstracts.</p>
               <p>n<sub>1 </sub>= total occurrences of word in genetic association abstracts.</p>
               <p>n<sub>2 </sub>= total occurrences of word in non-genetic association abstracts.</p>
            </sec>
            <sec>
               <st>
                  <p>Generating SVM input data</p>
               </st>
               <p>The statistically significant keywords are called feature keywords and were used to construct the SVM features. Each feature keyword was weighted according to its z score, normalized to values from -1 to +1. For the training and testing data sets, the script generated the SVM input based on sparse format <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. The presence of each keyword was represented by its position on the feature keyword list, followed by a colon and the normalized z score; the absence of keywords was ignored and each feature was separated by a space, for example, 1:0.003589 30:- 0.81189. In the training data set, the first column of the input data was set to the known outcome, i.e., 1 for positive, -1 for negative. In the test set, the first column of the input dataset was set to 0.</p>
               <p>Two sets of significant keywords were generated. One set contained those with positive z scores above the threshold (1.96) (called one-way weighted scheme); the other contained key words with both positive (greater than 1.96) and negative z scores (less than -1.96) (called two-way weighted scheme).</p>
            </sec>
            <sec>
               <st>
                  <p>SVM model training</p>
               </st>
               <p>We used LibSVM <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>, a freely available SVM software library, to train the SVM model. The accompanying utility, grid.py, was used to find optimum parameters for penalty parameter C and gamma in the radial basis function (RBF) kernel. The RBF kernel was chosen based on its potential in terms of performance <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Stand-alone Application Implementation</p>
            </st>
            <p>GAPscreener is a stand-alone application built with the Java programming language. Java Swing <abbrgrp><abbr bid="B24">24</abbr></abbrgrp> components were used to build the graphical user interface (GUI). The application incorporates open-source LibSVM Java codes for prediction, employing the SVM model we trained. Java-based Web services in the NCBI E-utility were used to query and retrieve PubMed records. EzInstall <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>, a freeware application, was used to package the application with a Java Runtime Environment (JRE), for automatic, self-contained installation.</p>
         </sec>
         <sec>
            <st>
               <p>Performance Evaluation</p>
            </st>
            <sec>
               <st>
                  <p>General performance evaluation</p>
               </st>
               <p>To evaluate the performance of the screening tool, we used a series of new test data (not included in the training set). The first test data set (92253 negatives, 773 positives) consisted of selections from PubMed during five consecutive weeks (February 22, 2007 to March 22, 2007) according to the routine, traditional screening process used to build the HuGE Navigator <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. Positive or negative status assigned by the routine process was considered the gold standard. We used this data set to evaluate two keyword weighting schemes. A second data set (68255 negatives, 597 positives), selected from PubMed during four subsequent weeks (April 5, 2007 to April 26, 2007), was used to evaluate false-positive results generated by the GAPscreener using the selected weighting scheme.</p>
               <p>Recall, specificity and precision were calculated from the test data to evaluate the performance of the application. The formulas for calculating these parameters are as follows:</p>
               <p>
                  <display-formula>
                     <m:math name="1471-2105-9-205-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mrow>
                              <m:mtable>
                                 <m:mtr>
                                    <m:mtd>
                                       <m:mrow>
                                          <m:mi>Re</m:mi>
                                          <m:mo>&#8289;</m:mo>
                                          <m:mi>c</m:mi>
                                          <m:mi>a</m:mi>
                                          <m:mi>l</m:mi>
                                          <m:mi>l</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mfrac>
                                             <m:mrow>
                                                <m:mi>T</m:mi>
                                                <m:mi>P</m:mi>
                                             </m:mrow>
                                             <m:mrow>
                                                <m:mi>T</m:mi>
                                                <m:mi>P</m:mi>
                                                <m:mo>+</m:mo>
                                                <m:mi>F</m:mi>
                                                <m:mi>N</m:mi>
                                             </m:mrow>
                                          </m:mfrac>
                                       </m:mrow>
                                    </m:mtd>
                                 </m:mtr>
                                 <m:mtr>
                                    <m:mtd>
                                       <m:mrow>
                                          <m:mi>Pr</m:mi>
                                          <m:mo>&#8289;</m:mo>
                                          <m:mi>e</m:mi>
                                          <m:mi>c</m:mi>
                                          <m:mi>i</m:mi>
                                          <m:mi>s</m:mi>
                                          <m:mi>o</m:mi>
                                          <m:mi>n</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mfrac>
                                             <m:mrow>
                                                <m:mi>T</m:mi>
                                                <m:mi>P</m:mi>
                                             </m:mrow>
                                             <m:mrow>
                                                <m:mi>T</m:mi>
                                                <m:mi>P</m:mi>
                                                <m:mo>+</m:mo>
                                                <m:mi>F</m:mi>
                                                <m:mi>P</m:mi>
                                             </m:mrow>
                                          </m:mfrac>
                                       </m:mrow>
                                    </m:mtd>
                                 </m:mtr>
                                 <m:mtr>
                                    <m:mtd>
                                       <m:mrow>
                                          <m:mi>S</m:mi>
                                          <m:mi>p</m:mi>
                                          <m:mi>e</m:mi>
                                          <m:mi>c</m:mi>
                                          <m:mi>i</m:mi>
                                          <m:mi>f</m:mi>
                                          <m:mi>i</m:mi>
                                          <m:mi>c</m:mi>
                                          <m:mi>i</m:mi>
                                          <m:mi>t</m:mi>
                                          <m:mi>y</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mfrac>
                                             <m:mrow>
                                                <m:mi>T</m:mi>
                                                <m:mi>N</m:mi>
                                             </m:mrow>
                                             <m:mrow>
                                                <m:mi>T</m:mi>
                                                <m:mi>N</m:mi>
                                                <m:mo>+</m:mo>
                                                <m:mi>F</m:mi>
                                                <m:mi>P</m:mi>
                                             </m:mrow>
                                          </m:mfrac>
                                       </m:mrow>
                                    </m:mtd>
                                 </m:mtr>
                              </m:mtable>
                           </m:mrow>
                           <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabmqaaaqaaiGbckfasjabcwgaLjabdogaJjabdggaHjabdYgaSjabdYgaSjabg2da9KqbaoaalaaabaGaemivaqLaemiuaafabaGaemivaqLaemiuaaLaey4kaSIaemOrayKaemOta4eaaaGcbaGagiiuaaLaeiOCaiNaemyzauMaem4yamMaemyAaKMaem4CamNaem4Ba8MaemOBa4Maeyypa0tcfa4aaSaaaeaacqWGubavcqWGqbauaeaacqWGubavcqWGqbaucqGHRaWkcqWGgbGrcqWGqbauaaaakeaacqWGtbWucqWGWbaCcqWGLbqzcqWGJbWycqWGPbqAcqWGMbGzcqWGPbqAcqWGJbWycqWGPbqAcqWG0baDcqWG5bqEcqGH9aqpjuaGdaWcaaqaaiabdsfaujabd6eaobqaaiabdsfaujabd6eaojabgUcaRiabdAeagjabdcfaqbaaaaaaaa@6A27@</m:annotation>
                        </m:semantics>
                     </m:math>
                  </display-formula>
               </p>
               <p>where TP, TN, FP and FN represent the number of true positive, true negative, false positive and false negative results, respectively.</p>
               <p>To compare the results of classification by the SVM tool with the gold standard, we used logistic regression (SAS Version 9.13, SAS Institute, Cary, NC). We produced separate logistic regression models for results of the one-way and two-way SVM schemes during the 5-week experiment (February 22, 2007 to March 28, 2007). Results from each model were used to generate receiver-operating characteristics (ROC) and calculate the area under the curve (AUC) with 95% confidence intervals. The AUC of ROC curves for the two models were compared using nonparametric methods <abbrgrp><abbr bid="B26">26</abbr><abbr bid="B27">27</abbr></abbrgrp>.</p>
            </sec>
            <sec>
               <st>
                  <p>Domain-specific performance evaluation</p>
               </st>
               <p>A list of articles compiled independently by domain experts was used as the gold standard to evaluate the predictive accuracy of the application. A network of eight experts in the analysis of genetic associations with preterm birth performed a comprehensive literature search to build a knowledge base for systematic review and meta-analysis. The search was limited to articles published from January 1, 1990, to April 12, 2007. Complex queries compiled by a librarian were used to query PubMed and EMBASE <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. The complex queries consisted of sophisticated PubMed and EMBASE syntax filling more than four single-spaced pages. The results were manually reviewed by the domain experts.</p>
               <p>For comparison, we used the GAPscreener to screen all PubMed abstracts published during the same period of time in a two-step process. First, we compiled a broad PubMed query based on common terms related to preterm birth. The 42,585 PubMed abstracts returned by this query were then classified by the SVM tool.</p>
               <p>Query: Prematurity OR infant, premature OR infant, low birth weight OR labor, premature OR preterm labour OR premature birth OR preterm birth OR preterm infant OR preterm premature rupture OR preterm pregnancy outcome OR preterm delivery OR adverse outcomes of pregnancy OR obstetric labor, premature.</p>
            </sec>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>SVM feature selection</p>
            </st>
            <p>We generated a list of significant keywords using the z score method, based on comparing their relative frequencies in 10,000 general PubMed abstracts and 10,000 gene disease-associated abstracts included in the HuGE Pub Lit database. The one-way and two-way weighted schemes generated lists of 1,301 and 4,589 keywords, respectively. Normalized z scores between 1 and -1 were used as weighting parameters for each keyword.</p>
            <p>The two-way weighted scheme (using keywords with positive and negative z scores) performed better than the one-way scheme in terms of recall, specificity and precision (Table <tblr tid="T1">1</tblr>). The AUC for the two-way scheme was significantly larger than for the one-way scheme (p &lt; 0.0001).</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Performance test results comparing SVM results with known classification in test set (data selected from PubMed during five consecutive weeks from Feb 22, 2007 to March 28, 2007)</p>
               </caption>
               <tblbdy cols="9">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Test Parameters</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>22-Feb-07</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>1-Mar-07</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>8-Mar-07</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>15-Mar-07</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>22-Mar-07</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>ROC area (95% CI)</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b><it>p </it>value</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="9">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>One Way</p>
                     </c>
                     <c ca="left">
                        <p>Recall</p>
                     </c>
                     <c ca="center">
                        <p>0.946</p>
                     </c>
                     <c ca="center">
                        <p>0.968</p>
                     </c>
                     <c ca="center">
                        <p>0.951</p>
                     </c>
                     <c ca="center">
                        <p>0.965</p>
                     </c>
                     <c ca="center">
                        <p>0.951</p>
                     </c>
                     <c ca="center">
                        <p>0.967</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Precision</p>
                     </c>
                     <c ca="center">
                        <p>0.345</p>
                     </c>
                     <c ca="center">
                        <p>0.297</p>
                     </c>
                     <c ca="center">
                        <p>0.265</p>
                     </c>
                     <c ca="center">
                        <p>0.298</p>
                     </c>
                     <c ca="center">
                        <p>0.265</p>
                     </c>
                     <c ca="center">
                        <p>(0.958&#8211;0.975)</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Specificity</p>
                     </c>
                     <c ca="center">
                        <p>0.981</p>
                     </c>
                     <c ca="center">
                        <p>0.981</p>
                     </c>
                     <c ca="center">
                        <p>0.980</p>
                     </c>
                     <c ca="center">
                        <p>0.981</p>
                     </c>
                     <c ca="center">
                        <p>0.980</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c cspan="8">
                        <hr/>
                     </c>
                     <c ca="center">
                        <p>&lt; 0.0001</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Two Way</p>
                     </c>
                     <c ca="left">
                        <p>Recall</p>
                     </c>
                     <c ca="center">
                        <p>0.946</p>
                     </c>
                     <c ca="center">
                        <p>0.992</p>
                     </c>
                     <c ca="center">
                        <p>0.967</p>
                     </c>
                     <c ca="center">
                        <p>0.977</p>
                     </c>
                     <c ca="center">
                        <p>0.993</p>
                     </c>
                     <c ca="center">
                        <p>0.982</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Precision</p>
                     </c>
                     <c ca="center">
                        <p>0.345</p>
                     </c>
                     <c ca="center">
                        <p>0.311</p>
                     </c>
                     <c ca="center">
                        <p>0.291</p>
                     </c>
                     <c ca="center">
                        <p>0.323</p>
                     </c>
                     <c ca="center">
                        <p>0.336</p>
                     </c>
                     <c ca="center">
                        <p>(0.976 &#8211; 0.987)</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Specificity</p>
                     </c>
                     <c ca="center">
                        <p>0.981</p>
                     </c>
                     <c ca="center">
                        <p>0.982</p>
                     </c>
                     <c ca="center">
                        <p>0.982</p>
                     </c>
                     <c ca="center">
                        <p>0.983</p>
                     </c>
                     <c ca="center">
                        <p>0.984</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>One-way: key words with z scores greater than 1.96 were selected as featured key words.</p>
                  <p>Two-way: key words with z scores greater than 1.96 or less than -1.96 were selected as featured key words.</p>
                  <p>AUC: area under the curve.</p>
                  <p>CI: confident interval</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Using the SVM tool for HuGE Pub Lit database screening and curation</p>
            </st>
            <p>The routine screening process used to perform weekly updates of the HuGE Pub Lit database was based on a complex query that combined Medical Subject Headings (MeSH) terms and selected text words, followed by a labor-intensive, time-consuming manual review by a single curator (MC) <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. Because a previous evaluation had concluded that the recall of this process was about 80% <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, we re-evaluated the SVM false positives and found that the SVM was able to pick up 47 positive articles missed by the traditional curation process during the 4-week evaluation period; however, 14 positive abstracts were missed by the SVM (Table <tblr tid="T2">2</tblr>).</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Results of the SVM method and previous method in screening PubMed for the HuGE Pub Lit database.</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>
                           <b>05-Apr-07</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>12-Apr-07</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>19-Apr-07</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>26-Apr-07</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Number of positive abstracts missed by the previous method*</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>22</p>
                     </c>
                     <c ca="center">
                        <p>17</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Number of positive abstracts missed by SVM</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Number of positive abstracts picked up by both methods</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>179</p>
                     </c>
                     <c ca="center">
                        <p>159</p>
                     </c>
                     <c ca="center">
                        <p>131</p>
                     </c>
                     <c ca="center">
                        <p>114</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Number of total positive abstracts</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>206</p>
                     </c>
                     <c ca="center">
                        <p>180</p>
                     </c>
                     <c ca="center">
                        <p>137</p>
                     </c>
                     <c ca="center">
                        <p>121</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>* True positives re-evaluated by the curator.</p>
               </tblfn>
            </tbl>
            <p>The number of abstracts returned by the query is a crucial factor in determining the burden of curating the HuGE Navigator database. The ever-increasing number of genetic association studies &#8211; combined with curator fatigue &#8211; may also influence the quality of the database. Our 4-week experiment showed that using the GAPscreener reduced the number of abstracts requiring manual review approximately 8-fold (Table <tblr tid="T3">3</tblr>).</p>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Numbers of PubMed abstracts requiring manual review after screening by SVM method and previous method*.</p>
               </caption>
               <tblbdy cols="6">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>
                           <b>05-Apr-07</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>12-Apr-07</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>19-Apr-07</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>26-Apr-07</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Total</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>The SVM tool</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>521</p>
                     </c>
                     <c ca="center">
                        <p>397</p>
                     </c>
                     <c ca="center">
                        <p>458</p>
                     </c>
                     <c ca="center">
                        <p>400</p>
                     </c>
                     <c ca="center">
                        <p>1776</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>The previous method</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>4010</p>
                     </c>
                     <c ca="center">
                        <p>3013</p>
                     </c>
                     <c ca="center">
                        <p>3789</p>
                     </c>
                     <c ca="center">
                        <p>3382</p>
                     </c>
                     <c ca="center">
                        <p>14194</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Note: the number for the SVM tool was generated based on Entrez date; the number for the previous method was generated based on MeSH date.</p>
                  <p>*: Previous method: the screening method described in the reference 5.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Screening PubMed for genetic associations with preterm birth</p>
            </st>
            <p>We built this application not only for general screening of the PubMed literature on genetic associations but also as a tool that could be customized for searching genetic association literature in any specific domain. We used preterm birth as an example to evaluate the application's performance in this setting. An independent screening process performed by domain experts first identified 5,421 articles in PubMed and EMBASE by complex PubMed and EMBASE queries. After reviewing each abstract manually, 49 articles were included in the knowledge base. All 49 articles were recorded in the PubMed database. In a parallel process, the GAPscreener was used to perform the initial screening automatically with the preterm birth specific query (see Method), identifying 531 articles. Of these, 47 (96%) overlapped with the set of 49 articles identified by the domain experts. The GAPscreener missed two articles found by the traditional process but picked up six additional articles that the traditional process had missed (Figure <figr fid="F1">1</figr>).</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Results of traditional search method compared with use of GAPscreener (preterm birth example)</p>
               </caption>
               <text>
                  <p><b>Results of traditional search method compared with use of GAPscreener (preterm birth example)</b>. Both methods searched all PubMed abstracts entered from January 1, 1990 through April 12, 2007. Numbers indicate the number of PubMed abstracts processed at each stage.</p>
               </text>
               <graphic file="1471-2105-9-205-1"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Implementation of the user-friendly application</p>
            </st>
            <p>The GAPscreener includes all components in the screening process: PubMed record retrieval from NCBI, text content processing for keyword extraction, SVM input data formatting, and SVM output display and record export (Figure <figr fid="F2">2</figr>). A graphical user interface (GUI) provides a user-friendly environment (Figure <figr fid="F3">3</figr>). The application can be freely downloaded and its self-installation capacity makes the process fairly easy.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Data flow scheme in GAPscreener's screening process</p>
               </caption>
               <text>
                  <p>
                     <b>Data flow scheme in GAPscreener's screening process.</b>
                  </p>
               </text>
               <graphic file="1471-2105-9-205-2"/>
            </fig>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Graphical user interface (GUI) of GAPscreener</p>
               </caption>
               <text>
                  <p>
                     <b>Graphical user interface (GUI) of GAPscreener.</b>
                  </p>
               </text>
               <graphic file="1471-2105-9-205-3"/>
            </fig>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>The number of published genetic associations has exploded during the past decade <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. Finding these associations in major online databases like PubMed is critical for establishing the knowledge base on genetic factors in specific diseases <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. Automated tools are needed to help scientists cope with the information overload. For 6 years, the HuGE Pub Lit database has continuously collected PubMed literature related to human genome epidemiology, providing a great opportunity to test machine learning techniques for automating the screening process <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. Compared with the existing, traditional screening process, the GAPscreener dramatically reduced the burden of manual review and substantially improved screening recall, from 80% to 97.5%.</p>
         <p>Feature selection is an important element of the support vector machine technique. Our weighted z score method performed better than a previously reported method based on the Term Frequency &#215; Inverse Document Frequency (TFIDF) weighting scheme <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. Representing statistical information for each keyword as a normalized z score (value between 1 and -1) performed better than the binary representation <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>.</p>
         <p>As we demonstrated in the example of preterm birth, a potentially important application of the GAPscreener is identifying genetic association literature in a specific domain (e.g., disease, gene, or pathway). This could be very useful to disease-specific networks or consortia, such as those that have banded together in a global HuGENet collaboration <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. The GAPscreener takes advantage of PubMed search capacity to narrow down the returned abstracts to a specific topic before applying the SVM technique.</p>
         <p>The GAPscreener could become a routine screening tool for researchers and database curators for maintaining a local reference database. The tool can be downloaded at no charge and source code is available upon request. It is a freeware search tool that can assist researchers with systematic reviews by identifying genetic association literature in PubMed in a user-friendly and sensitive way. To our knowledge, it is the first free application that uses SVM techniques to classify published literature related to human genetic association studies. Certainly, a similar approach could be used to classify literature in other biomedical fields.</p>
         <p>Although the GAPscreener demonstrated high recall and specificity, it has many aspects that could be improved. For example, the two-way weighted z score scheme based on a threshold of &#177; 1.96 generated 4,589 keywords. The number of featured keywords influences the processing speed, which in this example averaged about 0.02 second per abstract. We are planning to experiment with shorter featured keyword lists to improve processing time without sacrificing recall.</p>
         <p>The keyword approach is only one of many ways to transform text into a feature vector. Use of controlled vocabularies can make "keywords" more meaningful and condense the list by reducing synonyms for a particular concept to a single term. The Unified Medical Language System (UMLS) sponsored by the National Library of Medicine provides a central repository for standard controlled vocabularies in the biomedical fields <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>. MetaMap Transfer (MMTx) is a tool that maps free text to concepts in the UMLS Metathesaurus <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>. UMLS terms could be used during the selection of featured keywords.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>GAPscreener is the first free SVM-based application available for screening the human genetic association literature in PubMed. It uses a novel SVM weighted-feature selection scheme. A performance evaluation demonstrated high recall and specificity. The user-friendly graphical user interface makes this a practical, stand-alone application.</p>
      </sec>
      <sec>
         <st>
            <p>Competing interests</p>
         </st>
         <p>The authors declare that they have no competing interests.</p>
      </sec>
      <sec>
         <st>
            <p>Availability and requirements</p>
         </st>
         <p>Project home page:</p>
         <p>
            <url>http://www.hugenavigator.net/HuGENavigator/HNDescription/opensource_GAP.htm</url>
         </p>
         <p>Operating systems: Windows</p>
         <p>Programming language: Java</p>
         <p>Software packages: J2EE 1.4.</p>
         <p>License: GNU General Public License. This license allows the source code to be redistributed and/or modified under the terms of the GNU General Public License as published by the Free Software Foundation. The source code for the application is available at no charge.</p>
         <p>Any restrictions to use by non-academics: None</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>WY designed and implemented the infrastructure, wrote the source codes, and drafted the manuscript. MC was involved in the data curation and evaluation tests. SD was involved in the test data preparation and evaluation. AY was involved in the data analysis and helped in manuscript preparation. AW participated in design of the system evaluation, data collection and analysis. TL performed the statistical design and data analysis. MG provided advice on the project and revised the draft manuscript and led the project. MJK oversaw the project and revised the draft manuscript. All authors read and approved the final document.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>We thank Dr. Sham Navathe and his group at the Georgia Institute of Technology for useful discussions on support vector machines. Thanks also to Joseph Long for comments on the manuscript.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Literature mining for the biologist: from information retrieval to biological discovery</p>
            </title>
            <aug>
               <au>
                  <snm>Jensen</snm>
                  <fnm>LJ</fnm>
               </au>
               <au>
                  <snm>Saric</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Bork</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Nat Rev Genet</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>119</fpage>
            <lpage>129</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nrg1768</pubid>
                  <pubid idtype="pmpid" link="fulltext">16418747</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Realizing the promise of genomics in biomedical research</p>
            </title>
            <aug>
               <au>
                  <snm>Guttmacher</snm>
                  <fnm>AE</fnm>
               </au>
               <au>
                  <snm>Collins</snm>
                  <fnm>FS</fnm>
               </au>
            </aug>
            <source>JAMA</source>
            <pubdate>2005</pubdate>
            <volume>294</volume>
            <fpage>1399</fpage>
            <lpage>1402</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">16174701</pubid>
                  <pubid idtype="doi">10.1001/jama.294.11.1399</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>A road map for efficient and reliable human genome epidemiology</p>
            </title>
            <aug>
               <au>
                  <snm>Ioannidis</snm>
                  <fnm>JP</fnm>
               </au>
               <au>
                  <snm>Gwinn</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Little</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Higgins</snm>
                  <fnm>JP</fnm>
               </au>
               <au>
                  <snm>Bernstein</snm>
                  <fnm>JL</fnm>
               </au>
               <au>
                  <snm>Boffetta</snm>
                  <fnm>P</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nat Genet</source>
            <pubdate>2006</pubdate>
            <volume>38</volume>
            <fpage>3</fpage>
            <lpage>5</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/ng0106-3</pubid>
                  <pubid idtype="pmpid">16468121</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>HuGENet Handbook of Systematic Reviews</p>
            </title>
            <pubdate>2007</pubdate>
            <url>http://www.genesens.net/_intranet/doc_nouvelles/HuGE Review Handbook v11.pdf</url>
         </bibl>
         <bibl id="B5">
            <title>
               <p>A navigator for human genome epidemiology</p>
            </title>
            <aug>
               <au>
                  <snm>Yu</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Gwinn</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Clyne</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Yesupriya</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Khoury</snm>
                  <fnm>MJ</fnm>
               </au>
            </aug>
            <source>Nat Genet</source>
            <pubdate>2008</pubdate>
            <volume>40</volume>
            <fpage>124</fpage>
            <lpage>125</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">18227866</pubid>
                  <pubid idtype="doi">10.1038/ng0208-124</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Tracking the epidemiology of human genes in the literature: the HuGE Published Literature database</p>
            </title>
            <aug>
               <au>
                  <snm>Lin</snm>
                  <fnm>BK</fnm>
               </au>
               <au>
                  <snm>Clyne</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Walsh</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Gomez</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Yu</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Gwinn</snm>
                  <fnm>M</fnm>
               </au>
               <etal/>
            </aug>
            <source>Am J Epidemiol</source>
            <pubdate>2006</pubdate>
            <volume>164</volume>
            <fpage>1</fpage>
            <lpage>4</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/aje/kwj175</pubid>
                  <pubid idtype="pmpid" link="fulltext">16641305</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Systematic meta-analyses of Alzheimer disease genetic association studies: the AlzGene database</p>
            </title>
            <aug>
               <au>
                  <snm>Bertram</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>McQueen</snm>
                  <fnm>MB</fnm>
               </au>
               <au>
                  <snm>Mullin</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Blacker</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Tanzi</snm>
                  <fnm>RE</fnm>
               </au>
            </aug>
            <source>Nat Genet</source>
            <pubdate>2007</pubdate>
            <volume>39</volume>
            <fpage>17</fpage>
            <lpage>23</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">17192785</pubid>
                  <pubid idtype="doi">10.1038/ng1934</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>PubMed. Bethesda, MD: National Library of Medicine</p>
            </title>
            <pubdate>2006</pubdate>
            <url>http://www.ncbi.nlm.nih.gov/entrez</url>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Hairpins in bookstacks: information retrieval from biomedical text</p>
            </title>
            <aug>
               <au>
                  <snm>Shatkay</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Brief Bioinform</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <fpage>222</fpage>
            <lpage>238</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">16212771</pubid>
                  <pubid idtype="doi">10.1093/bib/6.3.222</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Investigation into biomedical literature classification using support vector machines</p>
            </title>
            <aug>
               <au>
                  <snm>Polavarapu</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Navathe</snm>
                  <fnm>SB</fnm>
               </au>
               <au>
                  <snm>Ramnarayanan</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>ul</snm>
                  <fnm>HA</fnm>
               </au>
               <au>
                  <snm>Sahay</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Liu</snm>
                  <fnm>Y</fnm>
               </au>
            </aug>
            <source>Proc IEEE Comput Syst Bioinform Conf</source>
            <pubdate>2005</pubdate>
            <fpage>366</fpage>
            <lpage>374</lpage>
            <xrefbib>
               <pubid idtype="pmpid">16447994</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>PreBIND and Textomy&#8211;mining the biomedical literature for protein-protein interactions using a support vector machine</p>
            </title>
            <aug>
               <au>
                  <snm>Donaldson</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Martin</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>de</snm>
                  <fnm>BB</fnm>
               </au>
               <au>
                  <snm>Wolting</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Lay</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Tuekam</snm>
                  <fnm>B</fnm>
               </au>
               <etal/>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>4</volume>
            <fpage>11</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">153503</pubid>
                  <pubid idtype="pmpid" link="fulltext">12689350</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-4-11</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>The TREC 2004 genomics track categorization task: classifying full text biomedical documents</p>
            </title>
            <aug>
               <au>
                  <snm>Cohen</snm>
                  <fnm>AM</fnm>
               </au>
               <au>
                  <snm>Hersh</snm>
                  <fnm>WR</fnm>
               </au>
            </aug>
            <source>J Biomed Discov Collab</source>
            <pubdate>2006</pubdate>
            <volume>1</volume>
            <fpage>4</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">16722582</pubid>
                  <pubid idtype="pmcid">1440303</pubid>
                  <pubid idtype="doi">10.1186/1747-5333-1-4</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Support-vector networks</p>
            </title>
            <aug>
               <au>
                  <snm>Cortes</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Vapnik</snm>
                  <fnm>V</fnm>
               </au>
            </aug>
            <source>Machine Learning</source>
            <pubdate>1995</pubdate>
            <volume>20</volume>
            <fpage>273</fpage>
            <lpage>297</lpage>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Substring selection for biomedical document classification</p>
            </title>
            <aug>
               <au>
                  <snm>Han</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Obradovic</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Hu</snm>
                  <fnm>ZZ</fnm>
               </au>
               <au>
                  <snm>Wu</snm>
                  <fnm>CH</fnm>
               </au>
               <au>
                  <snm>Vucetic</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>22</volume>
            <fpage>2136</fpage>
            <lpage>2142</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btl350</pubid>
                  <pubid idtype="pmpid" link="fulltext">16837530</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Training a support vector machine in the primal</p>
            </title>
            <aug>
               <au>
                  <snm>Chapelle</snm>
                  <fnm>O</fnm>
               </au>
            </aug>
            <source>Neural Comput</source>
            <pubdate>2007</pubdate>
            <volume>19</volume>
            <fpage>1155</fpage>
            <lpage>1178</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">17381263</pubid>
                  <pubid idtype="doi">10.1162/neco.2007.19.5.1155</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures</p>
            </title>
            <aug>
               <au>
                  <snm>Ng</snm>
                  <fnm>KL</fnm>
               </au>
               <au>
                  <snm>Mishra</snm>
                  <fnm>SK</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>23</volume>
            <fpage>1321</fpage>
            <lpage>1330</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btm026</pubid>
                  <pubid idtype="pmpid" link="fulltext">17267435</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>A novel approach using pharmacophore ensemble/support vector machine (PhE/SVM) for prediction of hERG liability</p>
            </title>
            <aug>
               <au>
                  <snm>Leong</snm>
                  <fnm>MK</fnm>
               </au>
            </aug>
            <source>Chem Res Toxicol</source>
            <pubdate>2007</pubdate>
            <volume>20</volume>
            <fpage>217</fpage>
            <lpage>226</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">17261034</pubid>
                  <pubid idtype="doi">10.1021/tx060230c</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Mining protein function from text using term-based support vector machines</p>
            </title>
            <aug>
               <au>
                  <snm>Rice</snm>
                  <fnm>SB</fnm>
               </au>
               <au>
                  <snm>Nenadic</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Stapley</snm>
                  <fnm>BJ</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <issue>Suppl 1</issue>
            <fpage>S22</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">15960835</pubid>
                  <pubid idtype="pmcid">1869015</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-6-S1-S22</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>GAPscreener</p>
            </title>
            <url>http://www.hugenavigator.net/HuGENavigator/HNDescription/opensource_GAP.htm</url>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Entrez Programming Utilities. bethesda, MD: National Library of Medicine</p>
            </title>
            <pubdate>2006</pubdate>
            <url>http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html</url>
         </bibl>
         <bibl id="B21">
            <aug>
               <au>
                  <snm>Rosener</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Fundamentals of Biostatistics</source>
            <publisher>Boston. Duxbury Press</publisher>
            <edition>5</edition>
            <pubdate>2000</pubdate>
            <fpage>356</fpage>
            <lpage>359</lpage>
         </bibl>
         <bibl id="B22">
            <title>
               <p>A library for support vector machines</p>
            </title>
            <aug>
               <au>
                  <snm>Chang</snm>
                  <fnm>CC</fnm>
               </au>
               <au>
                  <snm>Lin</snm>
                  <fnm>CJ</fnm>
               </au>
            </aug>
            <pubdate>2001</pubdate>
            <url>http://www.csie.ntu.edu.tw/~cjlin/libsvm</url>
         </bibl>
         <bibl id="B23">
            <title>
               <p>A study on sigmoid kernels for SVM and the training of non-PSD kernels by SMO-type methods</p>
            </title>
            <aug>
               <au>
                  <snm>Lin</snm>
                  <fnm>HT</fnm>
               </au>
               <au>
                  <snm>Lin</snm>
                  <fnm>CJ</fnm>
               </au>
            </aug>
            <publisher>Technical report, Department of Computer Science, National Taiwan University</publisher>
            <pubdate>2003</pubdate>
            <url>http://www.csie.ntu.edu.tw/~cjlin/papers/tanh.pdf</url>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Java Swing</p>
            </title>
            <aug>
               <au>
                  <snm>Eckstein</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Loy</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Wood</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <publisher>O'Reilly &amp; Associates, Inc., Sebastopol, CA,</publisher>
            <pubdate>1998</pubdate>
         </bibl>
         <bibl id="B25">
            <title>
               <p>EzInstall 5.2</p>
            </title>
            <url>http://www.download3000.com/download_500.html</url>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach</p>
            </title>
            <aug>
               <au>
                  <snm>DeLong</snm>
                  <fnm>ER</fnm>
               </au>
               <au>
                  <snm>DeLong</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Clarke-Pearson</snm>
                  <fnm>DL</fnm>
               </au>
            </aug>
            <source>Biometrics</source>
            <pubdate>1988</pubdate>
            <volume>44</volume>
            <fpage>837</fpage>
            <lpage>845</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid">3203132</pubid>
                  <pubid idtype="doi">10.2307/2531595</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <aug>
               <au>
                  <snm>Puri</snm>
                  <fnm>ML</fnm>
               </au>
               <au>
                  <snm>Sen</snm>
                  <fnm>PK</fnm>
               </au>
            </aug>
            <source>Nonparametric Methods in Multivariate Analysis</source>
            <publisher>Wiley</publisher>
            <pubdate>1971</pubdate>
         </bibl>
         <bibl id="B28">
            <title>
               <p>EMBASE Excerpta Medica</p>
            </title>
            <publisher>New York, NY: Elsevier</publisher>
            <pubdate>2005</pubdate>
            <url>http://www.elsevier.com/wps/find/bibliographicdatabasedescription.cws_home/523328/description</url>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Machine learning in automated text categorization</p>
            </title>
            <aug>
               <au>
                  <snm>Sebastiani</snm>
                  <fnm>F</fnm>
               </au>
            </aug>
            <source>ACM Computing Surveys</source>
            <pubdate>2002</pubdate>
            <volume>34</volume>
            <fpage>1</fpage>
            <lpage>47</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1145/505282.505283</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>A network of investigator networks in human genome epidemiology</p>
            </title>
            <aug>
               <au>
                  <snm>Ioannidis</snm>
                  <fnm>JP</fnm>
               </au>
               <au>
                  <snm>Bernstein</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Boffetta</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Danesh</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Dolan</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Hartge</snm>
                  <fnm>P</fnm>
               </au>
               <etal/>
            </aug>
            <source>Am J Epidemiol</source>
            <pubdate>2005</pubdate>
            <volume>162</volume>
            <fpage>302</fpage>
            <lpage>304</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/aje/kwi201</pubid>
                  <pubid idtype="pmpid" link="fulltext">16014777</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B31">
            <title>
               <p>The Unified Medical Language System</p>
            </title>
            <aug>
               <au>
                  <snm>Lindberg</snm>
                  <fnm>DA</fnm>
               </au>
               <au>
                  <snm>Humphreys</snm>
                  <fnm>BL</fnm>
               </au>
               <au>
                  <snm>McCray</snm>
                  <fnm>AT</fnm>
               </au>
            </aug>
            <source>Methods Inf Med</source>
            <pubdate>1993</pubdate>
            <volume>32</volume>
            <fpage>281</fpage>
            <lpage>291</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid">8412823</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program</p>
            </title>
            <aug>
               <au>
                  <snm>Aronson</snm>
                  <fnm>AR</fnm>
               </au>
            </aug>
            <source>Proc AMIA Symp</source>
            <pubdate>2001</pubdate>
            <fpage>17</fpage>
            <lpage>21</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid">11825149</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
