<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>gb-2007-8-6-r103</ui>
   <ji>GBJ</ji>
   <fm>
      <dochead>Research</dochead>
      <bibl>
         <title>
            <p>Characterization and modeling of the <it>Haemophilus influenzae </it>core and supragenomes based on the complete genomic sequences of Rd and 12 clinical nontypeable strains</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Hogg</snm>
               <mi>S</mi>
               <fnm>Justin</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>jhogg@wpahs.org</email>
            </au>
            <au id="A2" ca="yes">
               <snm>Hu</snm>
               <mi>Z</mi>
               <fnm>Fen</fnm>
               <insr iid="I1"/>
               <email>fhu@wpahs.org</email>
            </au>
            <au id="A3">
               <snm>Janto</snm>
               <fnm>Benjamin</fnm>
               <insr iid="I1"/>
               <email>bjanto@wpahs.org</email>
            </au>
            <au id="A4">
               <snm>Boissy</snm>
               <fnm>Robert</fnm>
               <insr iid="I1"/>
               <email>rboissy@wpahs.org</email>
            </au>
            <au id="A5">
               <snm>Hayes</snm>
               <fnm>Jay</fnm>
               <insr iid="I1"/>
               <email>jhayes@wpahs.org</email>
            </au>
            <au id="A6">
               <snm>Keefe</snm>
               <fnm>Randy</fnm>
               <insr iid="I1"/>
               <email>rkeefe@wpahs.org</email>
            </au>
            <au id="A7">
               <snm>Post</snm>
               <fnm>J Christopher</fnm>
               <insr iid="I1"/>
               <email>cpost@wpahs.org</email>
            </au>
            <au id="A8" ca="yes">
               <snm>Ehrlich</snm>
               <mi>D</mi>
               <fnm>Garth</fnm>
               <insr iid="I1"/>
               <email>gehrlich@wpahs.org</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Allegheny General Hospital, Allegheny-Singer Research Institute, Center for Genomic Sciences, Pittsburgh, Pennsylvania 15212, USA</p>
            </ins>
            <ins id="I2">
               <p>Joint Carnegie Mellon University - University of Pittsburgh Ph.D. Program in Computational Biology. 3064 Biomedical Science Tower 3, 3501 Fifth Avenue, Pittsburgh, Pennsylvania 15260, USA</p>
            </ins>
         </insg>
         <source>Genome Biology</source>
         <issn>1465-6906</issn>
         <pubdate>2007</pubdate>
         <volume>8</volume>
         <issue>6</issue>
         <fpage>R103</fpage>
         <url>http://genomebiology.com/2007/8/6/R103</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17550610</pubid>
               <pubid idtype="doi">10.1186/gb-2007-8-6-r103</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>9</day>
               <month>2</month>
               <year>2007</year>
            </date>
         </rec>
         <revrec>
            <date>
               <day>17</day>
               <month>4</month>
               <year>2007</year>
            </date>
         </revrec>
         <acc>
            <date>
               <day>5</day>
               <month>6</month>
               <year>2007</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>05</day>
               <month>06</month>
               <year>2007</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2007</year>
         <collab>Hogg et al.; licensee BioMed Central Ltd.</collab>
         <note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <shorttitle>
         <p><it>H. influenzae </it>core-and supra-genome characterization</p>
      </shorttitle>
      <shortabs>
         <p>The genomes of 9 non-typeable <it>H. influenzae </it>clinical isolates were sequenced and compared with a reference strain, allowing the characterisation and modelling of the core-and supra genomes of this organism.</p>
      </shortabs>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>The distributed genome hypothesis (DGH) posits that chronic bacterial pathogens utilize polyclonal infection and reassortment of genic characters to ensure persistence in the face of adaptive host defenses. Studies based on random sequencing of multiple strain libraries suggested that free-living bacterial species possess a supragenome that is much larger than the genome of any single bacterium.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We derived high depth genomic coverage of nine nontypeable <it>Haemophilus influenzae </it>(NTHi) clinical isolates, bringing to 13 the number of sequenced NTHi genomes. Clustering identified 2,786 genes, of which 1,461 were common to all strains, with each of the remaining 1,328 found in a subset of strains; the number of clusters ranged from 1,686 to 1,878 per strain. Genic differences of between 96 and 585 were identified per strain pair. Comparisons of each of the NTHi strains with the Rd strain revealed between 107 and 158 insertions and 100 and 213 deletions per genome. The mean insertion and deletion sizes were 1,356 and 1,020 base-pairs, respectively, with mean maximum insertions and deletions of 26,977 and 37,299 base-pairs. This relatively large number of small rearrangements among strains is in keeping with what is known about the transformation mechanisms in this naturally competent pathogen.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>A finite supragenome model was developed to explain the distribution of genes among strains. The model predicts that the NTHi supragenome contains between 4,425 and 6,052 genes with most uncertainty regarding the number of rare genes, those that have a frequency of &lt;0.1 among strains; collectively, these results support the DGH.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="BMC" subtype="man_spc_id" id="30010016">Molecular biology</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010014">Microbiology and parasitology</classification>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p><it>Haemophilus influenzae </it>is a Gram-negative bacterium that colonizes the human nasopharynx and is also etiologically associated with a spectrum of acute and chronic diseases. There are six recognized capsular serotypes (a-f), but the majority of clinical strains are unencapsulated and are referred to as nontypeable <it>H. influenzae </it>(NTHi). The type b polysaccharide capsular variants (Hib) are associated with invasive disease, particularly meningitis; however, the introduction of a highly effective vaccine has nearly eliminated this pathogen from developed countries. Recent studies have demonstrated that the NTHi form biofilms on the respiratory mucosa of humans and other mammals and it has been hypothesized that this contributes to the chronicity of these infections <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>. They are the most frequently detected pathogens associated with both the acute and chronic forms of otitis media (OM) <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> and also are recognized as a seed pathogen in a wide range of chronic polymicrobial infections of the respiratory mucosa, including the cystic fibrosis lung, chronic obstructive pulmonary disease, tracheobronchitis, rhinosinusitis, and mastoiditis <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr></abbrgrp>.</p>
         <p>The NTHi are naturally transformable and their genomes demonstrate a high degree of plasticity among strains <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr></abbrgrp>. Previous work from our laboratory has shown that approximately 10% of the genes possessed by each clinically isolated strain are novel with respect to the reference strain Rd KW20 and that the distribution of these genes among the strains is non-uniform <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. Polyclonal NTHi populations have been associated with chronic disease as well as with nasopharyngeal carriage <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B12">12</abbr></abbrgrp>, while other researchers have observed <it>in situ </it>horizontal gene transfer in diseased patients <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B13">13</abbr></abbrgrp>. The twin observations that the NTHi form biofilms during chronic infections and that these infections are often polyclonal suggests that multiple unique strains are co-localized within an environment demonstrated to support greatly elevated rates of horizontal gene transfer <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr></abbrgrp>. These circumstantial evidences suggest that a genetically diverse population may be important to the fitness of <it>H. influenzae </it>as a human pathogen and that continuous horizontal gene transfer among co-colonizing strains is the mechanism that generates the diversity observed in the population. It has been hypothesized that this microbial diversity generation is the counterpoint to the adaptive immune response of the mammalian host <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>. The distributed genome hypothesis (DGH) states that the full complement of genes available to a pathogenic bacterial species exists in a 'supragenome' pool that is not contained by any particular strain, but is available through a genically diverse population of naturally transformable bacterial strains. The distributed genome is not a phenomenon isolated to <it>H. influenzae</it>; comparative genomic studies in other bacterial pathogens, including pneumococcus and <it>Pseudomonas aeruginosa</it>, have demonstrated even greater degrees of genomic plasticity among clinical strains <abbrgrp><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr></abbrgrp>. Moreover, evolutionary studies have demonstrated that pneumococcus uses competence and transformation as a pathogenic mechanism <abbrgrp><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr><abbr bid="B24">24</abbr></abbrgrp>.</p>
         <p>Testing of the DGH and its predictions will provide insight into clinically relevant problems, such as antibiotic resistance, chronic biofilm disease, and serotype-diverse species, which readily adapt to standard vaccinations. Further characterization of the <it>H. influenzae </it>supragenome is a prerequisite to addressing these issues. In this regard we have sequenced the genomes of 11 clinical NTHi isolates, 2 by standard clone-based Sanger sequencing and 9 using the new 454-based pyrosequencing technology. This dataset, combined with the published genomic sequences of Rd and R2866, constitutes the largest set of genomic data collected for <it>H. influenzae </it>to date - the first step towards a characterization of the full complement of genes that collectively define the <it>H. influenzae </it>supragenome. In this paper we present a global comparative analysis that characterizes the distribution of genetic diversity among the strains.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>DNA sequence data</p>
            </st>
            <p>Table <tblr tid="T1">1</tblr> lists the 12 <it>H. influenzae </it>clinical strains and the reference strain Rd, a largely non-pathogenic strain, used in the comparative genomic studies described herein, their NCBI locus tags, the location where the sequencing was performed, and their clinical origins. Nine of the clinical strains were sequenced using 454 LifeSciences novel pyrosequencing technology <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. The number of sequencing runs, the extent of genomic coverage, and the number of contigs resulting from first and in some cases second pass assemblies are tabulated (Table <tblr tid="T2">2</tblr>).</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Bacterial strains and sources used for whole genome sequencing, comparative genomics, and computation of the NTHi core and supragenomes</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c ca="left">
                        <p>NTHi strain</p>
                     </c>
                     <c ca="left">
                        <p>NCBI locus tag prefix</p>
                     </c>
                     <c ca="left">
                        <p>Sequence source</p>
                     </c>
                     <c ca="left">
                        <p>Clinical source [reference]</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Rd KW20</p>
                     </c>
                     <c ca="left">
                        <p>HI</p>
                     </c>
                     <c ca="left">
                        <p>NCBI</p>
                     </c>
                     <c ca="left">
                        <p>Lab strain, formerly serotype D [32]</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>86-028NP</p>
                     </c>
                     <c ca="left">
                        <p>NTHI</p>
                     </c>
                     <c ca="left">
                        <p>NCBI</p>
                     </c>
                     <c ca="left">
                        <p>NP isolate from COM patient [33]</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>R2846</p>
                     </c>
                     <c ca="left">
                        <p>N/A</p>
                     </c>
                     <c ca="left">
                        <p>SBRI</p>
                     </c>
                     <c ca="left">
                        <p>OM isolate, St Louis [10,52]</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>R2866</p>
                     </c>
                     <c ca="left">
                        <p>N/A</p>
                     </c>
                     <c ca="left">
                        <p>SBRI</p>
                     </c>
                     <c ca="left">
                        <p>Blood isolate (meningitis), Seattle [10,53]</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>3655</p>
                     </c>
                     <c ca="left">
                        <p>CGSHi3655</p>
                     </c>
                     <c ca="left">
                        <p>CGS</p>
                     </c>
                     <c ca="left">
                        <p>AOM isolate, Missouri [54, from A. Ryan]</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PittAA</p>
                     </c>
                     <c ca="left">
                        <p>CGSHiAA</p>
                     </c>
                     <c ca="left">
                        <p>CGS</p>
                     </c>
                     <c ca="left">
                        <p>OME isolate, Pittsburgh [11]</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PittEE</p>
                     </c>
                     <c ca="left">
                        <p>CGSHiEE</p>
                     </c>
                     <c ca="left">
                        <p>CGS</p>
                     </c>
                     <c ca="left">
                        <p>OME isolate, Pittsburgh [11]</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PittGG</p>
                     </c>
                     <c ca="left">
                        <p>CGSHiGG</p>
                     </c>
                     <c ca="left">
                        <p>CGS</p>
                     </c>
                     <c ca="left">
                        <p>Otorrhea isolate, Pittsburgh [11]</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PittHH</p>
                     </c>
                     <c ca="left">
                        <p>CGSHiHH</p>
                     </c>
                     <c ca="left">
                        <p>CGS</p>
                     </c>
                     <c ca="left">
                        <p>OME isolate, Pittsburgh [11]</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PittII</p>
                     </c>
                     <c ca="left">
                        <p>CGSHiII</p>
                     </c>
                     <c ca="left">
                        <p>CGS</p>
                     </c>
                     <c ca="left">
                        <p>Otorrhea isolate, Pittsburgh [11]</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>R3021</p>
                     </c>
                     <c ca="left">
                        <p>CGSHiR3021</p>
                     </c>
                     <c ca="left">
                        <p>CGS</p>
                     </c>
                     <c ca="left">
                        <p>NP isolate [10]</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>22.4-21</p>
                     </c>
                     <c ca="left">
                        <p>CGSHi22421</p>
                     </c>
                     <c ca="left">
                        <p>CGS</p>
                     </c>
                     <c ca="left">
                        <p>NP isolate, Michigan [12]*</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>22.1-21</p>
                     </c>
                     <c ca="left">
                        <p>CGSHi22121</p>
                     </c>
                     <c ca="left">
                        <p>CGS</p>
                     </c>
                     <c ca="left">
                        <p>NP isolate, Michigan [12]*</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>AOM, acute otitis media; CGS, Center for Genomic Sciences; NP, nasopharyngeal; N/A, not available; OM, otitis media; OME, otitis media with effusion; SBRI, Seattle Biomedical Research Institute.</p>
               </tblfn>
            </tbl>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Sequencing data for the 9 Nthi strains sequenced with 454-technology</p>
               </caption>
               <tblbdy cols="7">
                  <r>
                     <c ca="left">
                        <p><it>H. influenzae </it>strain</p>
                     </c>
                     <c ca="center">
                        <p>40&#215;70 plates sequenced</p>
                     </c>
                     <c ca="center">
                        <p>454 read coverage</p>
                     </c>
                     <c ca="center">
                        <p>No. of Newbler contigs</p>
                     </c>
                     <c ca="center">
                        <p>PCR gap closure?</p>
                     </c>
                     <c ca="center">
                        <p>4 kb clone library?</p>
                     </c>
                     <c ca="center">
                        <p>Final no. of contigs</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>3655</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>30&#215;</p>
                     </c>
                     <c ca="center">
                        <p>59</p>
                     </c>
                     <c ca="center">
                        <p>No</p>
                     </c>
                     <c ca="center">
                        <p>No</p>
                     </c>
                     <c ca="center">
                        <p>59</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PittAA</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>23&#215;</p>
                     </c>
                     <c ca="center">
                        <p>88</p>
                     </c>
                     <c ca="center">
                        <p>Yes</p>
                     </c>
                     <c ca="center">
                        <p>No</p>
                     </c>
                     <c ca="center">
                        <p>38</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PittEE</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>42&#215;</p>
                     </c>
                     <c ca="center">
                        <p>49</p>
                     </c>
                     <c ca="center">
                        <p>Yes</p>
                     </c>
                     <c ca="center">
                        <p>4&#215; cover</p>
                     </c>
                     <c ca="center">
                        <p>12</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PittGG</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>21&#215;</p>
                     </c>
                     <c ca="center">
                        <p>60</p>
                     </c>
                     <c ca="center">
                        <p>No</p>
                     </c>
                     <c ca="center">
                        <p>Yes*</p>
                     </c>
                     <c ca="center">
                        <p>60</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PittHH</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>48&#215;</p>
                     </c>
                     <c ca="center">
                        <p>73</p>
                     </c>
                     <c ca="center">
                        <p>No</p>
                     </c>
                     <c ca="center">
                        <p>No</p>
                     </c>
                     <c ca="center">
                        <p>73</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PittII</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>16&#215;</p>
                     </c>
                     <c ca="center">
                        <p>205</p>
                     </c>
                     <c ca="center">
                        <p>No</p>
                     </c>
                     <c ca="center">
                        <p>Yes</p>
                     </c>
                     <c ca="center">
                        <p>205</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>22.4-21</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>19&#215;</p>
                     </c>
                     <c ca="center">
                        <p>69</p>
                     </c>
                     <c ca="center">
                        <p>No</p>
                     </c>
                     <c ca="center">
                        <p>No</p>
                     </c>
                     <c ca="center">
                        <p>69</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>R3021</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>35&#215;</p>
                     </c>
                     <c ca="center">
                        <p>51</p>
                     </c>
                     <c ca="center">
                        <p>No</p>
                     </c>
                     <c ca="center">
                        <p>No</p>
                     </c>
                     <c ca="center">
                        <p>51</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>22.1-21</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>19&#215;</p>
                     </c>
                     <c ca="center">
                        <p>71</p>
                     </c>
                     <c ca="center">
                        <p>No</p>
                     </c>
                     <c ca="center">
                        <p>No</p>
                     </c>
                     <c ca="center">
                        <p>71</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>*Clone library not incorporated in present analysis.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Determination of gene clustering parameters</p>
            </st>
            <p>Gene clustering parameters for the grouping of homologs were empirically determined by minimizing the change in the number of clusters per change in the parameters (Figure <figr fid="F1">1</figr>). We hypothesize that this minimum point coincides with the best estimate threshold for distinguishing true orthologs from functionally distinct homologs. Some homologs will be more similar than 70%, while some orthologs will be more divergent than 70%, but as a uniform criterion, the threshold is optimized. Visual inspection of the clusters reveals that most clusters are reasonable. Mosaic genes were particularly difficult to cluster due to high levels of rearrangement. In the remainder of the paper, genes in the same cluster are considered to be the same gene.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>A plot of the total number of clusters as a function of clustering parameters shows an inflection point near 0.65 identity and 0.70 match length</p>
               </caption>
               <text>
                  <p>A plot of the total number of clusters as a function of clustering parameters shows an inflection point near 0.65 identity and 0.70 match length. The inflection, which minimizes the rate of change in the number of clusters per change in parameters, suggests a set of parameters that optimally segregates orthologs and paralogs.</p>
               </text>
               <graphic file="gb-2007-8-6-r103-1"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Enumeration of gene clusters and genic relationships among NTHi strains</p>
            </st>
            <p>We identified 2,786 gene clusters among the 13 strains (Table <tblr tid="T3">3</tblr>). Of these, 52% were found in every strain (core genes) and 19% were found in only a single strain (unique genes). The remaining 29% of genes were found in some combination of two or more strains, but not all (distributed genes; Figure <figr fid="F2">2</figr>). The number of clusters found per strain varied from 1,686 in PittEE to 1,878 in PittII (Table <tblr tid="T4">4</tblr>). All strains possessed some unique genes not seen in any of the other strains. A pair-wise comparison was performed among all possible strain pairs, which determined the mean number of genic differences between any two strains was 395 with a standard deviation of 94 (Figure <figr fid="F3">3</figr>). This analysis also identified minimal and maximal genic differences of 81 and 577, respectively, for the strain pairs 2866:PittII and 2866:PittAA. The number of coding sequences identified per genome by AMIgene did not correlate strongly with genome size. This is likely due to the presence of split open reading frames (ORFs) in the 454 sequenced genomes as an analysis of the 4 completed genomes showed a linear relationship between gene number and genome size with an R<sup>2 </sup>= 0.910. In contrast, the correlation between total gene clusters and genome size is 0.86, implying that the number of distinct genes found on the genome is linearly related to the genome size.</p>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Gene clustering results</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="left">
                        <p>Total gene clusters</p>
                     </c>
                     <c ca="right">
                        <p>2,786</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Core gene clusters</p>
                     </c>
                     <c ca="right">
                        <p>1,461</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Contingency clusters</p>
                     </c>
                     <c ca="right">
                        <p>1,325</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Unique clusters</p>
                     </c>
                     <c ca="right">
                        <p>539</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <tbl id="T4">
               <title>
                  <p>Table 4</p>
               </title>
               <caption>
                  <p>Gene identification and clustering results</p>
               </caption>
               <tblbdy cols="6">
                  <r>
                     <c ca="left">
                        <p><it>H. influenzae </it>strain</p>
                     </c>
                     <c ca="center">
                        <p>Genome size (MB)</p>
                     </c>
                     <c ca="right">
                        <p>No. of AMIgene CDSs found</p>
                     </c>
                     <c ca="right">
                        <p>Total gene clusters</p>
                     </c>
                     <c ca="right">
                        <p>Contingency gene clusters</p>
                     </c>
                     <c ca="right">
                        <p>Unique gene clusters</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Rd KW20</p>
                     </c>
                     <c ca="center">
                        <p>1.83</p>
                     </c>
                     <c ca="right">
                        <p>1,802</p>
                     </c>
                     <c ca="right">
                        <p>1,710</p>
                     </c>
                     <c ca="right">
                        <p>271</p>
                     </c>
                     <c ca="right">
                        <p>52</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>86028-NP</p>
                     </c>
                     <c ca="center">
                        <p>1.91</p>
                     </c>
                     <c ca="right">
                        <p>1,867</p>
                     </c>
                     <c ca="right">
                        <p>1,830</p>
                     </c>
                     <c ca="right">
                        <p>391</p>
                     </c>
                     <c ca="right">
                        <p>28</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>R2846</p>
                     </c>
                     <c ca="center">
                        <p>1.82</p>
                     </c>
                     <c ca="right">
                        <p>1,729</p>
                     </c>
                     <c ca="right">
                        <p>1,702</p>
                     </c>
                     <c ca="right">
                        <p>263</p>
                     </c>
                     <c ca="right">
                        <p>4</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>R2866</p>
                     </c>
                     <c ca="center">
                        <p>1.93</p>
                     </c>
                     <c ca="right">
                        <p>1,864</p>
                     </c>
                     <c ca="right">
                        <p>1,835</p>
                     </c>
                     <c ca="right">
                        <p>396</p>
                     </c>
                     <c ca="right">
                        <p>1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>3655</p>
                     </c>
                     <c ca="center">
                        <p>1.85</p>
                     </c>
                     <c ca="right">
                        <p>1,880</p>
                     </c>
                     <c ca="right">
                        <p>1,819</p>
                     </c>
                     <c ca="right">
                        <p>380</p>
                     </c>
                     <c ca="right">
                        <p>62</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PittAA</p>
                     </c>
                     <c ca="center">
                        <p>1.92</p>
                     </c>
                     <c ca="right">
                        <p>1,971</p>
                     </c>
                     <c ca="right">
                        <p>1,871</p>
                     </c>
                     <c ca="right">
                        <p>432</p>
                     </c>
                     <c ca="right">
                        <p>98</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PittEE</p>
                     </c>
                     <c ca="center">
                        <p>1.80</p>
                     </c>
                     <c ca="right">
                        <p>1,762</p>
                     </c>
                     <c ca="right">
                        <p>1,686</p>
                     </c>
                     <c ca="right">
                        <p>247</p>
                     </c>
                     <c ca="right">
                        <p>19</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PittGG</p>
                     </c>
                     <c ca="center">
                        <p>1.84</p>
                     </c>
                     <c ca="right">
                        <p>2,038</p>
                     </c>
                     <c ca="right">
                        <p>1,779</p>
                     </c>
                     <c ca="right">
                        <p>340</p>
                     </c>
                     <c ca="right">
                        <p>53</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PittHH</p>
                     </c>
                     <c ca="center">
                        <p>1.83</p>
                     </c>
                     <c ca="right">
                        <p>1,931</p>
                     </c>
                     <c ca="right">
                        <p>1,783</p>
                     </c>
                     <c ca="right">
                        <p>344</p>
                     </c>
                     <c ca="right">
                        <p>45</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PittHII</p>
                     </c>
                     <c ca="center">
                        <p>1.92</p>
                     </c>
                     <c ca="right">
                        <p>2,245</p>
                     </c>
                     <c ca="right">
                        <p>1,878</p>
                     </c>
                     <c ca="right">
                        <p>439</p>
                     </c>
                     <c ca="right">
                        <p>26</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>22.4-21</p>
                     </c>
                     <c ca="center">
                        <p>1.84</p>
                     </c>
                     <c ca="right">
                        <p>2,264</p>
                     </c>
                     <c ca="right">
                        <p>1,796</p>
                     </c>
                     <c ca="right">
                        <p>357</p>
                     </c>
                     <c ca="right">
                        <p>86</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>R3021</p>
                     </c>
                     <c ca="center">
                        <p>1.89</p>
                     </c>
                     <c ca="right">
                        <p>2,075</p>
                     </c>
                     <c ca="right">
                        <p>1,844</p>
                     </c>
                     <c ca="right">
                        <p>405</p>
                     </c>
                     <c ca="right">
                        <p>55</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>22.1-21</p>
                     </c>
                     <c ca="center">
                        <p>1.85</p>
                     </c>
                     <c ca="right">
                        <p>2,181</p>
                     </c>
                     <c ca="right">
                        <p>1,781</p>
                     </c>
                     <c ca="right">
                        <p>342</p>
                     </c>
                     <c ca="right">
                        <p>10</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>A histogram of gene clusters observed in exactly <it>N </it>of 13 <it>H. influenzae </it>strains compared to the expected number of genes estimated by the supragenome model (trained on all 13 strains)</p>
               </caption>
               <text>
                  <p>A histogram of gene clusters observed in exactly <it>N </it>of 13 <it>H. influenzae </it>strains compared to the expected number of genes estimated by the supragenome model (trained on all 13 strains). Over 1,400 genes were observed in all 13 strains, indicating that there is a common core set of genes. Distributed genes appear in variable numbers of strains, from 1 to 12. Overall, the model fits the data well, though it underestimated the number of genes observed once and overestimated the number of genes observed twice.</p>
               </text>
               <graphic file="gb-2007-8-6-r103-2"/>
            </fig>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>A pairwise genic comparison of 12 NTHi strains of <it>H. influenzae </it>and the reference strain Rd KW20</p>
               </caption>
               <text>
                  <p>A pairwise genic comparison of 12 NTHi strains of <it>H. influenzae </it>and the reference strain Rd KW20. The comparison of two strains is found at the intersection of the row and column corresponding to the respective strains. Strains are compared based on the number of genes shared between the pair, the number of genes found in one strain but not the other, and the number of shared genes that are unique to that pair of strains. A typical pair of strains differs by 395 genes. Similar pairs of strains are shaded in yellow, while divergent strains are shaded orange.</p>
               </text>
               <graphic file="gb-2007-8-6-r103-3"/>
            </fig>
            <p>A dendrogram based on non-core genic differences (Figure <figr fid="F4">4a</figr>) demonstrates the diversity in the NTHi population. A typical strain differs from its nearest neighbor by more than 200 genes. The strains collected from otitis media with effusion (OME) patients at Children's Hospital in Pittsburgh (designated as Pitt strains) show that a genetically diverse population can be isolated contemporaneously from a single geographic location from patients with similar indications. In contrast, two pairs of strains, PittEE/R2846 and PittII/R2866 are relatively similar despite geographically distinct points of isolation. Interestingly, the laboratory strain Rd KW20 is not an outlier among the clinical strains. For comparison, a maximum likelihood tree was generated using sequence from seven multi-locus sequence typing (MLST) housekeeping genes for the same set of 13 strains (Figure <figr fid="F4">4b</figr>). The topology of the trees is significantly different, both in terms of pairwise groupings and overall structure.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Plotting of relationships among the sequenced NTHi strains by gene sharing and multi-locus sequence typing</p>
               </caption>
               <text>
                  <p>Plotting of relationships among the sequenced NTHi strains by gene sharing and multi-locus sequence typing. <b>(a) </b>A dendrogram based on genic differences among the 13 strains of <it>H. influenzae</it>. While several pairs of strains appear to be closely related, there is not a well-defined clade structure. The dendrogram was generated using the unweighted pair group method with arithmetic mean (UPGMA) method [44-46]. The number on each branch corresponds to the number of genic differences from the previous branch point. <b>(b) </b>A dendrogram based on sequence alignments of the seven MLST loci. The tree was built using the maximum likelihood method implemented in fastDNAml. The number on each branch corresponds to the number of point mutations per kilobase from the previous branch point. The topologies of the genic and MLST based trees are different. Most notably, strains PittEE and R2846 are closely related in the genic dendrogram, but are separated in the MLST dendrogram. In other instances, such as PittII and R2866, the strains are closely related in both trees.</p>
               </text>
               <graphic file="gb-2007-8-6-r103-4"/>
            </fig>
            <p>The identified number of new genes and core genes found per addition of each genome (as determined by incremental clustering of the 13 strains) shows an exponentially decaying trend in both cases (Figures <figr fid="F5">5</figr> and <figr fid="F6">6</figr>). Qualitative inspection suggests a diminishing return on new genes found in future sequences, though it is expected that approximately 40 new gene clusters will be found in each of the next few genomes that are sequenced. The number of core genes appears to trend towards a horizontal asymptote near 1,450 genes. A quantitative analysis of these results is developed below in the section 'Mathematical development of a finite supragenome model'.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>The expected number of total gene clusters and core gene clusters identified at the addition of each genome to the clustering dataset</p>
               </caption>
               <text>
                  <p>The expected number of total gene clusters and core gene clusters identified at the addition of each genome to the clustering dataset. Modeling predictions are based on the eight strain training set (see 'Mathematical development of a finite supragenome model'). The number of genes observed in all strains levels off to an asymptote that corresponds to a core set of genes. The rate of increase in total genes decreases, but does not level off due to the discovery of rare genes.</p>
               </text>
               <graphic file="gb-2007-8-6-r103-5"/>
            </fig>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>The observed and expected number of new gene clusters found at the addition of each genome to the clustering dataset</p>
               </caption>
               <text>
                  <p>The observed and expected number of new gene clusters found at the addition of each genome to the clustering dataset. Modeling predictions are based on the eight strain training set (see 'Mathematical development of a finite supragenome model').</p>
               </text>
               <graphic file="gb-2007-8-6-r103-6"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Whole genome alignments reinforce the great diversity observed among gene clusters</p>
            </st>
            <p>Whole genome alignments were generated between Rd and each of the 12 clinical strains to quantify genomic insertions and deletions independently of gene identification (Table <tblr tid="T5">5</tblr>). On average, each of the clinical strains had 127 genomic insertions (>90 base-pairs (bp) in length) that did not correspond to any Rd KW20 sequence. Similarly, each clinical strain contained, on average, 147 genomic deletions (>90 bp) when compared to the Rd KW20 strain. The average total length of non-matching sequences between the 12 clinical strains and Rd was 321 kb, approximately 18% of the genome. The quantity of non-matching sequences reasonably accounts for the average of 390 genic differences between strain pairs. Figure <figr fid="F7">7</figr> shows a genomic region in which two different forms of an insert, homologous to the plasmid ICEhin, have integrated into the same site of two different genomes, but which is wholly absent from the other strains in the alignment. Similarly, a 40 kb contiguous region in Rd shows extensive deletional diversity among seven of the clinical strains, with only two of the clinical strains demonstrating the same local genomic organization (Figure <figr fid="F8">8</figr>). Interestingly, the two strains, PittAA and PittEE, that are similar in this region are highly divergent overall (Figure <figr fid="F3">3</figr>). Genic diversity also exists on a smaller scale. Figure <figr fid="F9">9</figr> displays a 20 kb region from 7 clinical strains that shows 5 different combinations of possession and loss of the lic2C gene, the NTHI0683 gene, and the UreABCEFGH operon.</p>
            <tbl id="T5">
               <title>
                  <p>Table 5</p>
               </title>
               <caption>
                  <p>Analysis of inserted and deleted Sequence in 12 strains with respect to Rd KW20</p>
               </caption>
               <tblbdy cols="13">
                  <r>
                     <c ca="left">
                        <p>Reference: Rd KW20</p>
                     </c>
                     <c ca="right">
                        <p>86-028</p>
                     </c>
                     <c ca="right">
                        <p>R2846</p>
                     </c>
                     <c ca="right">
                        <p>R2866</p>
                     </c>
                     <c ca="right">
                        <p>3655</p>
                     </c>
                     <c ca="right">
                        <p>PittAA</p>
                     </c>
                     <c ca="right">
                        <p>PittEE</p>
                     </c>
                     <c ca="right">
                        <p>PittGG</p>
                     </c>
                     <c ca="right">
                        <p>PittHH</p>
                     </c>
                     <c ca="right">
                        <p>PittII</p>
                     </c>
                     <c ca="right">
                        <p>22.4-21</p>
                     </c>
                     <c ca="right">
                        <p>22.1-21</p>
                     </c>
                     <c ca="right">
                        <p>R3021</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="13">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Number of insertions</p>
                     </c>
                     <c ca="right">
                        <p>118</p>
                     </c>
                     <c ca="right">
                        <p>107</p>
                     </c>
                     <c ca="right">
                        <p>115</p>
                     </c>
                     <c ca="right">
                        <p>139</p>
                     </c>
                     <c ca="right">
                        <p>136</p>
                     </c>
                     <c ca="right">
                        <p>136</p>
                     </c>
                     <c ca="right">
                        <p>119</p>
                     </c>
                     <c ca="right">
                        <p>124</p>
                     </c>
                     <c ca="right">
                        <p>158</p>
                     </c>
                     <c ca="right">
                        <p>131</p>
                     </c>
                     <c ca="right">
                        <p>128</p>
                     </c>
                     <c ca="right">
                        <p>118</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Median insert length (bp)</p>
                     </c>
                     <c ca="right">
                        <p>310</p>
                     </c>
                     <c ca="right">
                        <p>250</p>
                     </c>
                     <c ca="right">
                        <p>315</p>
                     </c>
                     <c ca="right">
                        <p>191</p>
                     </c>
                     <c ca="right">
                        <p>360</p>
                     </c>
                     <c ca="right">
                        <p>290</p>
                     </c>
                     <c ca="right">
                        <p>192</p>
                     </c>
                     <c ca="right">
                        <p>237</p>
                     </c>
                     <c ca="right">
                        <p>167</p>
                     </c>
                     <c ca="right">
                        <p>179</p>
                     </c>
                     <c ca="right">
                        <p>215</p>
                     </c>
                     <c ca="right">
                        <p>260</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Mean insert length (bp)</p>
                     </c>
                     <c ca="right">
                        <p>2,076</p>
                     </c>
                     <c ca="right">
                        <p>1,199</p>
                     </c>
                     <c ca="right">
                        <p>2,041</p>
                     </c>
                     <c ca="right">
                        <p>1,248</p>
                     </c>
                     <c ca="right">
                        <p>1,245</p>
                     </c>
                     <c ca="right">
                        <p>961</p>
                     </c>
                     <c ca="right">
                        <p>1,419</p>
                     </c>
                     <c ca="right">
                        <p>1,408</p>
                     </c>
                     <c ca="right">
                        <p>879</p>
                     </c>
                     <c ca="right">
                        <p>1,274</p>
                     </c>
                     <c ca="right">
                        <p>959</p>
                     </c>
                     <c ca="right">
                        <p>1,869</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Max insert length (bp)</p>
                     </c>
                     <c ca="right">
                        <p>55,275</p>
                     </c>
                     <c ca="right">
                        <p>13,119</p>
                     </c>
                     <c ca="right">
                        <p>53,044</p>
                     </c>
                     <c ca="right">
                        <p>15,789</p>
                     </c>
                     <c ca="right">
                        <p>20,222</p>
                     </c>
                     <c ca="right">
                        <p>9,796</p>
                     </c>
                     <c ca="right">
                        <p>28,306</p>
                     </c>
                     <c ca="right">
                        <p>32,587</p>
                     </c>
                     <c ca="right">
                        <p>11,085</p>
                     </c>
                     <c ca="right">
                        <p>14,983</p>
                     </c>
                     <c ca="right">
                        <p>10,810</p>
                     </c>
                     <c ca="right">
                        <p>58,706</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Total insert length (bp)</p>
                     </c>
                     <c ca="right">
                        <p>244,946</p>
                     </c>
                     <c ca="right">
                        <p>128,290</p>
                     </c>
                     <c ca="right">
                        <p>234,704</p>
                     </c>
                     <c ca="right">
                        <p>173,459</p>
                     </c>
                     <c ca="right">
                        <p>169,310</p>
                     </c>
                     <c ca="right">
                        <p>130,683</p>
                     </c>
                     <c ca="right">
                        <p>168,840</p>
                     </c>
                     <c ca="right">
                        <p>174,636</p>
                     </c>
                     <c ca="right">
                        <p>138,906</p>
                     </c>
                     <c ca="right">
                        <p>166,923</p>
                     </c>
                     <c ca="right">
                        <p>122,721</p>
                     </c>
                     <c ca="right">
                        <p>220,535</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Number of deletions</p>
                     </c>
                     <c ca="right">
                        <p>120</p>
                     </c>
                     <c ca="right">
                        <p>100</p>
                     </c>
                     <c ca="right">
                        <p>106</p>
                     </c>
                     <c ca="right">
                        <p>178</p>
                     </c>
                     <c ca="right">
                        <p>129</p>
                     </c>
                     <c ca="right">
                        <p>110</p>
                     </c>
                     <c ca="right">
                        <p>158</p>
                     </c>
                     <c ca="right">
                        <p>169</p>
                     </c>
                     <c ca="right">
                        <p>213</p>
                     </c>
                     <c ca="right">
                        <p>172</p>
                     </c>
                     <c ca="right">
                        <p>156</p>
                     </c>
                     <c ca="right">
                        <p>159</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Median deleted length (bp)</p>
                     </c>
                     <c ca="right">
                        <p>276</p>
                     </c>
                     <c ca="right">
                        <p>268</p>
                     </c>
                     <c ca="right">
                        <p>359</p>
                     </c>
                     <c ca="right">
                        <p>274</p>
                     </c>
                     <c ca="right">
                        <p>288</p>
                     </c>
                     <c ca="right">
                        <p>264</p>
                     </c>
                     <c ca="right">
                        <p>195</p>
                     </c>
                     <c ca="right">
                        <p>205</p>
                     </c>
                     <c ca="right">
                        <p>246</p>
                     </c>
                     <c ca="right">
                        <p>317</p>
                     </c>
                     <c ca="right">
                        <p>357</p>
                     </c>
                     <c ca="right">
                        <p>340</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Mean deleted length (bp)</p>
                     </c>
                     <c ca="right">
                        <p>1,254</p>
                     </c>
                     <c ca="right">
                        <p>1,354</p>
                     </c>
                     <c ca="right">
                        <p>1,128</p>
                     </c>
                     <c ca="right">
                        <p>900</p>
                     </c>
                     <c ca="right">
                        <p>1,339</p>
                     </c>
                     <c ca="right">
                        <p>1,340</p>
                     </c>
                     <c ca="right">
                        <p>816</p>
                     </c>
                     <c ca="right">
                        <p>874</p>
                     </c>
                     <c ca="right">
                        <p>708</p>
                     </c>
                     <c ca="right">
                        <p>990</p>
                     </c>
                     <c ca="right">
                        <p>898</p>
                     </c>
                     <c ca="right">
                        <p>938</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Max deleted length (bp)</p>
                     </c>
                     <c ca="right">
                        <p>41,022</p>
                     </c>
                     <c ca="right">
                        <p>34,677</p>
                     </c>
                     <c ca="right">
                        <p>41,021</p>
                     </c>
                     <c ca="right">
                        <p>17,858</p>
                     </c>
                     <c ca="right">
                        <p>38,501</p>
                     </c>
                     <c ca="right">
                        <p>33,544</p>
                     </c>
                     <c ca="right">
                        <p>38,506</p>
                     </c>
                     <c ca="right">
                        <p>38,367</p>
                     </c>
                     <c ca="right">
                        <p>41,021</p>
                     </c>
                     <c ca="right">
                        <p>41,022</p>
                     </c>
                     <c ca="right">
                        <p>41,021</p>
                     </c>
                     <c ca="right">
                        <p>41,022</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Total deleted length (bp)</p>
                     </c>
                     <c ca="right">
                        <p>150,491</p>
                     </c>
                     <c ca="right">
                        <p>135,377</p>
                     </c>
                     <c ca="right">
                        <p>119,612</p>
                     </c>
                     <c ca="right">
                        <p>160,262</p>
                     </c>
                     <c ca="right">
                        <p>172,723</p>
                     </c>
                     <c ca="right">
                        <p>147,451</p>
                     </c>
                     <c ca="right">
                        <p>128,936</p>
                     </c>
                     <c ca="right">
                        <p>147,689</p>
                     </c>
                     <c ca="right">
                        <p>150,857</p>
                     </c>
                     <c ca="right">
                        <p>170,262</p>
                     </c>
                     <c ca="right">
                        <p>140,021</p>
                     </c>
                     <c ca="right">
                        <p>149,079</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>All results are quantified with respect to Rd KW20.</p>
               </tblfn>
            </tbl>
            <fig id="F7">
               <title>
                  <p>Figure 7</p>
               </title>
               <caption>
                  <p>A multi-sequence alignment using 86-028NP as a reference shows varying degrees of homology among 6 strains to a 50 kb region homologous to the plasmid ICEhin1056</p>
               </caption>
               <text>
                  <p>A multi-sequence alignment using 86-028NP as a reference shows varying degrees of homology among 6 strains to a 50 kb region homologous to the plasmid ICEhin1056. The plasmid is integrated in 86-028NP and is partially present in R2866, but absent from the other strains in the alignment. Sequences present in other strains without homology to 86-028NP are not shown.</p>
               </text>
               <graphic file="gb-2007-8-6-r103-7"/>
            </fig>
            <fig id="F8">
               <title>
                  <p>Figure 8</p>
               </title>
               <caption>
                  <p>A 40 kb region present in Rd KW20 shows two blocks of genomic variation among other strains</p>
               </caption>
               <text>
                  <p>A 40 kb region present in Rd KW20 shows two blocks of genomic variation among other strains. The upstream block is bounded on the right by a frame-shifted insertion sequence (IS) element (HI1018). The downstream block (HI1024-HI1032) includes genes with likely roles in sugar transport and metabolism. Rd is used as a reference for the alignment, and sequence present in other strains without homology to Rd is not shown.</p>
               </text>
               <graphic file="gb-2007-8-6-r103-8"/>
            </fig>
            <fig id="F9">
               <title>
                  <p>Figure 9</p>
               </title>
               <caption>
                  <p>A 20 kb region that demonstrates strain diversity at the level of an individual gene (lic2C), a pair of genes (NTHi0683/4), and a group of seven functionally related genes (urease system)</p>
               </caption>
               <text>
                  <p>A 20 kb region that demonstrates strain diversity at the level of an individual gene (lic2C), a pair of genes (NTHi0683/4), and a group of seven functionally related genes (urease system). 86-028NP is used as a reference for the alignment, and sequence present in other strains without homology to 86-028NP is not shown.</p>
               </text>
               <graphic file="gb-2007-8-6-r103-9"/>
            </fig>
            <p>Global genomic alignments of PittEE against R2846 and R2866 were performed (Figures <figr fid="F10">10</figr> and <figr fid="F11">11</figr>). PittEE and R2846 are very similar at the global level and this is reinforced by the gene cluster analysis, which revealed only 96 genic differences. In contrast, R2866 has a large inversion and several large insertions and deletions with respect to PittEE. This diversity at the global level corresponds to the 377 genic differences identified between these two strains by cluster analysis (Figure <figr fid="F3">3</figr>). Global alignments were not visualized for most strains since the ordering of the contigs had not been determined.</p>
            <fig id="F10">
               <title>
                  <p>Figure 10</p>
               </title>
               <caption>
                  <p>A global alignment of R2846 and PittEE as visualized by Mummerplot</p>
               </caption>
               <text>
                  <p>A global alignment of R2846 and PittEE as visualized by Mummerplot. A point is placed at the (x,y) coordinate if the x-coordinate of R2846 matches the y-coordinate of PittEE. Green matches indicate a reverse complement match. It can be seen that PittEE and R2846 are similar at the global level.</p>
               </text>
               <graphic file="gb-2007-8-6-r103-10"/>
            </fig>
            <fig id="F11">
               <title>
                  <p>Figure 11</p>
               </title>
               <caption>
                  <p>Global alignment of R2866 and PittEE shows a large inversion and several regions unique to each strain</p>
               </caption>
               <text>
                  <p>Global alignment of R2866 and PittEE shows a large inversion and several regions unique to each strain. The strains are similar across the majority of the genome; however, there is one large inversion as well as several regions unique to each strain.</p>
               </text>
               <graphic file="gb-2007-8-6-r103-11"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Codon usage analysis</p>
            </st>
            <p>The codon usage of each gene cluster was compared to the typical <it>H. influenzae </it>codon usage pattern by the epsilon-score calculated by CodeSquare <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. A low epsilon score indicates that a gene's codon usage is similar to typical patterns of the organism, while a high score indicates atypical codon usage. Since the epsilon score is partially dependent on the length of a coding sequence, all scores were normalized by length. The average normalized score is 0 and low values continue to indicate typical codon usage. Figure <figr fid="F12">12</figr> is a scatter plot of the normalized epsilon scores versus the number of strains in which the gene was found. The range of normalized epsilon values is similar for core, distributed, and unique genes, though the median values are slightly higher for distributed and unique genes (Tables <tblr tid="T6">6</tblr> and <tblr tid="T7">7</tblr>). The Mann Whitney U-test was employed to determine the significance of this difference. To eliminate any remaining length bias, only genes with lengths of 200-300 amino acids were analyzed. The median normalized-epsilon value of core genes is significantly smaller than the medians of distributed and unique genes, and as a consequence, these non-core genes are more likely to have foreign origins. Interestingly, there is no significant difference between distributed and unique genes and most of these non-core genes display typical <it>H. influenzae </it>codon usage.</p>
            <tbl id="T6">
               <title>
                  <p>Table 6</p>
               </title>
               <caption>
                  <p>Codon usage comparisons of core, contingency and unique genes</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c ca="left">
                        <p>Group 1</p>
                     </c>
                     <c ca="left">
                        <p>Group 2</p>
                     </c>
                     <c ca="center">
                        <p><it>P </it>value</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Core</p>
                     </c>
                     <c ca="left">
                        <p>Unique</p>
                     </c>
                     <c ca="center">
                        <p>5.34E-16</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Core</p>
                     </c>
                     <c ca="left">
                        <p>Distributed</p>
                     </c>
                     <c ca="center">
                        <p>4.95E-16</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Core</p>
                     </c>
                     <c ca="left">
                        <p>Non-core</p>
                     </c>
                     <c ca="center">
                        <p>6.55E-25</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Contingency</p>
                     </c>
                     <c ca="left">
                        <p>Unique</p>
                     </c>
                     <c ca="center">
                        <p>0.17</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>The Mann Whitney U-test for significant differences in median of epsilon scores for each pair of gene groups. Only genes with a protein coding length of 200-300 amino acids were tested to minimize length bias. Median core epsilon scores are significantly different among the three gene groups.</p>
               </tblfn>
            </tbl>
            <tbl id="T7">
               <title>
                  <p>Table 7</p>
               </title>
               <caption>
                  <p>Codon usage comparison of core, contingency and unique genes</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c ca="left">
                        <p>Group</p>
                     </c>
                     <c ca="center">
                        <p>Median epsilon</p>
                     </c>
                     <c ca="center">
                        <p>Median length (amino acids)</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Core</p>
                     </c>
                     <c ca="center">
                        <p>-0.57</p>
                     </c>
                     <c ca="center">
                        <p>243</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Contingency</p>
                     </c>
                     <c ca="center">
                        <p>-0.01</p>
                     </c>
                     <c ca="center">
                        <p>252</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Unique</p>
                     </c>
                     <c ca="center">
                        <p>0.16</p>
                     </c>
                     <c ca="center">
                        <p>248</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Median epsilon scores and protein coding length for each category of genes (includes genes of all lengths).</p>
               </tblfn>
            </tbl>
            <fig id="F12">
               <title>
                  <p>Figure 12</p>
               </title>
               <caption>
                  <p>Codon usage of genes is quantified by a normalized epsilon score [26]</p>
               </caption>
               <text>
                  <p>Codon usage of genes is quantified by a normalized epsilon score [26]. Low epsilon scores indicate that a gene's codon usage is similar to the typical <it>H. influenzae </it>codon usage pattern. The range of epsilon scores is similar for all three classes of genes: unique, distributed and core. However, the median scores are significantly different among the classes. The observation that the distributions for non-core genes overlap with the core genes suggests that many of the non-core genes have been evolving in the same pool with the core genes.</p>
               </text>
               <graphic file="gb-2007-8-6-r103-12"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Phage homology analysis</p>
            </st>
            <p>Phage insertion is a common origin of genomic diversity. The influence of phage was quantified by a homology search between all gene clusters and the NCBI NT database. A gene cluster was said to be 'phage associated' if one of the top ten significant matches was annotated as a sequence of phage origin. Overall, 9.3% of gene clusters were phage associated. The distribution of these genes is not uniform among core and non-core genes. Only 0.3% of core genes were phage associated, while 14.6% and 25.8% of distributed and unique genes, respectively, were phage associated (Table <tblr tid="T8">8</tblr>).</p>
            <tbl id="T8">
               <title>
                  <p>Table 8</p>
               </title>
               <caption>
                  <p>Percentage of genes with probable phage origin per category</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c ca="left">
                        <p>Category</p>
                     </c>
                     <c ca="center">
                        <p>Total genes</p>
                     </c>
                     <c ca="center">
                        <p>Phage derived</p>
                     </c>
                     <c ca="center">
                        <p>Percent phage</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Unique genes (1 strain)</p>
                     </c>
                     <c ca="center">
                        <p>539</p>
                     </c>
                     <c ca="center">
                        <p>139</p>
                     </c>
                     <c ca="center">
                        <p>25.8%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Distributed genes (2-12 strains)</p>
                     </c>
                     <c ca="center">
                        <p>786</p>
                     </c>
                     <c ca="center">
                        <p>115</p>
                     </c>
                     <c ca="center">
                        <p>14.6%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Core genes (all strains)</p>
                     </c>
                     <c ca="center">
                        <p>1,461</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>0.3%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Totals</p>
                     </c>
                     <c ca="center">
                        <p>2,786</p>
                     </c>
                     <c ca="center">
                        <p>258</p>
                     </c>
                     <c ca="center">
                        <p>9.26%</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Development of a finite supragenome model</p>
            </st>
            <p>The comparative genomic data presented above are supportive of the DGH and reinforces the concept that, at the species level, there is an <it>H. influenzae </it>supragenome that is much larger than the genome of any single individual strain, and hence many strains must be sequenced to generate an accurate picture of the species supragenome. Among the questions we may ask about the supragenome, the most obvious is, how many strains must be sequenced to observe the entire (or nearly all) of the supragenome?. The problem is similar to determining the read coverage necessary to sequence an entire individual genome using a random shotgun library approach. Lander-Waterman statistics provide an answer in the latter case by using the assumption that reads are independently and randomly sampled from the genome with equal probability. Previously, Tettelin <it>et al</it>. <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> developed a supragenome model for <it>S. agalactiae </it>that, like Lander-Waterman statistics, is based on the assumption that contingency genes are independently sampled from the supragenome with equal probability, except in the case of rare genes, which are modeled as unique events that appear only once in the entire global population. The model requires four parameters: the number of core genes, the number of contingency genes, the probability of finding a contingency gene, and the expected number of 'unique' genes found per strain. This model predicted that the supragenome of <it>S. agalactiae </it>is infinite in size (that is, the expected number of unique genes found in each strain is non-zero). While the model is an insightful attack on the problem, we question the assumption that contingency genes are sampled in the population with equal probability. It is important to compare the existing model against a new model that does not rely on this assumption.</p>
            <p>The Supragenome is represented here by a generative model that emits genomes according to a set of probabilistic rules. The supragenome contains <it>N </it>genes that are modeled as Bernoulli random variables with 'success' probabilities that correspond to the population frequency of each gene. A genome is generated by observing the Bernoulli variables: a gene is present if the corresponding trial is a success and otherwise absent. Each gene variable is assumed to be independent of all other genes. This assumption is sometimes violated in real <it>H. influenzae </it>genomes. For example, genomic islands are sets of genes that are not independent. However, we proceed with this assumption since it significantly reduces the complexity of the model and is reasonable in many cases.</p>
            <p>The true population frequencies are, in general, unknown. Therefore, population frequencies are also treated in a probabilistic fashion. It is assumed that there are <it>K </it>discrete classes of genes. Each class <it>k </it>has an associated population frequency, &#956;<sub>k</sub>. All genes in class <it>k </it>will have population frequency &#956;<sub>k</sub>. Each of the <it>N </it>genes is assigned to a class according to a probability distribution given by the vector &#960;, where &#960;<sub>k </sub>is the probability that a gene is assigned to class <it>k</it>. Conceptually, &#960;<sub>k </sub>is the percentage of genes in the supragenome that have population frequency &#956;<sub>k</sub>. The assignment of a gene to a class is independent of all other gene assignments.</p>
            <p>The complete model is depicted in plate notation in Figure <figr fid="F13">13</figr>. 'Z' is the hidden class variable in which <it>z</it><sub>n </sub>corresponds to the class of gene <it>n</it>. 'X' is the observed gene variable, where <it>x</it><sub>n,s </sub>corresponds to the presence or absence of gene <it>n </it>in strain <it>s</it>. The outer plate represents the supragenome, while the inner plate represents instances of specific genomes. The model requires 2 &#215; <it>K </it>+ 2 parameters: <it>N</it>, <it>K</it>, a mixture coefficient &#960;<sub>k </sub>for each class, and a Bernoulli probability &#956;<sub>k </sub>for each class. The number of gene classes, <it>K</it>, and their associated Bernoulli probabilities, &#956;<sub>k</sub>, are fixed in advance. Care must be taken to choose classes that represent low and high population frequencies. Seven classes were selected for this study (<it>K </it>= 7) with associated probabilities &#956; = &lt;0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0>. The class with probability 1.00 represents 'core' genes that appear in all strains.</p>
            <fig id="F13">
               <title>
                  <p>Figure 13</p>
               </title>
               <caption>
                  <p>A plate diagram of the <it>H. influenza</it>e supragenome model</p>
               </caption>
               <text>
                  <p>A plate diagram of the <it>H. influenza</it>e supragenome model. Each node in the diagram represents a random variable, and the arrows indicate dependence between the variables. Independent, identically distributed (IID) nodes appear in boxes with an index listed in the corner.</p>
               </text>
               <graphic file="gb-2007-8-6-r103-13"/>
            </fig>
            <p>The remaining parameters, <it>N </it>and &#960;<sub>k</sub>, are selected under a maximum likelihood scheme. Suppose that |<it>S</it>| genomes have been sequenced and a particular gene from class <it>k </it>was observed in <it>n </it>of the |<it>S</it>| strains. The probability of this observation is given by a binomial probability since this result is the sum of independent Bernoulli variables. As a function of &#960;<sub>k </sub>and <it>N</it>, the probability is given by:</p>
            <p>
               <display-formula>
                  <m:math name="gb-2007-8-6-r103-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>P</m:mi>
                           <m:mrow>
                              <m:mo>(</m:mo>
                              <m:mrow>
                                 <m:mi>x</m:mi>
                                 <m:mo>=</m:mo>
                                 <m:mi>n</m:mi>
                                 <m:mo>|</m:mo>
                                 <m:mi>z</m:mi>
                                 <m:mo>=</m:mo>
                                 <m:mi>k</m:mi>
                                 <m:mo>,</m:mo>
                                 <m:msub>
                                    <m:mi>&#956;</m:mi>
                                    <m:mi>k</m:mi>
                                 </m:msub>
                              </m:mrow>
                              <m:mo>)</m:mo>
                           </m:mrow>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mrow>
                                    <m:mo>|</m:mo>
                                    <m:mi>S</m:mi>
                                    <m:mo>|</m:mo>
                                 </m:mrow>
                                 <m:mo>!</m:mo>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>n</m:mi>
                                 <m:mo>!</m:mo>
                                 <m:mrow>
                                    <m:mo>(</m:mo>
                                    <m:mrow>
                                       <m:mrow>
                                          <m:mo>|</m:mo>
                                          <m:mi>S</m:mi>
                                          <m:mo>|</m:mo>
                                       </m:mrow>
                                       <m:mo>&#8722;</m:mo>
                                       <m:mi>n</m:mi>
                                    </m:mrow>
                                    <m:mo>)</m:mo>
                                 </m:mrow>
                                 <m:mo>!</m:mo>
                              </m:mrow>
                           </m:mfrac>
                           <m:msubsup>
                              <m:mi>&#956;</m:mi>
                              <m:mi>k</m:mi>
                              <m:mi>n</m:mi>
                           </m:msubsup>
                           <m:msup>
                              <m:mrow>
                                 <m:mrow>
                                    <m:mo>(</m:mo>
                                    <m:mrow>
                                       <m:mn>1</m:mn>
                                       <m:mo>&#8722;</m:mo>
                                       <m:msub>
                                          <m:mi>&#956;</m:mi>
                                          <m:mi>k</m:mi>
                                       </m:msub>
                                    </m:mrow>
                                    <m:mo>)</m:mo>
                                 </m:mrow>
                              </m:mrow>
                              <m:mrow>
                                 <m:mrow>
                                    <m:mo>|</m:mo>
                                    <m:mi>S</m:mi>
                                    <m:mo>|</m:mo>
                                 </m:mrow>
                                 <m:mo>&#8722;</m:mo>
                                 <m:mi>n</m:mi>
                              </m:mrow>
                           </m:msup>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8tuQ8FMI8Gi=hEeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqadeqadaaakeaacaWGqbWaaeWaaeaacaWG4bGaeyypa0JaamOBaiaacYhacaWG6bGaeyypa0Jaam4AaiaacYcacqaH8oqBdaWgaaWcbaGaam4AaaqabaaakiaawIcacaGLPaaacqGH9aqpdaWcaaqaamaaemaabaGaam4uaaGaay5bSlaawIa7aiaacgcaaeaacaWGUbGaaiyiamaabmaabaWaaqWaaeaacaWGtbaacaGLhWUaayjcSdGaeyOeI0IaamOBaaGaayjkaiaawMcaaiaacgcaaaGaeqiVd02aa0baaSqaaiaadUgaaeaacaWGUbaaaOWaaeWaaeaacaaIXaGaeyOeI0IaeqiVd02aaSbaaSqaaiaadUgaaeqaaaGccaGLOaGaayzkaaWaaWbaaSqabeaadaabdaqaaiaadofaaiaawEa7caGLiWoacqGHsislcaWGUbaaaaaa@5F3B@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>However, we do not know the true gene class, so we must consider a mixture of binomial probabilities:</p>
            <p>
               <display-formula>
                  <m:math name="gb-2007-8-6-r103-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>P</m:mi>
                           <m:mrow>
                              <m:mo>(</m:mo>
                              <m:mrow>
                                 <m:mi>x</m:mi>
                                 <m:mo>=</m:mo>
                                 <m:mi>n</m:mi>
                                 <m:mo>|</m:mo>
                                 <m:mover accent="true">
                                    <m:mi>&#960;</m:mi>
                                    <m:mo>&#8594;</m:mo>
                                 </m:mover>
                                 <m:mo>,</m:mo>
                                 <m:mover accent="true">
                                    <m:mi>&#956;</m:mi>
                                    <m:mo>&#8594;</m:mo>
                                 </m:mover>
                              </m:mrow>
                              <m:mo>)</m:mo>
                           </m:mrow>
                           <m:mo>=</m:mo>
                           <m:mstyle displaystyle="true">
                              <m:munderover>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mrow>
                                    <m:mi>k</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                                 <m:mi>K</m:mi>
                              </m:munderover>
                              <m:mrow>
                                 <m:mi>P</m:mi>
                                 <m:mrow>
                                    <m:mo>(</m:mo>
                                    <m:mrow>
                                       <m:mi>x</m:mi>
                                       <m:mo>=</m:mo>
                                       <m:mi>n</m:mi>
                                       <m:mo>|</m:mo>
                                       <m:mi>z</m:mi>
                                       <m:mo>=</m:mo>
                                       <m:mi>k</m:mi>
                                       <m:mo>,</m:mo>
                                       <m:msub>
                                          <m:mi>&#956;</m:mi>
                                          <m:mi>k</m:mi>
                                       </m:msub>
                                    </m:mrow>
                                    <m:mo>)</m:mo>
                                 </m:mrow>
                                 <m:mo>&#8901;</m:mo>
                                 <m:mi>P</m:mi>
                                 <m:mrow>
                                    <m:mo>(</m:mo>
                                    <m:mrow>
                                       <m:mi>z</m:mi>
                                       <m:mo>=</m:mo>
                                       <m:mi>k</m:mi>
                                       <m:mo>|</m:mo>
                                       <m:msub>
                                          <m:mi>&#960;</m:mi>
                                          <m:mi>k</m:mi>
                                       </m:msub>
                                    </m:mrow>
                                    <m:mo>)</m:mo>
                                 </m:mrow>
                              </m:mrow>
                           </m:mstyle>
                           <m:mo>=</m:mo>
                           <m:mstyle displaystyle="true">
                              <m:munderover>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mrow>
                                    <m:mi>k</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                                 <m:mi>K</m:mi>
                              </m:munderover>
                              <m:mrow>
                                 <m:msub>
                                    <m:mi>&#960;</m:mi>
                                    <m:mi>k</m:mi>
                                 </m:msub>
                                 <m:mfrac>
                                    <m:mrow>
                                       <m:mrow>
                                          <m:mo>|</m:mo>
                                          <m:mi>S</m:mi>
                                          <m:mo>|</m:mo>
                                       </m:mrow>
                                       <m:mo>!</m:mo>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mi>n</m:mi>
                                       <m:mo>!</m:mo>
                                       <m:mrow>
                                          <m:mo>(</m:mo>
                                          <m:mrow>
                                             <m:mrow>
                                                <m:mo>|</m:mo>
                                                <m:mi>S</m:mi>
                                                <m:mo>|</m:mo>
                                             </m:mrow>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mi>n</m:mi>
                                          </m:mrow>
                                          <m:mo>)</m:mo>
                                       </m:mrow>
                                       <m:mo>!</m:mo>
                                    </m:mrow>
                                 </m:mfrac>
                              </m:mrow>
                           </m:mstyle>
                           <m:msubsup>
                              <m:mi>&#956;</m:mi>
                              <m:mi>k</m:mi>
                              <m:mi>n</m:mi>
                           </m:msubsup>
                           <m:msup>
                              <m:mrow>
                                 <m:mrow>
                                    <m:mo>(</m:mo>
                                    <m:mrow>
                                       <m:mn>1</m:mn>
                                       <m:mo>&#8722;</m:mo>
                                       <m:msub>
                                          <m:mi>&#956;</m:mi>
                                          <m:mi>k</m:mi>
                                       </m:msub>
                                    </m:mrow>
                                    <m:mo>)</m:mo>
                                 </m:mrow>
                              </m:mrow>
                              <m:mrow>
                                 <m:mrow>
                                    <m:mo>|</m:mo>
                                    <m:mi>S</m:mi>
                                    <m:mo>|</m:mo>
                                 </m:mrow>
                                 <m:mo>&#8722;</m:mo>
                                 <m:mi>n</m:mi>
                              </m:mrow>
                           </m:msup>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8tuQ8FMI8Gi=hEeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqadeqadaaakeaacaWGqbWaaeWaaeaacaWG4bGaeyypa0JaamOBaiaacYhacuaHapaCgaWcaiaacYcacuaH8oqBgaWcaaGaayjkaiaawMcaaiabg2da9maaqahabaGaamiuamaabmaabaGaamiEaiabg2da9iaad6gacaGG8bGaamOEaiabg2da9iaadUgacaGGSaGaeqiVd02aaSbaaSqaaiaadUgaaeqaaaGccaGLOaGaayzkaaGaeyyXICTaamiuamaabmaabaGaamOEaiabg2da9iaadUgacaGG8bGaeqiWda3aaSbaaSqaaiaadUgaaeqaaaGccaGLOaGaayzkaaaaleaacaWGRbGaeyypa0JaaGymaaqaaiaadUeaa0GaeyyeIuoakiabg2da9maaqahabaGaeqiWda3aaSbaaSqaaiaadUgaaeqaaOWaaSaaaeaadaabdaqaaiaadofaaiaawEa7caGLiWoacaGGHaaabaGaamOBaiaacgcadaqadaqaamaaemaabaGaam4uaaGaay5bSlaawIa7aiabgkHiTiaad6gaaiaawIcacaGLPaaacaGGHaaaaaWcbaGaam4Aaiabg2da9iaaigdaaeaacaWGlbaaniabggHiLdGccqaH8oqBdaqhaaWcbaGaam4Aaaqaaiaad6gaaaGcdaqadaqaaiaaigdacqGHsislcqaH8oqBdaWgaaWcbaGaam4AaaqabaaakiaawIcacaGLPaaadaahaaWcbeqaamaaemaabaGaam4uaaGaay5bSlaawIa7aiabgkHiTiaad6gaaaaaaa@84D9@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>Now consider the complete set of genes. Let c = &lt;<it>c</it><sub>0</sub>, <it>c</it><sub>1</sub>, ..., <it>c</it><sub><it>S</it></sub>>, where <it>c</it><sub>n </sub>is the number of genes observed that appear in exactly <it>n </it>of |<it>S</it>| strains. The probability of the total observation is given by a multinomial distribution:</p>
            <p>
               <display-formula>
                  <m:math name="gb-2007-8-6-r103-i3" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mtable>
                              <m:mtr>
                                 <m:mtd>
                                    <m:mrow>
                                       <m:mi>P</m:mi>
                                       <m:mrow>
                                          <m:mo>(</m:mo>
                                          <m:mrow>
                                             <m:mover accent="true">
                                                <m:mi>c</m:mi>
                                                <m:mo>&#8594;</m:mo>
                                             </m:mover>
                                             <m:mo>|</m:mo>
                                             <m:mi>N</m:mi>
                                             <m:mo>,</m:mo>
                                             <m:mover accent="true">
                                                <m:mi>&#960;</m:mi>
                                                <m:mo>&#8594;</m:mo>
                                             </m:mover>
                                             <m:mo>,</m:mo>
                                             <m:mover accent="true">
                                                <m:mi>&#956;</m:mi>
                                                <m:mo>&#8594;</m:mo>
                                             </m:mover>
                                          </m:mrow>
                                          <m:mo>)</m:mo>
                                       </m:mrow>
                                       <m:mo>=</m:mo>
                                       <m:mfrac>
                                          <m:mrow>
                                             <m:mi>N</m:mi>
                                             <m:mo>!</m:mo>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:msub>
                                                <m:mi>c</m:mi>
                                                <m:mn>0</m:mn>
                                             </m:msub>
                                             <m:mo>!</m:mo>
                                             <m:msub>
                                                <m:mi>c</m:mi>
                                                <m:mn>1</m:mn>
                                             </m:msub>
                                             <m:mo>!</m:mo>
                                             <m:mo>&#8943;</m:mo>
                                             <m:msub>
                                                <m:mi>c</m:mi>
                                                <m:mi>s</m:mi>
                                             </m:msub>
                                             <m:mo>!</m:mo>
                                          </m:mrow>
                                       </m:mfrac>
                                       <m:msup>
                                          <m:mrow>
                                             <m:mstyle displaystyle="true">
                                                <m:munderover>
                                                   <m:mo>&#8719;</m:mo>
                                                   <m:mrow>
                                                      <m:mi>n</m:mi>
                                                      <m:mo>=</m:mo>
                                                      <m:mn>0</m:mn>
                                                   </m:mrow>
                                                   <m:mrow>
                                                      <m:mrow>
                                                         <m:mo>|</m:mo>
                                                         <m:mi>S</m:mi>
                                                         <m:mo>|</m:mo>
                                                      </m:mrow>
                                                   </m:mrow>
                                                </m:munderover>
                                                <m:mrow>
                                                   <m:mi>p</m:mi>
                                                   <m:mrow>
                                                      <m:mo>(</m:mo>
                                                      <m:mrow>
                                                         <m:mi>x</m:mi>
                                                         <m:mo>=</m:mo>
                                                         <m:mi>n</m:mi>
                                                         <m:mo>|</m:mo>
                                                         <m:mover accent="true">
                                                            <m:mi>&#960;</m:mi>
                                                            <m:mo>&#8594;</m:mo>
                                                         </m:mover>
                                                         <m:mo>,</m:mo>
                                                         <m:mover accent="true">
                                                            <m:mi>&#956;</m:mi>
                                                            <m:mo>&#8594;</m:mo>
                                                         </m:mover>
                                                      </m:mrow>
                                                      <m:mo>)</m:mo>
                                                   </m:mrow>
                                                </m:mrow>
                                             </m:mstyle>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:msub>
                                                <m:mi>C</m:mi>
                                                <m:mi>n</m:mi>
                                             </m:msub>
                                          </m:mrow>
                                       </m:msup>
                                    </m:mrow>
                                 </m:mtd>
                              </m:mtr>
                              <m:mtr>
                                 <m:mtd>
                                    <m:mrow>
                                       <m:mo>=</m:mo>
                                       <m:mfrac>
                                          <m:mrow>
                                             <m:mi>N</m:mi>
                                             <m:mo>!</m:mo>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:msub>
                                                <m:mi>c</m:mi>
                                                <m:mn>0</m:mn>
                                             </m:msub>
                                             <m:mo>!</m:mo>
                                             <m:msub>
                                                <m:mi>c</m:mi>
                                                <m:mn>1</m:mn>
                                             </m:msub>
                                             <m:mo>!</m:mo>
                                             <m:mo>&#8943;</m:mo>
                                             <m:msub>
                                                <m:mi>c</m:mi>
                                                <m:mi>s</m:mi>
                                             </m:msub>
                                             <m:mo>!</m:mo>
                                          </m:mrow>
                                       </m:mfrac>
                                       <m:msup>
                                          <m:mrow>
                                             <m:mstyle displaystyle="true">
                                                <m:munderover>
                                                   <m:mo>&#8719;</m:mo>
                                                   <m:mrow>
                                                      <m:mi>n</m:mi>
                                                      <m:mo>=</m:mo>
                                                      <m:mn>0</m:mn>
                                                   </m:mrow>
                                                   <m:mrow>
                                                      <m:mrow>
                                                         <m:mo>|</m:mo>
                                                         <m:mi>S</m:mi>
                                                         <m:mo>|</m:mo>
                                                      </m:mrow>
                                                   </m:mrow>
                                                </m:munderover>
                                                <m:mrow>
                                                   <m:mrow>
                                                      <m:mo>(</m:mo>
                                                      <m:mrow>
                                                         <m:mstyle displaystyle="true">
                                                            <m:munderover>
                                                               <m:mo>&#8721;</m:mo>
                                                               <m:mrow>
                                                                  <m:mi>k</m:mi>
                                                                  <m:mo>=</m:mo>
                                                                  <m:mn>1</m:mn>
                                                               </m:mrow>
                                                               <m:mi>K</m:mi>
                                                            </m:munderover>
                                                            <m:mrow>
                                                               <m:msub>
                                                                  <m:mi>&#960;</m:mi>
                                                                  <m:mi>k</m:mi>
                                                               </m:msub>
                                                               <m:mfrac>
                                                                  <m:mrow>
                                                                     <m:mrow>
                                                                        <m:mo>|</m:mo>
                                                                        <m:mi>S</m:mi>
                                                                        <m:mo>|</m:mo>
                                                                     </m:mrow>
                                                                     <m:mo>!</m:mo>
                                                                  </m:mrow>
                                                                  <m:mrow>
                                                                     <m:mi>n</m:mi>
                                                                     <m:mo>!</m:mo>
                                                                     <m:mrow>
                                                                        <m:mo>(</m:mo>
                                                                        <m:mrow>
                                                                           <m:mrow>
                                                                              <m:mo>|</m:mo>
                                                                              <m:mi>S</m:mi>
                                                                              <m:mo>|</m:mo>
                                                                           </m:mrow>
                                                                           <m:mo>&#8722;</m:mo>
                                                                           <m:mi>n</m:mi>
                                                                        </m:mrow>
                                                                        <m:mo>)</m:mo>
                                                                     </m:mrow>
                                                                     <m:mo>!</m:mo>
                                                                  </m:mrow>
                                                               </m:mfrac>
                                                            </m:mrow>
                                                         </m:mstyle>
                                                         <m:msubsup>
                                                            <m:mi>&#956;</m:mi>
                                                            <m:mi>k</m:mi>
                                                            <m:mi>n</m:mi>
                                                         </m:msubsup>
                                                         <m:msup>
                                                            <m:mrow>
                                                               <m:mrow>
                                                                  <m:mo>(</m:mo>
                                                                  <m:mrow>
                                                                     <m:mn>1</m:mn>
                                                                     <m:mo>&#8722;</m:mo>
                                                                     <m:msub>
                                                                        <m:mi>&#956;</m:mi>
                                                                        <m:mi>k</m:mi>
                                                                     </m:msub>
                                                                  </m:mrow>
                                                                  <m:mo>)</m:mo>
                                                               </m:mrow>
                                                            </m:mrow>
                                                            <m:mrow>
                                                               <m:mrow>
                                                                  <m:mo>|</m:mo>
                                                                  <m:mi>S</m:mi>
                                                                  <m:mo>|</m:mo>
                                                               </m:mrow>
                                                               <m:mo>&#8722;</m:mo>
                                                               <m:mi>n</m:mi>
                                                            </m:mrow>
                                                         </m:msup>
                                                      </m:mrow>
                                                      <m:mo>)</m:mo>
                                                   </m:mrow>
                                                </m:mrow>
                                             </m:mstyle>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:msub>
                                                <m:mi>C</m:mi>
                                                <m:mi>n</m:mi>
                                             </m:msub>
                                          </m:mrow>
                                       </m:msup>
                                    </m:mrow>
                                 </m:mtd>
                              </m:mtr>
                           </m:mtable>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8tuQ8FMI8Gi=hEeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqadeqadaaakeaafaqadeGabaaabaGaamiuamaabmaabaGabm4yayaalaGaaiiFaiaad6eacaGGSaGafqiWdaNbaSaacaGGSaGafqiVd0MbaSaaaiaawIcacaGLPaaacqGH9aqpdaWcaaqaaiaad6eacaGGHaaabaGaam4yamaaBaaaleaacaaIWaaabeaakiaacgcacaWGJbWaaSbaaSqaaiaaigdaaeqaaOGaaiyiaiabl+UimjaadogadaWgaaWcbaGaam4CaaqabaGccaGGHaaaamaarahabaGaamiCamaabmaabaGaamiEaiabg2da9iaad6gacaGG8bGafqiWdaNbaSaacaGGSaGafqiVd0MbaSaaaiaawIcacaGLPaaaaSqaaiaad6gacqGH9aqpcaaIWaaabaWaaqWaaeaacaWGtbaacaGLhWUaayjcSdaaniabg+GivdGcdaahaaWcbeqaaiaadoeadaWgaaadbaGaamOBaaqabaaaaaGcbaGaeyypa0ZaaSaaaeaacaWGobGaaiyiaaqaaiaadogadaWgaaWcbaGaaGimaaqabaGccaGGHaGaam4yamaaBaaaleaacaaIXaaabeaakiaacgcacqWIVlctcaWGJbWaaSbaaSqaaiaadohaaeqaaOGaaiyiaaaadaqeWbqaamaabmaabaWaaabCaeaacqaHapaCdaWgaaWcbaGaam4AaaqabaGcdaWcaaqaamaaemaabaGaam4uaaGaay5bSlaawIa7aiaacgcaaeaacaWGUbGaaiyiamaabmaabaWaaqWaaeaacaWGtbaacaGLhWUaayjcSdGaeyOeI0IaamOBaaGaayjkaiaawMcaaiaacgcaaaaaleaacaWGRbGaeyypa0JaaGymaaqaaiaadUeaa0GaeyyeIuoakiabeY7aTnaaDaaaleaacaWGRbaabaGaamOBaaaakmaabmaabaGaaGymaiabgkHiTiabeY7aTnaaBaaaleaacaWGRbaabeaaaOGaayjkaiaawMcaamaaCaaaleqabaWaaqWaaeaacaWGtbaacaGLhWUaayjcSdGaeyOeI0IaamOBaaaaaOGaayjkaiaawMcaaaWcbaGaamOBaiabg2da9iaaicdaaeaadaabdaqaaiaadofaaiaawEa7caGLiWoaa0Gaey4dIunakmaaCaaaleqabaGaam4qamaaBaaameaacaWGUbaabeaaaaaaaaaa@9EF7@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>The parameters <it>N </it>and &#960; can be determined by maximizing the log-likelihood of the observation c:</p>
            <p>
               <display-formula>
                  <m:math name="gb-2007-8-6-r103-i4" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>log</m:mi>
                           <m:mo>&#8289;</m:mo>
                           <m:mi>P</m:mi>
                           <m:mrow>
                              <m:mo>(</m:mo>
                              <m:mrow>
                                 <m:mover accent="true">
                                    <m:mi>c</m:mi>
                                    <m:mo>&#8594;</m:mo>
                                 </m:mover>
                                 <m:mo>|</m:mo>
                                 <m:mi>N</m:mi>
                                 <m:mo>,</m:mo>
                                 <m:mover accent="true">
                                    <m:mi>&#960;</m:mi>
                                    <m:mo>&#8594;</m:mo>
                                 </m:mover>
                                 <m:mo>,</m:mo>
                                 <m:mover accent="true">
                                    <m:mi>&#956;</m:mi>
                                    <m:mo>&#8594;</m:mo>
                                 </m:mover>
                              </m:mrow>
                              <m:mo>)</m:mo>
                           </m:mrow>
                           <m:mo>=</m:mo>
                           <m:mi>log</m:mi>
                           <m:mo>&#8289;</m:mo>
                           <m:mi>N</m:mi>
                           <m:mo>!</m:mo>
                           <m:mo>&#8722;</m:mo>
                           <m:mstyle displaystyle="true">
                              <m:munderover>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mrow>
                                    <m:mi>n</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>0</m:mn>
                                 </m:mrow>
                                 <m:mrow>
                                    <m:mrow>
                                       <m:mo>|</m:mo>
                                       <m:mi>S</m:mi>
                                       <m:mo>|</m:mo>
                                    </m:mrow>
                                 </m:mrow>
                              </m:munderover>
                              <m:mrow>
                                 <m:mi>log</m:mi>
                                 <m:mo>&#8289;</m:mo>
                                 <m:mrow>
                                    <m:mo>(</m:mo>
                                    <m:mrow>
                                       <m:msub>
                                          <m:mi>c</m:mi>
                                          <m:mi>n</m:mi>
                                       </m:msub>
                                       <m:mo>!</m:mo>
                                    </m:mrow>
                                    <m:mo>)</m:mo>
                                 </m:mrow>
                              </m:mrow>
                           </m:mstyle>
                           <m:mo>+</m:mo>
                           <m:mstyle displaystyle="true">
                              <m:munderover>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mrow>
                                    <m:mi>n</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>0</m:mn>
                                 </m:mrow>
                                 <m:mrow>
                                    <m:mrow>
                                       <m:mo>|</m:mo>
                                       <m:mi>S</m:mi>
                                       <m:mo>|</m:mo>
                                    </m:mrow>
                                 </m:mrow>
                              </m:munderover>
                              <m:mrow>
                                 <m:msub>
                                    <m:mi>c</m:mi>
                                    <m:mi>n</m:mi>
                                 </m:msub>
                                 <m:mi>log</m:mi>
                                 <m:mo>&#8289;</m:mo>
                              </m:mrow>
                           </m:mstyle>
                           <m:mrow>
                              <m:mo>(</m:mo>
                              <m:mrow>
                                 <m:mstyle displaystyle="true">
                                    <m:munderover>
                                       <m:mo>&#8721;</m:mo>
                                       <m:mrow>
                                          <m:mi>k</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                       <m:mi>K</m:mi>
                                    </m:munderover>
                                    <m:mrow>
                                       <m:msub>
                                          <m:mi>&#960;</m:mi>
                                          <m:mi>k</m:mi>
                                       </m:msub>
                                       <m:mfrac>
                                          <m:mrow>
                                             <m:mrow>
                                                <m:mo>|</m:mo>
                                                <m:mi>S</m:mi>
                                                <m:mo>|</m:mo>
                                             </m:mrow>
                                             <m:mo>!</m:mo>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mi>n</m:mi>
                                             <m:mo>!</m:mo>
                                             <m:mrow>
                                                <m:mo>(</m:mo>
                                                <m:mrow>
                                                   <m:mrow>
                                                      <m:mo>|</m:mo>
                                                      <m:mi>S</m:mi>
                                                      <m:mo>|</m:mo>
                                                   </m:mrow>
                                                   <m:mo>&#8722;</m:mo>
                                                   <m:mi>n</m:mi>
                                                </m:mrow>
                                                <m:mo>)</m:mo>
                                             </m:mrow>
                                             <m:mo>!</m:mo>
                                          </m:mrow>
                                       </m:mfrac>
                                    </m:mrow>
                                 </m:mstyle>
                                 <m:msubsup>
                                    <m:mi>&#956;</m:mi>
                                    <m:mi>k</m:mi>
                                    <m:mi>n</m:mi>
                                 </m:msubsup>
                                 <m:msup>
                                    <m:mrow>
                                       <m:mrow>
                                          <m:mo>(</m:mo>
                                          <m:mrow>
                                             <m:mn>1</m:mn>
                                             <m:mo>&#8722;</m:mo>
                                             <m:msub>
                                                <m:mi>&#956;</m:mi>
                                                <m:mi>k</m:mi>
                                             </m:msub>
                                          </m:mrow>
                                          <m:mo>)</m:mo>
                                       </m:mrow>
                                    </m:mrow>
           