<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-8-246</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Software</dochead>
      <bibl>
         <title>
            <p>Simcluster: clustering enumeration gene expression data on the simplex space</p>
         </title>
         <aug>
            <au id="A1" ca="yes" ce="yes">
               <snm>V&#234;ncio</snm>
               <mi>ZN</mi>
               <fnm>Ricardo</fnm>
               <insr iid="I1"/>
               <email>rvencio@gmail.com</email>
            </au>
            <au id="A2" ce="yes">
               <snm>Varuzza</snm>
               <fnm>Leonardo</fnm>
               <insr iid="I2"/>
               <email>lvaruzza@vision.ime.usp.br</email>
            </au>
            <au id="A3">
               <snm>de B Pereira</snm>
               <mi>A</mi>
               <fnm>Carlos</fnm>
               <insr iid="I2"/>
               <email>cpereira@ime.usp.br</email>
            </au>
            <au id="A4">
               <snm>Brentani</snm>
               <fnm>Helena</fnm>
               <insr iid="I3"/>
               <email>helena@lbhc.hcancer.org.br</email>
            </au>
            <au id="A5">
               <snm>Shmulevich</snm>
               <fnm>Ilya</fnm>
               <insr iid="I1"/>
               <email>ishmulevich@systemsbiology.org</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Institute for Systems Biology, 1441 North 34th street, Seattle, WA 98103-8904, USA</p>
            </ins>
            <ins id="I2">
               <p>BIOINFO-USP &#8211; N&#250;cleo de Pesquisas em Bioinform&#225;tica, Universidade de S&#227;o Paulo, S&#227;o Paulo, Brazil</p>
            </ins>
            <ins id="I3">
               <p>Hospital do C&#226;ncer A. C. Camargo, S&#227;o Paulo, Brazil</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2007</pubdate>
         <volume>8</volume>
         <issue>1</issue>
         <fpage>246</fpage>
         <url>http://www.biomedcentral.com/1471-2105/8/246</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17625017</pubid>
               <pubid idtype="doi">10.1186/1471-2105-8-246</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>02</day>
               <month>3</month>
               <year>2007</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>11</day>
               <month>7</month>
               <year>2007</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>11</day>
               <month>7</month>
               <year>2007</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2007</year>
         <collab>V&#234;ncio et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Transcript enumeration methods such as SAGE, MPSS, and sequencing-by-synthesis EST "digital northern", are important high-throughput techniques for digital gene expression measurement. As other counting or voting processes, these measurements constitute compositional data exhibiting properties particular to the simplex space where the summation of the components is constrained. These properties are not present on regular Euclidean spaces, on which hybridization-based microarray data is often modeled. Therefore, pattern recognition methods commonly used for microarray data analysis may be non-informative for the data generated by transcript enumeration techniques since they ignore certain fundamental properties of this space.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>Here we present a software tool, Simcluster, designed to perform clustering analysis for data on the simplex space. We present Simcluster as a stand-alone command-line C package and as a user-friendly on-line tool. Both versions are available at: <url>http://xerad.systemsbiology.net/simcluster</url>.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>Simcluster is designed in accordance with a well-established mathematical framework for compositional data analysis, which provides principled procedures for dealing with the simplex space, and is thus applicable in a number of contexts, including enumeration-based gene expression data.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Technologies for high-throughput measurement of transcriptional gene expression are mainly divided into two categories: those based on hybridization, such as all microarray-related technologies <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp> and those based on transcript enumeration, which include SAGE <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>, MPSS <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>, and Digital Northern powered by traditional <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> or, recently developed, EST sequencing-by-synthesis (SBS) technologies <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>.</p>
         <p>Currently, transcript enumeration methods are relatively expensive and more time-consuming than methods based on hybridization. However, recent improvements in sequencing technology, powered by the "$1000 genome" effort <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>, promises to transform the transcript enumeration approach into a fast and accessible alternative <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr></abbrgrp> paving the way for a systems-level absolute digital description of individualized samples <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>.</p>
         <p>Methods for finding differentially expressed genes have been developed specifically in the context of enumeration-based techniques of different sequencing scales such as EST <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>, SAGE <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> and MPSS <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>. However, in spite of their differences, hybridization-based and enumeration-based data are typically analyzed using the same pattern recognition techniques, which are generally imported from the microarray analysis field.</p>
         <p>In the case of clustering analysis of gene profiles, the simple appropriation of practices from the microarray analysis field has been shown to lead to suboptimal performance <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. Cai and co-workers <abbrgrp><abbr bid="B15">15</abbr></abbrgrp> provided an elegant clustering computational solution to group tag (rows in a usual expression matrix representation) profiles that takes into account the specificities of enumeration-based datasets. However, to the best of our knowledge, a solution for transcript enumeration libraries (columns in a usual expression matrix representation) is still needed. We report on a novel computational solution, called Simcluster, to support clustering analysis of transcript enumeration libraries.</p>
      </sec>
      <sec>
         <st>
            <p>Implementation</p>
         </st>
         <sec>
            <st>
               <p>Theory</p>
            </st>
            <p>Without loss of generality, we use the term "tag" to refer to the transcripts' representation, as usual in the SAGE field (this is equivalent to the term "signature" in MPSS analysis or "contigs" in EST analysis). The theoretical model used here to describe the transcript enumeration process is the usual uniform sampling of interchangeable colored balls from an infinite urn model. Given the total number <it>n </it>of counted tags and the abundance vector <b><it>&#960; </it></b>of all transcripts, this model leads to a probabilistic description of the observed result: <b><it>x</it></b>|<b><it>&#960;</it></b>, <it>n </it>~ Multi(<b><it>&#960;</it></b>, <it>n</it>), i.e., the counts <b><it>x </it></b>follow a Multinomial distribution <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. It is also possible to model <b><it>x </it></b>as Poisson distributed <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> since it is an approximation for the Multinomial. Regardless of the specificities of the theoretical probabilistic model, it is well known that, as with other counting or voting processes, the natural space for dealing with this kind of data is the simplex space. The unitary simplex space, having <it>d </it>dimensions, is defined as <abbrgrp><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr></abbrgrp>:</p>
            <p>
               <display-formula id="M1">
                  <m:math name="1471-2105-8-246-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msub>
                              <m:mi>S</m:mi>
                              <m:mrow>
                                 <m:mi>d</m:mi>
                                 <m:mo>&#8722;</m:mo>
                                 <m:mn>1</m:mn>
                              </m:mrow>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:mo>{</m:mo>
                           <m:mi>&#960;</m:mi>
                           <m:mo>|</m:mo>
                           <m:mi>&#960;</m:mi>
                           <m:mo>&#8712;</m:mo>
                           <m:msubsup>
                              <m:mi>&#8477;</m:mi>
                              <m:mo>+</m:mo>
                              <m:mi>d</m:mi>
                           </m:msubsup>
                           <m:mo>,</m:mo>
                           <m:mi>&#960;</m:mi>
                           <m:msup>
                              <m:mn>1</m:mn>
                              <m:mo>&#8242;</m:mo>
                           </m:msup>
                           <m:mo>=</m:mo>
                           <m:mn>1</m:mn>
                           <m:mo>}</m:mo>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGtbWudaWgaaWcbaGaemizaqMaeyOeI0IaeGymaedabeaakiabg2da9iabcUha7HGadiab=b8aWjabcYha8jab=b8aWjabgIGioprr1ngBPrwtHrhAYaqeguuDJXwAKbstHrhAGq1DVbaceaGae4xhHi1aa0baaSqaaiabgUcaRaqaaiabdsgaKbaakiabcYcaSiab=b8aWHqabiqb9fdaXyaafaGaeyypa0JaeGymaeJaeiyFa0haaa@4E6F@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>where <b>1 </b>is a vector of ones. In the gene expression context, <it>d </it>is the number of unique tags observed. An example of a simplex vector is <b><it>p </it></b>= <inline-formula><m:math name="1471-2105-8-246-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi mathvariant="double-struck"/><m:annotation encoding="MathType-MTEF"/></m:semantics></m:math></inline-formula>[<b><it>&#960;</it></b>|<b><it>x</it></b>] and applying a standard Bayesian approach, one obtains from <b><it>x</it></b>|<b><it>&#960;</it></b>, <it>n</it>, using a Dirichlet prior density <b><it>&#960; </it></b>~ Dir(<b><it>&#945;</it></b>), the posterior density: <b><it>&#960;</it></b>|<b><it>x </it></b>~ Dir(<b><it>x </it></b>+ <b><it>&#945;</it></b>).</p>
            <p>It is known that clustering analysis is inherently dependent on the choice of a distance measure between the considered objects. This, in turn, is connected to the structure of the underlying space. A metric &#916;, measuring the distance between two objects <it>a </it>and <it>b</it>, must respect the properties:</p>
            <p>(i) &#916;(<it>a, b</it>) = &#916;(<it>b, a</it>);</p>
            <p>(ii) &#916;(<it>a, b</it>) = 0 &#8660; <it>a </it>= <it>b</it>;</p>
            <p>(iii) &#916;(<it>a, c</it>) &#8804; &#916;(<it>a, b</it>) + &#916;(<it>b, c</it>).</p>
            <p>One may also consider additional reasonable properties such as:</p>
            <p>(iv) scale invariance &#916;(<it>xa, yb</it>) = &#916;(<it>a, b</it>), <it>x, y </it>&#8712; &#8477;<sub>+</sub>; and</p>
            <p>(v) translational invariance &#916;(<it>a </it>+ <it>t, b </it>+ <it>t</it>) = &#916;(<it>a, b</it>).</p>
            <p>These commonly required additional properties guarantee that distance measurements are not affected by the definition of arbitrary scale or measurement units and that more importance is given to the actual difference between the objects being measured rather than commonalities (more details can be found in the appendix Additional File <supplr sid="S1">1</supplr>).</p>
            <suppl id="S1">
               <title>
                  <p>Additional file 1</p>
               </title>
               <text>
                  <p><b>Appendix: Aitchisonean distance</b>. Contains an appendix with some background on the usage of the Aitchisonean distance.</p>
               </text>
               <file name="1471-2105-8-246-S1.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <p>Translations on the simplex space are defined by <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>:</p>
            <p>
               <display-formula id="M2">
                  <m:math name="1471-2105-8-246-i3" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>p</m:mi>
                           <m:mo>&#8853;</m:mo>
                           <m:mi>t</m:mi>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>p</m:mi>
                                 <m:mo>&#8901;</m:mo>
                                 <m:mi>t</m:mi>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                              <m:mrow>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mi>p</m:mi>
                                 <m:mo>&#8901;</m:mo>
                                 <m:mi>t</m:mi>
                                 <m:mo stretchy="false">)</m:mo>
                                 <m:msup>
                                    <m:mn>1</m:mn>
                                    <m:mo>&#8242;</m:mo>
                                 </m:msup>
                                 <m:mi/>
                              </m:mrow>
                           </m:mfrac>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaieWacqWFWbaCcqGHvksXcqWF0baDcqGH9aqpdaWcaaqaaiabcIcaOiab=bhaWjabgwSixlab=rha0jabcMcaPaqaaiabcIcaOiab=bhaWjabgwSixlab=rha0jabcMcaPGqabiqb+fdaXyaafaGae4hiaacaaaaa@4204@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>where &#183; is the usual Hadamard product and the division is vector-evaluated.</p>
            <p>Well known distances, such as Euclidean, Manhattan, and correlation-based distances, do not exhibit the properties (i)-(v) if the measured objects belong to the simplex space, as is the case of transcript enumeration data. A possible metric that obeys (i)-(v) on the simplex space is the Aitchisonean distance <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>:</p>
            <p>
               <display-formula id="M3">
                  <m:math name="1471-2105-8-246-i4" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>&#916;</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>p</m:mi>
                           <m:mo>,</m:mo>
                           <m:mi>q</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:msqrt>
                              <m:mrow>
                                 <m:mi>l</m:mi>
                                 <m:mi>n</m:mi>
                                 <m:mrow>
                                    <m:mo>(</m:mo>
                                    <m:mrow>
                                       <m:mfrac>
                                          <m:mrow>
                                             <m:msub>
                                                <m:mi>p</m:mi>
                                                <m:mrow>
                                                   <m:mo>&#8722;</m:mo>
                                                   <m:mi>d</m:mi>
                                                </m:mrow>
                                             </m:msub>
                                             <m:mo>/</m:mo>
                                             <m:msub>
                                                <m:mi>p</m:mi>
                                                <m:mi>d</m:mi>
                                             </m:msub>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:msub>
                                                <m:mi>q</m:mi>
                                                <m:mrow>
                                                   <m:mo>&#8722;</m:mo>
                                                   <m:mi>d</m:mi>
                                                </m:mrow>
                                             </m:msub>
                                             <m:mo>/</m:mo>
                                             <m:msub>
                                                <m:mi>q</m:mi>
                                                <m:mi>d</m:mi>
                                             </m:msub>
                                          </m:mrow>
                                       </m:mfrac>
                                    </m:mrow>
                                    <m:mo>)</m:mo>
                                 </m:mrow>
                                 <m:msup>
                                    <m:mrow>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>I</m:mi>
                                       <m:mo>+</m:mo>
                                       <m:msup>
                                          <m:mn>1</m:mn>
                                          <m:mo>&#8242;</m:mo>
                                       </m:msup>
                                       <m:mo>&#215;</m:mo>
                                       <m:mn>1</m:mn>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mo>&#8722;</m:mo>
                                       <m:mn>1</m:mn>
                                    </m:mrow>
                                 </m:msup>
                                 <m:mi>l</m:mi>
                                 <m:mi>n</m:mi>
                                 <m:msup>
                                    <m:mrow>
                                       <m:mrow>
                                          <m:mo>(</m:mo>
                                          <m:mrow>
                                             <m:mfrac>
                                                <m:mrow>
                                                   <m:msub>
                                                      <m:mi>p</m:mi>
                                                      <m:mrow>
                                                         <m:mo>&#8722;</m:mo>
                                                         <m:mi>d</m:mi>
                                                      </m:mrow>
                                                   </m:msub>
                                                   <m:mo>/</m:mo>
                                                   <m:msub>
                                                      <m:mi>p</m:mi>
                                                      <m:mi>d</m:mi>
                                                   </m:msub>
                                                </m:mrow>
                                                <m:mrow>
                                                   <m:msub>
                                                      <m:mi>q</m:mi>
                                                      <m:mrow>
                                                         <m:mo>&#8722;</m:mo>
                                                         <m:mi>d</m:mi>
                                                      </m:mrow>
                                                   </m:msub>
                                                   <m:mo>/</m:mo>
                                                   <m:msub>
                                                      <m:mi>q</m:mi>
                                                      <m:mi>d</m:mi>
                                                   </m:msub>
                                                </m:mrow>
                                             </m:mfrac>
                                          </m:mrow>
                                          <m:mo>)</m:mo>
                                       </m:mrow>
                                    </m:mrow>
                                    <m:mo>&#8242;</m:mo>
                                 </m:msup>
                              </m:mrow>
                           </m:msqrt>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbwvMCKfMBHbqedmvETj2BSbqee0evGueE0jxyaibaieIgFLIOYR2NHOxjYhrPYhrPYpI8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbbG8FasPYRqj0=yi0lXdbba9pGe9qqFf0dXdHuk9fr=xfr=xfrpiWZqaaeaabiGaaiaacaqabeaabeqacmaaaOqaaiabfs5aejaacIcatCvAUfKttLearyWrP9MDH5MBPbIqV92AaGabdiab=bhaWjaacYcacqWFXbqCcaGGPaGaeyypa0ZaaOaaaeaacaWGSbGaamOBamaabmaabaWaaSaaaeaacqWFWbaCdaWgaaWcbaGaeyOeI0IaamizaaqabaGccaGGVaGaamiCamaaBaaaleaacaWGKbaabeaaaOqaaiab=fhaXnaaBaaaleaacqGHsislcaWGKbaabeaakiaac+cacaWGXbWaaSbaaSqaaiaadsgaaeqaaaaaaOGaayjkaiaawMcaaiaacIcacqWFjbqscqGHRaWkiqqacuGFXaqmgaqbaiabgEna0kab+fdaXiaacMcadaahaaWcbeqaaiabgkHiTiaaigdaaaGccaWGSbGaamOBamaabmaabaWaaSaaaeaacqWFWbaCdaWgaaWcbaGaeyOeI0IaamizaaqabaGccaGGVaGaamiCamaaBaaaleaacaWGKbaabeaaaOqaaiab=fhaXnaaBaaaleaacqGHsislcaWGKbaabeaakiaac+cacaWGXbWaaSbaaSqaaiaadsgaaeqaaaaaaOGaayjkaiaawMcaamaaCaaaleqabaGccWaGqBOmGikaaaWcbeaaaaa@706D@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>where <b><it>I </it></b>is the identity matrix, &#215; is the Kronecker product, -<it>d </it>subscript is a notation for "excluding the <it>d</it><sup><it>th </it></sup>element", and elementary operations are vector-evaluated.</p>
            <p>Clustering procedures coherent with this theoretical background are suitable for transcript enumeration data.</p>
         </sec>
         <sec>
            <st>
               <p>Software design</p>
            </st>
            <p>In short, Simcluster's method can be described as the use of a Bayesian inference step (currently with a uniform prior) to obtain the expected abundance simplex vectors given the observed counts <inline-formula><m:math name="1471-2105-8-246-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi mathvariant="double-struck">E</m:mi><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaatuuDJXwAK1uy0HMmaeHbfv3ySLgzG0uy0HgiuD3BaGabaiab=ri8fbaa@388C@</m:annotation></m:semantics></m:math></inline-formula>[<b><it>&#960;</it></b>|<b><it>x</it></b>], and the use of the Aitchisonean distance in the following algorithms: k-means, k-medoids and self-organizing maps (SOM) for partition clustering, PCA for inferring the number of variability sources present, and common variants of agglomerative hierarchical clustering.</p>
            <p>Currently, the Simcluster package is comprised of: Simtree, for hierarchical clustering; Simpart, for partition clustering; Simpca for Principal Component Analysis (PCA); and several utilities such as TreeDraw, a program to draw hierarchical clustering dendrograms with user-defined colored leaves. Simcluster's modularity allows relatively simple extension and addition of new modules or algorithms. Increasing the coverage of algorithms and validity assessment methods <abbrgrp><abbr bid="B20">20</abbr></abbrgrp> are envisioned in future updates. Simcluster can be used, modified and distributed under the terms of the GPL license <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. The software was implemented in C for improved performance and memory usage, assuring that even large datasets can be processed on a regular desktop PC (Additional File <supplr sid="S2">2</supplr>).</p>
            <suppl id="S2">
               <title>
                  <p>Additional file 2</p>
               </title>
               <text>
                  <p><b>Stand-alone command-line Simcluster version 0.8.14</b>. Simcluster version used to create results of this work. Note that this version is distributed for compatibility issues only and users should always obtain the latest version at the project's website: <url>http://xerad.systemsbiology.net/simcluster</url>.</p>
               </text>
               <file name="1471-2105-8-246-S2.zip">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <p>To increase source code reuse, established libraries were used: Cluster 3 <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> for clustering, GNU Scientific library <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> for PCA, Cairo <abbrgrp><abbr bid="B24">24</abbr></abbrgrp> and a modification of TreeDraw X <abbrgrp><abbr bid="B25">25</abbr></abbrgrp> for colored dendrogram drawing. The input data set can be a matrix of transcript counts or general simplex vectors. Some auxiliary shell and Perl scripts are available to: automatically download data from the GEO database <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>, convert GEO files to Simcluster input format, and filter out low-count tags.</p>
            <p>The Linux-based installation and compilation is facilitated by a configuration script that detects all the prerequisites for Simcluster compilation. Missing libraries are automatically downloaded from the Simcluster website and compiled by the Simcluster compilation process.</p>
            <p>To broaden usability, a user-friendly web interface was developed and is made available at <url>http://xerad.systemsbiology.net/simcluster_web/</url>. Figure <figr fid="F1">1</figr> shows a screenshot of an analysis session using Simcluster's web-based interface.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Screenshot of an analysis session using Simcluster's web-based interface</p>
               </caption>
               <text>
                  <p><b>Screenshot of an analysis session using Simcluster's web-based interface</b>. Simcluster's on-line version was designed to be a user-friendly interface for the command-line version. The screenshot shown is an illustration of an interactive session usign the example data provided.</p>
               </text>
               <graphic file="1471-2105-8-246-1"/>
            </fig>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results and Discussion</p>
         </st>
         <p>We agree with Dougherty and Brun <abbrgrp><abbr bid="B27">27</abbr><abbr bid="B28">28</abbr></abbrgrp> that "validation" of clustering results is a heuristic process, even though there are some interesting efforts to objectively incorporate biological knowledge in this process using Gene Ontology, especially when one is clustering gene expression profiles <abbrgrp><abbr bid="B29">29</abbr><abbr bid="B30">30</abbr></abbrgrp>. However, to illustrate the usefulness of our software, we collected several examples in which the performance of Simcluster can be considered as qualitatively superior to some traditional approaches imported from the microarray analysis field. These examples include EST, SAGE and MPSS datasets, and are available on the project's webpage <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>. Among these, we describe here a simulated enumeration dataset built from real microarray data, for which we can define the ground truth and check results against it in a relatively objective way. Of course, a comprehensive study with simulated data, consisting of comparisons of clustering algorithms, distance metrics, and distributions generating the random point sets, would be necessary to properly evaluate any clustering algorithm. This should be the subject of future work. The objective of this example is to show that Simcluster is able to reconstruct the clustering result obtained for an Affymetrix microarray dataset when the input is a simulated transcript enumeration dataset, built to mimic the real microarray biological data.</p>
         <p>The data used to create the virtual transcript enumeration data was obtained from the Innate Immunity Systems Biology project <abbrgrp><abbr bid="B32">32</abbr></abbrgrp> and is provided as an Additional File <supplr sid="S3">3</supplr>. This data is a set of Affymetrix experiments of mouse macrophages stimulated by different Toll-like receptor agonists (LPS, PIC, CPG, R848, PAM) during a time-course (0, 20, 40, 60, 80 and 120 minutes). A detailed description and biological significance of this dataset is presented elsewhere <abbrgrp><abbr bid="B32">32</abbr><abbr bid="B33">33</abbr></abbrgrp>.</p>
         <suppl id="S3">
            <title>
               <p>Additional file 3</p>
            </title>
            <text>
               <p><b>Simulation data, results and scripts</b>. Contains the script that generated the virtual transcript enumeration data, the dataset used as the basis for the analysis, the results from it, and the conclusions for all tested samples sizes <it>n </it>from 100,000 to 100,000,000.</p>
            </text>
            <file name="1471-2105-8-246-S3.zip">
               <p>Click here for file</p>
            </file>
         </suppl>
         <p>Using this data, a clustering analysis result is shown in Figure <figr fid="F2">2</figr>. This pattern is obtained using the most common type of clustering analysis in the microarray field: Euclidean distance with average linkage agglomerative hierarchical clustering, implemented by R <abbrgrp><abbr bid="B34">34</abbr></abbrgrp> routines, available as Additional File <supplr sid="S3">3</supplr>. This clustering pattern will be considered to be the "gold-standard" for the purpose of this simulation.</p>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>Clustering analysis of the Affymetrix dataset</p>
            </caption>
            <text>
               <p><b>Clustering analysis of the Affymetrix dataset</b>. Data produced by the Innate Immunity Systems Biology project [32,33] and available as Additional File <supplr sid="S3">3</supplr>. This data is a set of Affymetrix experiments of mouse macrophages stimulated by different Toll-like receptor agonists (LPS, PIC, CPG, R848, PAM) during a time-course (0, 20, 40, 60, 80 and 120 minutes). Method: Euclidean distance with average linkage agglomerative hierarchical clustering.</p>
            </text>
            <graphic file="1471-2105-8-246-2"/>
         </fig>
         <p>The virtual experiment consists of the creation of a transcriptome with the relative abundance between genes defined by the Affymetrix data; sampling a random number of tags from it of different magnitudes; enumeration of sampled transcripts; and using some common clustering procedures along with Simcluster. It is easier to understand the concept of the virtual transcriptome by following a particular case. For the sample labeled LPS-120 measured 120 minutes after the LPS stimulus, for the Affymetrix expression levels see Table <tblr tid="T1">1</tblr>.</p>
         <tbl id="T1">
            <title>
               <p>Table 1</p>
            </title>
            <caption>
               <p>Affymetrix expression levels</p>
            </caption>
            <tblbdy cols="4">
               <r>
                  <c ca="left">
                     <p>Probesets</p>
                  </c>
                  <c ca="left">
                     <p>Representative ID</p>
                  </c>
                  <c ca="left">
                     <p>Gene Symbol</p>
                  </c>
                  <c ca="left">
                     <p>Intensity (sorted)</p>
                  </c>
               </r>
               <r>
                  <c cspan="4">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>1457375_at</p>
                  </c>
                  <c ca="left">
                     <p>BG094499</p>
                  </c>
                  <c ca="left">
                     <p>Transcribed locus</p>
                  </c>
                  <c ca="left">
                     <p>1.94760</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>1452109_at</p>
                  </c>
                  <c ca="left">
                     <p>BG973910</p>
                  </c>
                  <c ca="left">
                     <p>interleukin 17 receptor E</p>
                  </c>
                  <c ca="left">
                     <p>2.14522</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>...</p>
                  </c>
                  <c ca="left">
                     <p>...</p>
                  </c>
                  <c ca="left">
                     <p>...</p>
                  </c>
                  <c ca="left">
                     <p>...</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>M12481_3_at</p>
                  </c>
                  <c ca="left">
                     <p>AFFX-b-ActinMur</p>
                  </c>
                  <c ca="left">
                     <p>actin beta cytoplasmic</p>
                  </c>
                  <c ca="left">
                     <p>36191.41765</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>1436996_x_at</p>
                  </c>
                  <c ca="left">
                     <p>AV066625</p>
                  </c>
                  <c ca="left">
                     <p>P lysozyme structural</p>
                  </c>
                  <c ca="left">
                     <p>43458.17590</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <p>The virtual total number of available tags is defined as proportional to the measured intensity using 10,000 as a scaling constant, an arbitrary number large enough to assure that finite population issues are negligible. Actual examples are: 19,476 for BG094499; 21,452 for BG973910; and so on until 361,914,176 for actin; and 434,581,759 for AV066625. The total amount of available tags is <it>T </it>= 126,971,909,452, which is a number much greater than the typical number of sequenced tags and is in accordance with the "infinite urn" model.</p>
         <p>The total of virtually sequenced tags <it>N </it>for each sample is simulated from a Poisson distribution, <it>N </it>~ Poisson(<it>n</it>), to create a realistic virtual sequencing library. All generated data and results are available as Additional File <supplr sid="S3">3</supplr>. For example, the actual simulation for <it>n </it>= 1,000,000 virtually sequenced tags assigned <it>N </it>= 1,001,794 for the LPS-120 library; <it>N </it>= 998,382 for the CPG-40 library; and so on. The same process is repeated for increasing <it>n </it>from 100,000 to 100,000,000. Since <it>n </it>&#8810; <it>T </it>for all <it>n </it>considered, the multinomial sampling is used and its mean is taken for each library, according to the assumed "infinite urn" model. The results for the largest simulation are shown in Figures <figr fid="F3">3</figr>, <figr fid="F4">4</figr>, <figr fid="F5">5</figr>, <figr fid="F6">6</figr> and individual results for all separate increasing <it>n </it>sizes are available as Additional File <supplr sid="S3">3</supplr>.</p>
         <fig id="F3">
            <title>
               <p>Figure 3</p>
            </title>
            <caption>
               <p>Simcluster's clustering of simulated data based on Affymetrix expression levels</p>
            </caption>
            <text>
               <p><b>Simcluster's clustering of simulated data based on Affymetrix expression levels</b>. Transcript enumeration data produced by the simulation of a virtual transcriptome according to the Affymetrix expression levels. Sample size <it>n </it>= 100,000,000. Method: Simcluster's average linkage agglomerative hierarchical clustering.</p>
            </text>
            <graphic file="1471-2105-8-246-3"/>
         </fig>
         <fig id="F4">
            <title>
               <p>Figure 4</p>
            </title>
            <caption>
               <p>Clustering of simulated data using Euclidean distance</p>
            </caption>
            <text>
               <p><b>Clustering of simulated data using Euclidean distance</b>. Transcript enumeration data produced by the simulation of a virtual transcriptome according to the Affymetrix expression levels. Sample size <it>n </it>= 100,000,000. Method: Euclidean distance with average linkage agglomerative hierarchical clustering.</p>
            </text>
            <graphic file="1471-2105-8-246-4"/>
         </fig>
         <fig id="F5">
            <title>
               <p>Figure 5</p>
            </title>
            <caption>
               <p>Clustering of simulated data using correlation distance</p>
            </caption>
            <text>
               <p><b>Clustering of simulated data using correlation distance</b>. Transcript enumeration data produced by the simulation of a virtual transcriptome according to the Affymetrix expression levels. Sample size <it>n </it>= 100,000,000. Method: correlation-based distance with average linkage agglomerative hierarchical clustering.</p>
            </text>
            <graphic file="1471-2105-8-246-5"/>
         </fig>
         <fig id="F6">
            <title>
               <p>Figure 6</p>
            </title>
            <caption>
               <p>Clustering of simulated data using cosine distance</p>
            </caption>
            <text>
               <p><b>Clustering of simulated data using cosine distance</b>. Transcript enumeration data produced by the simulation of a virtual transcriptome according to the Affymetrix expression levels. Sample size <it>n </it>= 100,000,000. Method: cosine distance with average linkage agglomerative hierarchical clustering.</p>
            </text>
            <graphic file="1471-2105-8-246-6"/>
         </fig>
         <p>It is clear that cluster results obtained by Simcluster converge to the same structure obtained by analyzing the Affymetrix data, as the number of virtually sequenced tags increases. Moreover, Simcluster's results are not only compatible with the usual microarray analysis for Affymetrix data, but also are more biologically meaningful than the results obtained by the usual microarray analysis techniques applied to the virtual sequencing data. As in the original microarray analysis, the Simcluster result is able to cluster together the different stimuli, placing consecutive time-points close to each other.</p>
         <p>Although this kind of analysis certainly does not provide a proof, the above result indicate that the theoretical framework is adequate for enumeration-based data, as expected. Additional examples and discussions can be found on the project's website <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>We developed a software tool, called Simcluster, for clustering libraries of enumeration-based data. It is important to note that Simcluster is built in accordance with a well-established mathematical framework for compositional data analysis, which provides principled procedures for dealing with the simplex space, and is thus applicable in contexts other than transcript enumeration.</p>
      </sec>
      <sec>
         <st>
            <p>Availability and requirements</p>
         </st>
         <p>&#8226; Project Name: Simcluster</p>
         <p>&#8226; Project Home Page: <url>http://xerad.systemsbiology.net/simcluster</url></p>
         <p>&#8226; Operating Systems: Linux for the stand-alone version and platform independent for the web-based tool.</p>
         <p>&#8226; Programming Languages: C for the stand-alone version and C, Perl and HTML for the web-based tool.</p>
         <p>&#8226; Other requirements: some GNU/GPL or GNU/LGPL libraries distributed together with the main package.</p>
         <p>&#8226; License: GNU General Public License 2.0</p>
      </sec>
      <sec>
         <st>
            <p>List of abbreviations</p>
         </st>
         <p>EST &#8211; Expressed Sequence Tag</p>
         <p>SAGE &#8211; Serial Analysis of Gene Expression</p>
         <p>MPSS &#8211; Massive Parallel Signature Sequencing</p>
         <p>SBS &#8211; Sequencing-By-Synthesis</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>RZNV proposed and conducted the study. LV wrote the software and helped to interpret the results. CABP indicated the compositional analysis literature and is LV's PhD thesis advisor. HB provided biological insight for result interpretation. IS supervised the study. RZNV and IS wrote the manuscript. All authors read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>We thank Dr. Jared Roach (ISB) and Dr Jo&#227;o C. Barata (USP) for constructive discussions and Dr. Alistair Rust (ISB) for help with the web server. LV is supported by CAPES. CABP is partially supported by CNPq. This work is partially supported by NIH/NIAID grants U19-AI057266 and U54-AI54253 and NIH/NIGMS P50-GMO-76547.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Quantitative monitoring of gene expression patterns with a complementary DNA microarray</p>
            </title>
            <aug>
               <au>
                  <snm>Schena</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Shalon</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Davis</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1995</pubdate>
            <volume>270</volume>
            <issue>5235</issue>
            <fpage>467</fpage>
            <lpage>470</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.270.5235.467</pubid>
                  <pubid idtype="pmpid" link="fulltext">7569999</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Multiplexed biochemical assays with biological chips</p>
            </title>
            <aug>
               <au>
                  <snm>Fodor</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Rava</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Huang</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Pease</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Holmes</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Adams</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>1993</pubdate>
            <volume>364</volume>
            <fpage>555</fpage>
            <lpage>556</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/364555a0</pubid>
                  <pubid idtype="pmpid" link="fulltext">7687751</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Serial analysis of gene expression</p>
            </title>
            <aug>
               <au>
                  <snm>Velculescu</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Vogelstein</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Kinzler</snm>
                  <fnm>K</fnm>
               </au>
               <etal/>
            </aug>
            <source>Science</source>
            <pubdate>1995</pubdate>
            <volume>270</volume>
            <issue>5235</issue>
            <fpage>484</fpage>
            <lpage>487</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.270.5235.484</pubid>
                  <pubid idtype="pmpid" link="fulltext">7570003</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays</p>
            </title>
            <aug>
               <au>
                  <snm>Brenner</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Johnson</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Bridgham</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Golda</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Lloyd</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Johnson</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Luo</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>McCurdy</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Foy</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Ewan</snm>
                  <fnm>M</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nature Biotechnology</source>
            <pubdate>2000</pubdate>
            <volume>18</volume>
            <fpage>630</fpage>
            <lpage>634</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/76469</pubid>
                  <pubid idtype="pmpid" link="fulltext">10835600</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Large scale cDNA sequencing for analysis of quantitative and qualitative aspects of gene expression</p>
            </title>
            <aug>
               <au>
                  <snm>Okubo</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Hori</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Matoba</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Niiyama</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Fukushima</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Kojima</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Matsubara</snm>
                  <fnm>K</fnm>
               </au>
            </aug>
            <source>Nature Genetics</source>
            <pubdate>1992</pubdate>
            <volume>2</volume>
            <fpage>173</fpage>
            <lpage>179</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/ng1192-173</pubid>
                  <pubid idtype="pmpid" link="fulltext">1345164</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Analysis of the prostate cancer cell line LNCaP transcriptome using a sequencing-by-synthesis approach</p>
            </title>
            <aug>
               <au>
                  <snm>Bainbridge</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Warren</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Hirst</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Romanuik</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Zeng</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Go</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Delaney</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Griffith</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Hickenbotham</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Magrini</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Mardis</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Sadar</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Siddiqui</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Marra</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Jones</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>BMC Genomics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>246</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1592491</pubid>
                  <pubid idtype="pmpid" link="fulltext">17010196</pubid>
                  <pubid idtype="doi">10.1186/1471-2164-7-246</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Gene sequencing. The race for the $1000 genome</p>
            </title>
            <aug>
               <au>
                  <snm>Service</snm>
                  <fnm>RF</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>2006</pubdate>
            <volume>311</volume>
            <issue>5767</issue>
            <fpage>1544</fpage>
            <lpage>1546</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.311.5767.1544</pubid>
                  <pubid idtype="pmpid" link="fulltext">16543431</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Genome sequencing in microfabricated high-density picolitre reactors</p>
            </title>
            <aug>
               <au>
                  <snm>Margulies</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Egholm</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Altman</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Attiya</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Bader</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Bemben</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Berka</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Braverman</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>Z</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nature</source>
            <pubdate>2005</pubdate>
            <volume>437</volume>
            <fpage>376</fpage>
            <lpage>380</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1464427</pubid>
                  <pubid idtype="pmpid" link="fulltext">16056220</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Four-color DNA sequencing by synthesis on a chip using photocleavable fluorescent nucleotides</p>
            </title>
            <aug>
               <au>
                  <snm>Seo</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Bai</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Kim</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Meng</snm>
                  <fnm>Q</fnm>
               </au>
               <au>
                  <snm>Shi</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Ruparel</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Turro</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Ju</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Proceedings of the National Academy of Sciences</source>
            <pubdate>2005</pubdate>
            <volume>102</volume>
            <issue>17</issue>
            <fpage>5926</fpage>
            <lpage>5931</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1073/pnas.0501965102</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Sequence information can be obtained from single DNA molecules</p>
            </title>
            <aug>
               <au>
                  <snm>Braslavsky</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Hebert</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Kartalov</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Quake</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2003</pubdate>
            <volume>100</volume>
            <issue>7</issue>
            <fpage>3960</fpage>
            <lpage>3964</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">153030</pubid>
                  <pubid idtype="pmpid" link="fulltext">12651960</pubid>
                  <pubid idtype="doi">10.1073/pnas.0230489100</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Systems Biology and New Technologies Enable Predictive and Preventative Medicine</p>
            </title>
            <aug>
               <au>
                  <snm>Hood</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Heath</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Phelps</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Lin</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>2004</pubdate>
            <volume>306</volume>
            <issue>5696</issue>
            <fpage>640</fpage>
            <lpage>643</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.1104635</pubid>
                  <pubid idtype="pmpid" link="fulltext">15499008</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>The significance of digital gene expression profiles</p>
            </title>
            <aug>
               <au>
                  <snm>Audic</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Claverie</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>1997</pubdate>
            <volume>7</volume>
            <fpage>986</fpage>
            <lpage>989</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">9331369</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Bayesian model accounting for within-class biological variability in Serial Analysis of Gene Expression (SAGE)</p>
            </title>
            <aug>
               <au>
                  <snm>Vencio</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Brentani</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Patrao</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Pereira</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>5</volume>
            <fpage>119</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">517707</pubid>
                  <pubid idtype="pmpid" link="fulltext">15339345</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-5-119</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Statistical analysis of MPSS measurements: Application to the study of LPS-activated macrophage gene expression</p>
            </title>
            <aug>
               <au>
                  <snm>Stolovitzky</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Kundaje</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Held</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Duggar</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Haudenschild</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Zhou</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Vasicek</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Smith</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Aderem</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Roach</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Proceedings of the National Academy of Sciences</source>
            <pubdate>2005</pubdate>
            <volume>102</volume>
            <issue>5</issue>
            <fpage>1402</fpage>
            <lpage>1407</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1073/pnas.0406555102</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Clustering analysis of SAGE data using a Poisson approach</p>
            </title>
            <aug>
               <au>
                  <snm>Cai</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Huang</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Blackshaw</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Liu</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Cepko</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Wong</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2004</pubdate>
            <volume>5</volume>
            <issue>7</issue>
            <fpage>R51</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">463327</pubid>
                  <pubid idtype="pmpid" link="fulltext">15239836</pubid>
                  <pubid idtype="doi">10.1186/gb-2004-5-7-r51</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Statistical Methods in Serial Analysis of Gene Expression (SAGE)</p>
            </title>
            <aug>
               <au>
                  <snm>Vencio</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Brentani</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Computational and Statistical Approaches to Genomics</source>
            <publisher>New York City, New York: Springer</publisher>
            <editor>Zhang W, Shmulevich I</editor>
            <edition>2</edition>
            <pubdate>2006</pubdate>
            <fpage>209</fpage>
            <lpage>233</lpage>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Modeling Sage data with a truncated gamma-Poisson model</p>
            </title>
            <aug>
               <au>
                  <snm>Thygesen</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Zwinderman</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>157</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1479844</pubid>
                  <pubid idtype="pmpid" link="fulltext">16549008</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-7-157</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <aug>
               <au>
                  <snm>Aitchison</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>The Statistical Annalysis of Compositional Data. Monographs on Statistics and Applied Probability</source>
            <publisher>London: Chapman and Hall</publisher>
            <pubdate>1986</pubdate>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Simplicial inference</p>
            </title>
            <aug>
               <au>
                  <snm>Aitchison</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Algebraic Methods in Statistics and Probability: Contemporary Mathematics Series, no. 287 in Contemporary Mathematics Series</source>
            <publisher>Providence, Rhode Island: American Mathematical Society</publisher>
            <editor>Viana M, Richards D</editor>
            <pubdate>2001</pubdate>
            <fpage>1</fpage>
            <lpage>22</lpage>
         </bibl>
         <bibl id="B20">
            <title>
               <p>An integrated tool for microarray data clustering and cluster validity assessment</p>
            </title>
            <aug>
               <au>
                  <snm>Bolshakova</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Azuaje</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Cunningham</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <issue>4</issue>
            <fpage>451</fpage>
            <lpage>455</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bti190</pubid>
                  <pubid idtype="pmpid" link="fulltext">15608048</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>GNU General Public License</p>
            </title>
            <url>http://www.gnu.org/licenses/gpl.txt</url>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Open source clustering software</p>
            </title>
            <aug>
               <au>
                  <snm>de Hoon</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Imoto</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Nolan</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Miyano</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>20</volume>
            <fpage>1453</fpage>
            <lpage>1454</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bth078</pubid>
                  <pubid idtype="pmpid" link="fulltext">14871861</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>GNU Scientific library</p>
            </title>
            <url>http://www.gnu.org/software/gsl</url>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Cairo Graphics</p>
            </title>
            <url>http://cairographics.org</url>
         </bibl>
         <bibl id="B25">
            <title>
               <p>TreeView: an application to display phylogenetic trees on personal computers</p>
            </title>
            <aug>
               <au>
                  <snm>Page</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Computer Applications in the Biosciences</source>
            <pubdate>1996</pubdate>
            <volume>12</volume>
            <issue>4</issue>
            <fpage>357</fpage>
            <lpage>358</lpage>
            <xrefbib>
               <pubid idtype="pmpid">8902363</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Gene Expression Omnibus database</p>
            </title>
            <url>http://www.ncbi.nlm.nih.gov/geo</url>
         </bibl>
         <bibl id="B27">
            <title>
               <p>A probabilistic theory of clustering</p>
            </title>
            <aug>
               <au>
                  <snm>Dougherty</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Brun</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Pattern Recognition</source>
            <pubdate>2004</pubdate>
            <volume>37</volume>
            <issue>5</issue>
            <fpage>917</fpage>
            <lpage>925</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/j.patcog.2003.10.003</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Model-based evaluation of clustering validation measures</p>
            </title>
            <aug>
               <au>
                  <snm>Brun</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Sima</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Hua</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Lowey</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Carroll</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Suh</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Dougherty</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Pattern Recognition</source>
            <pubdate>2007</pubdate>
            <volume>40</volume>
            <issue>3</issue>
            <fpage>807</fpage>
            <lpage>824</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/j.patcog.2006.06.026</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes</p>
            </title>
            <aug>
               <au>
                  <snm>Datta</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Datta</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>397</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1590054</pubid>
                  <pubid idtype="pmpid" link="fulltext">16945146</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-7-397</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Metric for measuring the effectiveness of clustering of DNA microarray expression</p>
            </title>
            <aug>
               <au>
                  <snm>Loganantharaj</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>S</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Clifford</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <issue>Suppl 2</issue>
            <fpage>S5</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1683560</pubid>
                  <pubid idtype="pmpid" link="fulltext">17118148</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-7-S2-S5</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B31">
            <title>
               <p>Simcluster Home Page</p>
            </title>
            <url>http://xerad.systemsbiology.net/simcluster</url>
         </bibl>
         <bibl id="B32">
            <title>
               <p>Innate Immunity Systems Biology</p>
            </title>
            <url>http://www.innateimmunity-systemsbiology.org</url>
         </bibl>
         <bibl id="B33">
            <title>
               <p>Systems biology approaches identify ATF3 as a negative regulator of Toll-like receptor 4</p>
            </title>
            <aug>
               <au>
                  <snm>Gilchrist</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Thorsson</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Rust</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Korb</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Kennedy</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Hai</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Bolouri</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Aderem</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2006</pubdate>
            <volume>441</volume>
            <fpage>173</fpage>
            <lpage>178</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature04768</pubid>
                  <pubid idtype="pmpid" link="fulltext">16688168</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B34">
            <title>
               <p>The R Project for Statistical Computing</p>
            </title>
            <url>http://www.r-project.org</url>
         </bibl>
      </refgrp>
   </bm>
</art>
