<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-6-289</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Software</dochead>
      <bibl>
         <title>
            <p>GenClust: A genetic algorithm for clustering gene expression data</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Di Ges&#250;</snm>
               <fnm>Vito</fnm>
               <insr iid="I1"/>
               <email>digesu@math.unipa.it</email>
            </au>
            <au id="A2" ca="yes">
               <snm>Giancarlo</snm>
               <fnm>Raffaele</fnm>
               <insr iid="I1"/>
               <email>raffaele@math.unipa.it</email>
            </au>
            <au id="A3">
               <snm>Lo Bosco</snm>
               <fnm>Giosu&#233;</fnm>
               <insr iid="I1"/>
               <email>lobosco@math.unipa.it</email>
            </au>
            <au id="A4">
               <snm>Raimondi</snm>
               <fnm>Alessandra</fnm>
               <insr iid="I1"/>
               <email>Aleworld@email.it</email>
            </au>
            <au id="A5">
               <snm>Scaturro</snm>
               <fnm>Davide</fnm>
               <insr iid="I1"/>
               <email>scatdav@simail.it</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Dipartimento di Matematica ed Applicazioni, Universit&#225; di Palermo, Via Archirafi 34, 90123 Palermo, Italy</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2005</pubdate>
         <volume>6</volume>
         <issue>1</issue>
         <fpage>289</fpage>
         <url>http://www.biomedcentral.com/1471-2105/6/289</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">16336639</pubid>
               <pubid idtype="doi">10.1186/1471-2105-6-289</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>18</day>
               <month>5</month>
               <year>2005</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>07</day>
               <month>12</month>
               <year>2005</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>07</day>
               <month>12</month>
               <year>2005</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2005</year>
         <collab>Di Ges&#250; et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Clustering is a key step in the analysis of gene expression data, and in fact, many classical clustering algorithms are used, or more innovative ones have been designed and validated for the task. Despite the widespread use of artificial intelligence techniques in bioinformatics and, more generally, data analysis, there are very few clustering algorithms based on the genetic paradigm, yet that paradigm has great potential in finding good heuristic solutions to a difficult optimization problem such as clustering.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p><it>GenClust </it>is a new genetic algorithm for clustering gene expression data. It has two key features: (a) a novel coding of the search space that is simple, compact and easy to update; (b) it can be used naturally in conjunction with data driven internal validation methods. We have experimented with the FOM methodology, specifically conceived for validating clusters of gene expression data. The validity of <it>GenClust </it>has been assessed experimentally on real data sets, both with the use of validation measures and in comparison with other algorithms, i.e., <it>Average Link, Cast, Click </it>and <it>K-means</it>.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>Experiments show that none of the algorithms we have used is markedly superior to the others across data sets and validation measures; i.e., in many cases the observed differences between the worst and best performing algorithm may be statistically insignificant and they could be considered equivalent. However, there are cases in which an algorithm may be better than others and therefore worthwhile. In particular, experiments for <it>GenClust </it>show that, although simple in its data representation, it converges very rapidly to a local optimum and that its ability to identify meaningful clusters is comparable, and sometimes superior, to that of more sophisticated algorithms. In addition, it is well suited for use in conjunction with data driven internal validation measures and, in particular, the FOM methodology.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>In recent years, the advent of high density arrays of oligonucleotides and cDNAs has had a deep impact on biological and medical research. Indeed, the new technology enables the acquisition of data that is proving to be fundamental in many areas of the biological sciences, ranging from the understanding of complex biological systems to clinical diagnosis (see for instance the Stanford Microarray Database <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>).</p>
         <p>Due to the large number of genes involved in each experiment, cluster analysis is a very useful exploratory technique aiming at identifying genes that exhibit similar expression patterns. This may highlight groups of functionally related genes. This leads, in turn, into two well established and rich research areas. One deals with the design of new clustering algorithms and the other with the design of new validation techniques that should assess the biological relevance of the clustering solutions found. Despite the vast amount of knowledge available in those two areas <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp>, gene expression data provide unique challenges, in particular with respect to internal validation criteria. Indeed, they must predict how many clusters are really present in a data set, an already difficult task, made even worse by the fact that the estimation must be sensible enough to capture the inherent biological structure of functionally related genes. As a consequence, a new and very active area of research for cluster analysis has flourished <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr></abbrgrp>. Techniques in artificial intelligence find wide application in bioinformatics and, more in general, data analysis <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. Although clustering plays a central role in these areas, very few clustering algorithms based on the genetic paradigm are available <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp>, yet such a powerful paradigm <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> has great potential in tackling a difficult optimization problem such as clustering, in particular for high dimensional gene expression data.</p>
         <p>Here we give a genetic algorithm, referred to as <it>GenClust</it>, for clustering gene expression data and show experimentally that it is competitive with either classical algorithms, such as <it>K-means </it><abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, or more innovative and state-of-the-art ones, such as <it>Click </it><abbrgrp><abbr bid="B17">17</abbr></abbrgrp> and <it>Cast </it><abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. Moreover, the algorithm is well suited for use in conjunction with data driven internal validation methodologies <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr></abbrgrp> and in particular FOM, which has received great attention in the specialized literature <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>. Finally, we mention that <it>GenClust </it>is a generic clustering algorithm that can be used also in other data analysis tasks; e.g., sample classification, exactly as all other algorithms we have used here for our study.</p>
      </sec>
      <sec>
         <st>
            <p>Implementation</p>
         </st>
         <sec>
            <st>
               <p>Clustering as an optimization problem</p>
            </st>
            <p>Let <it>X </it>= {<it>x</it><sub>1</sub>, <it>x</it><sub>2 </sub>..., <it>x</it><sub><it>n</it></sub>} be a set of elements, where each element is a <it>d</it>-dimensional vector. In our case, each gene is an element <it>x </it>&#8712; <it>X</it>, and <it>x</it><sub><it>i </it></sub>is the value of its expression level under experimental condition <it>i</it>. Given a subset <it>Y </it>= {<it>y</it><sub>1</sub>, <it>y</it><sub>2</sub>, ..., <it>y</it><sub><it>m</it></sub>} of <it>X</it>, let <it>c</it>(<it>Y</it>) denote the centroid of <it>Y </it>and let its variance be</p>
            <p>
               <m:math name="1471-2105-6-289-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mrow>
                        <m:mi>V</m:mi>
                        <m:mi>A</m:mi>
                        <m:mi>R</m:mi>
                        <m:mo stretchy="false">(</m:mo>
                        <m:mi>Y</m:mi>
                        <m:mo stretchy="false">)</m:mo>
                        <m:mo>=</m:mo>
                        <m:mfrac>
                           <m:mn>1</m:mn>
                           <m:mi>m</m:mi>
                        </m:mfrac>
                        <m:mstyle displaystyle="true">
                           <m:munderover>
                              <m:mo>&#8721;</m:mo>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mo>=</m:mo>
                                 <m:mn>1</m:mn>
                              </m:mrow>
                              <m:mi>m</m:mi>
                           </m:munderover>
                           <m:mrow>
                              <m:mstyle displaystyle="true">
                                 <m:munderover>
                                    <m:mo>&#8721;</m:mo>
                                    <m:mrow>
                                       <m:mi>j</m:mi>
                                       <m:mo>=</m:mo>
                                       <m:mn>1</m:mn>
                                    </m:mrow>
                                    <m:mi>d</m:mi>
                                 </m:munderover>
                                 <m:mrow>
                                    <m:msup>
                                       <m:mrow>
                                          <m:mo stretchy="false">(</m:mo>
                                          <m:msub>
                                             <m:mi>y</m:mi>
                                             <m:mrow>
                                                <m:mi>i</m:mi>
                                                <m:mo>,</m:mo>
                                                <m:mi>j</m:mi>
                                             </m:mrow>
                                          </m:msub>
                                          <m:mo>&#8722;</m:mo>
                                          <m:mi>c</m:mi>
                                          <m:msub>
                                             <m:mrow>
                                                <m:mo stretchy="false">(</m:mo>
                                                <m:mi>Y</m:mi>
                                                <m:mo stretchy="false">)</m:mo>
                                             </m:mrow>
                                             <m:mi>j</m:mi>
                                          </m:msub>
                                          <m:mo stretchy="false">)</m:mo>
                                       </m:mrow>
                                       <m:mn>2</m:mn>
                                    </m:msup>
                                    <m:mo>.</m:mo>
                                 </m:mrow>
                              </m:mstyle>
                           </m:mrow>
                        </m:mstyle>
                        <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                        <m:mo stretchy="false">(</m:mo>
                        <m:mn>1</m:mn>
                        <m:mo stretchy="false">)</m:mo>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGwbGvcqWGbbqqcqWGsbGucqGGOaakcqWGzbqwcqGGPaqkcqGH9aqpdaWcaaqaaiabigdaXaqaaiabd2gaTbaadaaeWbqaamaaqahabaGaeiikaGIaemyEaK3aaSbaaSqaaiabdMgaPjabcYcaSiabdQgaQbqabaGccqGHsislcqWGJbWycqGGOaakcqWGzbqwcqGGPaqkdaWgaaWcbaGaemOAaOgabeaakiabcMcaPmaaCaaaleqabaGaeGOmaidaaOGaeiOla4caleaacqWGQbGAcqGH9aqpcqaIXaqmaeaacqWGKbaza0GaeyyeIuoaaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabd2gaTbqdcqGHris5aOGaaCzcaiaaxMaacqGGOaakcqaIXaqmcqGGPaqkaaa@5801@</m:annotation>
                  </m:semantics>
               </m:math>
            </p>
            <p>Given an integer <it>k</it>, we are interested in finding a partition <m:math name="1471-2105-6-289-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi>P</m:mi><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacaWFqbaaaa@396B@</m:annotation></m:semantics></m:math> of <it>X </it>into <it>k </it>classes <it>C</it><sub>0</sub>, <it>C</it><sub>1 </sub>..., <it>C</it><sub><it>k</it>-1 </sub>so that the total internal variance</p>
            <p>
               <m:math name="1471-2105-6-289-i3" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mrow>
                        <m:mi>V</m:mi>
                        <m:mi>A</m:mi>
                        <m:mi>R</m:mi>
                        <m:mo stretchy="false">(</m:mo>
                        <m:mi>P</m:mi>
                        <m:mo stretchy="false">)</m:mo>
                        <m:mo>=</m:mo>
                        <m:mstyle displaystyle="true">
                           <m:munderover>
                              <m:mo>&#8721;</m:mo>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mo>=</m:mo>
                                 <m:mn>0</m:mn>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>k</m:mi>
                                 <m:mo>&#8722;</m:mo>
                                 <m:mn>1</m:mn>
                              </m:mrow>
                           </m:munderover>
                           <m:mrow>
                              <m:mi>V</m:mi>
                              <m:mi>A</m:mi>
                              <m:mi>R</m:mi>
                              <m:mo stretchy="false">(</m:mo>
                              <m:msub>
                                 <m:mi>C</m:mi>
                                 <m:mi>i</m:mi>
                              </m:msub>
                              <m:mo stretchy="false">)</m:mo>
                           </m:mrow>
                        </m:mstyle>
                        <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                        <m:mo stretchy="false">(</m:mo>
                        <m:mn>2</m:mn>
                        <m:mo stretchy="false">)</m:mo>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaacqWGwbGvcqWGbbqqcqWGsbGucqGGOaakimaacaWFqbGaeiykaKIaeyypa0ZaaabCaeaacqWGwbGvcqWGbbqqcqWGsbGucqGGOaakcqWGdbWqdaWgaaWcbaGaemyAaKgabeaakiabcMcaPaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaem4AaSMaeyOeI0IaeGymaedaniabggHiLdGccaWLjaGaaCzcaiabcIcaOiabikdaYiabcMcaPaaa@5410@</m:annotation>
                  </m:semantics>
               </m:math>
            </p>
            <p>is minimized. <it>GenClust </it>provides a feasible solution to the posed optimization problem, and experiments show its convergence to a local optimum.</p>
         </sec>
         <sec>
            <st>
               <p>The algorithm GenClust</p>
            </st>
            <p><it>GenClust </it>proceeds in stages, producing a sequence of partitions <m:math name="1471-2105-6-289-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mi>P</m:mi><m:mi>i</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacaWFqbWaaSbaaSqaaiabdMgaPbqabaaaaa@3AF2@</m:annotation></m:semantics></m:math>, each consisting of <it>k </it>classes, until a halting condition is met. Let <it>&#945; </it>= (<it>x</it>, <it>&#955;</it>) be an <it>individual</it>, <it>x </it>&#8712; <it>X </it>and 0 &#8804; <it>&#955; </it>&lt;<it>k</it>. A partition <m:math name="1471-2105-6-289-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mi>P</m:mi><m:mi>i</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacaWFqbWaaSbaaSqaaiabdMgaPbqabaaaaa@3AF2@</m:annotation></m:semantics></m:math> is best seen as a collection of individuals arranged in any order, i.e., a population. Only at the end, <it>GenClust </it>assembles elements according to cluster number. Following the evolutionary computational paradigm, a population evolves by means of genetic operators, i.e., cross-over, mutation and selection, resulting in a random walk in cluster space, where the fitness function gives a drift to the process towards a local optimum.</p>
            <p>The internal data representation and coding is crucial to <it>GenClust</it>. The elements of <it>X </it>are stored into an <it>n </it>&#215; <it>d </it>matrix, and the row <it>r</it>(<it>x</it>), corresponding to <it>x</it>, is the internal name of <it>x</it>. We also keep the inverse mapping <it>r</it><sup>-1</sup>(<it>i</it>) = <it>x</it>, 0 &#8804; <it>i </it>&lt;<it>n </it>- 1. A partition <m:math name="1471-2105-6-289-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi>P</m:mi><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacaWFqbaaaa@396B@</m:annotation></m:semantics></m:math> of <it>X </it>is encoded with a list of <it>n </it>32-bit strings, each representing an individual (<it>x</it>, <it>&#955;</it>). That individual is encoded, one-to-many, by arbitrarily choosing a string <it>s </it>from a set of 32-bit strings, as follows. The least significant 8 bits of <it>s </it>give a "representation" of <it>&#955; </it>and the remaining ones a "representation" of <it>r</it>(<it>x</it>). If <it>r</it>(<it>x</it>) is in [0, <it>n </it>- 2], the binary encoding of any integer in <m:math name="1471-2105-6-289-i5" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mrow><m:mo>[</m:mo><m:mrow><m:mi>i</m:mi><m:mo>&#8727;</m:mo><m:mrow><m:mo>&#8970;</m:mo><m:mrow><m:mfrac><m:mrow><m:msup><m:mn>2</m:mn><m:mrow><m:mn>24</m:mn></m:mrow></m:msup></m:mrow><m:mi>n</m:mi></m:mfrac></m:mrow><m:mo>&#8971;</m:mo></m:mrow><m:mo>,</m:mo><m:mo stretchy="false">(</m:mo><m:mi>i</m:mi><m:mo>+</m:mo><m:mn>1</m:mn><m:mo stretchy="false">)</m:mo><m:mo>&#8727;</m:mo><m:mrow><m:mo>&#8970;</m:mo><m:mrow><m:mfrac><m:mrow><m:msup><m:mn>2</m:mn><m:mrow><m:mn>24</m:mn></m:mrow></m:msup></m:mrow><m:mi>n</m:mi></m:mfrac></m:mrow><m:mo>&#8971;</m:mo></m:mrow><m:mo>&#8722;</m:mo><m:mn>1</m:mn></m:mrow><m:mo>]</m:mo></m:mrow></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaWadaqaaiabdMgaPjabgEHiQmaagmaabaWaaSaaaeaacqaIYaGmdaahaaWcbeqaaiabikdaYiabisda0aaaaOqaaiabd6gaUbaaaiaawcp+caGL7JpacqGGSaalcqGGOaakcqWGPbqAcqGHRaWkcqaIXaqmcqGGPaqkcqGHxiIkdaGbdaqaamaalaaabaGaeGOmaiZaaWbaaSqabeaacqaIYaGmcqaI0aanaaaakeaacqWGUbGBaaaacaGLWJVaay5+4dGaeyOeI0IaeGymaedacaGLBbGaayzxaaaaaa@4CB1@</m:annotation></m:semantics></m:math> will do. Otherwise, the binary encoding of any integer in <m:math name="1471-2105-6-289-i6" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mrow><m:mo>[</m:mo><m:mrow><m:mi>n</m:mi><m:mrow><m:mo>&#8970;</m:mo><m:mrow><m:mfrac><m:mrow><m:msup><m:mn>2</m:mn><m:mrow><m:mn>24</m:mn></m:mrow></m:msup></m:mrow><m:mi>n</m:mi></m:mfrac></m:mrow><m:mo>&#8971;</m:mo></m:mrow><m:mo>,</m:mo><m:msup><m:mn>2</m:mn><m:mrow><m:mn>24</m:mn></m:mrow></m:msup><m:mo>&#8722;</m:mo><m:mn>1</m:mn></m:mrow><m:mo>]</m:mo></m:mrow></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaWadaqaaiabd6gaUnaagmaabaWaaSaaaeaacqaIYaGmdaahaaWcbeqaaiabikdaYiabisda0aaaaOqaaiabd6gaUbaaaiaawcp+caGL7JpacqGGSaalcqaIYaGmdaahaaWcbeqaaiabikdaYiabisda0aaakiabgkHiTiabigdaXaGaay5waiaaw2faaaaa@3F71@</m:annotation></m:semantics></m:math> will do. Analogous rules apply to <it>&#955;</it>, except that 2<sup>24 </sup>and <it>n </it>are replaced by 2<sup>8 </sup>and <it>k</it>, respectively. Given any 32-bit string, we can recover in a constant number of operations the unique (<it>r</it>(<it>x</it>), <it>&#955;</it>) of which it can be an encoding, and therefore (<it>x</it>, <it>&#955;</it>) (via the inverse mapping r<sup>-1</sup>). The straightforward details are omitted. In what follows, <it>D</it>(<it>s</it>) returns (<it>r</it>(<it>x</it>), <it>&#955;</it>), with <it>D</it><sub>1</sub>(<it>s</it>) = <it>r</it>(<it>x</it>) and <it>D</it><sub>2</sub>(<it>s</it>) = <it>&#955;</it>, <it>x </it>&#8712; <it>X </it>and 0 &#8804; <it>&#955; </it>&lt;<it>k</it>. The chosen encoding is compact, easy to handle, and allows up to 256 classes and data sets of size up to 16,793,604 elements, values adequate for real applications.</p>
            <p>The initial partition <m:math name="1471-2105-6-289-i7" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mi>P</m:mi><m:mn>0</m:mn></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacaWFqbWaaSbaaSqaaiabicdaWaqabaaaaa@3A85@</m:annotation></m:semantics></m:math> can be computed by either randomly partitioning the elements of <it>X </it>into <it>k </it>classes or by using a user specified partition of the elements of <it>X</it>, such as the one produced by yet another clustering algorithm.</p>
            <p>The heart of <it>GenClust </it>is the transition in cluster space from <m:math name="1471-2105-6-289-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mi>P</m:mi><m:mi>i</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacaWFqbWaaSbaaSqaaiabdMgaPbqabaaaaa@3AF2@</m:annotation></m:semantics></m:math> to <m:math name="1471-2105-6-289-i8" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mi>P</m:mi><m:mrow><m:mi>i</m:mi><m:mo>+</m:mo><m:mn>1</m:mn></m:mrow></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacaWFqbWaaSbaaSqaaiabdMgaPjabgUcaRiabigdaXaqabaaaaa@3CC4@</m:annotation></m:semantics></m:math>, <it>i </it>&#8805; 0. This is accomplished by a proper manipulation of the 32-bit strings in the list <it>L</it><sub><it>i </it></sub>= (<it>s</it><sub>0</sub>, <it>s</it><sub>1</sub>, ..., <it>s</it><sub><it>n</it>-1</sub>) encoding <m:math name="1471-2105-6-289-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mi>P</m:mi><m:mi>i</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacaWFqbWaaSbaaSqaaiabdMgaPbqabaaaaa@3AF2@</m:annotation></m:semantics></m:math>. Assume that <it>L</it><sub><it>i </it></sub>is sorted according to the internal representation of the elements; i.e., <it>D</it><sub>1</sub>(<it>s</it><sub><it>p</it></sub>) &lt;<it>D</it><sub>1</sub>(<it>s</it><sub><it>j</it></sub>), <it>p </it>&lt;<it>j</it>. The following steps are applied in order.</p>
            <sec>
               <st>
                  <p>Cross-over</p>
               </st>
               <p>The objective is to produce a list <it>L</it><sub><it>temp </it></sub>of new binary strings by properly recombining the ones in <it>L</it><sub><it>i</it></sub>. For each string <it>s</it><sub><it>j</it></sub>, 0 &#8804; <it>j </it>&lt;<it>n</it>, the standard one point cross-over operation is performed <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>, with probability 0.9. The second string is chosen at random from the ones in <it>L</it><sub><it>i </it></sub>- {<it>s</it><sub><it>j</it></sub>}. The cross-over operation generates two new strings that are appended to <it>L</it><sub><it>temp</it></sub>. At the end, <it>L</it><sub><it>temp </it></sub>is a list of <it>m </it>&#8804; <it>n </it>32-bits strings. Notice that, because of the encoding and decoding process we are using, the recombined string will still represent a pair (<it>r</it>(<it>x</it>), <it>&#955;</it>), with 0 &#8804; <it>r</it>(<it>x</it>) &lt;<it>n </it>and 0 &#8804; <it>&#955; </it>&lt;<it>k</it>.</p>
            </sec>
            <sec>
               <st>
                  <p>First selection</p>
               </st>
               <p>Notice that while each string in <it>L</it><sub><it>i </it></sub>corresponds to exactly one element <it>x </it>&#8712; <it>X </it>and vice versa, that is no longer true for the concatenated lists <it>L</it><sub><it>i </it></sub>&#9675; <it>L</it><sub><it>temp</it></sub>. We eliminate duplicates by keeping only the rightmost string <it>s </it>in L<sub><it>i </it></sub>&#9675; <it>L</it><sub><it>temp </it></sub>such that <it>D</it><sub>1</sub>(<it>s</it>) = <it>j</it>, for <it>j </it>= 0, ..., <it>n </it>- 1. Denote the result by <it>L'</it>.</p>
            </sec>
            <sec>
               <st>
                  <p>One-bit mutation</p>
               </st>
               <p><it>L' </it>is an encoding of a partition related to <m:math name="1471-2105-6-289-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mi>P</m:mi><m:mi>i</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacaWFqbWaaSbaaSqaaiabdMgaPbqabaaaaa@3AF2@</m:annotation></m:semantics></m:math>. In order to climb out of local minima, it is perturbed as follows. For <it>j </it>= 0, ..., <it>n </it>- 1, a one-bit mutation is applied to <m:math name="1471-2105-6-289-i9" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:msup><m:mi>s</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mi>j</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGZbWCgaqbamaaBaaaleaacqWGQbGAaeqaaaaa@2FB2@</m:annotation></m:semantics></m:math> &#8712; <it>L' </it>with probability 0.01, resulting in a string <it>s</it>. There are several possible outcomes. The mutation is silent, i.e., <it>D</it>(<m:math name="1471-2105-6-289-i9" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:msup><m:mi>s</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mi>j</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGZbWCgaqbamaaBaaaleaacqWGQbGAaeqaaaaa@2FB2@</m:annotation></m:semantics></m:math>) = <it>D</it>(<it>s</it>). No action is taken. It affects the cluster membership of <it>D</it><sub>1</sub>(<m:math name="1471-2105-6-289-i9" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:msup><m:mi>s</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mi>j</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGZbWCgaqbamaaBaaaleaacqWGQbGAaeqaaaaa@2FB2@</m:annotation></m:semantics></m:math>), i.e., <it>D</it><sub>1</sub>(<m:math name="1471-2105-6-289-i9" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:msup><m:mi>s</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mi>j</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGZbWCgaqbamaaBaaaleaacqWGQbGAaeqaaaaa@2FB2@</m:annotation></m:semantics></m:math>) = <it>D</it><sub>1</sub>(<it>s</it>) but <it>D</it><sub>2</sub>(<m:math name="1471-2105-6-289-i9" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:msup><m:mi>s</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mi>j</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGZbWCgaqbamaaBaaaleaacqWGQbGAaeqaaaaa@2FB2@</m:annotation></m:semantics></m:math>) &#8800; <it>D</it><sub>2</sub>(<it>s</it>), or it causes a collision, i.e., there exists an <m:math name="1471-2105-6-289-i10" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:msup><m:mi>s</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mi>p</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGZbWCgaqbamaaBaaaleaacqWGWbaCaeqaaaaa@2FBE@</m:annotation></m:semantics></m:math> in <it>L'</it>, <it>p </it>&#8800; <it>j</it>, such that <it>D</it><sub>1</sub>(<it>s</it>) = <it>D</it><sub>1</sub>(<m:math name="1471-2105-6-289-i10" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:msup><m:mi>s</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mi>p</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGZbWCgaqbamaaBaaaleaacqWGWbaCaeqaaaaa@2FBE@</m:annotation></m:semantics></m:math>). Then, <it>s </it>replaces <m:math name="1471-2105-6-289-i9" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:msup><m:mi>s</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mi>j</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGZbWCgaqbamaaBaaaleaacqWGQbGAaeqaaaaa@2FB2@</m:annotation></m:semantics></m:math>.</p>
            </sec>
            <sec>
               <st>
                  <p>Second selection</p>
               </st>
               <p>We have now two lists <it>L</it><sub><it>i </it></sub>and <it>L' </it>of <it>n </it>32-bit strings, representing the encoding of <m:math name="1471-2105-6-289-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mi>P</m:mi><m:mi>i</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacaWFqbWaaSbaaSqaaiabdMgaPbqabaaaaa@3AF2@</m:annotation></m:semantics></m:math> and <m:math name="1471-2105-6-289-i11" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:msup><m:mi>P</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mi>i</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaaceWFqbGbauaadaWgaaWcbaGaemyAaKgabeaaaaa@3AFE@</m:annotation></m:semantics></m:math> where this latter one is possibly a new partition. Let <it>L' </it>be sorted according to the internal representation of the elements, i.e., <it>D</it><sub>1</sub>(<m:math name="1471-2105-6-289-i10" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:msup><m:mi>s</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mi>p</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGZbWCgaqbamaaBaaaleaacqWGWbaCaeqaaaaa@2FBE@</m:annotation></m:semantics></m:math>) &lt;<it>D</it><sub>1</sub>(<m:math name="1471-2105-6-289-i9" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:msup><m:mi>s</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mi>j</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGZbWCgaqbamaaBaaaleaacqWGQbGAaeqaaaaa@2FB2@</m:annotation></m:semantics></m:math>), <it>p </it>&lt;<it>j</it>. The encoding <it>L</it><sub><it>i</it>+1 </sub>= {<it>c</it><sub>0</sub>, ..., <it>c</it><sub><it>n</it>-1</sub>} of <m:math name="1471-2105-6-289-i8" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mi>P</m:mi><m:mrow><m:mi>i</m:mi><m:mo>+</m:mo><m:mn>1</m:mn></m:mrow></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacaWFqbWaaSbaaSqaaiabdMgaPjabgUcaRiabigdaXaqabaaaaa@3CC4@</m:annotation></m:semantics></m:math> is obtained via the following selection process:</p>
               <p>
                  <m:math name="1471-2105-6-289-i12" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msub>
                              <m:mi>c</m:mi>
                              <m:mi>r</m:mi>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:mrow>
                              <m:mo>{</m:mo>
                              <m:mrow>
                                 <m:mtable columnalign="left">
                                    <m:mtr columnalign="left">
                                       <m:mtd columnalign="left">
                                          <m:mrow>
                                             <m:msub>
                                                <m:msup>
                                                   <m:mi>s</m:mi>
                                                   <m:mo>&#8242;</m:mo>
                                                </m:msup>
                                                <m:mi>r</m:mi>
                                             </m:msub>
                                          </m:mrow>
                                       </m:mtd>
                                       <m:mtd columnalign="left">
                                          <m:mrow>
                                             <m:mi>i</m:mi>
                                             <m:mi>f</m:mi>
                                             <m:mtext>&#8201;</m:mtext>
                                             <m:mi>f</m:mi>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>D</m:mi>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:msub>
                                                <m:msup>
                                                   <m:mi>s</m:mi>
                                                   <m:mo>&#8242;</m:mo>
                                                </m:msup>
                                                <m:mi>r</m:mi>
                                             </m:msub>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mo>&lt;</m:mo>
                                             <m:mi>f</m:mi>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>D</m:mi>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:msub>
                                                <m:mi>s</m:mi>
                                                <m:mi>r</m:mi>
                                             </m:msub>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mo stretchy="false">)</m:mo>
                                          </m:mrow>
                                       </m:mtd>
                                    </m:mtr>
                                    <m:mtr columnalign="left">
                                       <m:mtd columnalign="left">
                                          <m:mrow>
                                             <m:msub>
                                                <m:mi>s</m:mi>
                                                <m:mi>r</m:mi>
                                             </m:msub>
                                          </m:mrow>
                                       </m:mtd>
                                       <m:mtd columnalign="left">
                                          <m:mrow>
                                             <m:mtext>otherwise</m:mtext>
                                          </m:mrow>
                                       </m:mtd>
                                    </m:mtr>
                                 </m:mtable>
                              </m:mrow>
                           </m:mrow>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGJbWydaWgaaWcbaGaemOCaihabeaakiabg2da9maaceaabaqbaeaabiGaaaqaaiqbdohaZzaafaWaaSbaaSqaaiabdkhaYbqabaaakeaacqWGPbqAcqWGMbGzcaaMc8UaemOzayMaeiikaGIaemiraqKaeiikaGIafm4CamNbauaadaWgaaWcbaGaemOCaihabeaakiabcMcaPiabcMcaPiabgYda8iabdAgaMjabcIcaOiabdseaejabcIcaOiabdohaZnaaBaaaleaacqWGYbGCaeqaaOGaeiykaKIaeiykaKcabaGaem4Cam3aaSbaaSqaaiabdkhaYbqabaaakeaacqqGVbWBcqqG0baDcqqGObaAcqqGLbqzcqqGYbGCcqqG3bWDcqqGPbqAcqqGZbWCcqqGLbqzaaaacaGL7baaaaa@5B75@</m:annotation>
                     </m:semantics>
                  </m:math>
               </p>
               <p><it>r </it>= 0, ..., n - 1 and where</p>
               <p>
                  <m:math name="1471-2105-6-289-i13" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>f</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>x</m:mi>
                           <m:mo>,</m:mo>
                           <m:mi>&#955;</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:msqrt>
                              <m:mrow>
                                 <m:mfrac>
                                    <m:mn>1</m:mn>
                                    <m:mi>d</m:mi>
                                 </m:mfrac>
                                 <m:mstyle displaystyle="true">
                                    <m:munderover>
                                       <m:mo>&#8721;</m:mo>
                                       <m:mrow>
                                          <m:mi>j</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                       <m:mi>d</m:mi>
                                    </m:munderover>
                                    <m:mrow>
                                       <m:mfrac>
                                          <m:mrow>
                                             <m:msup>
                                                <m:mrow>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:msub>
                                                      <m:mi>x</m:mi>
                                                      <m:mi>j</m:mi>
                                                   </m:msub>
                                                   <m:mo>&#8722;</m:mo>
                                                   <m:mi>c</m:mi>
                                                   <m:msub>
                                                      <m:mrow>
                                                         <m:mo stretchy="false">(</m:mo>
                                                         <m:msub>
                                                            <m:mi>C</m:mi>
                                                            <m:mi>&#955;</m:mi>
                                                         </m:msub>
                                                         <m:mo stretchy="false">)</m:mo>
                                                      </m:mrow>
                                                      <m:mi>j</m:mi>
                                                   </m:msub>
                                                   <m:mo stretchy="false">)</m:mo>
                                                </m:mrow>
                                                <m:mn>2</m:mn>
                                             </m:msup>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mi>max</m:mi>
                                             <m:mo>&#8289;</m:mo>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:msub>
                                                <m:mi>x</m:mi>
                                                <m:mi>j</m:mi>
                                             </m:msub>
                                             <m:mo>,</m:mo>
                                             <m:mi>c</m:mi>
                                             <m:msub>
                                                <m:mrow>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:msub>
                                                      <m:mi>C</m:mi>
                                                      <m:mi>&#955;</m:mi>
                                                   </m:msub>
                                                   <m:mo stretchy="false">)</m:mo>
                                                </m:mrow>
                                                <m:mi>j</m:mi>
                                             </m:msub>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:msup>
                                                <m:mo stretchy="false">)</m:mo>
                                                <m:mn>2</m:mn>
                                             </m:msup>
                                          </m:mrow>
                                       </m:mfrac>
                                    </m:mrow>
                                 </m:mstyle>
                              </m:mrow>
                           </m:msqrt>
                           <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mn>3</m:mn>
                           <m:mo stretchy="false">)</m:mo>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGMbGzcqGGOaakcqGGOaakcqWG4baEcqGGSaalcqaH7oaBcqGGPaqkcqGGPaqkcqGH9aqpdaGcaaqaamaalaaabaGaeGymaedabaGaemizaqgaamaaqahabaWaaSaaaeaacqGGOaakcqWG4baEdaWgaaWcbaGaemOAaOgabeaakiabgkHiTiabdogaJjabcIcaOiabdoeadnaaBaaaleaacqaH7oaBaeqaaOGaeiykaKYaaSbaaSqaaiabdQgaQbqabaGccqGGPaqkdaahaaWcbeqaaiabikdaYaaaaOqaaiGbc2gaTjabcggaHjabcIha4jabcIcaOiabdIha4naaBaaaleaacqWGQbGAaeqaaOGaeiilaWIaem4yamMaeiikaGIaem4qam0aaSbaaSqaaiabeU7aSbqabaGccqGGPaqkdaWgaaWcbaGaemOAaOgabeaakiabcMcaPiabcMcaPmaaCaaaleqabaGaeGOmaidaaaaaaeaacqWGQbGAcqGH9aqpcqaIXaqmaeaacqWGKbaza0GaeyyeIuoaaSqabaGccaWLjaGaaCzcaiabcIcaOiabiodaZiabcMcaPaaa@6570@</m:annotation>
                     </m:semantics>
                  </m:math>
               </p>
               <p>is the <it>fitness function </it>of individual (<it>x</it>, <it>&#955;</it>) in a generic partition <m:math name="1471-2105-6-289-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi>P</m:mi><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacaWFqbaaaa@396B@</m:annotation></m:semantics></m:math>, and <it>C</it><sub><it>&#955; </it></sub>is cluster number <it>&#955; </it>in that partition. That is, <it>f</it>(<it>D</it>(<m:math name="1471-2105-6-289-i14" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:msup><m:mi>s</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mi>r</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacuWGZbWCgaqbamaaBaaaleaacqWGYbGCaeqaaaaa@2FC2@</m:annotation></m:semantics></m:math>)) refers to the partition encoded by <it>L' </it>and <it>f</it>(<it>D</it>(<it>s</it><sub><it>r</it></sub>)) to the one encoded by <it>L</it><sub><it>i</it></sub>.</p>
               <p>There are several types of halting criteria that can be used for <it>GenClust</it>. We have considered one in which the algorithm is given a user-specified number of iterations, i.e., number of partitions <m:math name="1471-2105-6-289-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mi>P</m:mi><m:mi>i</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacaWFqbWaaSbaaSqaaiabdMgaPbqabaaaaa@3AF2@</m:annotation></m:semantics></m:math> to produce. At each iteration, apart from the current partition, it also keeps track of the partition corresponding to the best internal variance seen over the iterations performed so far. Another user-specified parameter indicates whether, at the end of the iterations, the algorithm must output the last partition or the one corresponding to the minimum internal variance seen during its execution. We refer to those partitions as <m:math name="1471-2105-6-289-i15" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mi>P</m:mi><m:mrow><m:mi>l</m:mi><m:mi>a</m:mi><m:mi>s</m:mi><m:mi>t</m:mi></m:mrow></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacaWFqbWaaSbaaSqaaiabdYgaSjabdggaHjabdohaZjabdsha0bqabaaaaa@3F23@</m:annotation></m:semantics></m:math> and <m:math name="1471-2105-6-289-i16" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mi>P</m:mi><m:mrow><m:mi>b</m:mi><m:mi>e</m:mi><m:mi>s</m:mi><m:mi>t</m:mi></m:mrow></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacaWFqbWaaSbaaSqaaiabdkgaIjabdwgaLjabdohaZjabdsha0bqabaaaaa@3F17@</m:annotation></m:semantics></m:math>, respectively. The rationale behind the described mode of operation is to allow <it>GenClust </it>to climb out of local optima. Since the number of iterations must be determined experimentally, the algorithm outputs also two auxiliary files: <it>variance</it>, reporting the values of internal variance, and <it>best</it>, internal variance for each iteration. This point is related to the convergence of <it>GenClust </it>to a local optimum and is discussed in the Experiments subsection.</p>
               <p>We point out that the inherent freedom of the one-to-many mapping of individuals to binary strings, which we have used, provides enough flexibility so that <it>GenClust </it>can work on one single partition, allowing it to change. This should be contrasted with other existing clustering algorithms based on the genetic paradigm, since at each stage, they typically maintain a family of partitions <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp>. This results in higher computational demand when going from one iteration to the next.</p>
               <p>Since <it>GenClust </it>needs in input the number <it>k </it>of clusters, it must be used in conjunction with a methodology that guides in the estimation of the real number of clusters in a data set and also evaluates the quality of clustering solutions. We have chosen FOM for our experiments, since it has had great impact on the scientific literature in this area. Valid alternatives are described in <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>, where additional references to the literature are also given. Data reduction techniques, such as filtering <abbrgrp><abbr bid="B20">20</abbr></abbrgrp> and principal component analysis may also be of help in those circumstances.</p>
            </sec>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results and discussion</p>
         </st>
         <sec>
            <st>
               <p>Experimental methodology</p>
            </st>
            <p>We have chosen data sets for which a biological meaningful partition into classes is known in the literature: e.g., biologically distinct functional classes. We refer to that partition as the <it>true solution</it>. We have also chosen a suite of algorithms, <it>Average Link </it>among the <it>Hierarchical Methods </it><abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, <it>K-means </it><abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, <it>Cast, Click </it>against which we compare the performance of <it>GenClust</it>, established by means of external and internal criteria. The external criteria measure how well a clustering solution computed by an algorithm agrees with the <it>true solution </it>for a given data set. Among the many available <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>, we have chosen the adjusted Rand index <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>, a flexible index allowing comparison among partitions with different numbers of classes and also recommended in the statistics and classification literature <abbrgrp><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr></abbrgrp>. When the true solution is not known, the internal criteria must give a reliable indication of how well a partitioning solution produced by an algorithm captures the inherent separation of the data into clusters, i.e., how many clusters are really present in the data. We have chosen FOM for our experiments.</p>
            <sec>
               <st>
                  <p>Data sets</p>
               </st>
               <p><b>RCNS</b>. The data set is obtained by reverse transcription coupled PCR to study the expression levels of 112 genes during rat central nervous system development over 9 time points <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. That results in a 112<it>x</it>9 data matrix. It was studied by Wen et al. <abbrgrp><abbr bid="B25">25</abbr></abbrgrp> to obtain a division of the genes into 6 classes, four of which are composed of biologically functionally related genes. This division is assumed to be the <it>true solution</it>. Before the analysis, Wen et al. performed two transformations on the data for each gene: (a) each row is divided by its maximum value; (b) to capture the temporal nature of the data, the difference between the values of two consecutive data points is added as an extra data point. Therefore, the final data set consists of a 112<it>x</it>17 data matrix, which is the input to our algorithms. We point out that the second transformation has the effect to enhance the similarity between genes with closely parallel, but offset, expression patterns.</p>
               <p><b>YCC</b>. The data set is part of that studied by Spellman et al. <abbrgrp><abbr bid="B26">26</abbr></abbrgrp> and has been used by Sharan et al. for validation of their clustering algorithm <it>Click</it>. The complete data set contains the expression levels of roughly 6000 yeast ORFs over 79 conditions. The analysis by Spellman et al. identified 800 genes that are cell cycle regulated. In order to demonstrate the validity of <it>Click</it>, Sharan et al. extracted 698 out of those 800 genes, over 72 conditions, by eliminating all genes that had at least three missing entries. Additional details on that "extraction process" can be found in <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. The resulting 698<it>x</it>72 data matrix is standardized (i.e., for each row, the entries are scaled so that the mean is zero and the variance is one) and used for our experiments. The <it>true solution </it>is given by the partition of the 698 extracted genes according to the five functional classes they belong to in the classification by Spellman et al.</p>
               <p><b>RYCC</b>. This data set originates in the one by Cho et al. <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> for the study of yeast cell cycle regulated genes and has been created and used by Ka Yee Yeung for her study of FOM in her doctoral dissertation <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. Ka Yee Yeung extracted 384 genes from the yeast cell cycle data set in Cho et al. to obtain a 384<it>x</it>17 data expression matrix. The details of the extraction process are in <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. That matrix is then standardized as in Tamayo et al. <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. That is, the data matrix is divided in two contiguous pieces and each piece is standardized separately. We use that standardized data set for our experiments and assume as the <it>true solution </it>the same as in the dissertation by Ka Yee Yeung. It is to be pointed out that each gene in the <b>RYCC </b>data set appears also in the <b>YCC </b>data set. However, the dimensionality of the two data sets is quite different, and this may cause algorithms to behave differently. Moreover, <b>RYCC </b>is also useful for a qualitative comparison of our results with the ones in the doctoral dissertation by Ka Yee Yeung.</p>
               <p><b>PBM</b>. The data set was used by Hartuv et al. <abbrgrp><abbr bid="B29">29</abbr></abbrgrp> to test their clustering algorithm. It contains 2329 cDNAs with a fingerprint of 139 oligos. This gives a 2329<it>x</it>139 data matrix. Each row corresponds to a gene, but different rows may correspond to the same gene. The true solution consists of a division of the rows in 18 classes, i.e., the data set consists of 18 genes.</p>
               <p><b>RPBM</b>. Since FOM was too time demanding to complete its execution on the data set by Hartuv et al., we have reduced the data in order to get an indication of the number of clusters in the data set. We have randomly picked 10% of the cDNAs in each of the 18 original classes. Whenever that percentage is less than one, we have retained the entire class. The result is a 235<it>x</it>139 data matrix, and the <it>true solution </it>is readily obtained from that of <b>PBM</b>. Data sets are provided as supplementary material <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>.</p>
            </sec>
            <sec>
               <st>
                  <p>Algorithms</p>
               </st>
               <p><it>Average Link </it>has been implemented, among the hierarchical methods. Following prior work <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr></abbrgrp>, a dendogram is built bottom-up until one obtains <it>k </it>subtrees, for a user-specified parameter <it>k</it>. Then, <it>k </it>clusters are obtained by assuming that the genes at the leaves of each subtree form a distinct cluster. We have also implemented <it>GenClust </it>and <it>K-means</it>. Both algorithms take as input a parameter <it>k </it>and return <it>k </it>clusters. They can either start with a randomly generated initial partition of the genes in <it>k </it>classes, or they can take as input a user-specified partition of the elements, for instance the output of yet another clustering algorithm. For our experiments, we have chosen the output of <it>Average Link </it>in this second case. In what follows, the type of initial partition chosen for those two algorithms appear as a suffix, i.e., <it>K-means-Random </it>means that the initial partition has been generated at random. Moreover, since <it>GenClust </it>can output one of two partitions, i.e., <m:math name="1471-2105-6-289-i15" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mi>P</m:mi><m:mrow><m:mi>l</m:mi><m:mi>a</m:mi><m:mi>s</m:mi><m:mi>t</m:mi></m:mrow></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacaWFqbWaaSbaaSqaaiabdYgaSjabdggaHjabdohaZjabdsha0bqabaaaaa@3F23@</m:annotation></m:semantics></m:math> or <m:math name="1471-2105-6-289-i16" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mi>P</m:mi><m:mrow><m:mi>b</m:mi><m:mi>e</m:mi><m:mi>s</m:mi><m:mi>t</m:mi></m:mrow></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacaWFqbWaaSbaaSqaaiabdkgaIjabdwgaLjabdohaZjabdsha0bqabaaaaa@3F17@</m:annotation></m:semantics></m:math>, we also add the appropriate suffix. So, <it>GenClust-Random-last </it>takes as input a random partition and returns the last partition produced during its execution. We also used an implementation of <it>Cast </it>that was made available to us by Ka Yee Yeung and that is well suited for the FOM methodology. Finally, we have used the version of <it>Click </it>available with the <it>Expander </it>software system <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>.</p>
            </sec>
            <sec>
               <st>
                  <p>Validation criteria</p>
               </st>
               <p>The adjusted Rand index measures the level of agreement between two partitions, not necessarily containing the same number of classes. Qualitatively, it takes value zero when the partitions are randomly correlated, value one when there is a perfect correlation, and value -1 when there is perfect anti-correlation. Those statements can be put on a more formal ground.</p>
               <p>2-norm FOM, which is the internal measure used for our experiments, is a measure of the predictive power of a clustering algorithm. It should display the following properties. For a given clustering algorithm, it must have a low value in correspondence with the number of clusters that are really present in the data. Moreover, when comparing clustering algorithms for a given number of clusters <it>k</it>, the lower the value of 2-norm FOM for a given algorithm, the better its predictive power. Experiments by Ka Yee Yeung et al. show that the FOM family and its associated validation methodology satisfy those properties with a good degree of accuracy. Indeed, Ka Yee Yeung et al. give experimental evidence of some degree of anti-correlation between FOM and adjusted Rand index, in particular when the number of clusters is small. Since it is a rather novel measure, we provide a formal definition.</p>
               <p>For a given data set, let <it>R </it>denote the raw data matrix, e.g., the data matrix without standardization for our data sets. Assume that <it>R </it>has dimension <it>nxm</it>, i.e., each row corresponds to a gene and each column corresponds to an experimental condition. Assume that a clustering algorithm is given the raw matrix <it>R </it>with column <it>e </it>excluded. Assume also that, with that reduced data set, the algorithm produces <it>k </it>clusters <it>C</it><sub>0</sub>, ..., <it>C</it><sub><it>k</it>-1</sub>. Let <it>R</it>(<it>g</it>, <it>e</it>) be the expression level of gene <it>g </it>and <it>m</it><sub><it>i</it></sub>(<it>e</it>) be the average expression level of condition <it>e </it>for genes in cluster <it>C</it><sub><it>i</it></sub>. The 2-norm FOM with respect to <it>k </it>clusters and condition <it>e </it>is defined as:</p>
               <p>
                  <m:math name="1471-2105-6-289-i17" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mtext>FOM</m:mtext>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>e</m:mi>
                           <m:mo>,</m:mo>
                           <m:mi>k</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:msqrt>
                              <m:mrow>
                                 <m:mfrac>
                                    <m:mn>1</m:mn>
                                    <m:mi>n</m:mi>
                                 </m:mfrac>
                                 <m:mstyle displaystyle="true">
                                    <m:munderover>
                                       <m:mo>&#8721;</m:mo>
                                       <m:mrow>
                                          <m:mi>i</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mn>0</m:mn>
                                       </m:mrow>
                                       <m:mrow>
                                          <m:mi>k</m:mi>
                                          <m:mo>&#8722;</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                    </m:munderover>
                                    <m:mrow>
                                       <m:mstyle displaystyle="true">
                                          <m:munder>
                                             <m:mo>&#8721;</m:mo>
                                             <m:mrow>
                                                <m:mi>x</m:mi>
                                                <m:mo>&#8712;</m:mo>
                                                <m:msub>
                                                   <m:mi>C</m:mi>
                                                   <m:mi>i</m:mi>
                                                </m:msub>
                                             </m:mrow>
                                          </m:munder>
                                          <m:mrow>
                                             <m:msup>
                                                <m:mrow>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:mi>R</m:mi>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:mi>x</m:mi>
                                                   <m:mo>,</m:mo>
                                                   <m:mi>e</m:mi>
                                                   <m:mo stretchy="false">)</m:mo>
                                                   <m:mo>&#8722;</m:mo>
                                                   <m:msub>
                                                      <m:mi>m</m:mi>
                                                      <m:mi>i</m:mi>
                                                   </m:msub>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:mi>e</m:mi>
                                                   <m:mo stretchy="false">)</m:mo>
                                                   <m:mo stretchy="false">)</m:mo>
                                                </m:mrow>
                                                <m:mn>2</m:mn>
                                             </m:msup>
                                          </m:mrow>
                                       </m:mstyle>
                                    </m:mrow>
                                 </m:mstyle>
                              </m:mrow>
                           </m:msqrt>
                           <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mn>4</m:mn>
                           <m:mo stretchy="false">)</m:mo>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqqGgbGrcqqGpbWtcqqGnbqtcqGGOaakcqWGLbqzcqGGSaalcqWGRbWAcqGGPaqkcqGH9aqpdaGcaaqaamaalaaabaGaeGymaedabaGaemOBa4gaamaaqahabaWaaabuaeaacqGGOaakcqWGsbGucqGGOaakcqWG4baEcqGGSaalcqWGLbqzcqGGPaqkcqGHsislcqWGTbqBdaWgaaWcbaGaemyAaKgabeaakiabcIcaOiabdwgaLjabcMcaPiabcMcaPmaaCaaaleqabaGaeGOmaidaaaqaaiabdIha4jabgIGiolabdoeadnaaBaaameaacqWGPbqAaeqaaaWcbeqdcqGHris5aaWcbaGaemyAaKMaeyypa0JaeGimaadabaGaem4AaSMaeyOeI0IaeGymaedaniabggHiLdaaleqaaOGaaCzcaiaaxMaacqGGOaakcqaI0aancqGGPaqkaaa@5D8D@</m:annotation>
                     </m:semantics>
                  </m:math>
               </p>
               <p>Notice that FOM(<it>e</it>, <it>k</it>) is essentially a root mean square deviation. The aggregate 2-norm FOM for <it>k </it>clusters is then:</p>
               <p>
                  <m:math name="1471-2105-6-289-i18" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mtext>FOM</m:mtext>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>k</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mstyle displaystyle="true">
                              <m:munderover>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mrow>
                                    <m:mi>e</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                                 <m:mi>m</m:mi>
                              </m:munderover>
                              <m:mrow>
                                 <m:mtext>FOM</m:mtext>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mtext>e</m:mtext>
                                 <m:mo>,</m:mo>
                                 <m:mtext>k</m:mtext>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                           </m:mstyle>
                           <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mn>5</m:mn>
                           <m:mo stretchy="false">)</m:mo>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqqGgbGrcqqGpbWtcqqGnbqtcqGGOaakcqWGRbWAcqGGPaqkcqGH9aqpdaaeWbqaaiabbAeagjabb+eapjabb2eanjabcIcaOiabbwgaLjabcYcaSiabbUgaRjabcMcaPaWcbaGaemyzauMaeyypa0JaeGymaedabaGaemyBa0ganiabggHiLdGccaWLjaGaaCzcaiabcIcaOiabiwda1iabcMcaPaaa@479D@</m:annotation>
                     </m:semantics>
                  </m:math>
               </p>
               <p>A few remarks are in order. Both formulae (4) and (5) can be used to measure the predictive power of an algorithm. The first gives us more flexibility, since we can pick any condition, while the second gives us a total estimate over all conditions. Following the literature, we use (5) in our experiments. Moreover, since the experimental studies conducted by Ka Yee Yeung et al. show that FOM(<it>k</it>) behaves as a decreasing function of <it>k</it>, an adjustment factor has been introduced to properly compare clustering solutions with different numbers of clusters. A theoretical analysis by Ka Yee Yeung et al. provides the following adjustment factor:</p>
               <p>
                  <m:math name="1471-2105-6-289-i19" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msqrt>
                              <m:mrow>
                                 <m:mfrac>
                                    <m:mrow>
                                       <m:mi>n</m:mi>
                                       <m:mo>&#8722;</m:mo>
                                       <m:mi>k</m:mi>
                                    </m:mrow>
                                    <m:mi>n</m:mi>
                                 </m:mfrac>
                              </m:mrow>
                           </m:msqrt>
                           <m:mo>.</m:mo>
                           <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mn>6</m:mn>
                           <m:mo stretchy="false">)</m:mo>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaGcaaqaamaalaaabaGaemOBa4MaeyOeI0Iaem4AaSgabaGaemOBa4gaaaWcbeaakiabc6caUiaaxMaacaWLjaGaeiikaGIaeGOnayJaeiykaKcaaa@36CD@</m:annotation>
                     </m:semantics>
                  </m:math>
               </p>
               <p>When (6) divides (4), we refer to (4) and (5) as <it>adjusted </it>FOMs. We use the adjusted aggregate FOM for our experiments and, for brevity, we refer to it simply as FOM.</p>
            </sec>
            <sec>
               <st>
                  <p>Experimental setup</p>
               </st>
               <p>All of the experiments were performed on a PC with 1G of main memory and a 3.2 GHZ AMD Athlon 64 processor. For the randomized algorithms, i.e., <it>Cast, GenClust-Random, K-means-Random</it>, we executed five runs to measure the variability of the validation measures with respect to the various solutions found by the algorithms. We find that only <it>K-means-Random </it>and <it>GenClust-Random-best </it>display a non-negligible variation from run to run, but for the adjusted Rand index only. For those algorithms and particular index, we report the minimum and the maximum value obtained in each run, while we give the results of a single run in all other cases.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Experiments</p>
            </st>
            <p>We now analyze the performance of <it>GenClust</it>, with respect to the choice of the initial partition, the two partitions it can give in output, and the performance of the other algorithms.</p>
            <sec>
               <st>
                  <p>Convergence to a local optimum of internal variance</p>
               </st>
               <p>For each of the chosen data sets, we have run <it>GenClust-Random-last </it>for 500 iterations; i.e., it has produced 500 partitions. The value of <it>k </it>has been set equal to the classes in the true solution for each data set. The results are reported in Figure <figr fid="F1">1</figr>. As is evident, such a convergence indeed takes place with a good degree of accuracy. It is also worth noting that for <b>RCNS, YCC </b>and <b>RYCC</b>, the convergence is rather fast, i.e., 100 iterations. For the remaining two data sets, it is somewhat slower and, for one of them, less pronounced. The same conclusions apply to <it>GenClust-Random-best.</it></p>
               <fig id="F1">
                  <title>
                     <p>Figure 1</p>
                  </title>
                  <caption>
                     <p>Convergence of GenClust</p>
                  </caption>
                  <text>
                     <p><b>Convergence of GenClust</b>. Experimental convergence of <it>GenClust </it>on each of the five data sets. The <it>x</it>-coordinate gives the number of iterations and the <it>y</it>-coordinate the value of the total internal variance (2). For each data set, the experiment was performed by asking the algorithm to return a clustering solution with a number of clusters equal to the number of classes in the true solution, for each data set.</p>
                  </text>
                  <graphic file="1471-2105-6-289-1"/>
               </fig>
            </sec>
            <sec>
               <st>
                  <p>GenClust and the best and last partition</p>
               </st>
               <p>The discussion here refers to the data available at <abbrgrp><abbr bid="B30">30</abbr></abbrgrp> (Figures <figr fid="F1">1</figr> and <figr fid="F2">2</figr>), summarizing the experiments we conducted for <it>GenClust-Random-best </it>and <it>GenClust-AvLink-best</it>. This latter algorithm is really indistinguishable from <it>AvLink</it>. Indeed, it is not surprising that <it>GenClust-AvLink-best </it>retains the main characteristics of the initial partition given by <it>AvLink</it>, which, in our experiments, often provides an initial partition to <it>GenClust-AvLink-best </it>with the best variance. This fact seems to indicate that the partition corresponding to the best variance should not be required as output to <it>GenClust </it>if the initial partition is given by another clustering algorithm. <it>GenClust-Random-best </it>seems to be related to <it>K-means-Random</it>. Indeed, the relation is quite strong for FOM. As for the adjusted Rand index, the minimum values of the two algorithms are in many circumstances quite close. Such a relation is less pronounced for the maximum values, where sometimes one of the two algorithms dominates the other. There is, however, one important difference between the two algorithms: <it>GenClust-Random-best </it>is much faster than <it>K-means-Random</it>, e.g., four times faster on the <b>PBM </b>data set. The relation between the two algorithms seems to have the following justification. Starting from a random partition, <it>K-means-Random </it>tries to minimize the internal variance and, in practice, it aims at a good local optimum. <it>GenClust-Random-best </it>performs pretty much the same task by keeping track of the partition corresponding to the best variance seen during its execution. Based on those considerations, from now on we discuss only <it>GenClust-Random-last </it>and <it>GenClust-AvLink-last </it>and, for brevity, drop the suffix <it>last.</it></p>
               <fig id="F2">
                  <title>
                     <p>Figure 2</p>
                  </title>
                  <caption>
                     <p>Adjusted Rand Index</p>
                  </caption>
                  <text>
                     <p><b>Adjusted Rand Index</b>. Experiments for adjusted Rand index. For each data set and each algorithm, the index is displayed as a function of the number of clusters.</p>
                  </text>
                  <graphic file="1471-2105-6-289-2"/>
               </fig>
            </sec>
            <sec>
               <st>
                  <p>A synopsis of GenClust performance for external and internal criteria</p>
               </st>
               <p>The values of interest are the adjusted Rand index and FOM. They have been computed requiring all algorithms, except <it>Click</it>, to produce a number of clusters equal to the classes of the <it>true solution </it>in each data set. The results are reported in Tables <tblr tid="T1">1</tblr>, <tblr tid="T2">2</tblr>, <tblr tid="T3">3</tblr>, <tblr tid="T4">4</tblr>, <tblr tid="T5">5</tblr>. Table <tblr tid="T6">6</tblr> refers to <it>Click</it>, used in an unsupervised fashion, and for the adjusted Rand index. Indeed, <it>Click </it>does not lend itself to adaptation with the FOM methodology. Data has been given to <it>Click</it>, which has returned a partition. Since <it>Click </it>leaves elements unclustered, we have grouped all of those singletons together in one class in order to compute the adjusted Rand index. The number of classes in Table <tblr tid="T6">6</tblr> accounts for that unification.</p>
               <tbl id="T1">
                  <title>
                     <p>Table 1</p>
                  </title>
                  <caption>
                     <p>RCNS Data Set. Performance of the algorithms at the number of classes (six) of the <it>true solution </it>for RCNS Rat data set.</p>
                  </caption>
                  <tblbdy cols="3">
                     <r>
                        <c ca="center">
                           <p>
                              <it>Method</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>
                              <it>AdjustedRand</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>
                              <it>FOM</it>
                           </p>
                        </c>
                     </r>
                     <r>
                        <c cspan="3">
                           <hr/>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>GenClust random</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.168</p>
                        </c>
                        <c ca="center">
                           <p>3.89</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Min kmeans-random</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.144</p>
                        </c>
                        <c ca="center">
                           <p>3.81</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Max kmeans-random</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.258</p>
                        </c>
                        <c ca="center">
                           <p>3.81</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Cast</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.12</p>
                        </c>
                        <c ca="center">
                           <p>3.98</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Kmeans-Avlink</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.167</p>
                        </c>
                        <c ca="center">
                           <p>3.71</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Avlink</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.19</p>
                        </c>
                        <c ca="center">
                           <p>4.05</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>GenClust-Avlink</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.161</p>
                        </c>
                        <c ca="center">
                           <p>4.07</p>
                        </c>
                     </r>
                  </tblbdy>
               </tbl>
               <tbl id="T2">
                  <title>
                     <p>Table 2</p>
                  </title>
                  <caption>
                     <p>YCC. Performance of the algorithms at the number of classes (five) of the <it>true solution </it>for YCC data set.</p>
                  </caption>
                  <tblbdy cols="3">
                     <r>
                        <c ca="center">
                           <p>
                              <it>Method</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>
                              <it>AdjustedRand</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>
                              <it>FOM</it>
                           </p>
                        </c>
                     </r>
                     <r>
                        <c cspan="3">
                           <hr/>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>GenClust random</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.47</p>
                        </c>
                        <c ca="center">
                           <p>57.05</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Min kmeans-random</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.44</p>
                        </c>
                        <c ca="center">
                           <p>57.05</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Max kmeans-random</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.49</p>
                        </c>
                        <c ca="center">
                           <p>57.05</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Cast</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.529</p>
                        </c>
                        <c ca="center">
                           <p>56.66</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Kmeans-Avlink</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.508</p>
                        </c>
                        <c ca="center">
                           <p>57.36</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Avlink</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.559</p>
                        </c>
                        <c ca="center">
                           <p>58.78</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>GenClust-Avlink</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.518</p>
                        </c>
                        <c ca="center">
                           <p>57.21</p>
                        </c>
                     </r>
                  </tblbdy>
               </tbl>
               <tbl id="T3">
                  <title>
                     <p>Table 3</p>
                  </title>
                  <caption>
                     <p>RYCC. Performance of the algorithms at the number of classes (five) of the <it>true solution </it>for the RYCC data set.</p>
                  </caption>
                  <tblbdy cols="3">
                     <r>
                        <c ca="center">
                           <p>
                              <it>Method</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>
                              <it>AdjustedRand</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>
                              <it>FOM</it>
                           </p>
                        </c>
                     </r>
                     <r>
                        <c cspan="3">
                           <hr/>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>GenClust random</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.446</p>
                        </c>
                        <c ca="center">
                           <p>10.60</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Min kmeans-random</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.359</p>
                        </c>
                        <c ca="center">
                           <p>10.69</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Max kmeans-random</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.49</p>
                        </c>
                        <c ca="center">
                           <p>10.69</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Cast</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.49</p>
                        </c>
                        <c ca="center">
                           <p>10.84</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Kmeans-Avlink</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.469</p>
                        </c>
                        <c ca="center">
                           <p>10.73</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Avlink</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.46</p>
                        </c>
                        <c ca="center">
                           <p>11.50</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>GenClust-Avlink</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.518</p>
                        </c>
                        <c ca="center">
                           <p>10.804</p>
                        </c>
                     </r>
                  </tblbdy>
               </tbl>
               <tbl id="T4">
                  <title>
                     <p>Table 4</p>
                  </title>
                  <caption>
                     <p>PBM. Performance of the algorithms at the number of classes (eighteen) of the <it>true solution </it>for the PBM data set.</p>
                  </caption>
                  <tblbdy cols="2">
                     <r>
                        <c ca="center">
                           <p>
                              <it>Method</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>
                              <it>AdjustedRand</it>
                           </p>
                        </c>
                     </r>
                     <r>
                        <c cspan="2">
                           <hr/>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>GenClust random</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.51</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Min kmeans-random</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.37</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Max kmeans-random</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.429</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Cast</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.528</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Kmeans-Avlink</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.58</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Avlink</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.18</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>GenClust-Avlink</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.51</p>
                        </c>
                     </r>
                  </tblbdy>
               </tbl>
               <tbl id="T5">
                  <title>
                     <p>Table 5</p>
                  </title>
                  <caption>
                     <p>RPBM. Performance of the algorithms at the number of classes (eighteen) of the <it>true solution </it>for the RPBM data set.</p>
                  </caption>
                  <tblbdy cols="3">
                     <r>
                        <c ca="center">
                           <p>
                              <it>Method</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>
                              <it>AdjustedRand</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>
                              <it>FOM</it>
                           </p>
                        </c>
                     </r>
                     <r>
                        <c cspan="3">
                           <hr/>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>GenClust random</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.509</p>
                        </c>
                        <c ca="center">
                           <p>57.49</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Min kmeans-random</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.378</p>
                        </c>
                        <c ca="center">
                           <p>55.73</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Max kmeans-random</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.51</p>
                        </c>
                        <c ca="center">
                           <p>55.73</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Cast</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.679</p>
                        </c>
                        <c ca="center">
                           <p>50.21</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Kmeans-Avlink</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.618</p>
                        </c>
                        <c ca="center">
                           <p>59.49</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>Avlink</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.517</p>
                        </c>
                        <c ca="center">
                           <p>62.27</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <it>GenClust-Avlink</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>0.80</p>
                        </c>
                        <c ca="center">
                           <p>59.33</p>
                        </c>
                     </r>
                  </tblbdy>
               </tbl>
               <tbl id="T6">
                  <title>
                     <p>Table 6</p>
                  </title>
                  <caption>
                     <p>Adjusted Rand Index for Click. Performance of Click on the various data sets. The results in the clusters column give the number of clusters returned by Click, in addition to one class consisting of all the unclustered elements.</p>
                  </caption>
                  <tblbdy cols="3">
                     <r>
                        <c ca="center">
                           <p>
                              <it>Dataset</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>
                              <it>Clusters</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>
                              <it>AdjustedRand</it>
                           </p>
                        </c>
                     </r>
                     <r>
                        <c cspan="3">
                           <hr/>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <b>RCNS</b>
                           </p>
                        </c>
                        <c ca="center">
                           <p>3 + 1</p>
                        </c>
                        <c ca="center">
                           <p>0.183</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <b>PBM</b>
                           </p>
                        </c>
                        <c ca="center">
                           <p>18 + 1</p>
                        </c>
                        <c ca="center">
                           <p>0.767</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <b>RPBM</b>
                           </p>
                        </c>
                        <c ca="center">
                           <p>6 + 1</p>
                        </c>
                        <c ca="center">
                           <p>0.658</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <b>YCC</b>
                           </p>
                        </c>
                        <c ca="center">
                           <p>7 + 1</p>
                        </c>
                        <c ca="center">
                           <p>0.510</p>
                        </c>
                     </r>
                     <r>
                        <c ca="center">
                           <p>
                              <b>RYCC</b>
                           </p>
                        </c>
                        <c ca="center">
                           <p>6 + 1</p>
                        </c>
                        <c ca="center">
                           <p>0.479</p>
                        </c>
                     </r>
                  </tblbdy>
               </tbl>
               <p>The first striking conclusion is that no algorithm is markedly superior to the others on all indexes and all data sets. Indeed, in many cases the observed differences between the worst and best performing algorithm may be statistically insignificant and they could be considered equivalent. However, there are cases in which an algorithm may be better than others and therefore worthwhile.</p>
               <p>Based on the synopsis, it appears that <it>GenClust-AvLink </it>is to be preferred to <it>GenClust-Random</it>. Moreover, <it>GenClust-AvLink </it>seems to take better advantage of the output of <it>Average Link </it>than <it>K-means</it>. It also appears that <it>GenClust-AvLink </it>is competitive, both in comparison with classic algorithms, i.e., <it>Average Link </it>and <it>K-means</it>, and more recent state-of-the-art ones, such as <it>Cast </it>and <it>Click</it>. The following present a detailed description of our experiments.</p>
            </sec>
            <sec>
               <st>
                  <p>External criteria</p>
               </st>
               <p>This discussion refers to Figure <figr fid="F2">2</figr>. We recall from the literature that a good algorithm must display a good value of the Adjusted Rand Index for clustering solutions that have a number of clusters close to the classes of the <it>true solution</it>, for any given data set.</p>
               <p>With that criterion in mind, we see that, with the exception of the <b>RCNS </b>data set, <it>GenClust </it>is better with an initial partition provided by <it>Average Link</it>, in particular around the number of clusters in the <it>true solution </it>of each of the corresponding data sets.</p>
               <p>Moreover, on the <b>YCC, RYCC </b>and <b>RPBM </b>data sets, <it>GenClust </it>seems to take better advantage than <it>K-means </it>of the initial knowledge of the partition produced by <it>Average Link</it>.</p>
               <p>When compared with all of the methods, <it>GenClust-AvLink </it>has a performance at least as good, and sometimes better, on three of the data sets, i.e., <b>YCC, RYCC </b>and <b>RPBM, </b>around the number of classes in the true solution of each data set.</p>
            </sec>
            <sec>
               <st>
                  <p>Internal criteria</p>
               </st>
               <p>This discussion refers to Figure <figr fid="F3">3</figr>. We recall from the literature <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr></abbrgrp> that the FOM methodology captures the intrinsic structure in the data by exhibiting a very characteristic steep decline as the number of clusters grows and approaches the number of clusters in the <it>true solution</it>. For our data sets, we find that all partitional algorithms exhibit excellent predictive power on the <b>RCNS, YCC </b>and <b>RYCC. </b>In particular, the curve of each algorithm indicates that the number of clusters really present in the data is close or at exactly the number of classes in the true solution of each data set. Moreover, when the <it>GenClust </it>curves are excluded from the FOM diagrams, the results are essentially analogous to the ones reported in <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr></abbrgrp> for the same algorithms on essentially the same data sets. Since in <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr></abbrgrp> it is concluded that <it>K-means </it>and <it>Cast </it>have excellent predictive power, we can draw the same conclusion for both versions of <it>GenClust</it>. As for the <b>RPBM, </b>we see that all algorithms do not exhibit any noticeable decline as the number of clusters grows. This may be a limitation of the FOM methodology, which displays some anti-correlation with the adjusted Rand index only for data sets with a small number of clusters in the <it>true solution</it>, as shown by Ka Yee Yeung et al. In fact, the internal validation of <b>PBM </b>and <b>RPBM </b>attempted here may indicate both a computational and sensitivity limitation of the FOM methodology; i.e., a data set with relatively large numbers of conditions and genes and a large number of clusters. Indeed, the external validation measure on both data sets shows that <it>GenClust </it>picks a substantial part of the true solution at a number of clusters reasonably close to 18. In general, any algorithm such as <it>GenClust </it>and <it>Kmeans</it>, will be limited by the power of the validation methodology associated to it. Valid alternatives to FOM are given in <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>. In particular, Monti et al. provide a good presentation of those alternatives. Unfortunately, the data driven measures may display the same computational limitations displayed by FOM. Principal component analysis, a widely used data dimensionality reduction technique for clustering, may be of great help to reduce the computational demand of data driven validation measures. Unfortunately, its application to gene expression data is not entirely straightforward. This point is investigated experimentally in Ka Yee Yeung doctoral dissertation, where different strategies are proposed and compared. In those circumstances, it is also advisable to filter the data set, for instance with the <it>GeneCluster </it>software package <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>, leaving out genes that do not display any significant changes. That may result in a substantial reduction of the data set, as shown Ka Yee Yeung et al. in the analysis of the Barrett Esophagus data set.</p>
               <fig id="F3">
                  <title>
                     <p>Figure 3</p>
                  </title>
                  <caption>
                     <p>FOM</p>
                  </caption>
                  <text>
                     <p><b>FOM</b>. Experiments for FOM. The index is displayed as a function of the number of clusters.</p>
                  </text>
                  <graphic file="1471-2105-6-289-3"/>
               </fig>
            </sec>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>We have presented a very simple genetic algorithm for clustering of gene expression data, i.e., <it>GenClust</it>, and we have evaluated its performance on real data sets and in comparison with other either classic or more state-of-the-art algorithms, with use of both external and internal validation criteria. The study shows that none of the chosen algorithms is clearly superior to the others in terms of ability to identify classes of truly functionally related genes in the given data sets. However, <it>GenClust </it>seems to be competitive with all of the implemented algorithms and well suited for use in conjunction with the data driven internal validation measures, as the experiments with FOM indicate.</p>
      </sec>
      <sec>
         <st>
            <p>Availability and requirements</p>
         </st>
         <p>- <b>Project Name: </b><it>GenClust</it></p>
         <p>- <b>Project Home Page: </b><url>http://www.math.unipa.it/~lobosco/genclust/</url></p>
         <p>- <b>Operating Systems: </b>Windows XP, Mac OSX, Linux Operating Systems (see details at <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>).</p>
         <p>- <b>Programming Languages: </b>Standard ANSI C. Compilation tested on Microsoft Visual C++ 6, Pelles C for Windows-version 3.00.4, and various gcc versions (see <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>).</p>
         <p>- <b>Other Requirements: </b>None</p>
         <p>- <b>License: </b>GNU GPL</p>
         <p>- <b>Any restriction to use by non-academics: </b>reference to paper</p>
      </sec>
      <sec>
         <st>
            <p>Abbreviations</p>
         </st>
         <p>FOM: Figure of Merit</p>
         <p><b>PBM: </b>Pheripheral Blood Monocytes</p>
         <p><b>RPBM: </b>Reduced Pheripheral Blood Monocytes</p>
         <p><b>RCNS: </b>Central Nervous System Rat</p>
         <p><b>RYCC: </b>Reduced Yeast Cell Cycle</p>
         <p><b>YCC: </b>Yeast Cell Cycle</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>All authors participated in the design of the evaluation of <it>GenClust</it>. The initial design and engineering of the algorithm is due to G. Lo Bosco and V. Di Ges&#250;. D. Scaturro and A. Raimondi realized part of the software needed for the comparative analysis of <it>GenClust</it>. R. Giancarlo coordinated the research and wrote the report. All authors have read and approved the manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>Part of this work is partially supported by Italian Ministry of Scientific Research, FIRB Project "Bioinfomatica per la Genomica e la Proteomica", PRIN Project "Metodi Combinatori ed Algoritmici per la Scoperta di Patterns in Biosequenze" and PRIN Project "Acquisizione di Immagini TreD a Basso Costo".</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Stanford Microarray DataBase</p>
            </title>
            <url>http://genome-www5.stanford.edu/</url>
         </bibl>
         <bibl id="B2">
            <aug>
               <au>
                  <snm>Everitt</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Cluster Analysis</source>
            <publisher>London: Edward Arnold</publisher>
            <pubdate>1993</pubdate>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Cluster analysis and mathematical programming</p>
            </title>
            <aug>
               <au>
                  <snm>Hansen</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Jaumard</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Mathematical Programming</source>
            <pubdate>1997</pubdate>
            <volume>79</volume>
            <fpage>191</fpage>
            <lpage>215</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/S0025-5610(97)00059-2</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <aug>
               <au>
                  <snm>Hartigan</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Clustering Algorithms</source>
            <publisher>John Wiley and Sons</publisher>
            <pubdate>1975</pubdate>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Data clustering: a Review</p>
            </title>
            <aug>
               <au>
                  <snm>Jain</snm>
                  <fnm>AK</fnm>
               </au>
               <au>
                  <snm>Murty</snm>
                  <fnm>MN</fnm>
               </au>
               <au>
                  <snm>Flynn</snm>
                  <fnm>PJ</fnm>
               </au>
            </aug>
            <source>ACM Computing Surveys</source>
            <pubdate>1999</pubdate>
            <volume>31</volume>
            <issue>3</issue>
            <fpage>264</fpage>
            <lpage>323</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1145/331499.331504</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <aug>
               <au>
                  <snm>Mirkin</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Mathematical Classification and Clustering</source>
            <publisher>Kluwer Academic Publisher</publisher>
            <pubdate>1996</pubdate>
         </bibl>
         <bibl id="B7">
            <aug>
               <au>
                  <snm>Rice</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Mathematical Statistics and Data Analysis. Wadsworth</source>
            <pubdate>1996</pubdate>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Scoring clustering solutions by their biological relevance</p>
            </title>
            <aug>
               <au>
                  <snm>Gat-Viks</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Sharan</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Shamir</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <fpage>2381</fpage>
            <lpage>2389</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btg330</pubid>
                  <pubid idtype="pmpid" link="fulltext">14668221</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data</p>
            </title>
            <aug>
               <au>
                  <snm>Monti</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Tamayo</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Mesirov</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Golub</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Machine Learning</source>
            <pubdate>2003</pubdate>
            <volume>52</volume>
            <fpage>91</fpage>
            <lpage>118</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1023/A:1023949509487</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Algorithmic approaches to clustering gene expression data</p>
            </title>
            <aug>
               <au>
                  <snm>Shamir</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Sharan</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Current Topics in Computational Biology</source>
            <publisher>Cambridge, Ma.: MIT Press</publisher>
            <editor>Jiang T, Smith T, Xu Y, Zhang M</editor>
            <pubdate>2003</pubdate>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Cluster analysis of gene expression data</p>
            </title>
            <aug>
               <au>
                  <snm>Yeung</snm>
                  <fnm>KY</fnm>
               </au>
            </aug>
            <source>PhD thesis</source>
            <publisher>University of Washington</publisher>
            <pubdate>2001</pubdate>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Validating clustering for gene expression data</p>
            </title>
            <aug>
               <au>
                  <snm>Yeung</snm>
                  <fnm>KY</fnm>
               </au>
               <au>
                  <snm>Haynor</snm>
                  <fnm>DR</fnm>
               </au>
               <au>
                  <snm>Ruzzo</snm>
                  <fnm>WL</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2001</pubdate>
            <volume>17</volume>
            <fpage>309</fpage>
            <lpage>318</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/17.4.309</pubid>
                  <pubid idtype="pmpid" link="fulltext">11301299</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <aug>
               <au>
                  <snm>Witten</snm>
                  <fnm>I</fnm>
               </au>
            </aug>
            <source>Data mining: practical machine learning tools and techniques with Java implementations</source>
            <publisher>San Diego, CA,: Academic Press</publisher>
            <pubdate>2000</pubdate>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Genetic clustering for automatic evolution of clusters and application to image classification</p>
            </title>
            <aug>
               <au>
                  <snm>Bandyopadhyay</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Maulik</snm>
                  <fnm>U</fnm>
               </au>
            </aug>
            <source>Pattern Recognition</source>
            <pubdate>2002</pubdate>
            <volume>35</volume>
            <issue>6</issue>
            <fpage>1197</fpage>
            <lpage>1208</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/S0031-3203(01)00108-X</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>In search of optimal clusters using genetic algorithms</p>
            </title>
            <aug>
               <au>
                  <snm>Murthy</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Chowdhury</snm>
                  <fnm>N</fnm>
               </au>
            </aug>
            <source>Pattern Recognition Letters</source>
            <pubdate>1996</pubdate>
            <volume>17</volume>
            <issue>8</issue>
            <fpage>285</fpage>
            <lpage>832</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/0167-8655(96)00043-8</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <aug>
               <au>
                  <snm>Goldberg</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Genetic Algorithms in Search, Optimization and Machine Learning</source>
            <publisher>Reading, MA,: Addison Wesley</publisher>
            <pubdate>1989</pubdate>
         </bibl>
         <bibl id="B17">
            <title>
               <p>CLICK and EXPANDER: a system for clustering and visualizing gene expression data</p>
            </title>
            <aug>
               <au>
                  <snm>Sharan</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Maron-Katz</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Shamir</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <fpage>1787</fpage>
            <lpage>1799</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btg232</pubid>
                  <pubid idtype="pmpid" link="fulltext">14512350</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Clustering of gene expression patterns</p>
            </title>
            <aug>
               <au>
                  <snm>Ben-Dor</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Shamir</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Yakhini</snm>
                  <fnm>Z</fnm>
               </au>
            </aug>
            <source>Journal of Computational Biology</source>
            <pubdate>1999</pubdate>
            <volume>6</volume>
            <fpage>281</fpage>
            <lpage>297</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1089/106652799318274</pubid>
                  <pubid idtype="pmpid" link="fulltext">10582567</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>ISI Essential Science Indicators</p>
            </title>
            <url>http://www.esi-topics.com/fbp/fbp-december2002.html</url>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation</p>
            </title>
            <aug>
               <au>
                  <snm>Tamayo</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Slonim</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Mesirov</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Zhu</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Kitareewan</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>E</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Lander</snm>
                  <fnm>ES</fnm>
               </au>
               <au>
                  <snm>Golub</snm>
                  <fnm>TR</fnm>
               </au>
            </aug>
            <source>Proc Nat Acad Sci U S A</source>
            <pubdate>1999</pubdate>
            <volume>96</volume>
            <fpage>2907</fpage>
            <lpage>2912</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1073/pnas.96.6.2907</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Comparing partitions</p>
            </title>
            <aug>
               <au>
                  <snm>Hubert</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Arabie</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>J of Classification</source>
            <pubdate>1985</pubdate>
            <volume>2</volume>
            <fpage>193</fpage>
            <lpage>218</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1007/BF01908075</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>An examination of procedures for determining the number of clusters in a data set</p>
            </title>
            <aug>
               <au>
                  <snm>Milligan</snm>
                  <fnm>GW</fnm>
               </au>
               <au>
                  <snm>Cooper</snm>
                  <fnm>MC</fnm>
               </au>
            </aug>
            <source>Psychometrika</source>
            <pubdate>1985</pubdate>
            <volume>50</volume>
            <fpage>159</fpage>
            <lpage>179</lpage>
         </bibl>
         <bibl id="B23">
            <title>
               <p>A Study of the comparability of external criteria for hierarchical cluster analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Milligan</snm>
                  <fnm>GW</fnm>
               </au>
               <au>
                  <snm>Cooper</snm>
                  <fnm>MC</fnm>
               </au>
            </aug>
            <source>Multivariate Behavioral Research</source>
            <pubdate>1986</pubdate>
            <volume>21</volume>
            <fpage>441</fpage>
            <lpage>458</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1207/s15327906mbr2104_5</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Developmental kinetic of GLAD family mRNAs parallel neurogenesis in the rat Spinal Cord</p>
            </title>
            <aug>
               <au>
                  <snm>Somogyi</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Wen</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Ma</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Barker</snm>
                  <fnm>JL</fnm>
               </au>
            </aug>
            <source>J Neurosciences</source>
            <pubdate>1995</pubdate>
            <volume>15</volume>
            <fpage>2575</fpage>
            <lpage>2591</lpage>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Large scale temporal gene expression mapping of central nervous system development</p>
            </title>
            <aug>
               <au>
                  <snm>Wen</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Fuhrman</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Michaels</snm>
                  <fnm>GS</fnm>
               </au>
               <au>
                  <snm>Carr</snm>
                  <fnm>GS</fnm>
               </au>
               <au>
                  <snm>Smith</snm>
                  <fnm>DB</fnm>
               </au>
               <au>
                  <snm>Barker</snm>
                  <fnm>JL</fnm>
               </au>
               <au>
                  <snm>Somogyi</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Proc of The National Academy of Science USA</source>
            <pubdate>1998</pubdate>
            <volume>95</volume>
            <fpage>334</fpage>
            <lpage>339</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1073/pnas.95.1.334</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Comprehensive identification of cell cycle regulated genes of the yeast Saccharomyces Cerevisiae by microarray hybridization</p>
            </title>
            <aug>
               <au>
                  <snm>Spellman</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Sherlock</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>MQ</fnm>
               </au>
               <au>
                  <snm>Iyer</snm>
                  <fnm>VR</fnm>
               </au>
               <au>
                  <snm>Anders</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Eisen</snm>
                  <fnm>MB</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>PO</fnm>
               </au>
               <au>
                  <snm>Botstein</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Futcher</snm>
                  <fnm/>
               </au>
            </aug>
            <source>Mol Biol Cell</source>
            <pubdate>1998</pubdate>
            <volume>9</volume>
            <fpage>3273</fpage>
            <lpage>3297</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">25624</pubid>
                  <pubid idtype="pmpid" link="fulltext">9843569</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>A genome-wide transcriptional analysis of the mitotic cell cycle</p>
            </title>
            <aug>
               <au>
                  <snm>Cho</snm>
                  <fnm>RJ</fnm>
               </au>
               <au>
                  <snm>Campbell</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Winzeler</snm>
                  <fnm>EA</fnm>
               </au>
               <au>
                  <snm>Steinmetz</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Conway</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>L</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Wolfsberg</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Gabrielian</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Landsman</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Lockhart</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Davis</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Molecular Cell</source>
            <pubdate>1998</pubdate>
            <volume>2</volume>
            <fpage>65</fpage>
            <lpage>73</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S1097-2765(00)80114-8</pubid>
                  <pubid idtype="pmpid" link="fulltext">9702192</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Ka Yee Yeung Web Page for FOM</p>
            </title>
            <url>http://faculty.washington.edu/kayee/cluster/</url>
         </bibl>
         <bibl id="B29">
            <title>
               <p>An algorithm for clustering of cDNAs for gene expression analysis using short oligonucleotide fingerprints</p>
            </title>
            <aug>
               <au>
                  <snm>Hartuv</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Schmitt</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Lange</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Meier-Ewert</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>H</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Shamir</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Genomics</source>
            <pubdate>2000</pubdate>
            <volume>66</volume>
            <fpage>249</fpage>
            <lpage>256</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/geno.2000.6187</pubid>
                  <pubid idtype="pmpid" link="fulltext">10873379</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>GenClust Supplementary Material WebPage</p>
            </title>
            <url>http://www.math.unipa.it/~lobosco/genclust/</url>
         </bibl>
         <bibl id="B31">
            <title>
               <p>Expander Home Page</p>
            </title>
            <url>http://www.cs.tau.ac.il/~rshamir/expander/expander.html</url>
         </bibl>
      </refgrp>
   </bm>
</art>
