<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-6-237</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Software</dochead>
      <bibl>
         <title>
            <p>Measuring similarities between transcription factor binding sites</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Kielbasa</snm>
               <mi>M</mi>
               <fnm>Szymon</fnm>
               <insr iid="I1"/>
               <email>s.kielbasa@biologie.hu-berlin.de</email>
            </au>
            <au id="A2">
               <snm>Gonze</snm>
               <fnm>Didier</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>d.gonze@biologie.hu-berlin.de</email>
            </au>
            <au id="A3">
               <snm>Herzel</snm>
               <fnm>Hanspeter</fnm>
               <insr iid="I1"/>
               <email>h.herzel@biologie.hu-berlin.de</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Institute for Theoretical Biology, Humboldt University, Invalidenstra&#223;e 43, D-10115 Berlin, Germany</p>
            </ins>
            <ins id="I2">
               <p>Unit&#233; de Chronobiologie Th&#233;orique, Universit&#233; Libre de Bruxelles, CP 231, Campus Plaine, Bvd du Triomphe, B-1050 Bruxelles, Belgium</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2005</pubdate>
         <volume>6</volume>
         <issue>1</issue>
         <fpage>237</fpage>
         <url>http://www.biomedcentral.com/1471-2105/6/237</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">16191190</pubid>
               <pubid idtype="doi">10.1186/1471-2105-6-237</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>22</day>
               <month>11</month>
               <year>2004</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>28</day>
               <month>9</month>
               <year>2005</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>28</day>
               <month>9</month>
               <year>2005</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2005</year>
         <collab>Kielbasa et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Collections of transcription factor binding profiles (Transfac, Jaspar) are essential to identify regulatory elements in DNA sequences. Subsets of highly similar profiles complicate large scale analysis of transcription factor binding sites.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We propose to identify and group similar profiles using two independent similarity measures: <it>&#967;</it><sup>2 </sup>distances between position frequency matrices (PFMs) and correlation coefficients between position weight matrices (PWMs) scores.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>We show that these measures complement each other and allow to associate Jaspar and Transfac matrices. Clusters of highly similar matrices are identified and can be used to optimise the search for regulatory elements. Moreover, the application of the measures is illustrated by assigning E-box matrices of a SELEX experiment and of experimentally characterised binding sites of circadian clock genes to the Myc-Max cluster.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>In order to dissect the complex machinery of transcriptional control computational tools are widely used <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. Candidate binding sites of known transcription factors are located by consensus sequence search or binding scores calculated from position weight matrices (PWMs) <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>. These matrices are derived from position frequency matrices (PFMs) obtained by aligning binding sites for a given transcription factor. PFMs contain the observed nucleotide frequencies at each position of the alignment. A popular collection of eukaryotic PFMs is given by the Transfac database <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. Furthermore, an open-access database, Jaspar <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>, has been compiled recently.</p>
         <p>On-line tools are available to calculate high-scoring binding sites on the basis of these matrix collections <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp>. For a given transcription factor these programs predict many binding sites (on average every 1000 bp) implying a high excess of false positives <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. The situation is even worse if hundreds of different binding profiles are studied in parallel leading to multiple testing issues. Often these predictions overlap as a result of similarities of transcription factor binding profiles.</p>
         <p>First steps to overcome the flood of false positive signals are accurate predictions of promoter regions and enhancers <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr></abbrgrp>. Phylogenetic footprinting <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp>, correlation with gene expression data <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp> or analysis of cooperative binding of multiple transcription factors <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> allow to reduce the amount of false positives by at least an order of magnitude. Another helpful strategy is the <it>a priori </it>reduction of the number of matrices to be considered. However, a user-defined preselection of a few matrices is highly subjective and might hide novel interactions of several transcription factors. Therefore, in this paper we combine two objective criteria to measure similarities of transcription factor binding site profiles. These measures allow to construct groups of similar profiles. Representative matrices of the groups may be chosen and constitute a reduced and unbiased list of independent profiles for searching binding sites.</p>
         <p>Similarities in the collections of matrices may arise from several sources:</p>
         <p>1. Identical transcription factors are represented by different matrices. This appears, e.g., due to the distinct nomenclature in Transfac and Jaspar (for example the TATA-binding protein is referred as TATA in Transfac and as TBP in Jaspar) or due to the availability of matrices obtained with different methods (see for example Transfac matrices SRF_01 and SRF_Q6) or stringency criteria (see for example AP1_Q2 and AP1_Q6).</p>
         <p>2. Factors within one family are represented by similar matrices due to the conserved structure of DNA-binding domains <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. For example, both ATF and CREB matrices belong to the same bZIP family and recognise the TGACGT consensus sequence.</p>
         <p>3. There might be so far undetected similarities of different transcription factor binding sites. Such similarities can point to a possible cross-talk between different regulatory pathways (see our discussion of E-box binding sites below).</p>
         <p>4. It might be difficult to distinguish matrices for which only a few binding sites are known.</p>
         <p>In order to identify similar matrices we combine two similarity measures. The first one is based on the <it>&#967;</it><sup>2 </sup>distance of position frequencies of PFMs. The other utilizes scores from the corresponding position weight matrices (PWMs) &#8211; we calculate for a given pair of binding profiles the scores along a test DNA sequence and take the corresponding Pearson correlation coefficient as a similarity measure. Although related similarity measures have been already studied individually <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr></abbrgrp>, our combined approach applied to the Transfac matrices reveals that the two selected measures capture different properties of the matrices and therefore the measures complement each other. Moreover, since for many matrices only a few experimentally verified binding sites are available we take into account these small sample sizes in both measures. The application of the measures is illustrated by mapping CLOCK-BMAL1 binding sites of circadian clock genes to the Myc-Max family.</p>
      </sec>
      <sec>
         <st>
            <p>Implementation</p>
         </st>
         <sec>
            <st>
               <p>Databases</p>
            </st>
            <p>A commonly used database of experimentally verified transcription factor binding sites is Transfac <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. The release from May 2004 provides 694 position frequency matrices (PFMs) covering vertebrates, plants, insects and fungi. Recently, a publicly available Jaspar database <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> was compiled with 108 PFMs associated mainly to vertebrates. For our large-scale statistical analysis we discarded all matrices with inconsistencies, for example matrices, where the number of sites aligned to construct the matrix (sample size) could not be determined. Furthermore, we excluded rather poor matrices with a length below 6 bases or a sample size below 5. After these consistency checks and filtering steps we arrived at 637 different matrices for Transfac and 103 matrices for Jaspar. All the matrices can be characterized by their length, the sample size, and the information content <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> (Tab. <tblr tid="T1">1</tblr>).</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Properties of Transfac and Jaspar matrices: We removed matrices for which the sample size was normalized to 100 and no information about the actual number of samples was available, as well as matrices of length below 6 or sample size below 5.</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c ca="center">
                        <p>Property</p>
                     </c>
                     <c ca="center">
                        <p>Transfac</p>
                     </c>
                     <c ca="center">
                        <p>Jaspar</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Number of original matrices</p>
                     </c>
                     <c ca="center">
                        <p>694</p>
                     </c>
                     <c ca="center">
                        <p>108</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Number of matrices after filtering</p>
                     </c>
                     <c ca="center">
                        <p>637</p>
                     </c>
                     <c ca="center">
                        <p>103</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Min length</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Max length</p>
                     </c>
                     <c ca="center">
                        <p>30</p>
                     </c>
                     <c ca="center">
                        <p>30</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Median length</p>
                     </c>
                     <c ca="center">
                        <p>12</p>
                     </c>
                     <c ca="center">
                        <p>11</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Min sample size</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Max sample size</p>
                     </c>
                     <c ca="center">
                        <p>389</p>
                     </c>
                     <c ca="center">
                        <p>389</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Median sample size</p>
                     </c>
                     <c ca="center">
                        <p>18</p>
                     </c>
                     <c ca="center">
                        <p>23</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Min information content</p>
                     </c>
                     <c ca="center">
                        <p>3.6</p>
                     </c>
                     <c ca="center">
                        <p>5.7</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Max information content</p>
                     </c>
                     <c ca="center">
                        <p>44.3</p>
                     </c>
                     <c ca="center">
                        <p>26.2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Median information content</p>
                     </c>
                     <c ca="center">
                        <p>12.8</p>
                     </c>
                     <c ca="center">
                        <p>11.6</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
         </sec>
         <sec>
            <st>
               <p><it>&#967;</it><sup>2 </sup>distance <it>D </it>between position frequency matrices</p>
            </st>
            <p>For each possible overlap (of at least 6 bases) of two PFMs we count the number of corresponding columns which are statistically independent. This task can be addressed by the homogeneity test using the <it>&#967;</it><sup>2 </sup>measure with 3 degrees of freedom. The application of PFMs for the characterization of binding sites implies that the nucleotide positions are regarded as independent. Even though statistical dependencies between positions are known <abbrgrp><abbr bid="B23">23</abbr><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr></abbrgrp> the assumption of independent positions is a rather good approximation <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B26">26</abbr></abbrgrp>. In the following we denote by <it>f</it><sub><it>b,i </it></sub>and <it>g</it><sub><it>b,i </it></sub>the entries of the overlapping parts of the two frequency matrices to be compared. The index <it>i </it>refers to the base position along the matrices and <it>b </it>enumerates the four nucleotides A, C, G and T. The <it>&#967;</it><sup>2 </sup>distance at the position <it>i </it>is then given by:</p>
            <p>
               <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-6-237-i1">
                  <m:semantics>
                     <m:mrow>
                        <m:msup>
                           <m:mi>&#967;</m:mi>
                           <m:mn>2</m:mn>
                        </m:msup>
                        <m:mo>=</m:mo>
                        <m:mstyle displaystyle="true">
                           <m:munder>
                              <m:mo>&#8721;</m:mo>
                              <m:mrow>
                                 <m:mi>b</m:mi>
                                 <m:mo>=</m:mo>
                                 <m:mtext>A</m:mtext>
                                 <m:mo>,</m:mo>
                                 <m:mtext>C</m:mtext>
                                 <m:mo>,</m:mo>
                                 <m:mtext>G</m:mtext>
                                 <m:mo>,</m:mo>
                                 <m:mtext>T</m:mtext>
                              </m:mrow>
                           </m:munder>
                           <m:mrow>
                              <m:mfrac>
                                 <m:mrow>
                                    <m:msup>
                                       <m:mrow>
                                          <m:mo stretchy="false">(</m:mo>
                                          <m:msub>
                                             <m:mi>N</m:mi>
                                             <m:mrow>
                                                <m:mi>g</m:mi>
                                                <m:mo>,</m:mo>
                                                <m:mi>i</m:mi>
                                             </m:mrow>
                                          </m:msub>
                                          <m:msub>
                                             <m:mi>f</m:mi>
                                             <m:mrow>
                                                <m:mi>b</m:mi>
                                                <m:mo>,</m:mo>
                                                <m:mi>i</m:mi>
                                             </m:mrow>
                                          </m:msub>
                                          <m:mo>&#8722;</m:mo>
                                          <m:msub>
                                             <m:mi>N</m:mi>
                                             <m:mrow>
                                                <m:mi>f</m:mi>
                                                <m:mo>,</m:mo>
                                                <m:mi>i</m:mi>
                                             </m:mrow>
                                          </m:msub>
                                          <m:msub>
                                             <m:mi>g</m:mi>
                                             <m:mrow>
                                                <m:mi>b</m:mi>
                                                <m:mo>,</m:mo>
                                                <m:mi>i</m:mi>
                                             </m:mrow>
                                          </m:msub>
                                          <m:mo stretchy="false">)</m:mo>
                                       </m:mrow>
                                       <m:mn>2</m:mn>
                                    </m:msup>
                                 </m:mrow>
                                 <m:mrow>
                                    <m:msub>
                                       <m:mi>N</m:mi>
                                       <m:mrow>
                                          <m:mi>f</m:mi>
                                          <m:mo>,</m:mo>
                                          <m:mi>i</m:mi>
                                       </m:mrow>
                                    </m:msub>
                                    <m:msub>
                                       <m:mi>N</m:mi>
                                       <m:mrow>
                                          <m:mi>g</m:mi>
                                          <m:mo>,</m:mo>
                                          <m:mi>i</m:mi>
                                       </m:mrow>
                                    </m:msub>
                                    <m:mo stretchy="false">(</m:mo>
                                    <m:msub>
                                       <m:mi>f</m:mi>
                                       <m:mrow>
                                          <m:mi>b</m:mi>
                                          <m:mo>,</m:mo>
                                          <m:mi>i</m:mi>
                                       </m:mrow>
                                    </m:msub>
                                    <m:mo>+</m:mo>
                                    <m:msub>
                                       <m:mi>g</m:mi>
                                       <m:mrow>
                                          <m:mi>b</m:mi>
                                          <m:mo>,</m:mo>
                                          <m:mi>i</m:mi>
                                       </m:mrow>
                                    </m:msub>
                                    <m:mo stretchy="false">)</m:mo>
                                 </m:mrow>
                              </m:mfrac>
                           </m:mrow>
                        </m:mstyle>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqaHhpWydaahaaWcbeqaaiabikdaYaaakiabg2da9maaqafabaWaaSaaaeaacqGGOaakcqWGobGtdaWgaaWcbaGaem4zaCMaeiilaWIaemyAaKgabeaakiabdAgaMnaaBaaaleaacqWGIbGycqGGSaalcqWGPbqAaeqaaOGaeyOeI0IaemOta40aaSbaaSqaaiabdAgaMjabcYcaSiabdMgaPbqabaGccqWGNbWzdaWgaaWcbaGaemOyaiMaeiilaWIaemyAaKgabeaakiabcMcaPmaaCaaaleqabaGaeGOmaidaaaGcbaGaemOta40aaSbaaSqaaiabdAgaMjabcYcaSiabdMgaPbqabaGccqWGobGtdaWgaaWcbaGaem4zaCMaeiilaWIaemyAaKgabeaakiabcIcaOiabdAgaMnaaBaaaleaacqWGIbGycqGGSaalcqWGPbqAaeqaaOGaey4kaSIaem4zaC2aaSbaaSqaaiabdkgaIjabcYcaSiabdMgaPbqabaGccqGGPaqkaaaaleaacqWGIbGycqGH9aqpcqqGbbqqcqGGSaalcqqGdbWqcqGGSaalcqqGhbWrcqGGSaalcqqGubavaeqaniabggHiLdaaaa@6A6E@</m:annotation>
                  </m:semantics>
               </m:math>
            </p>
            <p>where <it>N</it><sub><it>f,i </it></sub>= &#8721;<sub><it>b</it></sub><it>f</it><sub><it>b,i </it></sub>and <it>N</it><sub><it>g,i </it></sub>= &#8721;<sub><it>b</it></sub><it>g</it><sub><it>b,i </it></sub>are the sample sizes of the matrices columns at position <it>i</it>. If <it>&#967;</it><sup>2 </sup>exceeds the threshold of <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-6-237-i2"><m:semantics><m:mrow><m:msubsup><m:mi>&#967;</m:mi><m:mrow><m:mtext>th</m:mtext></m:mrow><m:mn>2</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqaHhpWydaqhaaWcbaGaeeiDaqNaeeiAaGgabaGaeGOmaidaaaaa@3248@</m:annotation></m:semantics></m:math> (<it>p </it>= 0.05) = 7.81 the null hypothesis that the base counts in both columns are from the same distribution is rejected with a p-value of 0.05. In order to simplify the analysis we simply count the number of significantly different positions. The example in Fig. <figr fid="F1">1</figr> shows that for an appropriate alignment (with shift = 3) of the two matrices all <it>&#967;</it><sup>2</sup>-values are below the <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-6-237-i2"><m:semantics><m:mrow><m:msubsup><m:mi>&#967;</m:mi><m:mrow><m:mtext>th</m:mtext></m:mrow><m:mn>2</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqaHhpWydaqhaaWcbaGaeeiDaqNaeeiAaGgabaGaeGOmaidaaaaa@3248@</m:annotation></m:semantics></m:math> threshold and hence no column appears to be different. Although the counts in some columns look quite different the limited sample size allows no statistically significant discrimination.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>CREB versus ATF matrices: The distance <it>D </it>is computed for each possible alignment between the two matrices</p>
               </caption>
               <text>
                  <p>CREB versus ATF matrices: The distance <it>D </it>is computed for each possible alignment between the two matrices. For each aligned column, we calculated the <it>&#967;</it><sup>2 </sup>scores. <it>D </it>is then the number of <it>&#967;</it><sup>2 </sup>values which exceed the threshold <graphic file="1471-2105-6-237-i2.gif"/> = 7.81 For shift= 0, the two matrices are not properly aligned, <it>D </it>= 7. For shift= 3, the two matrices are properly aligned, <it>D </it>= 0.</p>
               </text>
               <graphic file="1471-2105-6-237-1"/>
            </fig>
            <p>Obviously, the number of significantly different columns depends on the relative position of both matrices. In our algorithm we study all possible alignments with a minimum overlap of 6 bases and containing at least 75% of the information content of each matrix. We calculate the minimal number of different positions among these alignments. We call this number <it>D </it>and interpret it as the distance between the compared matrices. Fig. <figr fid="F1">1</figr> illustrates that for a correct alignment of the ATF and CREB a distance <it>D </it>= 0 is obtained whereas other alignments lead to statistically significant different columns.</p>
            <p>An advantage of the distance measure we use in comparison to earlier studies <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B17">17</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr></abbrgrp> is the emphasis on the limited sample size of many matrices. Only few binding sites, such as those recognized by the Sp1 factor, are characterized by hundreds of experimentally verified sites. The more common sample size is around 15&#8211;20 (see Tab. <tblr tid="T1">1</tblr>) and, thus, it is much more difficult to distinguish matrices. The <it>&#967;</it><sup>2 </sup>measure leading to the distance <it>D </it>takes into account the limited sample size in a statistically well defined manner. The proposed measure could be generalized by allowing gaps, using the sum of scores or by taking the number of possible shifts into account. Since we studied in this paper only rather strong similarities our simple discrete threshold <it>D </it>&#8804; 1 was sufficient.</p>
         </sec>
         <sec>
            <st>
               <p>Correlation <it>C </it>between position frequency matrices scores</p>
            </st>
            <p>The information on experimentally verified binding sites stored in PFMs can be exploited to predict novel sites. For this purpose position weight matrices (PWMs) can be constructed from the counts <it>f</it><sub><it>b,i </it></sub>in the following manner <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B27">27</abbr></abbrgrp>. First, the probability <it>p</it><sub><it>b,i </it></sub>of a base <it>b </it>at a given position <it>i </it>is given by:</p>
            <p>
               <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-6-237-i3">
                  <m:semantics>
                     <m:mrow>
                        <m:msub>
                           <m:mi>p</m:mi>
                           <m:mrow>
                              <m:mi>b</m:mi>
                              <m:mo>,</m:mo>
                              <m:mi>i</m:mi>
                           </m:mrow>
                        </m:msub>
                        <m:mo>=</m:mo>
                        <m:mfrac>
                           <m:mrow>
                              <m:msub>
                                 <m:mi>f</m:mi>
                                 <m:mrow>
                                    <m:mi>b</m:mi>
                                    <m:mo>,</m:mo>
                                    <m:mi>i</m:mi>
                                 </m:mrow>
                              </m:msub>
                              <m:mo>+</m:mo>
                              <m:msub>
                                 <m:mi>s</m:mi>
                                 <m:mi>b</m:mi>
                              </m:msub>
                           </m:mrow>
                           <m:mrow>
                              <m:msub>
                                 <m:mi>N</m:mi>
                                 <m:mi>i</m:mi>
                              </m:msub>
                              <m:mo>+</m:mo>
                              <m:mstyle displaystyle="true">
                                 <m:msub>
                                    <m:mo>&#8721;</m:mo>
                                    <m:mrow>
                                       <m:msup>
                                          <m:mi>b</m:mi>
                                          <m:mo>&#8242;</m:mo>
                                       </m:msup>
                                       <m:mo>=</m:mo>
                                       <m:mtext>A</m:mtext>
                                       <m:mo>,</m:mo>
                                       <m:mtext>C</m:mtext>
                                       <m:mo>,</m:mo>
                                       <m:mtext>G</m:mtext>
                                       <m:mo>,</m:mo>
                                       <m:mtext>T</m:mtext>
                                    </m:mrow>
                                 </m:msub>
                                 <m:mrow>
                                    <m:msub>
                                       <m:mi>s</m:mi>
                                       <m:msup>
                                          <m:mi>b</m:mi>
                                          <m:mo>&#8242;</m:mo>
                                       </m:msup>
                                    </m:msub>
                                 </m:mrow>
                              </m:mstyle>
                           </m:mrow>
                        </m:mfrac>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGWbaCdaWgaaWcbaGaemOyaiMaeiilaWIaemyAaKgabeaakiabg2da9maalaaabaGaemOzay2aaSbaaSqaaiabdkgaIjabcYcaSiabdMgaPbqabaGccqGHRaWkcqWGZbWCdaWgaaWcbaGaemOyaigabeaaaOqaaiabd6eaonaaBaaaleaacqWGPbqAaeqaaOGaey4kaSYaaabeaeaacqWGZbWCdaWgaaWcbaGafmOyaiMbauaaaeqaaaqaaiqbdkgaIzaafaGaeyypa0JaeeyqaeKaeiilaWIaee4qamKaeiilaWIaee4raCKaeiilaWIaeeivaqfabeqdcqGHris5aaaaaaa@4D8D@</m:annotation>
                  </m:semantics>
               </m:math>
            </p>
            <p>where <it>N</it><sub><it>i </it></sub>= &#8721;<sub><it>b' </it></sub><it>f</it><sub><it>b',i </it></sub>denotes the sample size at the position <it>i </it>leading to the relative frequency <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-6-237-i4"><m:semantics><m:mrow><m:mfrac><m:mrow><m:msub><m:mi>f</m:mi><m:mrow><m:mi>b</m:mi><m:mo>,</m:mo><m:mi>i</m:mi></m:mrow></m:msub></m:mrow><m:mrow><m:msub><m:mi>N</m:mi><m:mi>i</m:mi></m:msub></m:mrow></m:mfrac></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabdAgaMnaaBaaaleaacqWGIbGycqGGSaalcqWGPbqAaeqaaaGcbaGaemOta40aaSbaaSqaaiabdMgaPbqabaaaaaaa@347B@</m:annotation></m:semantics></m:math>. This estimator is modified using pseudo-counts <it>s</it><sub><it>b</it></sub>. As suggested earlier <abbrgrp><abbr bid="B28">28</abbr></abbrgrp> we choose <it>sb </it>= <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-6-237-i5"><m:semantics><m:mrow><m:mfrac><m:mrow><m:msqrt><m:mrow><m:msub><m:mi>N</m:mi><m:mi>i</m:mi></m:msub></m:mrow></m:msqrt></m:mrow><m:mn>4</m:mn></m:mfrac></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaamaakaaabaGaemOta40aaSbaaSqaaiabdMgaPbqabaaabeaaaOqaaiabisda0aaaaaa@3078@</m:annotation></m:semantics></m:math>, i.e. the pseudo-count is proportional to the standard deviation of the counted frequencies. Such a choice of relatively large pseudo-counts has a pronounced effect on PWMs with a small sample size. Due to the pseudo-counts the estimated probabilities are strictly positive even if zeros appear in the PFM. From the estimated probabilities <it>p</it><sub><it>b,i </it></sub>we obtain the weights <it>w</it><sub><it>b,i </it></sub>as follows:</p>
            <p>
               <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-6-237-i6">
                  <m:semantics>
                     <m:mrow>
                        <m:msub>
                           <m:mi>w</m:mi>
                           <m:mrow>
                              <m:mi>b</m:mi>
                              <m:mo>,</m:mo>
                              <m:mi>i</m:mi>
                           </m:mrow>
                        </m:msub>
                        <m:mo>=</m:mo>
                        <m:msub>
                           <m:mrow>
                              <m:mi>log</m:mi>
                              <m:mo>&#8289;</m:mo>
                           </m:mrow>
                           <m:mn>2</m:mn>
                        </m:msub>
                        <m:mfrac>
                           <m:mrow>
                              <m:msub>
                                 <m:mi>p</m:mi>
                                 <m:mrow>
                                    <m:mi>b</m:mi>
                                    <m:mo>,</m:mo>
                                    <m:mi>i</m:mi>
                                 </m:mrow>
                              </m:msub>
                           </m:mrow>
                           <m:mrow>
                              <m:msub>
                                 <m:mi>r</m:mi>
                                 <m:mi>b</m:mi>
                              </m:msub>
                           </m:mrow>
                        </m:mfrac>
                        <m:mo>,</m:mo>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWG3bWDdaWgaaWcbaGaemOyaiMaeiilaWIaemyAaKgabeaakiabg2da9iGbcYgaSjabc+gaVjabcEgaNnaaBaaaleaacqaIYaGmaeqaaOWaaSaaaeaacqWGWbaCdaWgaaWcbaGaemOyaiMaeiilaWIaemyAaKgabeaaaOqaaiabdkhaYnaaBaaaleaacqWGIbGyaeqaaaaakiabcYcaSaaa@4134@</m:annotation>
                  </m:semantics>
               </m:math>
            </p>
            <p>where <it>r</it><sub><it>b </it></sub>refers to the <it>a priori </it>probability to find a base <it>b </it>in the DNA sequence. Consequently, the weights <it>w</it><sub><it>b,i </it></sub>represent log-likelihood ratios to find a base <it>b </it>at a position <it>i</it>. Finally, the score <it>S</it><sub><it>k </it></sub>around the position <it>k </it>of a test DNA sequence is a sum of the weights corresponding to bases observed in the DNA sequence at the subsequent positions starting from the position <it>k</it>. The sum <it>S</it><sub><it>k </it></sub>is computed for each position <it>k </it>of the matrix along the DNA sequence. High positive scores <it>S</it><sub><it>k </it></sub>indicate locations in the test DNA sequence with strong binding affinities whereas zero or negative scores are found elsewhere (Fig. <figr fid="F2">2</figr>).</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Comparison of ATF and CREB matrices: Correlation <it>C </it>of ATF and CREB scores along a test DNA sequence</p>
               </caption>
               <text>
                  <p>Comparison of ATF and CREB matrices: Correlation <it>C </it>of ATF and CREB scores along a test DNA sequence. Left: first 30 scores for ATF (solid line) and CREB (dashed line). Right: scores for ATF versus scores for CREB. Only the first 200 scores are plotted, but the full length of the test DNA sequence is 10000 bases. Upper (shift = 0): the matrices are not properly aligned (<it>C </it>= 0.068). Lower (shift = 3): the matrices ATF and CREB are properly aligned and both reveal a binding site at position 20 (<it>C </it>= 0.881).</p>
               </text>
               <graphic file="1471-2105-6-237-2"/>
            </fig>
            <p>This widely used technique of score calculation leads immediately to the second similarity measure (similar in spirit to the method used in <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>, but modified to take into account the sample sizes of compared matrices). For two given matrices <it>f </it>and <it>g </it>we can directly obtain the corresponding scores <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-6-237-i7"><m:semantics><m:mrow><m:msubsup><m:mi>S</m:mi><m:mi>k</m:mi><m:mi>f</m:mi></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGtbWudaqhaaWcbaGaem4AaSgabaGaemOzaygaaaaa@30BC@</m:annotation></m:semantics></m:math> and <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2105-6-237-i8"><m:semantics><m:mrow><m:msubsup><m:mi>S</m:mi><m:mi>k</m:mi><m:mi>f</m:mi></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGtbWudaqhaaWcbaGaem4AaSgabaGaemOzaygaaaaa@30BC@</m:annotation></m:semantics></m:math> along all positions <it>k </it>in a given test DNA sequence. If the weight matrices are highly similar we expect positive peaks at nearly the same positions, i.e. a prediction of nearly the same set of binding sites. In order to quantify the similarity of both matrices we calculate the Pearson correlation coefficient along a test sequence. Here we also consider all possible relative shifts between two PWMs (with a minimum overlap of 6 bases) and then take the maximum correlation coefficient as the similarity measure <it>C </it>of the two matrices. We have found, that the correlation coefficients do not depend strongly on the value of the pseudo-counts and reflect mainly the relevant rare peaks.</p>
            <p>In this paper we take as the test DNA sequence a random sequence with equidistributed bases. For specific applications it might be appropriate to use other test sequences such as upstream regions of the genes of interest.</p>
         </sec>
         <sec>
            <st>
               <p>Sensitivity and specificity</p>
            </st>
            <p>Sensitivity and specificity of different methods for measuring similarities of profiles recognized by transcription factors were assessed as follows: since large sets of experimentally verified similar matrix pairs are not available, artificial sets were prepared. A representative initial matrix (either ATF or CREB) was resampled to construct a set of matrices. On average we probed the initial matrix 18 times (which corresponds to the median sample size of Transfac matrices). In order to study varying sample sizes for each generated matrix the number of samples was randomly chosen out of the range from 13 to 21. All the matrices generated this way should be classified as similar to each other. A set with matrices dissimilar to each other was prepared by random shuffling of the contents of the initial matrix. The nucleotide counts at each position were randomly reordered as well as the order of the positions. Additionally, we take into account different lengths of the matrices. Both sets were extended with random columns and the number of added columns was chosen randomly from zero to half of the length of the initial matrix. In the analysis, sensitivity was defined as the fraction of resampled matrices which were correctly identified as similar matrices. Specificity was defined as the fraction of random matrices which were identified as dissimilar. Six methods quantifying similarity of profiles were compared. The <it>D </it>(chi2th) and <it>C </it>(corr) functions were calculated as introduced above. Another score was defined as a sum of <it>&#967;</it><sup>2 </sup>obtained for each compared columns (chi2sum). Three other methods (introduced in <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B17">17</abbr><abbr bid="B20">20</abbr></abbrgrp>) calculate a total sum over all compared columns of: Euclidian distance (ned), column-column correlation (ccc) and scalar product of columns (sp).</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results and discussion</p>
         </st>
         <p>In this paper two similarity measures of matrices are studied. The first quantifies for a given pair of matrices the number of significantly different columns <it>D</it>. The other represents the correlation <it>C </it>of binding sites scores along a DNA sequence for each of the given matrices.</p>
         <sec>
            <st>
               <p>Comparison of both similarity measures</p>
            </st>
            <p>For the Transfac library we analyze whether the pairs of matrices with small distances <it>D </it>and high correlation coefficients <it>C </it>coincide, i.e. for what matrices the two measures give consistent results. Fig. <figr fid="F3">3</figr> shows histograms of correlation coefficients <it>C </it>for matrices with distances <it>D </it>= 0, 1, 2. It turns out that there are many pairs of matrices with <it>D </it>= 0 and large values of <it>C </it>(see the right peak in the upper panel of Fig. <figr fid="F3">3</figr>). For such matrices the differences between their columns are negligible and predicted binding sites are essentially identical.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Combinations of both measures: Histograms of the correlation <it>C </it>of the scores vectors obtained for different values of the distance <it>D </it>(number of significantly different columns according to the <it>&#967;</it><sup>2 </sup>test)</p>
               </caption>
               <text>
                  <p>Combinations of both measures: Histograms of the correlation <it>C </it>of the scores vectors obtained for different values of the distance <it>D </it>(number of significantly different columns according to the <it>&#967;</it><sup>2 </sup>test). These data have been calculated for the Transfac matrices.</p>
               </text>
               <graphic file="1471-2105-6-237-3"/>
            </fig>
            <p>There are, however, also many pairs of matrices with <it>D </it>= 0 and relatively small correlation coefficients <it>C </it>(see the left peak in the upper panel of Fig. <figr fid="F3">3</figr>). These pairs refer mainly to matrices with a low information content and/or small sample size. In such cases the differences between columns are not statistically significant (many Ns in both consensus sequences) but their scores along a test DNA sequence correlate only weakly. For example, matrices V$STAT4_01 and V$MEF2_01 (see Transfac) are characterised by sample sizes <it>N </it>= 6, <it>N </it>= 5 respectively and have a distance <it>D </it>= 0 but a correlation <it>C </it>= 0.20.</p>
            <p>There are also cases with a high correlation coefficient but with a distance <it>D </it>> 2. Such a situation appears for large matrices for which only a part is informative. For example matrices V$GR_01 and V$PR_01 (see Transfac) have a length of 27, but only six positions constitute the core sequence (TGTTCT). Among the others positions three are significantly different, leading to a distance <it>D </it>= 3 but these differences affect the correlation <it>C </it>only weakly (<it>C </it>= 0.92).</p>
            <p>Several alternative measures have been proposed. We assessed the sensitivity and the specificity of these measures, as described in methods. The results of the comparison are presented in the supplemental Fig. <figr fid="F4">4</figr>. Both the our correlation measure and the column-to-column similarity give (for an appropriate threshold) a high specificity and sensitivity. However, in some cases, as illustrated above, adding a second criteria is useful to discard pairs involving large matrices for which only a part is informative. The <it>D </it>measure defined here can be used for this purpose. Both introduced measures quantify different properties and complement each other. Although alternative choices of measures might have been done, the advantage of using the correlation <it>C </it>is its implicit normalisation (the results do not depend much on the length and the sample size of the matrices) and the advantage of the distance <it>D </it>is its easy interpretation (number of different columns). Therefore, in the following, we focus on the most similar matrices based on the distance <it>D </it>and correlation <it>C </it>measures.</p>
         </sec>
         <sec>
            <st>
               <p>Clusters of similar matrices</p>
            </st>
            <p>Here we study the matrices of both Jaspar and Transfac databases. We consider pairs of matrices for which <it>D </it>&#8804; 1 and <it>C </it>&#8805; 0.8 as highly similar. These stringent thresholds were chosen to identify the most obvious similarities and they imply that the matrices are almost indistinguishable from a statistical point of view and that their scores along DNA sequences are strongly correlated. We verified that for all these pairs of matrices both similarity measures select the same relative shift of the corresponding matrices.</p>
            <p>Fig. <figr fid="F4">4</figr> shows an overview of all such matrices. Even though details of these clusters are only readable in the supplementary material (Fig. <figr fid="F1">1</figr>) the graph reveals interesting properties: The connecting lines visualizing high similarity join Jaspar matrices (ellipses) with Transfac matrices (boxes) in many cases. Consequently, our technique allows an automatic "alignment" of these collections of matrices. This is not a trivial task since the naming conventions used in the databases is different, and thus finding matrices corresponding to each other requires expert knowledge. We find that 84 matrices from Jaspar have counterparts in Transfac with <it>D </it>&#8804; 1 and <it>C </it>&#8805; 0.8. Another 16 matrices have somewhat smaller similarities <it>D </it>&#8804; 3 and <it>C </it>&#8805; 0.6. Only the Jaspar matrices P_HMG-1, P_HMG-IY and V_Ghlf, have no obvious "partners" in Transfac. A complete list of Transfac-Jaspar matrix pairs with high similarities is provided in the supplementary material (Tab. <tblr tid="T1">1</tblr>). Lists for other thresholds or other sets of matrices can be calculated through our web interface <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Graph showing similar matrices: Transfac matrices are indicated in white boxes, Jaspar matrices are indicated in gray ellipses</p>
               </caption>
               <text>
                  <p>Graph showing similar matrices: Transfac matrices are indicated in white boxes, Jaspar matrices are indicated in gray ellipses. An edge is drawn between two matrices when <it>D </it>&#8804; 1 and <it>C </it>&#8805; 0.8. An enlarged version of this figure is available in the supplementary material (Fig. S1).</p>
               </text>
               <graphic file="1471-2105-6-237-4"/>
            </fig>
            <p>In addition to the edges between Transfac and Jaspar matrices there are many clusters containing multiple Transfac or Jaspar matrices. These clusters reflect pronounced similarities in the matrix collections. There are for example, matrices of the same transcription factor with different degrees of stringency (see for instance AP1 matrices). Moreover, different transcription factors of certain families have almost identical binding motifs (see for example Myc-Max, USF and ARNT). A complete list of all clusters is provided in the supplementary material (Tab. S2). An interesting collection of structural classes of transcription factors has been compiled recently by Sandelin and Wasserman <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. Consistent with their results we find also clusters of the ETS family (see cluster 2 in Tab. S2, also enlarged in Fig. <figr fid="F5">5b</figr>), bHLH transcription factors (cluster 15), and REL family (cluster 5).</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Clusters of similar matrices: Transcription factor families (a) GATA and (b) ETS</p>
               </caption>
               <text>
                  <p>Clusters of similar matrices: Transcription factor families (a) GATA and (b) ETS.</p>
               </text>
               <graphic file="1471-2105-6-237-5"/>
            </fig>
            <p>In Fig. <figr fid="F5">5</figr> we present enlargements of two selected clusters representing the GATA (panel a) and ETS (panel b) transcription factors family. The high similarity of these matrices cannot be directly noticed by inspection of names or consensus sequences. Furthermore, subgroups might be detected using our statistical approach. For example, the GATA cluster reveals that the Jaspar matrix has particularly high similarity to the Transfac entries GATA1_02, GATA3_01 and GATA6_01, but less similarities to other members of the GATA class. The clusters visualized in Fig. <figr fid="F4">4</figr> and Fig. <figr fid="F5">5</figr> can be exploited to reduce the number of matrices. Highly similar matrices match a DNA sequence either both or not at all. Therefore, one could construct "consensus matrices" as in <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> or one might select representative matrices in each cluster. In this way the number of overlapping predictions in the search for transcription factor binding sites can be decreased <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Mapping of novel matrices to databases</p>
            </st>
            <p>A careful inspection of the clusters found automatically by our similarity analysis might reveal unexpected similarities pointing to possible cross-talks of different signaling cascades on the level of transcriptional regulation. As an example we discuss the regulation of circadian clock genes and cell cycle control <abbrgrp><abbr bid="B30">30</abbr><abbr bid="B31">31</abbr></abbrgrp>. In both processes bHLH transcription factors bind as dimers to E-boxes. The corresponding Myc-Max cluster appeared already in Fig. <figr fid="F4">4</figr> (the largest cluster). In the mammalian circadian clock the CLOCK-BMAL1 dimer regulates clock genes such as <it>Per1</it>, <it>Per2</it>, <it>Per3</it>, <it>Cry1 </it>and <it>Cry2</it>. We found no matrix in Transfac or Jaspar describing explicitly the binding sites of CLOCK-BMAL1. Consequently, we constructed such matrices ourselves in two different ways. On one hand we collected 9 experimentally verified binding sites from 7 different clock genes <abbrgrp><abbr bid="B32">32</abbr><abbr bid="B33">33</abbr><abbr bid="B34">34</abbr><abbr bid="B35">35</abbr><abbr bid="B36">36</abbr></abbrgrp>. On the other hand, we took from a SELEX experiment 10 sequences with high affinities to the CLOCK-BMAL1 dimer <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>.</p>
            <p>Both matrices are visualized in Fig. <figr fid="F6">6a</figr>. Details of the matrix construction are given in the supplementary material (Tab. S3). Both matrices contain the E-box consensus motif CACGTG but differ in the flanking regions.</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>Mapping of CLOCK-BMAL1 matrices: (a) CLOCK-BMAL1 matrices based on experimentally characterised binding sites of clock genes and from a SELEX study (see Tab. S3 of the supplementary material for the list of these binding sites)</p>
               </caption>
               <text>
                  <p>Mapping of CLOCK-BMAL1 matrices: (a) CLOCK-BMAL1 matrices based on experimentally characterised binding sites of clock genes and from a SELEX study (see Tab. S3 of the supplementary material for the list of these binding sites). (b) Mapping of CLOCK-BMAL1 matrices on E-box matrices. These matrices have been selected from the Transfac database and include MYC, MAX, ARNT, MYOD, USF, TAL1/E47 (see [35] for a review on E-box transcription factors). An edge is drawn when <it>D </it>&#8804; 1 and <it>C </it>&#8805; 0.8.</p>
               </text>
               <graphic file="1471-2105-6-237-6"/>
            </fig>
            <p>Fig. <figr fid="F6">6b</figr> shows that these novel matrices have highly similar counterparts in Transfac (NMYC, MYC, USF). Consequently, cross-talk of the circadian clock with cell cycle regulation and tumor genesis can be expected at the level of transcriptional control. Indeed, the success of chronotherapies and recent detailed studies on cross-talk underline the dependence of circadian rhythms with tumor growth <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>. Also in the process of liver regeneration a pronounced effect of the circadian clock on cell cycle control has been found <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>. This example illustrates that a careful SELEX experiment combined with a mapping of the resulting matrix to known matrices can reveal possible functions of the corresponding transcription factor.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>Understanding gene regulation in higher eukaryotes is still challenging and current computational algorithms suffer from a large amount of false positive predictions <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B40">40</abbr></abbrgrp>. In particular, mutually dependent position frequency matrices in databases such as Transfac or Jaspar lead to predictions of binding sites which overlap, what may be misinterpreted as a cluster of binding sites. Consequently, a careful pre-selection of matrices is essential. On one hand, expert knowledge can be used to select a subset of candidate matrices for the analysis of upstream regions. Such a selection is, however, subjective and novel combinations of transcription factor binding sites might be missed. On the other hand, for large scale computational studies, it is useful to have an automatic tool to detect similar matrices. Therefore, we introduce in this paper a method combining two independent similarity measures to compare position frequency matrices. This approach can be used to quantify similar matrices, to map the entries of different databases, and to cluster matrices.</p>
         <p>The first similarity measure used in our approach is based on a <it>&#967;</it><sup>2 </sup>test. In contrast to earlier approaches based on normalized frequencies <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B17">17</abbr><abbr bid="B20">20</abbr></abbrgrp> we take into account the small sample size of many matrices. We count the number of significantly different matrix columns which defines the distance <it>D</it>. In this paper we focus on highly similar matrices with <it>D </it>&#8804; 1. In forthcoming studies the <it>&#967;</it><sup>2 </sup>measure might be taken directly to calculate distances of matrices in more detail.</p>
         <p>The second measure is related to the primary application of position weight matrices &#8211; the prediction of binding sites in uncharacterized DNA sequences. We calculate for two matrices of interest the scores along a test DNA sequence and derive the Pearson correlation coefficient <it>C </it>of these vectors. Thus large values of <it>C </it>indicate that both matrices predict essentially the same binding sites. In this paper we take a 10000 bp long random sequence with equiprobable and independent bases as the test DNA sequence. However, the measure can be easily adapted also to other test sequences such as sets of promoter regions.</p>
         <p>Our combined similarity measure was first used to map the Jaspar matrices to the Transfac database automatically. Then, requiring rather strong similarity (<it>D </it>&#8804; 1, <it>C </it>&#8805; 0.8) we identified similar matrices present in these databases and constructed clusters of almost indistinguishable matrices. By choosing only one representative matrix for each cluster it is possible to construct smaller sets of matrices as input of binding site prediction algorithms. Consequently, this approach decreases the number of overlapping binding site predictions. Moreover, such a reduced set constitutes a better input for methods predicting close occurrences of different binding sites (e.g. <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>). In order to eliminate false signals further, approaches such as phylogenetic footprinting <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp>, transcriptional profiling <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>, ChIP on chip experiments <abbrgrp><abbr bid="B41">41</abbr><abbr bid="B42">42</abbr></abbrgrp> or modeling cis-regulatory modules need to be combined with a preselection of independent matrices. Our combined technique can be used to predict cross-talk on the level of transcriptional control. As an illustration we discuss the cluster of E-box binding bHLH transcription factors. Since circadian clock genes are regulated by a binding site quite similar to the Myc-Max motif, a strong interdependence of circadian regulation and cell cycle control is expected and is indeed known empirically for decades in connection with chronotherapies or liver regeneration.</p>
         <p>Finally we use the similarity measures to assign newly derived matrices to known factors. To illustrate this application, we map an E-box matrix obtained from SELEX experiments with the CLOCK-BMAL1 dimer to the Myc-Max cluster. Thus the possible function of poorly characterized transcription factors can be predicted using affinity measurements combined with a comparison of the resulting matrix to database matrices.</p>
      </sec>
      <sec>
         <st>
            <p>Availability</p>
         </st>
         <p>The method is available through a web interface at <url>http://wmcompare.gene-groups.net/</url>.</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>SK, DG and HH designed the study. SK and DG were involved in programming and SK set up the web interface. SK, DG and HH interpreted the results and drafted the manuscript. All authors read and approved the final manuscript.</p>
         <suppl id="S1">
            <title>
               <p>Additional File 1</p>
            </title>
            <text>
               <p>Correspondence between Jaspar and Transfac matrices: For each Jaspar matrix similar (<it>D </it>&#8804; 1 and <it>C </it>&#8805; 0.8) Transfac matrices are listed. 84 Jaspar matrices have at least one corresponding Transfac matrix.</p>
            </text>
            <file name="1471-2105-6-237-S1.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S2">
            <title>
               <p>Additional File 2</p>
            </title>
            <text>
               <p>Clusters of similar (<it>D </it>&#8804; 1 and <it>C </it>&#8805; 0.8) Jaspar and Transfac matrices.</p>
            </text>
            <file name="1471-2105-6-237-S2.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S3">
            <title>
               <p>Additional File 3</p>
            </title>
            <text>
               <p>Binding sites for Clock-Bmal1: Experimentally characterized binding sites for Clock-Bmal1 in clock genes and in selected sequences (SELEX experiment).</p>
            </text>
            <file name="1471-2105-6-237-S3.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S4">
            <title>
               <p>Additional File 4</p>
            </title>
            <text>
               <p>Comparison of different measures: specificity and sensitivity are determined as described in the "Methods" section of the paper for various thresholds of the different similarity measures. Specificity is defined as the fraction of the number of resampled matrices (TP on y-axis) found as similar. Sensitivity is defined as the fraction of the number of randomized matrices (TP on x-axis) found as dissimilar. Curves: "corr": correlation of scores along a DNA sequence, i.e. our score <it>C </it>(thresholds = 0.99, 0.95, 0.9, 0.8, 0.7...); "chi2th": our chi2 measure <it>D </it>(thresholds = 0, 1..8); "chi2sum": sum of column chi2 distances; "ned": normalized euclidian distance; "ccc": column-column correlation; "sp": column scalar product.</p>
            </text>
            <file name="1471-2105-6-237-S4.eps">
               <p>Click here for file</p>
            </file>
         </suppl>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>The authors thank N. Bl&#252;thgen, M. Swat and M. Futschik for discussions and critical reading of the manuscript. SzMK is supported by the German Federal Ministry of Education and Research (BMBF) and the German Research Foundation (DFG). DG is Charg&#233; de Recherches du Fonds National Belge de la Recherche Scientifique.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Applied bioinformatics for the identification of regulatory elements</p>
            </title>
            <aug>
               <au>
                  <snm>Wasserman</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Sandelin</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Nat Rev Genet</source>
            <pubdate>2004</pubdate>
            <volume>5</volume>
            <issue>4</issue>
            <fpage>276</fpage>
            <lpage>87</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15131651</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>DNA binding sites: representation and discovery</p>
            </title>
            <aug>
               <au>
                  <snm>Stormo</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2000</pubdate>
            <volume>16</volume>
            <fpage>16</fpage>
            <lpage>23</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">10812473</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>TRANSFAC: transcriptional regulation, from patterns to profiles</p>
            </title>
            <aug>
               <au>
                  <snm>Matys</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Fricke</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Geffers</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Gossling</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Haubrock</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Hehl</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Hornischer</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Karas</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Kel</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Kel-Margoulis</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Kloos</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Land</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Lewicki-Potapov</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Michael</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Munch</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Reuter</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Rotert</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Saxel</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Scheer</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Thiele</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Wingender</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <fpage>374</fpage>
            <lpage>8</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">165555</pubid>
                  <pubid idtype="pmpid" link="fulltext">12520026</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>JASPAR: an open-access database for eukaryotic transcription factor binding profiles</p>
            </title>
            <aug>
               <au>
                  <snm>Sandelin</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Alkema</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Engstrom</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Wasserman</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Lenhard</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <issue>Database</issue>
            <fpage>D91</fpage>
            <lpage>4</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">308747</pubid>
                  <pubid idtype="pmpid" link="fulltext">14681366</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data</p>
            </title>
            <aug>
               <au>
                  <snm>Quandt</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Frech</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Karas</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Wingender</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Werner</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1995</pubdate>
            <volume>23</volume>
            <issue>23</issue>
            <fpage>4878</fpage>
            <lpage>84</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">307478</pubid>
                  <pubid idtype="pmpid">8532532</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>MATCH: A tool for searching transcription factor binding sites in DNA sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Kel</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Gossling</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Reuter</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Cheremushkin</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Kel-Margoulis</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Wingender</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <issue>13</issue>
            <fpage>3576</fpage>
            <lpage>9</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">169193</pubid>
                  <pubid idtype="pmpid" link="fulltext">12824369</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Detection of functional DNA motifs via statistical over-representation</p>
            </title>
            <aug>
               <au>
                  <snm>Frith</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Fu</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Yu</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Hansen</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Weng</snm>
                  <fnm>Z</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <issue>4</issue>
            <fpage>1372</fpage>
            <lpage>81</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">390287</pubid>
                  <pubid idtype="pmpid" link="fulltext">14988425</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Computer modeling of promoter organization as a tool to study transcriptional coregulation</p>
            </title>
            <aug>
               <au>
                  <snm>Werner</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Fessele</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Maier</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Nelson</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>FASEB J</source>
            <pubdate>2003</pubdate>
            <volume>17</volume>
            <issue>10</issue>
            <fpage>1228</fpage>
            <lpage>37</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12832287</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Stochastic segment models of eukaryotic promoter regions</p>
            </title>
            <aug>
               <au>
                  <snm>Ohler</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Stemmer</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Harbeck</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Niemann</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Pac Symp Biocomput</source>
            <pubdate>2000</pubdate>
            <fpage>380</fpage>
            <lpage>91</lpage>
            <xrefbib>
               <pubid idtype="pmpid">10902186</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Computational identification of promoters and first exons in the human genome</p>
            </title>
            <aug>
               <au>
                  <snm>Davuluri</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Grosse</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Nat Genet</source>
            <pubdate>2001</pubdate>
            <volume>29</volume>
            <issue>4</issue>
            <fpage>412</fpage>
            <lpage>7</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11726928</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Human-mouse genome comparisons to locate regulatory sites</p>
            </title>
            <aug>
               <au>
                  <snm>Wasserman</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Palumbo</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Thompson</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Fickett</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Lawrence</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Nat Genet</source>
            <pubdate>2000</pubdate>
            <volume>26</volume>
            <issue>2</issue>
            <fpage>225</fpage>
            <lpage>8</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11017083</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Annotating regulatory DNA based on man-mouse genomic comparison</p>
            </title>
            <aug>
               <au>
                  <snm>Dieterich</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Cusack</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Rateitschak</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Krause</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Vingron</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>18</volume>
            <issue>Suppl 2</issue>
            <fpage>S84</fpage>
            <lpage>90</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12385988</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Comparative genomics: genome-wide analysis in metazoan eukaryotes</p>
            </title>
            <aug>
               <au>
                  <snm>Ureta-Vidal</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Ettwiller</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Birney</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Nat Rev Genet</source>
            <pubdate>2003</pubdate>
            <volume>4</volume>
            <issue>4</issue>
            <fpage>251</fpage>
            <lpage>62</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12671656</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Regulatory element detection using correlation with expression</p>
            </title>
            <aug>
               <au>
                  <snm>Bussemaker</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Siggia</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Nat Genet</source>
            <pubdate>2001</pubdate>
            <volume>27</volume>
            <issue>2</issue>
            <fpage>167</fpage>
            <lpage>71</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11175784</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae</p>
            </title>
            <aug>
               <au>
                  <snm>Hughes</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Estep</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Tavazoie</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Church</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>2000</pubdate>
            <volume>296</volume>
            <issue>5</issue>
            <fpage>1205</fpage>
            <lpage>14</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">10698627</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Cluster-Buster: Finding dense clusters of motifs in DNA sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Frith</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Weng</snm>
                  <fnm>Z</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <issue>13</issue>
            <fpage>3666</fpage>
            <lpage>8</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">168947</pubid>
                  <pubid idtype="pmpid" link="fulltext">12824389</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics</p>
            </title>
            <aug>
               <au>
                  <snm>Sandelin</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Wasserman</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>2004</pubdate>
            <volume>338</volume>
            <issue>2</issue>
            <fpage>207</fpage>
            <lpage>15</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15066426</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>CARRIE web service: automated transcriptional regulatory network inference and interactive analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Haverty</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Frith</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Weng</snm>
                  <fnm>Z</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <issue>Web Server</issue>
            <fpage>W213</fpage>
            <lpage>6</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">441540</pubid>
                  <pubid idtype="pmpid" link="fulltext">15215383</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Predicting transcription factor synergism</p>
            </title>
            <aug>
               <au>
                  <snm>Hannenhalli</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Levy</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <issue>19</issue>
            <fpage>4278</fpage>
            <lpage>84</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">140535</pubid>
                  <pubid idtype="pmpid" link="fulltext">12364607</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Searching databases of conserved sequence regions by aligning protein multiple-alignments</p>
            </title>
            <aug>
               <au>
                  <snm>Pietrokovski</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1996</pubdate>
            <volume>24</volume>
            <issue>19</issue>
            <fpage>3836</fpage>
            <lpage>45</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">146152</pubid>
                  <pubid idtype="pmpid" link="fulltext">8871566</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Similarity of position frequency matrices for transcription factor binding sites</p>
            </title>
            <aug>
               <au>
                  <snm>Schones</snm>
                  <fnm>DE</fnm>
               </au>
               <au>
                  <snm>Sumazin</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>MQ</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <issue>3</issue>
            <fpage>307</fpage>
            <lpage>313</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15319260</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Information content of binding sites on nucleotide sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Schneider</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Stormo</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Gold</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Ehrenfeucht</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1986</pubdate>
            <volume>188</volume>
            <issue>3</issue>
            <fpage>415</fpage>
            <lpage>31</lpage>
            <xrefbib>
               <pubid idtype="pmpid">3525846</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>A non-parametric model for transcription factor binding sites</p>
            </title>
            <aug>
               <au>
                  <snm>King</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Roth</snm>
                  <fnm>F</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <issue>19</issue>
            <fpage>e116</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">206482</pubid>
                  <pubid idtype="pmpid" link="fulltext">14500844</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Modeling within-motif dependence for transcription factor binding site predictions</p>
            </title>
            <aug>
               <au>
                  <snm>Zhou</snm>
                  <fnm>Q</fnm>
               </au>
               <au>
                  <snm>Liu</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>20</volume>
            <issue>6</issue>
            <fpage>909</fpage>
            <lpage>16</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">14751969</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Modeling dependencies in protein-DNA binding sites</p>
            </title>
            <aug>
               <au>
                  <snm>Barash</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Elidan</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Friedman</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Kaplan</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Proceedings of the seventh annual international conference on Computational molecular biology</source>
            <publisher>ACM Press New York, NY, USA</publisher>
            <pubdate>2003</pubdate>
            <fpage>28</fpage>
            <lpage>37</lpage>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Additivity in protein-DNA interactions: how good an approximation is it?</p>
            </title>
            <aug>
               <au>
                  <snm>Benos</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Bulyk</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Stormo</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <issue>20</issue>
            <fpage>4442</fpage>
            <lpage>51</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">137142</pubid>
                  <pubid idtype="pmpid" link="fulltext">12384591</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Identifying DNA and protein patterns with statistically significant alignments of multiple sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Hertz</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Stormo</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>1999</pubdate>
            <volume>15</volume>
            <issue>7&#8211;8</issue>
            <fpage>563</fpage>
            <lpage>77</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">10487864</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Quantitative discrimination of MEF2 sites</p>
            </title>
            <aug>
               <au>
                  <snm>Fickett</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Mol Cell Biol</source>
            <pubdate>1996</pubdate>
            <volume>16</volume>
            <fpage>437</fpage>
            <lpage>41</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">231020</pubid>
                  <pubid idtype="pmpid" link="fulltext">8524326</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>wmCompare</p>
            </title>
            <url>http://wmcompare.gene-groups.net/</url>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Timing the cell cycle</p>
            </title>
            <aug>
               <au>
                  <snm>Cardone</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Sassone-Corsi</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Nat Cell Biol</source>
            <pubdate>2003</pubdate>
            <volume>5</volume>
            <issue>10</issue>
            <fpage>859</fpage>
            <lpage>61</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">14523398</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B31">
            <title>
               <p>Control mechanism of the circadian clock for timing of cell division in vivo</p>
            </title>
            <aug>
               <au>
                  <snm>Matsuo</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Yamaguchi</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Mitsui</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Emi</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Shimoda</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Okamura</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>2003</pubdate>
            <volume>302</volume>
            <issue>5643</issue>
            <fpage>255</fpage>
            <lpage>9</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12934012</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>Closing the circadian loop: CLOCK-induced transcription of its own inhibitors per and tim</p>
            </title>
            <aug>
               <au>
                  <snm>Darlington</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Wager-Smith</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Ceriani</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Staknis</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Gekakis</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Steeves</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Weitz</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Takahashi</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Kay</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1998</pubdate>
            <volume>280</volume>
            <issue>5369</issue>
            <fpage>1599</fpage>
            <lpage>603</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">9616122</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B33">
            <title>
               <p>Role of the CLOCK protein in the mammalian circadian mechanism</p>
            </title>
            <aug>
               <au>
                  <snm>Gekakis</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Staknis</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Nguyen</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Davis</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Wilsbacher</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>King</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Takahashi</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Weitz</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1998</pubdate>
            <volume>280</volume>
            <issue>5369</issue>
            <fpage>1564</fpage>
            <lpage>9</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">9616112</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B34">
            <title>
               <p>The rat arylalkylamine N-acetyltransferase E-box: differential use in a master vs. a slave oscillator</p>
            </title>
            <aug>
               <au>
                  <snm>Chen</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Baler</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Brain Res Mol Brain Res</source>
            <pubdate>2000</pubdate>
            <volume>81</volume>
            <issue>1&#8211;2</issue>
            <fpage>43</fpage>
            <lpage>50</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11000477</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <title>
               <p>Circadian Transcription. Thinking outside the E-Box</p>
            </title>
            <aug>
               <au>
                  <snm>Munoz</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Brewer</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Baler</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>J Biol Chem</source>
            <pubdate>2002</pubdate>
            <volume>277</volume>
            <issue>39</issue>
            <fpage>36009</fpage>
            <lpage>17</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12130638</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B36">
            <title>
               <p>E-box function in a period gene repressed by light</p>
            </title>
            <aug>
               <au>
                  <snm>Vallone</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Gondi</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Whitmore</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Foulkes</snm>
                  <fnm>N</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>2004</pubdate>
            <volume>101</volume>
            <issue>12</issue>
            <fpage>4106</fpage>
            <lpage>11</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">384702</pubid>
                  <pubid idtype="pmpid" link="fulltext">15024110</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B37">
            <title>
               <p>The basic-helix-loop-helix-PAS orphan MOP3 forms transcriptionally active complexes with circadian and hypoxia factors</p>
            </title>
            <aug>
               <au>
                  <snm>Hogenesch</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Gu</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Jain</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Bradfield</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>1998</pubdate>
            <volume>95</volume>
            <issue>10</issue>
            <fpage>5474</fpage>
            <lpage>9</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">20401</pubid>
                  <pubid idtype="pmpid" link="fulltext">9576906</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B38">
            <title>
               <p>Host circadian clock as a control point in tumor progression</p>
            </title>
            <aug>
               <au>
                  <snm>Filipski</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>King</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Granda</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Mormont</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Liu</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Claustrat</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Hastings</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Levi</snm>
                  <fnm>F</fnm>
               </au>
            </aug>
            <source>J Natl Cancer Inst</source>
            <pubdate>2002</pubdate>
            <volume>94</volume>
            <issue>9</issue>
            <fpage>690</fpage>
            <lpage>7</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11983758</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B39">
            <title>
               <p>Circadian rhythms. Liver regeneration clocks on</p>
            </title>
            <aug>
               <au>
                  <snm>Schibler</snm>
                  <fnm>U</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>2003</pubdate>
            <volume>302</volume>
            <issue>5643</issue>
            <fpage>234</fpage>
            <lpage>5</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">14551421</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B40">
            <title>
               <p>Prediction of Cis-Regulatory Elements of Coregulated Genes</p>
            </title>
            <aug>
               <au>
                  <snm>Kielbasa</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Bl&#252;uthgen</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Sers</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Sch&#228;fer</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Herzel</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Genome Informatics</source>
            <pubdate>2004</pubdate>
            <volume>15</volume>
            <fpage>117</fpage>
            <lpage>124</lpage>
         </bibl>
         <bibl id="B41">
            <title>
               <p>Genome-wide location and function of DNA binding proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Ren</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Robert</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Wyrick</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Aparicio</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Jennings</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Simon</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Zeitlinger</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Schreiber</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Hannett</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Kanin</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Volkert</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Wilson</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Bell</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Young</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>2000</pubdate>
            <volume>290</volume>
            <issue>5500</issue>
            <fpage>2306</fpage>
            <lpage>9</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11125145</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B42">
            <title>
               <p>Distribution of NF-kappaB-binding sites across human chromosome 22</p>
            </title>
            <aug>
               <au>
                  <snm>Martone</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Euskirchen</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Bertone</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Hartman</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Royce</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Luscombe</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Rinn</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Nelson</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Gerstein</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Weissman</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Snyder</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>2003</pubdate>
            <volume>100</volume>
            <issue>21</issue>
            <fpage>12247</fpage>
            <lpage>52</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">218744</pubid>
                  <pubid idtype="pmpid" link="fulltext">14527995</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
