<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-4-38</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Software</dochead>
      <bibl>
         <title>
            <p>Genome-wide prediction, display and refinement of binding sites with information theory-based models</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Gadiraju</snm>
               <fnm>Sashidhar</fnm>
               <insr iid="I1"/>
               <insr iid="I3"/>
               <email>sashidharg@yahoo.com</email>
            </au>
            <au id="A2">
               <snm>Vyhlidal</snm>
               <mi>A</mi>
               <fnm>Carrie</fnm>
               <insr iid="I2"/>
               <email>cvyhlidal@cmh.edu</email>
            </au>
            <au id="A3">
               <snm>Leeder</snm>
               <fnm>J Steven</fnm>
               <insr iid="I2"/>
               <email>sleeder@cmh.edu</email>
            </au>
            <au id="A4" ca="yes">
               <snm>Rogan</snm>
               <mi>K</mi>
               <fnm>Peter</fnm>
               <insr iid="I1"/>
               <insr iid="I3"/>
               <email>progan@cmh.edu</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Laboratory of Human Molecular Genetics, Children's Mercy Hospital and Clinics, School of Medicine</p>
            </ins>
            <ins id="I2">
               <p>Section of Developmental and Experimental Pharmacology and Therapeutics, Children's Mercy Hospital and Clinics. School of Medicine</p>
            </ins>
            <ins id="I3">
               <p>School of Interdisciplinary Computer Science and Engineering, University of Missouri-Kansas City, Kansas City MO 64108 USA</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2003</pubdate>
         <volume>4</volume>
         <issue>1</issue>
         <fpage>38</fpage>
         <url>http://www.biomedcentral.com/1471-2105/4/38</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">12962546</pubid>
               <pubid idtype="doi">10.1186/1471-2105-4-38</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>07</day>
               <month>5</month>
               <year>2003</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>08</day>
               <month>9</month>
               <year>2003</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>08</day>
               <month>9</month>
               <year>2003</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2003</year>
         <collab>Gadiraju et al; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.</collab>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>We present <it>Delila-genome</it>, a software system for identification, visualization and analysis of protein binding sites in complete genome sequences. Binding sites are predicted by scanning genomic sequences with information theory-based (or user-defined) weight matrices. Matrices are refined by adding experimentally-defined binding sites to published binding sites. <it>Delila-Genome </it>was used to examine the accuracy of individual information contents of binding sites detected with refined matrices as a measure of the strengths of the corresponding protein-nucleic acid interactions. The software can then be used to predict novel sites by rescanning the genome with the refined matrices.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>Parameters for genome scans are entered using a Java-based GUI interface and backend scripts in Perl. Multi-processor CPU load-sharing minimized the average response time for scans of different chromosomes. Scans of human genome assemblies required 4&#8211;6 hours for transcription factor binding sites and 10&#8211;19 hours for splice sites, respectively, on 24- and 3-node Mosix and Beowulf clusters. Individual binding sites are displayed either as high-resolution sequence walkers or in low-resolution custom tracks in the UCSC genome browser. For large datasets, we applied a data reduction strategy that limited displays of binding sites exceeding a threshold information content to specific chromosomal regions within or adjacent to genes. An HTML document is produced listing binding sites ranked by binding site strength or chromosomal location hyperlinked to the UCSC custom track, other annotation databases and binding site sequences. Post-genome scan tools parse binding site annotations of selected chromosome intervals and compare the results of genome scans using different weight matrices. Comparisons of multiple genome scans can display binding sites that are unique to each scan and identify sites with significantly altered binding strengths.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusions</p>
               </st>
               <p><it>Delila-Genome </it>was used to scan the human genome sequence with information weight matrices of transcription factor binding sites, including PXR/RXR&#945;, AHR and NF-&#954;B p50/p65, and matrices for RNA binding sites including splice donor, acceptor, and SC35 recognition sites. Comparisons of genome scans with the original and refined PXR/RXR&#945; information weight matrices indicate that the refined model more accurately predicts the strengths of known binding sites and is more sensitive for detection of novel binding sites.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="bmc" subtype="user_supplied_xml" id="endnote"/>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>We describe a system to identify and display significant non-coding genomic sequences that are important for transcriptional regulation and post-transcriptional mRNA processing. Our system builds on <it>Delila </it><abbrgrp><abbr bid="B1">1</abbr></abbrgrp>, a series of programs designed to scan sets of sequence fragments (or small genomes, ie. bacterial) for potential binding sites. The regulatory sequences that are bound by proteins are detected by the tools provided with the <it>Delila </it>system, which defines binding sites according to Shannon information theory <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>.</p>
         <p>Information content is the number of choices needed to describe a sequence pattern and has units of bits <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. In the analysis of nucleic acid binding sites, functional site sequences are aligned and the frequencies of nucleotides at each position are used to calculate the individual information weight matrix, <it>R</it><sub><it>i</it></sub>(<it>b, l</it>) <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> of each base <it>b </it>at position <it>l</it>. Computation of binding site <it>R</it><sub><it>i</it></sub>(<it>b, l</it>) information weight matrices based upon published and laboratory-derived sites is a prerequisite to detecting and visualizing predicted binding sites with <it>Delila-Genome</it>. The procedures and software used to derive these matrices have been previously described <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr></abbrgrp> for different types of protein binding sites <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp>. This matrix is used to scan the genome and evaluate the individual information content (<it>R</it><sub><it>i</it></sub>, in bits) of potential binding sites. Functional binding sites have values > 0 bits and the consensus sequence has the maximum <it>R</it><sub><it>i </it></sub>value. A single bit difference in <it>R</it><sub><it>i </it></sub>value corresponds to at least a two-fold difference in binding site strength. Changes in information content resulting from mutations correspond to observed phenotypes both in vitro and in vivo <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B11">11</abbr></abbrgrp>; by contrast, non-deleterious polymorphisms result in nominal changes in <it>R</it><sub><it>i</it></sub>. value. Therefore, scans with information weight matrices can be used to measure the relative strengths of potential binding sites throughout the genome.</p>
         <p>Scans of eukaryotic genomes <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp> often require longer execution times and generate considerably larger outputs than prokaryotic genome scans <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> due to increased genome sizes and the quantities of sites detected. The development of <it>Delila-Genome </it>was motivated by the need to streamline the detection and display of sets of the most relevant binding sites in eukaryotic genomic or heteronuclear RNA sequences. Visual juxtaposition of these results with other genomic annotation facilitates the prediction and interpretation of binding sites. In order to limit the presentation of weak binding sites (with lower than average information content, ie. &lt;&lt;R<sub>sequence</sub>) which can be densely distributed in both expressed and non-expressed genomic intervals, we developed visualization tools in <it>Delila-Genome </it>to mine relevant binding sites in gene-rich regions and to display clusters of sites with their respective information contents. Details of individual binding sites can also be presented at high resolution as sequence walkers <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, which depict contributions of each nucleotide to the overall information content of the site.</p>
         <p>The number and <it>R</it><sub><it>i </it></sub>values of the sites that define the information weight matrix, <it>R</it><sub><it>i</it></sub>(<it>b, l</it>), dictate which binding sites are predicted and the corresponding strengths of these sites found in genome scans. Models based on small numbers of proven binding sites may fail to detect valid binding sites and can tend to predict <it>R</it><sub><it>i </it></sub>inaccurately. Iterative selection of functional binding sites has been used to optimize <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B16">16</abbr></abbrgrp> and to introduce bias <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> into the frequencies of each nucleotide in computing the information theory-based weight matrices of binding sites. Significant differences between information weight matrices have been determined from their respective evolutionary distance metrics (for example, see <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>). <it>Delila-Genome </it>monitors the effects of model alterations by comparing the genome scan results for pairs of information weight matrices. Although the primary application is to compare sets of binding sites with successive versions of the same weight matrix, other potential applications include determining the locations of overlapping binding sites recognized by different proteins and comparisons of binding sites detected with information models of orthologous proteins from different species.</p>
         <p><it>Delila-Genome </it>has been optimized to compute the locations of prospective transcription factor and splicing recognition sites by information theory-based analyses of recent human genome draft and finished sequences. We describe this software system, measure its performance, and illustrate the results of genome scans using visualization and post-genomic analytic tools which monitor the effects of matrix refinement on genome-wide identification of binding sites.</p>
      </sec>
      <sec>
         <st>
            <p>Implementation</p>
         </st>
         <p>The <it>Delila-Genome </it>system has a client-server architecture which is comprised of three functional modules: (A) the <it>Delila-Genome </it>Front End, (B) the <it>Delila-Genome </it>Server and (C) Post-genomic scan analysis tools (Figure <figr fid="F1">1</figr>). The front end is a graphical interface that takes user input to set parameters for scanning the genome sequence and processing the results. It interacts with the system tools, and while it currently does not have a WWW interface like the UCSC genome browser, it is available as an installable module. The server is the actual engine of the system where all the tools are hosted and all the computations are performed. For multiprocessor servers, a load balancing feature has been written for the Scyld operating system (for Beowulf clusters) using the 'mpprun' utility. This feature is not supported in operating systems like Mosix, where load balancing is done automatically based on CPU utilization. We now describe each of these modules and their respective interactions and dependencies.</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>Architecture of the <it>Delila-Genome </it>system</p>
            </caption>
            <text>
               <p><b>Architecture of the <it>Delila-Genome </it>system. </b>Server programs are shown on the right side of the schema and client programs shown on the left side. A Java-based GUI application (<it>Delgenfront</it>) is run on a desktop client that prompts entry of a series of parameters (server, results directory, genome draft, email address) and the location of ribl file or entry of a weight matrix. These data are sent to a Linux server which runs the <it>scan </it>and <it>promotsite </it>programs to display predicted binding sites. The <it>scan </it>and <it>promotsite </it>jobs may be submitted individually or sequentially. Since <it>scan </it>operates on <it>Delila </it>books, scripts have been provided to automate the downloading and build <it>Delila </it>books of the genome drafts from UCSC (documented in the package: Readme.txt). The <it>genvis </it>program uses the results of previous chromosome or genome analyses with <it>scan </it>and <it>promotsite </it>to generate BED and HTML files of predicted binding sites within a user-defined genomic interval. Upon opening the HTML page, the user uploads the BED file to the corresponding version of the UCSC genome browser, which then displays the custom binding site track of the interval containing the site juxtaposed with other genome annotations. The HTML page is also hyperlinked to the binding site sequence (which can be used to generate a sequence walker using the <it>autolist </it>script), details of the binding site location, and the GenBank and SOURCE entries of the transcript associated with the site. Results obtained with different information matrices can be compared with the <it>scandiff </it>program, which generates BED files for binding sites found with each of the matrices and summary output indicating these differences. While <it>promotsite </it>takes input parameters in a file, all other <it>Delila-Genome </it>programs have command line options to specify the required and optional parameters and most support an '-h' switch that displays these options.</p>
            </text>
            <graphic file="1471-2105-4-38-1"/>
         </fig>
         <sec>
            <st>
               <p>Delila-Genome Front End</p>
            </st>
            <sec>
               <st>
                  <p>Submission of the genome scan</p>
               </st>
               <p>A front end was developed for submission of the genome-wide or chromosomal scans and for tailoring the output to filter and view the most relevant results. A Java-based GUI tool (developed with Java Swing technology) enables submission of scans to the server. Besides the <it>Delila </it>books containing chromosomal sequences, the only required input file is the <it>R</it><sub><it>i</it></sub>(<it>b, l</it>) information weight matrix (ribl) of the protein binding site. This file is output by the <it>ri </it>program, and the procedure for generating this file has been described <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. In order to assess the degree to which the computed information depends on these weights, an option is provided to modify this matrix by uploading a file containing these weights or entering them as integers on a Java form. Parameters are requested for the <it>Delila scan </it>program <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> which performs the genome scan, and <it>promotsite </it>(see below: <b>Delila Genome Server</b>), a program that produces files for displaying binding sites within or adjacent to genes. The user selects the program to execute on the server and then either fills in the parameters required by the selected program or the front-end can pull the default parameters from the server. The front end also displays all of the genome assembly versions installed on the server (at our institution: human genome versions April, 2003, November, 2002, and October, 2000). The front end validates the parameters before submission. Java socket programming is used to connect to the server.</p>
            </sec>
            <sec>
               <st>
                  <p>Visualization</p>
               </st>
               <p>To present the most relevant results from the scans, Delila-Genome uses Javascript to produce an HTML page listing binding sites within or adjacent to expressed loci in the human genome sequence. The user can view these binding sites at <it>low resolution </it>(relative to genes and other sites) or at <it>high resolution </it>(at the nucleotide level). Figure <figr fid="F2">2</figr> shows an HTML page with corresponding high and low resolution links associated with each binding site. Binding sites are selected based on their proximity to the 5' termini of transcripts mapped onto the human genome draft at the UCSC Genome Browser database <url>http://genome.ucsc.edu</url>. The coordinates of mapped transcripts are read from each chromosome-specific, mRNA annotation table (downloaded from the UCSC Genome Browser annotation database (files: chrXX_mrna.txt) into the chromosome-specific directories containing the corresponding genomic sequences). Currently, the genome contains numerous expressed sequences that have not been definitively established as genes in public databases. By defining binding sites in the context of such mRNAs mapped onto the genome sequence, it may be possible to annotate regulatory or other features in otherwise poorly-characterized, expressed coding sequences.</p>
               <fig id="F2">
                  <title>
                     <p>Figure 2</p>
                  </title>
                  <caption>
                     <p>Screen shot of results generated by <it>Delila-Genome </it>visualization tools</p>
                  </caption>
                  <text>
                     <p><b>Screen shot of results generated by <it>Delila-Genome </it>visualization tools. </b>This example shows predicted PXR/RXR&#945; binding sites at the zeta crystalline locus. Genome-wide HTML and BED files have been generated by the <it>promotsite </it>program. Sites are in the HTML ordered by information content. Hyperlinked pages (arrows from <it>Delila-Genome </it>HTML page) reveal details about binding sites and annotations of the gene associated with the binding site. Panels indicate: (A) <it>Delila-Genome </it>HTML page for viewing sorted binding sites with associated genes; (B) UCSC browser custom track detail for specific binding site; (C) Sequence of binding site; (D) Sequence walker of the binding site (computed on the server and displayed on client running X-windows); (E) GenBank entry for mRNA accession number associated with binding site (F) Stanford SOURCE database entry providing current information about gene template of GenBank mRNA accession (G) UCSC browser for viewing sites in the gene associated with the GenBank accession.</p>
                  </text>
                  <graphic file="1471-2105-4-38-2"/>
               </fig>
            </sec>
            <sec>
               <st>
                  <p>Low resolution tools</p>
               </st>
               <p>The server generates a list of predicted binding sites as a BED-formatted file <url>http://genome.ucsc.edu/goldenPath/help/customTrack.html</url> which is uploaded to the appropriate human genome draft browser at the inception of the session. The name assigned to a site is a concatenation of the GenBank accession number associated with the site (described below: Delila-Genome server, <it>promotsite</it>), the name of <it>R</it><sub><it>i</it></sub>(<it>b, l</it>) matrix, ie. type of site, and the strength of the site in bits. Sites are represented as a color-shaded block in the custom track of the UCSC browser. The <it>score </it>field of the BED file controls the degree of shading of the site, with the strongest sites being the most opaque and the weakest being the most transparent. The score used in <it>Delila-Genome </it>BED files is a linear scaling of the <it>R</it><sub><it>i </it></sub>value. The start and end coordinates of a site correspond to the thick- and thin-ends of the BED features, respectively, so that its orientation can be visualized at high magnification. The <it>Scandiff </it>program generates BED files for different categories of output, each of which has a unique color coding. The <it>genvis </it>Perl tool selects genes with sites either within user-defined chromosomal intervals or sorted by information content from input BED files and generates HTML pages hyperlinked to the UCSC genome browser custom track. The user can either retrieve the BED files from the server and upload them to the genome browser locally, or connect to the server using X terminal software and upload them from the server to the genome browser.</p>
               <p>By navigating the other hyperlinks on the HTML page, one can view (i) the DNA sequence of a binding site (Fig. <figr fid="F2">2C</figr>), (ii) detailed characteristics of the binding site on the UCSC genome browser custom track (Fig. <figr fid="F2">2B</figr>), (iii) GenBank (Fig. <figr fid="F2">2E</figr>) and Stanford SOURCE (Fig. <figr fid="F2">2F</figr>) relational data describing the mRNA associated with this site, and (iv) all binding sites adjacent to the accession number on the UCSC browser within a user-defined window size (Fig. <figr fid="F2">2G</figr>).</p>
            </sec>
            <sec>
               <st>
                  <p>High resolution tools</p>
               </st>
               <p>The contributions of each nucleotide (in bits) to the overall individual information content of a single binding site can be at viewed at high resolution using sequence walkers (<abbrgrp><abbr bid="B5">5</abbr></abbrgrp>; shown in Figure <figr fid="F2">2D</figr>). A walker graphically represents the weight of each nucleotide at each position in a single possible binding site, with the height of the nucleotide indicating how well the bases match the individual information weight matrix.</p>
               <p>To display a sequence walker, the DNA sequence containing the binding site (through a hyperlink on the HTML page) should be stored in the user's autolister directory on the server (or a Linux/Unix client running <it>Delila</it>). The <it>Delila atchange </it>script is configured to display the sequence walker by running the <it>Delila-Genome autolist </it>script which scans the downloaded sequence for binding sites, executes the <it>lister </it>program to generate a postscript image of the sequence walker, and pops up the image in a new X-window with <it>ghostview</it>. Longer sequences may also be retrieved, permitting walkers from multiple, adjacent binding sites and the genomic context of the binding site to be visualized.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>The Delila-Genome Server</p>
            </st>
            <p>The first step in building <it>Delila-Genome </it>was to port the <it>Delila </it>individual information programs to the Linux platform. The <it>Delila </it>software library is distributed by the National Cancer Institute as binaries for the Sun Sparc system. Source code written in Pascal was translated to C using <it>p2c </it>and debugged.</p>
            <p>The main components of the server are the <it>scan </it>(from <it>Delila</it>), <it>promotsite, scandiff </it>and <it>genvis </it>programs. The server module generally runs directly on top of the <it>Delila </it>system however it can be run using a reduced set of <it>Delila </it>binaries. Besides the <it>scan </it>program, the only <it>Delila </it>programs required by <it>Delila-Genome </it>are <it>lister, mkdb</it>, and <it>dbbk </it>(for displaying sequence walkers). The Delila-Genome server programs are described below.</p>
            <p><it>Scan </it>evaluates the strength (in bits) of each binding site and reports those sites whose strength (<it>R</it><sub><it>i</it></sub>) lies within a user defined range <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. The parameters for scan are defined in the front-end Java program. The minimum threshold <it>R</it><sub><it>i </it></sub>value (<it>R</it><sub><it>i, minimum</it></sub>) is set at or above zero bits. Genome scans with an <it>R</it><sub><it>i</it></sub>(<it>b, l</it>) matrix derived from a limited number of binding sites, n = 50 can significantly contribute to Type 1 errors (false positive detection of weak binding sites). To decrease the source of this error, <it>R</it><sub><it>i, minimum </it></sub>is generally set to the <it>R</it><sub><it>i </it></sub>value of the weakest binding site used to compute the weight matrix. Alternatively, sites whose Z scores or probabilities of the binding strengths fall within a user-defined range may be selected. The user also specifies which portion of the individual information matrix is scanned and which strand to evaluate (positive, negative or both). <it>Scan </it>can output <it>data </it>(locations and strengths of sites), <it>scanfeatures </it>(features for display with <it>lister</it>) and <it>scaninst </it>(instructions for extracting sites as <it>Delila </it>book files) files for each chromosome, however, only the <it>data </it>file is required as input to the <it>promotsite </it>program. Each record in the <it>data </it>file contains the <it>R</it><sub><it>i </it></sub>values of all predicted binding sites in the genome, their respective coordinates, the Z scores of these <it>R</it><sub><it>i </it></sub>values and their corresponding probabilities. <it>Scan </it>has numerous other features, the details of which are presented in <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. The Z score for user-defined matrices is based upon the mean of the distribution of scores derived from these matrices. The mean is determined first by simulating a set of binding sites based upon this weight matrix (with the <it>ridi </it>program <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>) and then computing <it>R</it><sub><it>sequence </it></sub>from a book of sequences containing these sites with the <it>encode</it>, <it>dalvec </it>and <it>rseq </it>programs (eg. <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>).</p>
            <p><it>Promotsite </it>was developed to filter the output produced by <it>scan</it>, since these results may potentially contain large numbers of potential binding sites (>>10<sup>6</sup>), many of which are distant from expressed sequences. <it>Promotsite </it>prunes the <it>data </it>file produced by <it>scan </it>and reports only relevant sites which are within or adjacent to expressed genomic templates. The user defines a search window either upstream or downstream (or both) relative to the beginning genomic coordinate (often the transcriptional initiation site) of each gene. The upstream and downstream window lengths may be specified independently. <it>Promotsite </it>modifies the data file format produced by scan so that the associated GenBank accession number is appended to the record containing the binding site (psdataop file). Typical analyses of splice sites within human coding regions selected sites up to 1 Mb downstream of the transcription initiation site in order to ensure that even the longest genes would be encompassed by these searches. We have limited the analyses of promoters to a 10 kb interval upstream (in some cases, downstream) of the transcription initiation site. However, these parameters should be set (and subsequently optimized) based upon previous experimental or published binding site studies for specific factors. For example, to comprehensively detect insulator elements bound by the protein CTCF, this window has been specified bi-directionally and increased in length (to 50 kb; not shown).</p>
            <p>Since a site may, in some instances, fall within the search window of multiple mRNAs, the mRNA whose start position is closest to the binding site coordinate is assigned to be the associated mRNA for that site. The list of reported binding sites may also be pruned based on a range of chromosomal coordinates and by specifying particular chromosomes. <it>Promotsite </it>also defines a parameter known as the <it>paralog distance</it>. Since the same mRNA sequence may be mapped based upon its similarity to multiple genomic locations, paralogous genes on the same chromosome designated with the same mRNA accession number were distinguished from large genes containing multiple widely-dispersed exons by defining a parameter for the minimum distance between paralogous loci. Binding sites separated by less than the <it>paralog distance </it>are labeled with the same GenBank accession number and are considered part of the same gene, whereas sites exceeding this distance were assumed to be derived from different genes that were similar to the same GenBank accession. Typically, we set the paralog distance to 10<sup>5 </sup>or 10<sup>6 </sup>bp, depending upon the lengths and density of genes or gene families thought to contain relevant binding sites. Using the associated mRNA for each site, <it>promotsite </it>creates a BED-formatted file that can be uploaded as a custom track on the UCSC human genome browser <url>http://genome.ucsc.edu</url>.</p>
            <p>The execution time of <it>scan </it>depends on the length of the chromosome and the nucleotide length, <it>l</it>, of the <it>R</it><sub><it>i</it></sub>(<it>b, l</it>) weight matrix that defines the binding site. For hardware platforms with multiple computational nodes, the server can distribute <it>scan </it>and <it>promotsite </it>runs for each chromosome between these nodes so that the execution time over the whole genome is minimized. As <it>l </it>is constant over the whole genome, this load-balancing is based upon the length of each chromosome. Since execution times are generally several hours, the server informs the user of job completion by email.</p>
            <p>Relevant binding sites identified with <it>promotsite </it>or <it>scandiff </it>(see below) can be viewed with the <it>genvis </it>program. Like these programs, <it>genvis </it>also uses Javascript to generate HTML pages that display the binding site list extracted from the BED files. Since, in some instances, too many sites may be produced by <it>promotsite </it>and <it>scandiff </it>for browser uploading, <it>genvis </it>offers several options to select subsets of binding sites from a chromosome or genome scan. Groups of sites may be extracted by writing subsets of the BED files specified either by genomic strand, the chromosomal coordinates, or a list of accession numbers corresponding to mRNAs mapped onto the genome sequence.</p>
         </sec>
         <sec>
            <st>
               <p>Post-genome scan analysis</p>
            </st>
            <p>Inaccuracies in the genome draft coordinates of splice junction recognition sites motivated the development of an automated strategy to select correctly localized splice sites. Information weight matrices were iteratively recomputed from the set of sites with positive <it>R</it><sub><it>i </it></sub>values <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. More recently, we have built models of transcription factor binding sites by cyclical refinement of weight matrices based on published data from established regulated gene targets, supplemented with binding sites in these genes predicted by information theory and experimentally validated <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp>. With <it>Delila-Genome</it>, potential novel binding sites identified can be verified in the laboratory and included in subsequent refinements of the weight matrix.</p>
            <p>Previous approaches for comparing information weight matrices have involved determining the Euclidean or positional distances between related <it>R</it><sub><it>i</it></sub>(<it>b, l</it>) matrices <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B20">20</abbr></abbrgrp>. Comparisons of the results of successive genome scans offer an alternative approach for monitoring the progress of weight matrix refinement. The <it>scandiff </it>program computes model-to-model changes in information at experimentally-proven and predicted binding sites by scanning the same genome sequence with two different information weight matrices. This enables the user to monitor genome-wide sensitivity and specificity of binding site prediction. The psdataop output file generated by <it>promotsite </it>is the input to the <it>scandiff </it>program. The output files generated by <it>scandiff </it>categorize binding sites based upon their identification of unique sets of sites by each of the matrices (models A and B; columns A-B and B-A; Table <tblr tid="T1">1</tblr>), and sites detected with both weight matrices that show differences in information content (columns A &#8745; B; in Table <tblr tid="T1">1</tblr>). <it>Scandiff </it>can display differences in binding strength at the same coordinate based upon either exceeding thresholds of absolute changes in <it>R</it><sub><it>i </it></sub>(&#916;<it>R</it><sub><it>i</it></sub>), changes in their respective Z scores (&#916;Z) or distinct confidence intervals computed from each of the <it>R</it><sub><it>i</it></sub>(<it>b, l</it>) matrices <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Total binding site counts based on genome scans of promoters with PXR/RXR&#945; information weight matrices</p>
               </caption>
               <tblbdy cols="13">
                  <r>
                     <c cspan="2" ca="center">
                        <p>
                           <b>Models Compared</b>
                        </p>
                     </c>
                     <c cspan="11" ca="center">
                        <p>
                           <b>Numbers of sites in each category</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="11">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="2" ca="center">
                        <p>
                           <b>Unique sites</b>
                        </p>
                     </c>
                     <c cspan="3" ca="center">
                        <p>
                           <b>Z scores</b>
                        </p>
                     </c>
                     <c cspan="3" ca="center">
                        <p>
                           <b>
                              <it>R</it>
                           </b>
                           <sub>
                              <it>i</it>
                           </sub>
                        </p>
                     </c>
                     <c cspan="3" ca="center">
                        <p>
                           <b>Confidence intervals</b>
                           <sup>+</sup>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="13">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>A</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>B</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>A-B *</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>B-A^</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Threshold (&#916;Z)</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>(A &#8745; B) <it>S</it></b>
                           <sup>~</sup>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>(A &#8745; B) <it>I</it></b>
                           <sup>@</sup>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Threshold (&#916;<it>R</it><sub><it>i</it></sub>, bits)</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>(A &#8745; B) <it>S</it></b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>(A &#8745; B) <it>I</it></b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Threshold (&#177; S.D.)</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>(A &#8745; B) <it>S</it></b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>(A &#8745; B) <it>I</it></b>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c cspan="13">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>11758</p>
                     </c>
                     <c ca="center">
                        <p>45219</p>
                     </c>
                     <c ca="center">
                        <p>0.5</p>
                     </c>
                     <c ca="center">
                        <p>27945</p>
                     </c>
                     <c ca="center">
                        <p>44302</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>29378</p>
                     </c>
                     <c ca="center">
                        <p>42869</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>30982</p>
                     </c>
                     <c ca="center">
                        <p>41265</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.75</p>
                     </c>
                     <c ca="center">
                        <p>7492</p>
                     </c>
                     <c ca="center">
                        <p>64755</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>9080</p>
                     </c>
                     <c ca="center">
                        <p>63167</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>26931</p>
                     </c>
                     <c ca="center">
                        <p>45316</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>1.0</p>
                     </c>
                     <c ca="center">
                        <p>589</p>
                     </c>
                     <c ca="center">
                        <p>71658</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>2293</p>
                     </c>
                     <c ca="center">
                        <p>69954</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>23625</p>
                     </c>
                     <c ca="center">
                        <p>48622</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>17065</p>
                     </c>
                     <c ca="center">
                        <p>157922</p>
                     </c>
                     <c ca="center">
                        <p>0.5</p>
                     </c>
                     <c ca="center">
                        <p>90459</p>
                     </c>
                     <c ca="center">
                        <p>9942</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>54426</p>
                     </c>
                     <c ca="center">
                        <p>45975</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>55431</p>
                     </c>
                     <c ca="center">
                        <p>44970</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.75</p>
                     </c>
                     <c ca="center">
                        <p>73309</p>
                     </c>
                     <c ca="center">
                        <p>27092</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>26038</p>
                     </c>
                     <c ca="center">
                        <p>74363</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>45069</p>
                     </c>
                     <c ca="center">
                        <p>55332</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>1.0</p>
                     </c>
                     <c ca="center">
                        <p>48657</p>
                     </c>
                     <c ca="center">
                        <p>51744</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>11044</p>
                     </c>
                     <c ca="center">
                        <p>89357</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>37822</p>
                     </c>
                     <c ca="center">
                        <p>62579</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>61906</p>
                     </c>
                     <c ca="center">
                        <p>148894</p>
                     </c>
                     <c ca="center">
                        <p>0.5</p>
                     </c>
                     <c ca="center">
                        <p>54586</p>
                     </c>
                     <c ca="center">
                        <p>141831</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>93585</p>
                     </c>
                     <c ca="center">
                        <p>102832</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>104397</p>
                     </c>
                     <c ca="center">
                        <p>92020</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.75</p>
                     </c>
                     <c ca="center">
                        <p>17891</p>
                     </c>
                     <c ca="center">
                        <p>178526</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>33843</p>
                     </c>
                     <c ca="center">
                        <p>162574</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>80088</p>
                     </c>
                     <c ca="center">
                        <p>116329</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>1.0</p>
                     </c>
                     <c ca="center">
                        <p>5044</p>
                     </c>
                     <c ca="center">
                        <p>191373</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>11069</p>
                     </c>
                     <c ca="center">
                        <p>185348</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>68846</p>
                     </c>
                     <c ca="center">
                        <p>127571</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p><sup>+ </sup>Standard error computation for individual <it>R</it><sub><it>i </it></sub>values is based on derivation given in reference 11; *Sites found with model A but not with model B; ^sites found with model B, but not with model A; ~ Number of sites with differences in <it>R</it><sub><it>i </it></sub>values exceeding threshold Z scores; <sup>@</sup>Number of sites with differences in <it>R</it><sub><it>i </it></sub>values less than threshold.</p>
               </tblfn>
            </tbl>
            <p>The criteria of measuring changes in binding site strength is dictated by the stage of model refinement (see below). Absolute comparisons of <it>R</it><sub><it>i </it></sub>values are not as meaningful at early stages of refinement, since addition of experimentally-defined binding sites to an information model can substantially alter the distribution of <it>R</it><sub><it>i </it></sub>values of the binding sites that underlie these weight matrices. At early stages of refinement, the information models are based on fewer binding sites, resulting in larger confidence intervals for individual <it>R</it><sub><it>i </it></sub>values. Comparisons of <it>R</it><sub><it>i </it></sub>values based upon the sizes of confidence intervals are therefore not as reliable measure of significant change in information as changes in their respective Z scores.</p>
            <p>Upon model convergence, the proportion of sites in successive models with significant differences in information content should be quite small (<it>S</it>/ [<it>S+I</it>] (S = significant, I = insignificant) for confidence intervals of = 3 S.D. The proportion of sites common to both models relative to discordant sites found in only one model ([<it>S+I</it>] / [<it>A-B</it>] + [<it>B-A</it>]), should stabilize as successive versions of the information weight matrix are refined.</p>
            <p><it>Scandiff </it>generates BED-formatted files and data files similar in format to that produced by <it>promotsite </it>from the identified and categorized binding sites. We used the following color shading convention for the different types of binding sites. The sites with significant changes in <it>R</it><sub><it>i </it></sub>are shaded gray; sites identified only by scanning the first matrix are shaded brown; and sites found only with the second matrix are shaded blue. An example of this output is shown in Figure <figr fid="F3">3</figr>, which indicates the results for PXR/RXR&#945; models 1 and 2 in the vicinity of the <it>CYP3A4 </it>gene.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Screen shot of UCSC Genome Browser indicating binding sites found in genome scans using different information weight matrices</p>
               </caption>
               <text>
                  <p><b>Screen shot of UCSC Genome Browser indicating binding sites found in genome scans using different information weight matrices. </b>Binding sites in the promoter of the <it>CYP3A4 </it>gene found with PXR/RXR&#945; weight matrices are indicated by color-coded custom tracks. Sites uniquely identified with the weight matrices from Models 1 and 2 are respectively indicated with brown and blue tracks. The grey track shows binding sites with significantly different binding strengths that were identified by scanning with both of the matrices. The Custom tracks were generated by the <it>scandiff </it>program and uploaded to the Genome Browser.</p>
               </text>
               <graphic file="1471-2105-4-38-3"/>
            </fig>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results and Discussion</p>
         </st>
         <p>We tested the <it>Delila-Genome </it>system by scanning the human genome draft sequence (November, 2002) with information weight matrices developed from human transcription factor binding sites (PXR/RXR&#945; [pregnane-X receptor], NF-kB [p50/p65 heterodimer], and AHR [aryl hydrocarbon receptor]) and with models of sites required for post-transcriptional processing of heteronuclear RNA (donor and acceptor splice sites, and the SR protein, SC35). All binding site sequences were derived from published studies, and in some instances (PXR/RXR&#945;, NF-kB), supplemented by binding sites validated in our laboratory <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp>. The information weight matrices were derived with the <it>Delila </it>system using previously established procedures <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>.</p>
         <sec>
            <st>
               <p>Performance metrics</p>
            </st>
            <p>Table <tblr tid="T2">2</tblr> indicates the execution times of complete genome scans for various types of binding sites on two different Linux hardware platforms: a Beowulf cluster of three dual 1.1 Ghz CPU nodes running the Scyld operating system and a Mosix cluster of 24 single processor 500 Mhz nodes. Due to limitations in disk storage, Scyld Beowulf cluster was used to genome scans with PXR/RXR&#945; matrix only. The execution times given in Table <tblr tid="T2">2</tblr> represent combined results of running of both <it>scan </it>on the genome sequence and <it>promotsite </it>on the results of the <it>scan </it>program. The execution time for both programs depends upon the length of the binding site, <it>R</it><sub><it>sequence </it></sub>of the weight matrix, and <it>R</it><sub><it>i</it>,<it>minimum </it></sub>(specified by the user). The length of the site contributes to the CPU time, and the last two factors contribute to the I/O access time. From the table, we can see that for successive models of PXR/RXR&#945;, <it>R</it><sub><it>sequence </it></sub>decreases, and consequentially, the number of sites predicted, increases. Additional novel sites that are predicted by information analysis and validated by laboratory testing are introduced with each successive model. The additional sites in the model account for the decrease in <it>R</it><sub><it>sequence</it></sub>, and the increase in the number of predicted sites in the genome. <it>R</it><sub><it>sequence </it></sub>decreases from 17 bits to 14.9 bits from models 2 to 3, and there is a steep rise (more than a 2 fold increase) in the number of sites. With the addition of (somewhat weaker) binding sites to model 3, this resultant matrix is less biased towards the consensus sequence, resulting in a large genome-wide increase in predicted sites. The median execution times in the Mosix cluster were approximately 6.5 hrs and 3.5 hrs for the Scyld cluster for all PXR/RXR&#945; models, despite an increase of 3.5 fold in the number of sites from models 1 to 4. The effect of increased I/O access time on the total execution time is evident in the case of the SR protein SC35 site (which has a low <it>R</it><sub><it>sequence </it></sub>value of 3.64 bits), where the run time is 19 hours due to 76-fold increase in the quantity of sites predicted compared with the scan of PXR/RXR&#945; Model 4.</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Performance metrics for genome scans</p>
               </caption>
               <tblbdy cols="12">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="2" ca="center">
                        <p>
                           <b>Execution time (hrs) *</b>
                        </p>
                     </c>
                     <c cspan="4" ca="center">
                        <p>
                           <b>Number of sites found^</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Site</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Length</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Weight matrix version</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Num. sites in Model</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>
                              <it>R</it>
                           </b>
                           <sub><it>i</it>,<it>min</it></sub>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>
                              <it>R</it>
                           </b>
                           <sub>
                              <it>seq</it>
                           </sub>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Mosix</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Scyld</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b><it>R</it><sub><it>i</it></sub>&#8805;<it>R</it></b>
                           <sub><it>i</it>,<it>min</it></sub>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b><it>R</it><sub><it>i</it></sub>&#8805;<it>R</it></b>
                           <sub>
                              <it>Seq</it>
                           </sub>
                        </p>
                     </c>
                     <c ca="center">
                        <p>Unique Promoters with <b><it>R</it><sub><it>i</it></sub>&#8805;<it>R</it></b><sub><it>seq</it></sub></p>
                     </c>
                     <c ca="center">
                        <p>Promoters with multiple sites (%) <b><it>R</it><sub><it>i</it></sub>&#8805;<it>R</it></b><sub><it>seq</it></sub></p>
                     </c>
                  </r>
                  <r>
                     <c cspan="12">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PXR</p>
                     </c>
                     <c ca="center">
                        <p>23</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>15</p>
                     </c>
                     <c ca="center">
                        <p>7.1</p>
                     </c>
                     <c ca="center">
                        <p>17.1</p>
                     </c>
                     <c ca="center">
                        <p>6.5</p>
                     </c>
                     <c ca="center">
                        <p>4.3</p>
                     </c>
                     <c ca="center">
                        <p>3.48e5</p>
                     </c>
                     <c ca="center">
                        <p>218</p>
                     </c>
                     <c ca="center">
                        <p>200</p>
                     </c>
                     <c ca="center">
                        <p>8.3</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PXR</p>
                     </c>
                     <c ca="center">
                        <p>23</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>19</p>
                     </c>
                     <c ca="center">
                        <p>7.1</p>
                     </c>
                     <c ca="center">
                        <p>17.0</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>3.5</p>
                     </c>
                     <c ca="center">
                        <p>4.97e5</p>
                     </c>
                     <c ca="center">
                        <p>391</p>
                     </c>
                     <c ca="center">
                        <p>365</p>
                     </c>
                     <c ca="center">
                        <p>6.6</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PXR</p>
                     </c>
                     <c ca="center">
                        <p>23</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>32</p>
                     </c>
                     <c ca="center">
                        <p>7.1</p>
                     </c>
                     <c ca="center">
                        <p>14.9</p>
                     </c>
                     <c ca="center">
                        <p>7.1</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>1.10e6</p>
                     </c>
                     <c ca="center">
                        <p>3393</p>
                     </c>
                     <c ca="center">
                        <p>3036</p>
                     </c>
                     <c ca="center">
                        <p>10.5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PXR</p>
                     </c>
                     <c ca="center">
                        <p>23</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>48</p>
                     </c>
                     <c ca="center">
                        <p>7.1</p>
                     </c>
                     <c ca="center">
                        <p>14.4</p>
                     </c>
                     <c ca="center">
                        <p>6.8</p>
                     </c>
                     <c ca="center">
                        <p>3.8</p>
                     </c>
                     <c ca="center">
                        <p>1.44e6</p>
                     </c>
                     <c ca="center">
                        <p>7694</p>
                     </c>
                     <c ca="center">
                        <p>6439</p>
                     </c>
                     <c ca="center">
                        <p>16.3</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>NF-&#954;B</p>
                     </c>
                     <c ca="center">
                        <p>10</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>75</p>
                     </c>
                     <c ca="center">
                        <p>2.6</p>
                     </c>
                     <c ca="center">
                        <p>10.9</p>
                     </c>
                     <c ca="center">
                        <p>5.8</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>1.16e7</p>
                     </c>
                     <c ca="center">
                        <p>74050</p>
                     </c>
                     <c ca="center">
                        <p>33340</p>
                     </c>
                     <c ca="center">
                        <p>54.9</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>AHR</p>
                     </c>
                     <c ca="center">
                        <p>17</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>30</p>
                     </c>
                     <c ca="center">
                        <p>2.8</p>
                     </c>
                     <c ca="center">
                        <p>9.4</p>
                     </c>
                     <c ca="center">
                        <p>6.3</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>1.20e7</p>
                     </c>
                     <c ca="center">
                        <p>42487</p>
                     </c>
                     <c ca="center">
                        <p>24764</p>
                     </c>
                     <c ca="center">
                        <p>41.7</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Acc</p>
                     </c>
                     <c ca="center">
                        <p>28</p>
                     </c>
                     <c ca="center">
                        <p>12</p>
                     </c>
                     <c ca="center">
                        <p>1.08e5</p>
                     </c>
                     <c ca="center">
                        <p>2.4</p>
                     </c>
                     <c ca="center">
                        <p>7.4</p>
                     </c>
                     <c ca="center">
                        <p>14.5</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>4.87e7</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Don</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>1.11e5</p>
                     </c>
                     <c ca="center">
                        <p>2.4</p>
                     </c>
                     <c ca="center">
                        <p>6.7</p>
                     </c>
                     <c ca="center">
                        <p>10.5</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>4.85e7</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>SC35</p>
                     </c>
                     <c ca="center">
                        <p>8</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>30</p>
                     </c>
                     <c ca="center">
                        <p>0.4</p>
                     </c>
                     <c ca="center">
                        <p>3.6</p>
                     </c>
                     <c ca="center">
                        <p>19</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>1.07e8</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Abbreviations. Site: Binding site information matrix; PXR: PXR/RXR&#945;; NF-&#954;B: NF-&#954;B p50/p65 subunits; Acc: Splice Acceptor; Don: Splice Donor; Length: Length of the site in nucleotides; <it>R</it><sub><it>i</it>,<it>min </it></sub>: <it>R</it><sub><it>i</it>,<it>minimum </it></sub>(in bits); <it>R</it><sub><it>Seq </it></sub>: <it>R</it><sub><it>sequence </it></sub>(in bits) * total runtime for both <it>scan </it>and <it>promotsite </it>programs ^Results of information analysis with the PXR/RXR&#945;, NF-kB and AHR matrices of promoter regions (10 kb upstream of transcription initiation site) for all transcripts mapped in reference genome sequence. Complete gene sequences (from the transcription initiation site to the terminal sequence of the 3' UTR) were analyzed with the Acc, Don and SC35 matrices.</p>
               </tblfn>
            </tbl>
            <p>Analysis of the splice acceptor and donor runs required a modification of the published genome sequence. In the original genome drafts, a very large number of binding sites (>>10<sup>8</sup>) were initially found. Many of these sites were composed of long runs of undefined polynucleotides (ie. = N<sub>(10)</sub>) in heterochromatin and in gaps in the draft sequence. The <it>Delila </it>program defaults to adenine in these cases, and in the case of splice acceptor sites, these substitutions generated sites comprised of polyadenine, which itself has an <it>R</it><sub><it>i </it></sub>value exceeding the user-defined threshold (2.4 bits; <it>R</it><sub><it>i, minimum</it></sub>). These runs exceeded our available disk storage, and to reduce the quantity of false positive sites, we generated and substituted random nucleotides for every sequence of undefined polynucleotides = 10 bp in length. Our previous studies have shown that sequence randomization produces fewer than 2% of binding sites with <it>R</it><sub><it>i </it></sub>values above zero bits <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, and none above the minimum <it>R</it><sub><it>i </it></sub>threshold value <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. The genome scans of the substituted genome sequences with splice donor and acceptor <it>R</it><sub><it>i</it></sub>(<it>b, l</it>) weight matrices were completed in 10.5 and 14.5 hours, respectively.</p>
         </sec>
         <sec>
            <st>
               <p>Visualization of binding sites in subgenomic intervals</p>
            </st>
            <p>We have found that uploads of large BED files of binding sites to the remote UCSC genome browser can be time-consuming and sometimes fail. The BED file for all binding sites found with PXR/RXR&#945; Model 4, for example, is ~ 30 MB and required 5&#8211;10 minutes to upload. Furthermore, the large numbers of sites found with some information weight matrices (eg. splice donor and acceptor sites; 254 Mb for acceptor sites on chromosome 1 alone) produce BED file sizes exceeding browser/server limits. We therefore created and viewed subsets of binding sites for genomic regions of specific interest with the <it>genvis </it>tool.</p>
            <p>Figure <figr fid="F2">2</figr> depicts the HTML page generated by <it>genvis</it>, containing a partial list of binding sites on chromosome 1 for the PXR/RXR&#945; model 4 <it>R</it><sub><it>i</it></sub>(<it>b, l</it>) weight matrix. The websites linked to this page are also shown (but have been resized or truncated) to reflect only the important details of each. When the HTML page is initially loaded, a window for the UCSC browser pops up. The BED file is uploaded using a button in this window upon selecting the appropriate version of the genome draft at the UCSC website. When the genome browser target links (entries in the <it>R</it><sub><it>i</it></sub>, Seq and UCSC Browser columns) are activated, the genome browser displays the information based on this uploaded file.</p>
            <p>The second row of the HTML table in Figure <figr fid="F2">2</figr> corresponds to the binding site associated with the GenBank Accession L13278. This is a strong binding site (<it>R</it><sub><it>i </it></sub>value of ~ 20.1 bits) which is hyper-linked to the custom track detail in the genome browser. This track detail page indicates the size of the site and the orientation of the recognition sequence on the draft genome sequence. The user can obtain the DNA sequence of the site either from from the <it>Seq </it>cell in the HTML table or from the corresponding custom track detail. The pop up sequence walker indicates the relative contributions of each nucleotide in the site <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>.</p>
            <p>The linked GenBank and SOURCE database entries indicate that accession L13278 encodes the zeta-crystallin/quinone reductase gene. We selected this example to illustrate that <it>Delila-Genome </it>can be used to potentially discover novel transcriptional regulatory targets, since this gene has not been previously demonstrated to be regulated by PXR/RXR&#945;. The SOURCE entry is based on a dynamic collection and compilation of gene data from many scientific databases associated with the GenBank accession, whereas the GenBank entry, in some instances, is not curated and guaranteed only to contain the corresponding sequence. The SOURCE entry also indicates other information such as the aliases for the gene name, the locus link designation, expression profile, etc.</p>
            <p>The UCSC genome browser entry displays the binding site custom track and sequences in the proximity of the associated GenBank accession. The coordinates delineate a display window concordant with the search window defined in <it>promotsite </it>for generating the list of binding sites given in the HTML page. In Figure <figr fid="F2">2</figr>, the predicted site is 1112 bp upstream of L13278 and ~ 7.2 kb upstream of an as yet uncharacterized gene corresponding to both AK098237 and BC009514. Although we cannot exclude the possibility that this site regulates the gene encoded by AK098237/BC009514, its closer proximity to the zeta-crystallin gene and the common orientation of both the site and gene on the antisense strand suggests that this site may function as a potential transcriptional enhancer element. There are no other predicted binding sites in the vicinity of this gene.</p>
         </sec>
         <sec>
            <st>
               <p>Comparison of genome scans produced from successive transcription factor information weight matrices</p>
            </st>
            <p>The results of genome scans with successive refinements of PXR/RXR&#945; information weight matrices were compared using <it>scandiff</it>. The refinement procedure was validated by detecting binding sites in well-established PXR/RXR&#945; target genes. Initial models based on published sites were used to scan target genes that were known to be induced by PXR/RXR&#945; binding, but where additional sites had not been previously identified. Sites detected in these scans were assayed for binding to PXR/RXR&#945; and those found to bind were incorporated in subsequent rounds of refinement.</p>
            <p>The <it>genvis </it>program was used to display <it>scandiff </it>results for <it>CYP3A4</it>, which is a single gene known to be regulated by PXR/RXR&#945; (Figure <figr fid="F3">3</figr>). BED-custom tracks of this gene for scans of the initial and second PXR/RXR&#945; models (1 and 2) are indicated. Both information models recognize experimentally-verified binding sites <abbrgrp><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr></abbrgrp>: a strong, potential proximal enhancer binding site (custom track M18907_pxr_R17) 204 bp upstream of the transcription initiation site and a cluster of distal enhancer elements 7.2&#8211;7.8 kb upstream. Model 1 identified a 7 bit site (AF182273_pxr_R7) in the first intron, which is absent in the scan of model 2. However, model 2 also identifies an additional site (M18907_pxr_R7) within the distal enhancer cluster, which is consistent with the possibility that Model 2 more specifically recognizes promoter binding sites. Similar results were obtained confirming detection of experimentally-defined binding sites in the promoters of other PXR/RXR&#945; regulated genes (<it>CYP3A7</it>, <it>CYP2B6</it>; results not shown) induced by this transcription factor.</p>
            <p><it>Scandiff </it>also produces a summary statistics file which can be used to monitor the progress of information theory-based model refinement. The following example indicates how the results of complete genome scans with four successive PXR/RXR&#945; <it>R</it><sub><it>i</it></sub>(<it>b, l</it>) matrices can be interpreted from these summaries (each successive model is based on increasing numbers of experimentally validated binding sites; Table <tblr tid="T1">1</tblr>). The tables indicate the differences in the number of predicted binding sites in each category of these models. By selecting high thresholds for either &#916;<it>R</it><sub><it>i </it></sub>values, &#916; Z scores or confidence intervals, it is possible to identify binding sites with the most significant model-to-model changes. The following analysis is based on changes in information content of at least 3 bits (&#916;<it>R</it><sub><it>i</it></sub>), Z score differences of = 1, and confidence intervals = 3 standard deviations, ie. 95%.</p>
            <p>Newly identified sites (B-A) predicted with model 2 are 3.8 fold more abundant than those found only with model 1. Scanning the genome with model 3 (vs. model 2) resulted in an even greater disproportionate distribution of unique sites (9.2 fold). This trend continues in model 4, but the fraction of novel binding sites is decreased (2.4 fold). The findings indicate that increasing the diversity of the sequences underlying the matrix affects which binding sites are found in the genome scan. It is apparent that the PXR/RXR&#945; weight matrix has not converged, since large numbers of novel sites continue to be found with successive information models.</p>
            <p>Only a modest fraction of sites (<it>S/ [S+I]; </it>S = significant, I = insignificant) exhibit the largest significant changes in binding site strength (&#916;<it>R</it><sub><it>i </it></sub>= 3 bits; ranging from 3&#8211;11%), regardless of which pair of scans are analyzed. Most changes in information content are = 2 bits. As &#916;<it>R</it><sub><it>i </it></sub>values give no indication of the strengths of the sites that have changed (only the magnitude of those changes), we also cataloged significant changes by comparing the Z scores of the same binding sites found by successive models. The most stringent test (&#916; Z = 1) revealed that the transition from model 2 to model 3 produced the largest proportion of significant changes (48% of sites; n = 48,657), in comparison with more modest changes in Z score from models 1 to 2 (0.8%) and models 3 to 4 (2.5%). We interpret these results to indicate that model 3 may have altered the strengths of binding sites at outlying <it>R</it><sub><it>i </it></sub>values to a greater extent than the transitions either from models 1 to 2 or from models 3 to 4.</p>
            <p>Binding sites that are added to the models in subsequent rounds of experimental refinement have increasingly diverse sequences, resulting in lower measures of <it>R</it><sub><it>sequence </it></sub>and therefore detect additional predicted sites. Shorter binding sites, such as those recognized by AHR, with lower <it>R</it><sub><it>sequence </it></sub>values, are predicted to be even more abundant. The vast majority of the newly detected binding sites are considered "weak" (<it>R</it><sub><it>i </it></sub>&lt;&lt;<it>R</it><sub><it>sequence</it></sub>; Table <tblr tid="T2">2</tblr>). The lower threshold <it>R</it><sub><it>i </it></sub>value of binding sites reported by <it>scan </it>is typically set to the strength of the weakest binding site used to define the information weight matrix. The confidence intervals on binding sites with low <it>R</it><sub><it>i </it></sub>values are still fairly large [see Appendix to reference 11], and some of these sites may turn out to have <it>R</it><sub><it>i </it></sub>&lt; 0 bits. In any case, the affinities for sites with low <it>R</it><sub><it>i </it></sub>values, especially those ~ <it>R</it><sub><it>i, minimum </it></sub>are likely to be negligible and may not be detectable experimentally <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. Nevertheless, the increased sequence diversity introduced by these refinement procedures augments the dynamic range of site binding strengths found with later versions of refined models. The increased sequence diversity affects the frequencies of the nucleotides underlying the weight matrix and can significantly alter the information contents of predicted "strong" sites <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>.</p>
            <p>Additional gene promoters are found with successive PXR/RXR&#945; models (Table <tblr tid="T1">1</tblr>). In each pairwise comparison of information models, novel binding sites detected by the later model substantially outnumbered unique sites found only by the earlier model (by 4 to 11.2 fold). Nevertheless, it is encouraging that the increased number of genes containing these binding sites does not proportionally increase with the numbers of binding sites, which suggests that the subsequent models are predicting additional sites in the same genes. This is not surprising, since multiple PXR/RXR&#945; enhancer binding elements with "moderate-to-strong" <it>R</it><sub><it>i </it></sub>values have been documented in known targets of this transcription factor, including several <it>CYP3A </it>gene family members. We examined the distributions of such sites in genome scans of promoters with the different PXR/RXR&#945; weight matrices.</p>
            <p>The "moderate-to-strong" binding sites in the genome-wide promoter scans (<it>R</it><sub><it>i </it></sub>><it>R</it><sub><it>sequence</it></sub>; Table <tblr tid="T2">2</tblr>) are a small percentage of all sites detected (0.06 % in Model 1, increasing to 0.5 % in Model 4). The refinement procedure may improve the sensitivity of detecting such sites. PXR/RXR&#945; models 1 and 2 actually detect <it>fewer </it>of these sites in gene promoters (and genes) than the numbers of genes that exhibit changes in expression by microarray studies <abbrgrp><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr></abbrgrp>, suggesting that these models predict fewer binding sites, and consequently fewer target genes than expected. In subsequent models, increasingly higher frequencies of multiplex sites are found in the same promoters (8% in Model 1 versus 16% in Model 4). This degree of redundancy (in Model 4) substantially exceeds the expected frequency of promoters with multiple binding sites, and the information required to find these sites in the genome (<it>R</it><sub><it>frequency</it></sub>~ 4 bits). We also find that multiplex binding sites within promoters recognized by transcription factors with smaller footprints are considerably more frequent (NF-&#954;B p50/p65 and AHR), as expected from their lower <it>R</it><sub><it>sequence </it></sub>values.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusions</p>
         </st>
         <p><it>Delila-Genome </it>can be used to scan eukaryotic genomes with information theory-based models for transcription factor and post-transcriptional protein binding sites and displays the most relevant sites. Complete scans of human genome draft sequences with information-weight matrices of transcription factor binding sites (PXR/RXR&#945;, AHR and NF-&#954;B p50/p65) and sequences required for mRNA splicing (donor, acceptor, and SC35 splicing enhancer protein binding sites) were completed within several hours on small Linux clusters. Binding sites can be visualized at either high or low sequence resolution juxtaposed with other genome annotation. The software can also be used to compare the distributions of predicted sites in multiple or successive binding site models. Refinement of successive binding site models should enable more accurate and specific predictions of site strength, which in turn, may facilitate discovery of novel regulatory gene targets and assist in the prediction of mRNA splicing patterns.</p>
      </sec>
      <sec>
         <st>
            <p>Availability and Requirements</p>
         </st>
         <p>&#8226; <b>Project Name: </b>Delila-Genome</p>
         <p>&#8226; <b>Project Home Page: </b><url>http://www.sice.umkc.edu/~roganp/Information/delgen.html</url></p>
         <p>&#8226; <b>Operating System(s):</b></p>
         <p>Server &#8211; Linux; can be ported to Unix/Solaris with little or no modification.</p>
         <p>Client [Front end] &#8211; Any system with JRE (Java Runtime Environment) 1.4 or higher installed</p>
         <p>&#8226; <b>Programming Language:</b></p>
         <p>Server &#8211; Perl, Pascal, C/C++, Bash shell scripts, Javascript</p>
         <p>Client [Front end] &#8211; Java</p>
         <p>&#8226; <b>Other requirements:</b>. Individual information program package (for details, see <url>http://www.lecb.ncifcrf.gov/~toms/walker/iipp.html</url>)</p>
         <p>&#8226; <b>License</b>: Delila-Genome is deposited at <url>http://www.bioinformatics.org</url> under GNU GPL. The Individual Information programs are available from the National Cancer Institute via transfer agreements (see <url>http://www.lecb.ncifcrf.gov/~toms/contacts.html</url>). Linux binaries and the source code of the <it>Delila </it>programs are available to NCI-authorized users from the authors.</p>
         <p>&#8226; <b>Any restrictions to use by non-academics: </b>None</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>PKR developed and implemented the model refinement procedures and designed the <it>Delila-Genome </it>system. SG implemented the <it>Delila-Genome </it>architecture and wrote the code. SG and PKR have tested the system. PKR and JSL refined the AHR, NF-kB, and PXR/RXR&#945; information models (PXR/RXR&#945; with CAV); PKR developed and refined the splice donor, acceptor and SC35 models. CAV and JSL validated the predicted PXR/RXR&#945; binding sites in the laboratory. All authors have approved the manuscript.</p>
      </sec>
      <sec>
         <st>
            <p>Descriptions of additional data files</p>
         </st>
         <p>A package of <it>Delila-Genome </it>software and documentation and <it>Delila </it>books of the human genome sequence assembly (April 2003) are available at <url>http://www.sice.umkc.edu/~roganp/Information/delgen.html</url>. Examples of HTML pages produced by <it>Delila-Genome </it>with corresponding BED custom tracks can also be downloaded from this website.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>This work was sponsored by grant ES 10855 from the National Institute of Environmental Health. We are grateful to Tom Schneider and Joan Knoll for their valuable comments on the manuscript. We thank Information Services at the University of Missouri-Kansas City for access to the Mosix cluster.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>A design for computer nucleic-acid-sequence storage, retrieval, and manipulation</p>
            </title>
            <aug>
               <au>
                  <snm>Schneider</snm>
                  <fnm>TD</fnm>
               </au>
               <au>
                  <snm>Stormo</snm>
                  <fnm>GD</fnm>
               </au>
               <au>
                  <snm>Haemer</snm>
                  <fnm>JS</fnm>
               </au>
               <au>
                  <snm>Gold</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1982</pubdate>
            <volume>10</volume>
            <fpage>3013</fpage>
            <lpage>3024</lpage>
            <xrefbib>
               <pubid idtype="pmpid">7099972</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>A mathematical theory of communication</p>
            </title>
            <aug>
               <au>
                  <snm>Shannon</snm>
                  <fnm>CE</fnm>
               </au>
            </aug>
            <source>Bell System Technical Journal</source>
            <pubdate>1948</pubdate>
            <volume>27</volume>
            <fpage>379</fpage>
            <lpage>423 and 623-656</lpage>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Sequence logos, machine/ channel capacity, Maxwell's demon, and molecular computers: a review of the theory of molecular machines</p>
            </title>
            <aug>
               <au>
                  <snm>Schneider</snm>
                  <fnm>TD</fnm>
               </au>
            </aug>
            <source>Nanotechnology</source>
            <pubdate>1994</pubdate>
            <volume>5</volume>
            <fpage>1</fpage>
            <lpage>18</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1088/0957-4484/5/1/001</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Information content of individual genetic sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Schneider</snm>
                  <fnm>TD</fnm>
               </au>
            </aug>
            <source>J Theor Biol</source>
            <pubdate>1997</pubdate>
            <volume>189</volume>
            <fpage>427</fpage>
            <lpage>441</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jtbi.1997.0540</pubid>
                  <pubid idtype="pmpid" link="fulltext">9446751</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Sequence walkers: a graphical method to display how binding proteins interact with DNA or RNA sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Schneider</snm>
                  <fnm>TD</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1997</pubdate>
            <volume>25</volume>
            <fpage>4408</fpage>
            <lpage>4415</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/nar/25.21.4408</pubid>
                  <pubid idtype="pmpid" link="fulltext">9336476</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Information analysis of human splice site mutations</p>
            </title>
            <aug>
               <au>
                  <snm>Rogan</snm>
                  <fnm>PK</fnm>
               </au>
               <au>
                  <snm>Faux</snm>
                  <fnm>BM</fnm>
               </au>
               <au>
                  <snm>Schneider</snm>
                  <fnm>TD</fnm>
               </au>
            </aug>
            <source>Hum Mutat</source>
            <pubdate>1998</pubdate>
            <volume>12</volume>
            <fpage>153</fpage>
            <lpage>171</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/(SICI)1098-1004(1998)12:3&lt;153::AID-HUMU3>3.3.CO;2-O</pubid>
                  <pubid idtype="pmpid" link="fulltext">9711873</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Modeling differential binding of NF-kB p50 to a CYP2D6 promotor variant by information theory [abstract]</p>
            </title>
            <aug>
               <au>
                  <snm>Hurwitz</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Svojanovsky</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Leeder</snm>
                  <fnm>JS</fnm>
               </au>
               <au>
                  <snm>Rogan</snm>
                  <fnm>PK</fnm>
               </au>
            </aug>
            <source>American Journal of Human Genetics</source>
            <pubdate>2001</pubdate>
            <volume>69</volume>
            <fpage>s476</fpage>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Modeling splice site and transcription factor binding site variation  by information theory [abstract]</p>
            </title>
            <aug>
               <au>
                  <snm>Rogan</snm>
                  <fnm>PK</fnm>
               </au>
               <au>
                  <snm>Svojanovsky</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Hurwitz</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Schneider</snm>
                  <fnm>TD</fnm>
               </au>
               <au>
                  <snm>Leeder</snm>
                  <fnm>JS</fnm>
               </au>
            </aug>
            <source>American Journal of Human Genetics</source>
            <pubdate>2002</pubdate>
            <volume>71</volume>
            <fpage>s333</fpage>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Modeling PXR/RXR  Binding Using Information Theory [abstract]</p>
            </title>
            <aug>
               <au>
                  <snm>Vyhlidal</snm>
                  <fnm>CA</fnm>
               </au>
               <au>
                  <snm>Rogan</snm>
                  <fnm>PK</fnm>
               </au>
               <au>
                  <snm>Leeder</snm>
                  <fnm>JS</fnm>
               </au>
            </aug>
            <source>7th Annual Meeting of the International Society for Study of Xenobiotics</source>
            <pubdate>2002</pubdate>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Anatomy of Escherichia coli ribosome binding sites</p>
            </title>
            <aug>
               <au>
                  <snm>Shultzaberger</snm>
                  <fnm>RK</fnm>
               </au>
               <au>
                  <snm>Bucheimer</snm>
                  <fnm>RE</fnm>
               </au>
               <au>
                  <snm>Rudd</snm>
                  <fnm>KE</fnm>
               </au>
               <au>
                  <snm>Schneider</snm>
                  <fnm>TD</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>2001</pubdate>
            <volume>313</volume>
            <fpage>215</fpage>
            <lpage>228</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.2001.5040</pubid>
                  <pubid idtype="pmpid" link="fulltext">11601857</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Information theory-based analysis of CYP2C19, CYP2D6 and CYP3A5 splicing mutations</p>
            </title>
            <aug>
               <au>
                  <snm>Rogan</snm>
                  <fnm>PK</fnm>
               </au>
               <au>
                  <snm>Svojanovsky</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Leeder</snm>
                  <fnm>JS</fnm>
               </au>
            </aug>
            <source>Pharmacogenetics</source>
            <pubdate>2003</pubdate>
            <volume>13</volume>
            <fpage>207</fpage>
            <lpage>218</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1097/00008571-200304000-00005</pubid>
                  <pubid idtype="pmpid" link="fulltext">12668917</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Information analysis of Fis binding sites</p>
            </title>
            <aug>
               <au>
                  <snm>Hengen</snm>
                  <fnm>PN</fnm>
               </au>
               <au>
                  <snm>Bartram</snm>
                  <fnm>SL</fnm>
               </au>
               <au>
                  <snm>Stewart</snm>
                  <fnm>LE</fnm>
               </au>
               <au>
                  <snm>Schneider</snm>
                  <fnm>TD</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1997</pubdate>
            <volume>25</volume>
            <fpage>4994</fpage>
            <lpage>5002</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/nar/25.24.4994</pubid>
                  <pubid idtype="pmpid" link="fulltext">9396807</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>OxyR and SoxRS regulation of fur</p>
            </title>
            <aug>
               <au>
                  <snm>Zheng</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Doan</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Schneider</snm>
                  <fnm>TD</fnm>
               </au>
               <au>
                  <snm>Storz</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>J Bacteriol</source>
            <pubdate>1999</pubdate>
            <volume>181</volume>
            <fpage>4639</fpage>
            <lpage>4643</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">10419964</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome</p>
            </title>
            <aug>
               <au>
                  <snm>Berman</snm>
                  <fnm>BP</fnm>
               </au>
               <au>
                  <snm>Nibu</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Pfeiffer</snm>
                  <fnm>BD</fnm>
               </au>
               <au>
                  <snm>Tomancak</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Celniker</snm>
                  <fnm>SE</fnm>
               </au>
               <au>
                  <snm>Levine</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Rubin</snm>
                  <fnm>GM</fnm>
               </au>
               <au>
                  <snm>Eisen</snm>
                  <fnm>MB</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>2002</pubdate>
            <volume>99</volume>
            <fpage>757</fpage>
            <lpage>762</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1073/pnas.231608898</pubid>
                  <pubid idtype="pmpid" link="fulltext">11805330</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>SCORE: a computational approach to the identification of cis-regulatory modules and target genes in whole-genome sequence data. Site clustering over random expectation</p>
            </title>
            <aug>
               <au>
                  <snm>Rebeiz</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Reeves</snm>
                  <fnm>NL</fnm>
               </au>
               <au>
                  <snm>Posakony</snm>
                  <fnm>JW</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>2002</pubdate>
            <volume>99</volume>
            <fpage>9888</fpage>
            <lpage>9893</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1073/pnas.152320899</pubid>
                  <pubid idtype="pmpid" link="fulltext">12107285</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Characterization of human RNA splice signals by iterative functional selection of splice sites</p>
            </title>
            <aug>
               <au>
                  <snm>Lund</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Tange</snm>
                  <fnm>TO</fnm>
               </au>
               <au>
                  <snm>Dyhr-Mikkelsen</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Hansen</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Kjems</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>RNA</source>
            <pubdate>2000</pubdate>
            <volume>6</volume>
            <fpage>528</fpage>
            <lpage>544</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1017/S1355838200992033</pubid>
                  <pubid idtype="pmpid" link="fulltext">10786844</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Using sequence logos and information analysis of Lrp DNA binding sites to investigate discrepanciesbetween natural selection and SELEX</p>
            </title>
            <aug>
               <au>
                  <snm>Shultzaberger</snm>
                  <fnm>RK</fnm>
               </au>
               <au>
                  <snm>Schneider</snm>
                  <fnm>TD</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1999</pubdate>
            <volume>27</volume>
            <fpage>882</fpage>
            <lpage>887</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/nar/27.3.882</pubid>
                  <pubid idtype="pmpid" link="fulltext">9889287</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Delila programs documentation.</p>
            </title>
            <aug>
               <au>
                  <snm>Schneider</snm>
                  <fnm>TD</fnm>
               </au>
            </aug>
            <source>http://www.lecb.ncifcrf.gov/~toms/delila/delilaprograms.html</source>
            <pubdate>2003</pubdate>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites</p>
            </title>
            <aug>
               <au>
                  <snm>Stephens</snm>
                  <fnm>RM</fnm>
               </au>
               <au>
                  <snm>Schneider</snm>
                  <fnm>TD</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1992</pubdate>
            <volume>228</volume>
            <fpage>1124</fpage>
            <lpage>1136</lpage>
            <xrefbib>
               <pubid idtype="pmpid">1474582</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Measuring molecular information</p>
            </title>
            <aug>
               <au>
                  <snm>Schneider</snm>
                  <fnm>TD</fnm>
               </au>
            </aug>
            <source>J Theor Biol</source>
            <pubdate>1999</pubdate>
            <volume>201</volume>
            <fpage>87</fpage>
            <lpage>92</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jtbi.1999.1012</pubid>
                  <pubid idtype="pmpid" link="fulltext">10534438</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Monitoring expression of genes involved in drug metabolism and toxicology using DNA microarrays</p>
            </title>
            <aug>
               <au>
                  <snm>Gerhold</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Lu</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Xu</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Austin</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Caskey</snm>
                  <fnm>CT</fnm>
               </au>
               <au>
                  <snm>Rushmore</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Physiol Genomics</source>
            <pubdate>2001</pubdate>
            <volume>5</volume>
            <fpage>161</fpage>
            <lpage>170</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11328961</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Rifampin is a selective, pleiotropic inducer of drug metabolism genes in human hepatocytes: studies with cDNA and oligonucleotide expression arrays</p>
            </title>
            <aug>
               <au>
                  <snm>Rae</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Johnson</snm>
                  <fnm>MD</fnm>
               </au>
               <au>
                  <snm>Lippman</snm>
                  <fnm>ME</fnm>
               </au>
               <au>
                  <snm>Flockhart</snm>
                  <fnm>DA</fnm>
               </au>
            </aug>
            <source>J Pharmacol Exp Ther</source>
            <pubdate>2001</pubdate>
            <volume>299</volume>
            <fpage>849</fpage>
            <lpage>857</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11714868</pubid>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
