<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1751-0473-2-2</ui>
   <ji>1751-0473</ji>
   <fm>
      <dochead>Software review</dochead>
      <bibl>
         <title>
            <p>HitKeeper, a generic software package for hit list management</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Hau</snm>
               <fnm>J&#246;rg</fnm>
               <insr iid="I1"/>
               <email>joerg.hau@rdls.nestle.com</email>
            </au>
            <au id="A2">
               <snm>Muller</snm>
               <fnm>Michael</fnm>
               <insr iid="I2"/>
               <email>michael.muller@gmail.com</email>
            </au>
            <au id="A3" ca="yes">
               <snm>Pagni</snm>
               <fnm>Marco</fnm>
               <insr iid="I3"/>
               <email>marco.pagni@isb-sib.ch</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Nestl&#233; Research Center, Department of BioAnalytical Science, PO Box 44, CH-1000 Lausanne 26, Switzerland</p>
            </ins>
            <ins id="I2">
               <p>EPFL Database Laboratory, CH-1015 Lausanne, Switzerland</p>
            </ins>
            <ins id="I3">
               <p>Swiss Institute of Bioinformatics, Vital-IT group, CH-1015 Lausanne, Switzerland</p>
            </ins>
         </insg>
         <source>Source Code for Biology and Medicine</source>
         <issn>1751-0473</issn>
         <pubdate>2007</pubdate>
         <volume>2</volume>
         <issue>1</issue>
         <fpage>2</fpage>
         <url>http://www.scfbm.org/content/2/1/2</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17391514</pubid>
               <pubid idtype="doi">10.1186/1751-0473-2-2</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>08</day>
               <month>2</month>
               <year>2007</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>28</day>
               <month>3</month>
               <year>2007</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>28</day>
               <month>3</month>
               <year>2007</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2007</year>
         <collab>Hau et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>The automated annotation of biological sequences (protein, DNA) relies on the computation of hits (predicted features) on the sequences using various algorithms. Public databases of biological sequences provide a wealth of biological "knowledge", for example manually validated annotations (features) that are located on the sequences, but mining the sequence annotations and especially the predicted and curated features requires dedicated tools. Due to the heterogeneity and diversity of the biological information, it is difficult to handle redundancy, frequent updates, taxonomic information and "private" data together with computational algorithms in a common workflow.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We present <it>HitKeeper</it>, a software package that controls the fully automatic handling of multiple biological databases and of hit list calculations on a large scale. The software implements an asynchronous update system that introduces updates and computes hits as soon as new data become available. A query interface enables the user to search sequences by specifying constraints, such as retrieving sequences that contain specific motifs, or a defined arrangement of motifs ("metamotifs"), or filtering based on the taxonomic classification of a sequence.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>The software provides a generic and modular framework to handle the redundancy and incremental updates of biological databases, and an original query language. It is published under the terms and conditions of version 2 of the GNU Public License and available at <url>http://hitkeeper.sourceforge.net</url>.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>The automated annotation of protein or DNA sequences is performed using a rather heterogeneous collection of motif predictors, which include regular expressions, generalized profiles, hidden Markov models and neural networks. Since the search for hits by a motif on a sequence is expensive in terms of processing time, the lists of hits obtained by comparing collections of motifs with collections of sequences are usually stored and distributed as dedicated databases. InterPro <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> is a canonical example of such a public resource that, by definition, covers only publicly available sequences and motifs.</p>
         <p>However, biological research is often carried out using sequences and motifs that are derived from public, as well as private, sources. There is a clear need to incorporate both sources of data into the same workflow; however, since the software used to generate, manage and keep the public data up-to-date is usually not available, it is difficult to reproduce and maintain similar hit lists locally.</p>
         <p>Further issues that complicate matters are the update frequency of public databases, which can lead to an almost continuous data flow, and the redundancy between different databases. As an example, redundancy is visible by the fact that the same protein sequence may appear in different entries, in different databases, or in different releases of the same database. Since most computations are CPU-expensive, repeating the same computation should obviously be avoided.</p>
         <p>To satisfy these requirements and to simplify the in-house management of data from very different sources in various formats, we have developed <it>HitKeeper</it>. It is a software for the fully automatic handling of multiple sequence and motif databases, as well as classification (taxonomy) information, on a large scale. In addition, <it>HitKeeper </it>implements an elaborate and original query syntax to retrieve information. The distribution provides the core programs, a number of test scripts, and a manual with detailed instructions for the set-up of a pilot installation. Since the software architecture is designed to be customizable and extensible, it should be relatively easy for a user with some proficiency in the Perl programming language to introduce new data types and algorithms into the system.</p>
      </sec>
      <sec>
         <st>
            <p>Implementation</p>
         </st>
         <sec>
            <st>
               <p>Program architecture</p>
            </st>
            <p>The following description is focused on the essential features and algorithms of <it>HitKeeper</it>. The installation and a number of technical details are explained in-depth in the <it>Reference Manual</it>, which is part of the distribution package.</p>
            <p><it>HitKeeper </it>is a collection of scripts that interact as concurrent clients with a relational database management system (RDBMS). The software is mostly written in Perl and was developed and tested under various flavours of Linux and Mac OS X using MySQL as RDBMS; it is currently being extended to other RDBMS.</p>
            <p>Fig. <figr fid="F1">1</figr> is a schematic representation of the logical organization of <it>HitKeeper</it>. Starting from the abstract concept of "data", the structure is built around three <it>kinds </it>whose properties are reflected in the organization of the software and that are hardcoded in the application: <it>seq</it>, biological sequences; <it>mot</it>, motifs for predicting hits on the sequences, and <it>cla</it>, hierarchical classification (currently limited to taxonomy).</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>HitKeeper ontology</p>
               </caption>
               <text>
                  <p><b>HitKeeper ontology. </b>Logical organization of the data and software. See text for discussion.</p>
               </text>
               <graphic file="1751-0473-2-2-1"/>
            </fig>
            <p>For each <it>kind</it>, <it>HitKeeper </it>allows multiple <it>types </it>of data to be dealt with. As an example, <it>seq </it>allows multiple types of sequence data, such as "pep" (for peptide) and "nuc"(for nucleotide). Similar to this, <it>mot </it>may comprise the type "pattern" as well as "profile" or "HMM". All types, and all computation algorithms between them (<it>e.g</it>. which program is used to run a pattern search on a peptide sequence), are defined in a central configuration file. Besides some general parameters (database server, etc.), this file also holds the list of the modules and external programs that are used (a) for parsing the flat files, (b) performing the actual computations, and (c) dispatching and/or mirroring to any external computing elements. Custom modules written in Perl can be added, either derived from the existing modules or written "from scratch". Parsing of the input data is based on <it>lazy parsing </it>that extracts only the relevant information. This minimizes the amount of maintenance that might be induced by format changes in the source data. Custom modules can also be provided for the mirroring of the databases to external computing elements (<it>e.g</it>. formating for a BLAST server).</p>
            <p>Five distinct clients are available. Three of them, <it>HKLoader</it>, <it>HKUpdater </it>and <it>HKPublisher</it>, are used for RDBMS housekeeping and control of the data flow. They operate concurrently in the background, similar to a system daemon. The two other scripts, <it>HKReader </it>and <it>HKAdmin</it>, are used to interact with the RDBMS. While the former is solely intended for querying the system, the latter also allows the administration of <it>HitKeeper</it>; as an example, the <it>HitKeeper </it>administrator defines interactively which database (UniProt, Prosite, etc.) is actually parsed and used for the calculations. Both scripts provide the functionality of a command-line tool, and the interactivity of a "shell-like" environment; they accept input from STDIN and can thus be controlled through other scripts and pipes. This allows automation and enables performing tasks in batch mode, either directly from the command line or by reading commands from a file.</p>
         </sec>
         <sec>
            <st>
               <p>Data lifecycle and computations</p>
            </st>
            <p>As mentioned above, <it>HitKeeper </it>reads three kinds of input data. Each is associated with a "pipeline" where several versions of a database, such as weekly releases, can coexist. However, only the version with the status 'current' is in the production stage and can be queried. Fig. <figr fid="F2">2</figr> illustrates how the <it>seq </it>and <it>mot </it>pipelines are synchronized with respect to the incremental updates of the hit list.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Schematic representation of the sequence and motif pipelines</p>
               </caption>
               <text>
                  <p><b>Schematic representation of the sequence and motif pipelines. </b>Several successive versions of a given source database usually coexist at different stages in a pipeline. The databases are processed by three scripts running simultaneously, in a manner similar to a system daemon: <it>HKLoader </it>watches the source data files for changes (using the date/time stamp). This script is responsible for parsing and converting the raw data, detecting redundancy, and transferring the "clean" data into the SQL database. <it>HKUpdater </it>updates the hit list. Once a motif database enters the <it>prepare </it>state, the new motifs are computed against the sequences that are in <it>current </it>state. Similarly, when a sequence database comes in the states <it>prepare</it>, the new sequences are computed against the motifs that are in the <it>current </it>state. The two computational tasks, sequences-vs-motifs and motifs-vs-sequences, are never executed simultaneously &#8211; this keeps the two pipelines synchronized. Once the calculations are done, <it>HKPublisher </it>becomes responsible for the deployment of the databases to external computing elements (<it>e.g</it>. a blast server) and the database flagged as <it>ready </it>is promoted to <it>current </it>("in production"): all subsequent queries are now applied to this database. Previous versions can be kept as archives or deleted to reclaim space.</p>
               </text>
               <graphic file="1751-0473-2-2-2"/>
            </fig>
            <p>Computations are set up on a per-database basis, so that <it>all </it>entries in a given sequence database are expected to be calculated against <it>all </it>entries in a motif database. However, not all <it>databases </it>are necessarily calculated against each other: the software uses a "subscription" model, defining which database pairs are to be calculated. In this way, it is possible to set up calculations as needed and to adjust the allocation of computing resources.</p>
            <p>All hit list computations are performed by calling external software, <it>i.e</it>. they are not hardcoded in <it>HitKeeper</it>. A simple implementation of a pattern-matching algorithm is provided and can be used as template for custom extensions.</p>
            <p>If a sequence or motif database is updated, repeating the same computations for sequences or motifs that have not changed should be avoided. This is the purpose of the incremental update algorithm in <it>HitKeeper</it>. The algorithm is identical for sequence and motif database updates. Complications arise from the optional subscriptions and from the handling of redundancy; a typical case handled by our algorithm is outlined in Fig. <figr fid="F3">3</figr>.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Example of redundancy management</p>
               </caption>
               <text>
                  <p><b>Example of redundancy management. </b>This example [9] uses four motif (MA ... MD) and four sequence databases (SA ... SD). The upper part of the figure corresponds to all data that are currently in production. The table on the upper left represents the different databases and some of their individual entries (horizontal and vertical lines). A yellow rectangle symbolizes a subscription for computation. The small table on the right-hand side represents five "non-redundant" sequences ("Snr", arranged in columns) and three motifs ("Mnr", in rows). The computations between individual sequences and motifs are symbolised with black crosses; these do not necessary signal the <it>presence </it>of a match, but indicate the fact that the necessary calculations have been performed. - The bottom part of the figure shows a new version of sequence database SB that is being prepared to replace the current version. Sequence s0 will be deleted from database SB, sequence s5 will be inserted, s3 and s4 are new to SB but already present in other databases. The computations that must be performed are indicated by the red crosses. Note that there is no need to compute s4 against m3, since it was already present in SD which, in turn, is already subscribed to MD. - The same principle applies for updating a database of motifs.</p>
               </text>
               <graphic file="1751-0473-2-2-3"/>
            </fig>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results and discussion</p>
         </st>
         <sec>
            <st>
               <p>Installation, validation and scalability</p>
            </st>
            <p>The prerequisites for the installation of <it>HitKeeper </it>are the availability of a MySQL server and a few Perl modules from CPAN; according to our experience, the presence of the system administrator is preferable at this stage. The deployment of a <it>HitKeeper </it>installation as such is essentially performed through a shell script within a few seconds.</p>
            <p>The validation of a <it>HitKeeper </it>installation concerns in particular the incremental updates and the query mechanism. Two tests are provided in the distribution and implemented as shell scripts, thus emulating commands as they would be typed by the user instead of querying the RDBMS directly. They should be run as "operational qualification tests" and will verify the correct behaviour of the parser, computation engine, incremental update, and query mechanism.</p>
            <p>Historically, <it>HitKeeper </it>was developed as the "back end" of the <it>MyHits </it>web site <abbrgrp><abbr bid="B2">2</abbr></abbrgrp> as depicted in Fig. <figr fid="F4">4</figr>. <it>MyHits </it>has been in production status since 2003 and currently handles more than 7 million non-redundant sequences with weekly updates from a number of major databases, and 21 million hits on these (Table <tblr tid="T1">1</tblr>). Thus, <it>HitKeeper </it>can be considered to be robust and scalable.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Schematic structure of the MyHits webserver</p>
               </caption>
               <text>
                  <p><b>Schematic structure of the MyHits webserver. </b>The tasks provided by <it>HitKeeper </it>are shown in blue. Services that provide infrastructure (MySQL, Apache) are displayed in green, and computing services in pink. The different tasks are distributed over different hosts, and synchronization of data is controlled by <it>HitKeeper</it>.</p>
               </text>
               <graphic file="1751-0473-2-2-4"/>
            </fig>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Turnover of sequence data</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>db versions</p>
                     </c>
                     <c ca="center">
                        <p>total entries</p>
                     </c>
                     <c ca="center">
                        <p>sequences</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>current</p>
                     </c>
                     <c ca="center">
                        <p>39</p>
                     </c>
                     <c ca="center">
                        <p>~ 7 &#183; 10<sup>6</sup></p>
                     </c>
                     <c ca="center">
                        <p>~ 5.7 &#183; 10<sup>6</sup></p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>total over 9 months</p>
                     </c>
                     <c ca="center">
                        <p>545</p>
                     </c>
                     <c ca="center">
                        <p>~ 122 &#183; 10<sup>6</sup></p>
                     </c>
                     <c ca="center">
                        <p>~ 8.3 &#183; 10<sup>6</sup></p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>ratio</p>
                     </c>
                     <c ca="center">
                        <p>0.07</p>
                     </c>
                     <c ca="center">
                        <p>0.06</p>
                     </c>
                     <c ca="center">
                        <p>0.69</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Numbers of protein database versions, entries and sequences in <it>HitKeeper </it>behind the <it>MyHits </it>website, counted at the end of a nine-months period and cumulated over the same time. The ratios indicate frequent updates of the database source files (545 individual releases for a final count of 39 database versions), frequent modifications of the annotations for the complete entries, but a much lower rate of sequence changes (69% remain unchanged over time).</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Queries</p>
            </st>
            <p><it>HitKeeper </it>implements an elaborate and original query syntax to retrieve information. Besides support for logical operators (OR, AND, NOT), <it>HitKeeper </it>allows sequences to be retrieved with logical constraints on the arrangement of the motifs found along the sequence. An expression that specifies such a particular "motif of motifs" is called a <it>metamotif</it>. Metamotif queries are expressed in a grammar that is specific to <it>HitKeeper</it>, yet human readable. This grammar is parsed and then compiled into an SQL query. The metamotif query language was inspired by our experience with <it>mmsearch </it><abbrgrp><abbr bid="B3">3</abbr></abbrgrp> and <it>twofeat </it><abbrgrp><abbr bid="B4">4</abbr></abbrgrp>.</p>
            <p>While presenting the full query capabilities of <it>HitKeeper </it>is out of the scope of this paper, some typical examples of the query language are given below. The setup of the following example dataset is described fully in the Reference Manual; hits can be calculated overnight on a standard Linux workstation. We make use of three common databases: The Swiss-Prot protein sequences (designated with sw hereafter) <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, the Prosite patterns (pat) <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, and the NCBI taxonomy data (taxid) <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp>. An additional database of "virtual" motifs is automatically derived from the Swiss-Prot "FT" lines (feature) with a script that produces more than 800 of these 'virtual motifs'. The computation yields two 'hit' lists (Swiss-Prot <it>vs </it>Prosite, and Swiss-Prot <it>vs </it>the ft motifs), and a single 'hat' list (<it>i.e</it>. match between sequence and classification): Swiss-Prot <it>vs </it>NCBI taxomomy. The latter is used for filtering by taxonomy.</p>
            <p>The original text entries can be retrieved using alternative but unique designations. As an example, the Prosite entry for the pattern with id CORNICHON can be retrieved using its name, ID, or accession number:</p>
            <p>&#160;&#160;&#160;mot_fetch_entry pat:CORNICHON</p>
            <p>&#160;&#160;&#160;mot_fetch_entry PS01340</p>
            <p>Queries can be "stored" using a query identifier, indicated by -ref=... in the examples below. These identifiers are used to repeat, refine or even string together queries. The following example will refer to all bird sequences from Swiss-Prot:</p>
            <p>&#160;&#160;&#160;hat_query cla_parent=Aves seq_source=sw -ref=$BIRDSEQ</p>
            <p>Re-using the same query, count the sequences and the species, then retrieve the non-redundant sequences of all birds in Swiss-Prot:</p>
            <p>&#160;&#160;&#160;query_stat $BIRDSEQ</p>
            <p>&#160;&#160;&#160;seq_fetch_nr $BIRDSEQ</p>
            <p>Since the taxonomy data are available, it is easy to list all species covered by that query:</p>
            <p>&#160;&#160;&#160;cla_fetch_desc $BIRDSEQ</p>
            <p>A list of all matches of Prosite patterns against these sequences can be obtained as follows:</p>
            <p>&#160;&#160;&#160;hit_query seq_name=$BIRDSEQ mot_source=pat</p>
            <p><it>HitKeeper </it>also has the capability to perform negative matching, such as finding all bird sequences with <it>no </it>match by any Prosite pattern:</p>
            <p>&#160;&#160;&#160;hit_query seq_source=sw mot_source=pat -ref=$PROSITE</p>
            <p>&#160;&#160;&#160;seq_query seq_name=$BIRDSEQ &#160;not_seq_name=$PROSITE</p>
            <p>A simple example is to retrieve all existing hits for the protein VAV_RAT [Swiss-Prot:<ext-link ext-link-type="sprot" ext-link-id="P54100">P54100</ext-link>]:</p>
            <p>&#160;&#160;&#160;hit_query seq_name=sw:VAV_RAT</p>
            <p>Note that these include DOMAIN sh2 (one hit), DOMAIN sh3 (two hits) in a particular arrangement. To search all sequences that fulfill a similar arrangement, a metamotif query with the <it>followed by </it>operator <it>~~ </it>is used:</p>
            <p>&#160;&#160;&#160;mom_query (DOMAIN_sh3~~ DOMAIN_sh2~~ DOMAIN_sh3)</p>
            <p>At the time of writing, there are about 22 proteins meeting this criterion in Swiss-Prot.</p>
            <p>As another example, Prosite has a pattern pat:THIOREDOXIN that targets the active site of the thioredoxin domain [Prosite:PS00194]. In Swiss-Prot, the thioredoxin domain itself is annotated and was extracted from the FT line as ft:DOMAIN_Thioredoxin. The hit by the pattern is usually present within the domain annotation, but not always. In addition, some domain annotations do <it>not </it>include an active site that the pattern would match. The analysis is not straightforward since many proteins have multiple hits with the pattern and domain annotations. The following commands were used to obtain the counts as shown in Table <tblr tid="T2">2</tblr>:</p>
            <p>First, a hit query for hits by either of the two motifs pat:THIOREDOXIN or ft:DOMAIN_Thioredoxin is performed and is saved under the query identifier $all_hit. The comma between the two motif names is the <it>OR </it>operator:</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Match counts</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c ca="center">
                        <p>#sequences</p>
                     </c>
                     <c ca="center">
                        <p>#motifs</p>
                     </c>
                     <c ca="center">
                        <p>#hits</p>
                     </c>
                     <c ca="center">
                        <p>condition</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>374</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>737</p>
                     </c>
                     <c ca="center">
                        <p>
                           <monospace>X OR Y</monospace>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>231</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>542</p>
                     </c>
                     <c ca="center">
                        <p><monospace>X /&lt;/ Y</monospace> metamotif</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>81</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>81</p>
                     </c>
                     <c ca="center">
                        <p>
                           <monospace>X NOT IN (X /&lt;/ Y)</monospace>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>68</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>72</p>
                     </c>
                     <c ca="center">
                        <p>
                           <monospace>Y NOT IN (X /&lt;/ Y)</monospace>
                        </p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Counts documenting the occurence of a match by a THIOREDOXIN pattern (X), and by an annotated Thioredoxin domain (Y) in the "FT" lines of Swiss-Prot. The metamotif detects only those matches were X is located within Y. The numbers may change with a different release of Swiss-Prot.</p>
               </tblfn>
            </tbl>
            <p>&#160;&#160;&#160;hit_query -ref=$all_hit &#160;mot_name=pat:THIOREDOXIN,ft:DOMAIN_Thioredoxin</p>
            <p>Next a search for hits is carried out where the motif pat:THIOREDOXIN is 'embedded' in the motif ft:DOMAIN_Thioredoxin. Since there are proteins with multiple thioredoxin domains, a metamotif with the <it>is included </it>operator /&lt;/ was used to associate the pattern and the annotation. As this is a metamotif query, mom query is used instead of hit query:</p>
            <p>&#160;&#160;&#160;mom_query (pat:THIOREDOXIN) /&lt;/ (ft:DOMAIN_Thioredoxin) &#160;&#160;-ref=$mom_hit</p>
            <p>In a third step, hits that contain the motif pat:THIOREDOXIN but that are not included in those that bind the metamotif are identified:</p>
            <p>&#160;&#160;&#160;hit_query mot_name=pat:THIOREDOXIN &#160;not_hit_list=$mom_hit -ref=$pat_not_mom</p>
            <p>and the last dataset consists of hits with the motif ft:DOMAIN_Thioredoxin, but that are not in the metamotif hit list:</p>
            <p>&#160;&#160;&#160;hit_query mot_name=ft:DOMAIN_Thioredoxin &#160;&#160;not_hit_list=$mom_hit -ref=$ft_not_mom</p>
            <p>Finally, the results of all four queries are reported:</p>
            <p>&#160;&#160;&#160;query_stat $all_hit $mom_hit $ft_not_mom &#160;$pat_not_mom</p>
            <p>The result is summarised in Table <tblr tid="T2">2</tblr> and shows that there are 81 matches by patterns that are not included in the matches by domains. On the other hand, there are 72 domains where the corresponding pattern is not present. The execution time of this last example is typically only a few seconds.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p><it>HitKeeper </it>provides a generic, modular and extensible framework to handle the redundancy and incremental updates of biological databases and calculations between them. It allows any user to manage his/her own "private" collections of protein sequences and motifs, in addition to the public ones. <it>HitKeeper </it>implements an elaborate query syntax to retrieve information. These queries enable the user to specify constraints for searching proteins, such as retrieving sequences that contain specific motifs, or a defined arrangement of motifs ("metamotifs"), or queries based on the classification of sequences.</p>
         <p>While it is not a "ready-to-use" annotation software, the system is designed to be modular, extensible and scalable. New data formats can easily be incorporated by writing custom parsers. The command-line interface of <it>HitKeeper </it>allows straightforward integration and interaction with standard tools in the Unix environment, such as scripting, piping, etc.</p>
         <p><it>HitKeeper </it>is used at the production stage in the "back-end" of the <it>MyHits </it>web site. Hence it is actively maintained; bug fixes and new functionalities are being added into the distribution on a regular basis.</p>
      </sec>
      <sec>
         <st>
            <p>Availability and requirements</p>
         </st>
         <p>Project name: HitKeeper</p>
         <p>Project home page: <url>http://hitkeeper.sourceforge.net</url></p>
         <p>Operating system: Linux, Mac OS X, Solaris</p>
         <p>Programming language: Perl, bash, SQL</p>
         <p>Other requirements: MySQL 4.1 or higher, a few Perl modules from CPAN</p>
         <p>License: GNU General Public License version 2</p>
         <p>Any restriction to use by non-academics: None</p>
      </sec>
      <sec>
         <st>
            <p>Competing interests</p>
         </st>
         <p>The author(s) declare that they have no competing interests.</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>MP had the original idea and implemented most of the software. MM investigated the incremental update algorithm and the query language <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. JH developed the setup and testing procedures and wrote the documentation. All authors read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>We thank many people that have contributed, be it with suggestions, testing, discussions or code, to the development of <it>HitKeeper</it>: Vassilios Ioannidis, Laurent Falquet, Lorenzo "Luli" Cerutti, Heinz Stockinger, Monique Zahn-Zabal, Brian Stevenson, Dmitry Kuznetsov, Christelle Vangenot, Fabio Porto and Victor Jongeneel. Funding to pay the publication charges was provided by the Swiss Institute of Bioinformatics. MP acknowledges financial support from EMBRACE. The EMBRACE project is funded by the European Commission within its FP6 Programme, under the thematic area "Life sciences, genomics and biotechnology for health", contract number LHSG-CT-2004-512092.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>InterPro, progress and status in 2005</p>
            </title>
            <aug>
               <au>
                  <snm>Mulder</snm>
                  <fnm>NJ</fnm>
               </au>
               <au>
                  <snm>Apweiler</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Attwood</snm>
                  <fnm>TK</fnm>
               </au>
               <au>
                  <snm>Bairoch</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Bateman</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Binns</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Bradley</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Bork</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Bucher</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Cerutti</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Copley</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Courcelle</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Das</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Durbin</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Fleischmann</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Gough</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Haft</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Harte</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Hulo</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Kahn</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Kanapin</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Krestyaninova</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Lonsdale</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Lopez</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Letunic</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Madera</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Maslen</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>McDowall</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Mitchell</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Nikolskaya</snm>
                  <fnm>AN</fnm>
               </au>
               <au>
                  <snm>Orchard</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Pagni</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Ponting</snm>
                  <fnm>CP</fnm>
               </au>
               <au>
                  <snm>Quevillon</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Selengut</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Sigrist</snm>
                  <fnm>CJA</fnm>
               </au>
               <au>
                  <snm>Silventoinen</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Studholme</snm>
                  <fnm>DJ</fnm>
               </au>
               <au>
                  <snm>Vaughan</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Wu</snm>
                  <fnm>CH</fnm>
               </au>
            </aug>
            <source>Nucl Acids Res</source>
            <pubdate>2005</pubdate>
            <volume>33</volume>
            <fpage>D201</fpage>
            <lpage>205</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">540060</pubid>
                  <pubid idtype="pmpid" link="fulltext">15608177</pubid>
                  <pubid idtype="doi">10.1093/nar/gki106</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>MyHits: a new interactive resource for protein annotation and domain identification</p>
            </title>
            <aug>
               <au>
                  <snm>Pagni</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Ioannidis</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Cerutti</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Zahn-Zabal</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Jongeneel</snm>
                  <fnm>CV</fnm>
               </au>
               <au>
                  <snm>Falquet</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Nucl Acids Res</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <fpage>W332</fpage>
            <lpage>335</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">441617</pubid>
                  <pubid idtype="pmpid" link="fulltext">15215405</pubid>
                  <pubid idtype="doi">10.1093/nar/gkh479</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>mmsearch: a motif arrangement language and search program</p>
            </title>
            <aug>
               <au>
                  <snm>Junier</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Pagni</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Bucher</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2001</pubdate>
            <volume>17</volume>
            <fpage>1234</fpage>
            <lpage>1235</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/17.12.1234</pubid>
                  <pubid idtype="pmpid" link="fulltext">11751236</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>EMBOSS: the European Molecular Biology Open Software Suite</p>
            </title>
            <aug>
               <au>
                  <snm>Rice</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Longden</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Bleasby</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Trends Genet</source>
            <pubdate>2000</pubdate>
            <volume>16</volume>
            <issue>6</issue>
            <fpage>276</fpage>
            <lpage>7</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0168-9525(00)02024-2</pubid>
                  <pubid idtype="pmpid" link="fulltext">10827456</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>The Universal Protein Resource (UniProt): an expanding universe of protein information</p>
            </title>
            <aug>
               <au>
                  <snm>Wu</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Apweiler</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Bairoch</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Natale</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Barker</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Boeckmann</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Ferro</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Gasteiger</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Huang</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Lopez</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Magrane</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Martin</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Mazumder</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>O'Donovan</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Redaschi</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Suzek</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2006</pubdate>
            <volume>34</volume>
            <issue>Database issue</issue>
            <fpage>D187</fpage>
            <lpage>91</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1347523</pubid>
                  <pubid idtype="pmpid" link="fulltext">16381842</pubid>
                  <pubid idtype="doi">10.1093/nar/gkj161</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>The PROSITE database</p>
            </title>
            <aug>
               <au>
                  <snm>Hulo</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Bairoch</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Bulliard</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Cerutti</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>De Castro</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Langendijk-Genevaux</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Pagni</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Sigrist</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2006</pubdate>
            <volume>34</volume>
            <issue>Database issue</issue>
            <fpage>D227</fpage>
            <lpage>30</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1347426</pubid>
                  <pubid idtype="pmpid" link="fulltext">16381852</pubid>
                  <pubid idtype="doi">10.1093/nar/gkj063</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Database resources of the National Center for Biotechnology Information</p>
            </title>
            <aug>
               <au>
                  <snm>Wheeler</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Chappey</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Lash</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Leipe</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Madden</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Schuler</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Tatusova</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Rapp</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2000</pubdate>
            <volume>28</volume>
            <fpage>10</fpage>
            <lpage>4</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">102437</pubid>
                  <pubid idtype="pmpid" link="fulltext">10592169</pubid>
                  <pubid idtype="doi">10.1093/nar/28.1.10</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>GenBank</p>
            </title>
            <aug>
               <au>
                  <snm>Benson</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Karsch-Mizrachi</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Ostell</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Rapp</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Wheeler</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2000</pubdate>
            <volume>28</volume>
            <fpage>:15</fpage>
            <lpage>8</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">102453</pubid>
                  <pubid idtype="pmpid" link="fulltext">10592170</pubid>
                  <pubid idtype="doi">10.1093/nar/28.1.15</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Analysis, design and implementation of improved queries on an integrated biological database</p>
            </title>
            <aug>
               <au>
                  <snm>Muller</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Master's thesis</source>
            <pubdate>2005</pubdate>
         </bibl>
      </refgrp>
   </bm>
</art>
