<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-8-172</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Database</dochead>
      <bibl>
         <title>
            <p>The CRISPRdb database and tools to display CRISPRs and to generate dictionaries of spacers and repeats</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Grissa</snm>
               <fnm>Ibtissem</fnm>
               <insr iid="I1"/>
               <email>ibtissem.grissa@igmors.u-psud.fr</email>
            </au>
            <au id="A2">
               <snm>Vergnaud</snm>
               <fnm>Gilles</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>gilles.vergnaud@igmors.u-psud.fr</email>
            </au>
            <au id="A3" ca="yes">
               <snm>Pourcel</snm>
               <fnm>Christine</fnm>
               <insr iid="I1"/>
               <email>christine.pourcel@igmors.u-psud.fr</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Univ Paris-Sud, Institut de G&#233;n&#233;tique et Microbiologie, UMR 8621, Orsay, F-91405, France; CNRS, Orsay, F-91405, France</p>
            </ins>
            <ins id="I2">
               <p>Centre d'Etudes du Bouchet, 5 rue Lavoisier, 91710 Vert le Petit, France</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2007</pubdate>
         <volume>8</volume>
         <issue>1</issue>
         <fpage>172</fpage>
         <url>http://www.biomedcentral.com/1471-2105/8/172</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17521438</pubid>
               <pubid idtype="doi">10.1186/1471-2105-8-172</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>05</day>
               <month>1</month>
               <year>2007</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>23</day>
               <month>5</month>
               <year>2007</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>23</day>
               <month>5</month>
               <year>2007</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2007</year>
         <collab>Grissa et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>In Archeae and Bacteria, the repeated elements called CRISPRs for "clustered regularly interspaced short palindromic repeats" are believed to participate in the defence against viruses. Short sequences called spacers are stored in-between repeated elements. In the current model, motifs comprising spacers and repeats may target an invading DNA and lead to its degradation through a proposed mechanism similar to RNA interference. Analysis of intra-species polymorphism shows that new motifs (one spacer and one repeated element) are added in a polarised fashion. Although their principal characteristics have been described, a lot remains to be discovered on the way CRISPRs are created and evolve. As new genome sequences become available it appears necessary to develop automated scanning tools to make available CRISPRs related information and to facilitate additional investigations.</p>
            </sec>
            <sec>
               <st>
                  <p>Description</p>
               </st>
               <p>We have produced a program, CRISPRFinder, which identifies CRISPRs and extracts the repeated and unique sequences. Using this software, a database is constructed which is automatically updated monthly from newly released genome sequences. Additional tools were created to allow the alignment of flanking sequences in search for similarities between different loci and to build dictionaries of unique sequences. To date, almost six hundred CRISPRs have been identified in 475 published genomes. Two Archeae out of thirty-seven and about half of Bacteria do not possess a CRISPR. Fine analysis of repeated sequences strongly supports the current view that new motifs are added at one end of the CRISPR adjacent to the putative promoter.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>It is hoped that availability of a public database, regularly updated and which can be queried on the web will help in further dissecting and understanding CRISPR structure and flanking sequences evolution. Subsequent analyses of the intra-species CRISPR polymorphism will be facilitated by CRISPRFinder and the dictionary creator. CRISPRdb is accessible at <url>http://crispr.u-psud.fr/crispr</url></p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="bmc" subtype="user_supplied_xml" id="endnote"/>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Clustered regularly interspaced short palindromic repeats (CRISPRs) have been described in a wide range of prokaryotes, including the majority of Archaea and many Bacteria. They consist in the succession of 24&#8211;47 bp repeated sequences (often called direct repeats or DR) separated by unique sequences of a similar length (spacers) <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr></abbrgrp>. <it>Bona fide </it>CRISPRs possess at one end a partial DR and at the other end after the last DR a sequence of about 200 bp called the leader <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. The origin of the spacers is still largely unknown but several recent studies identified some of them as fragments of foreign DNA mostly of viral origin <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>. Analysis of a large number of <it>Yersinia pestis </it>isolates has shown that these elements are sequentially added in a polarised fashion next to the leader <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. This suggestion was further confirmed by observations in <it>Sulfolobus solfataricus </it>and in <it>Streptococcus thermophilus </it><abbrgrp><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr></abbrgrp>. A cluster of genes called <it>cas </it>(CRISPR-associated) are often found in the vicinity of CRISPRs <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. When several CRISPRs with the same DR are present, only one is associated with <it>cas </it>genes. The exact number of <it>cas </it>genes is not known and apparently varies from one strain to another. However, a core of 4 genes is regularly identified, which appears to encode proteins involved in DNA modification and repair <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. Phylogenetic studies performed on the CAS proteins suggest that CRISPRs are acquired by horizontal transfer <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp>. This is consistent with their presence on megaplasmids <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. CRISPRs are non-coding regions but different observations suggest that they are transcribed into small RNAs (smRNA) possibly from the leader acting as a promoter, and that they might play a role as siRNA (small interfering RNA) to block the entry of foreign sequences <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B14">14</abbr></abbrgrp>.</p>
         <p>In order to gain further insight into the organisation and behaviour of CRISPR loci it is necessary to perform extensive analyses of the available sequenced genomes. Several studies have been performed, the most extensive being that made on 370 prokaryotic genomes <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. However, these studies are static and considering the amount of ongoing sequencing projects they are rapidly becoming obsolete. The TIGRFAM database <abbrgrp><abbr bid="B15">15</abbr></abbrgrp> provides information on CAS associated CRISPR loci but it is not dedicated to CRISPR identification and will not report CRISPR structures devoid of neighbouring <it>cas </it>genes.</p>
         <p>For the algorithmic detection of CRISPR patterns, several methods were empirically applied previously, making use of REPuter <abbrgrp><abbr bid="B13">13</abbr><abbr bid="B16">16</abbr></abbrgrp>, PatScan <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B17">17</abbr></abbrgrp>, TRF <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B18">18</abbr></abbrgrp>, LUNA <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>, PYGRAM <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>. These programs are designed to find repeats and are not especially conceived for CRISPR patterns finding, so they may provide the CRISPR location but do not define accurately the consensus DR. The output of such tools requires significant manual discard to eliminate background, and post-processing to define the consensus DR and the spacers. Recently, a CRISPR dedicated software tool called PILER-CR was described <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. PILER-CR is based on an elegant algorithm that consists mainly in producing piles meeting the CRISPR properties from local alignments of the query sequence to itself. The software tool has the advantage of being rapidly executed but it sometimes misidentifies the DR boundaries and omits the truncated DR.</p>
         <p>Finally, using the available programs, "short" or "quite short" CRISPRs (defined as containing less than three, three or seven spacers <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B12">12</abbr><abbr bid="B19">19</abbr></abbrgrp>) are not considered.</p>
         <p>Since future insights into the evolution of CRISPRs may result from the investigation of these very small CRISPRs, some of which may be newly emerging structures, it is important to facilitate access to this enlarged, but much more difficult to define, group.</p>
         <p>We have developed tools to identify CRISPRs, select DR and store spacers into dictionaries, and a database which can be queried online at <url>http://crispr.u-psud.fr/crispr</url>. The CRISPRdb is automatically updated; in the May 2007 version, 475 published microbial genomes have been processed.</p>
      </sec>
      <sec>
         <st>
            <p>Construction and content</p>
         </st>
         <sec>
            <st>
               <p>Database and software design and implementation</p>
            </st>
            <p>CRISPRdb and associated web services are implemented in Perl version 5.8.8 <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> and take advantage of some BioPerl <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> modules for manipulating sequences. They run on an Apache 2.0 web server <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> with a Linux operating system (debian Sarge 3.1) <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. The core application consists of two main programs: CRISPRFinder to detect CRISPRs and extract them from a genomic sequence, and Database Tools for downloading prokaryotic genomes from the NCBI ftp site <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>, saving CRISPRs and making updates.</p>
            <p>The first program is a full command line tool written in-house in Perl. It is used to process published genome sequences and feed the CRISPR database. It can also be run interactively through the web interface for submission and analysis of users sequence data <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>.</p>
            <p>The second program is a set of Perl scripts. Downloading of genomic sequences, CRISPRs detection and motifs extraction are fully automated.</p>
            <p>A web resource is built on top of these programs via PHP <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> and Perl CGI scripts. This preserves platform independence across multiple operating systems and allows the user to interact with the different CRISPR tools programs without computer programming or (shell) scripting skills.</p>
         </sec>
         <sec>
            <st>
               <p>The CRISPRs database (CRISPRdb)</p>
            </st>
            <p>CRISPRdb is a relational database implemented using mysql 4.1 <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. It utilizes the CRISPRFinder program to identify putative CRISPRs and additional tests to further screen for the smallest CRISPRs in a polyphasic approach. Indeed the CRISPRFinder program is conceived to authorize the largest number of possible CRISPRs, especially the shortest ones, containing one or two spacers. The main idea of the program is to first find possible CRISPR localizations in a genomic sequence and then check if these regions contain a cluster that possess the characteristics of "obvious" CRISPR, i.e. containing at least three repeats. Finding possible CRISPR localizations is achieved using the Vmatch package to detect maximal repeats <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>, that is a repeat that cannot be extended in either direction without incurring a mismatch <abbrgrp><abbr bid="B16">16</abbr><abbr bid="B30">30</abbr></abbrgrp>. Reported matches must have a size within 23 to 55 bp with one possible mismatch, and the gap size between two instances of a repeat must be within 25 to 60 bp. The maximal repeats are clustered according to their position in the genome. In each "cluster", the maximal repeat which is the most frequent in the genome being processed is selected and "blasted" against the cluster. Such a maximal repeat is a candidate DR sequence, and when additional candidate DRs are identified, a score is computed to select the DR resulting in the minimum number of mismatches towards its boundaries. This step is probably instrumental to achieve a very precise identification of proper DR consensus compared to other programs. The related matches are then extracted and tested as putative DRs of a CRISPR, so that the first or the last match is allowed to be degenerated with a maximal number of errors equal to half the match length. This allows the efficient identification of the first, often truncated, DR. The other matches must be globally conserved at least to 80%. Finally two filters are added to check the CRISPR candidates' structure. The first one eliminates clusters for which spacers length are not within the range of 0.6&#215; and 2.5&#215; the DR length. In addition, CRISPR candidates with more than 60% of similarity between spacers (or between DR and spacer) are considered as tandem repeats and are eliminated by the second filter. The selected criteria described above imply that the minimal structure of a putative CRISPR detected by CRISPRFinder should consist in at least two successive direct repeats (one spacer) with a maximum of one mismatch. CRISPRs of more than 2 spacers with three or more perfect repeats are considered "confirmed CRISPR" whereas the shorter CRISPRs are considered "questionable".</p>
            <p>Currently, CRISPRdb is composed of 5 tables (Figure <figr fid="F1">1</figr>). For storage in CRISPRdb (Figure <figr fid="F2">2</figr>), several additional tests are applied to the questionable CRISPRs in order to validate a maximum of them. First, a comparison of their DR to previously identified DRs is performed (for example, CRISPR NC_006155_4 in <it>Yersinia pseudotuberculosis </it>IP 32953 with 2 spacers has the same DR as CRISPRs NC_006155_6 and NC_006155_7 in the same genome, comprising respectively 4 and 16 motifs; CRISPR NC_003272_3 in Nostoc sp. PCC 7120 with only one spacer, has the same DR as the CRISPR NC_007413_19 of <it>Anabaena variabilis </it>ATCC 29413 comprising 33 spacers). Then, a second filter is added to discard some of the non significant short CRISPRs, consisting in a restriction on the spacer allowed length, when the corresponding DR has no classical flanking nucleotides such as GTTT or GAAC.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>An entity-relationship diagram for the CRISPR database</p>
               </caption>
               <text>
                  <p><b>An entity-relationship diagram for the CRISPR database</b>. The downloaded data are represented in the yellow box: on the left the taxonomy report information and on the right the "GenomeInfo" report information about species replicons (chromosome or plasmid). The pink box represents tables related to the CRISPR clusters: a table for the cluster locus, a table for the DR consensus and a table for the spacers.</p>
               </text>
               <graphic file="1471-2105-8-172-1"/>
            </fig>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>The database construction: from genomes to CRISPRs</p>
               </caption>
               <text>
                  <p><b>The database construction: from genomes to CRISPRs</b>. The first step consists in downloading prokaryotic genomes which are then submitted to the CRISPRFinder program. The detected clusters are divided into two groups: confirmed CRISPRs (>=3DRs) are stored in the database; small questionable clusters (2 or 3 DRs) are analyzed by blasting their conserved region (DR) against the approved DRs; clusters with already identified DRs are added to the CRISPR database. Remaining questionable CRISPRs are analysed for classical flanking nucleotides and spacers length compared to the DR length. Clusters that do not fit these criteria are deleted, the remaining are kept as questionable. Manual discard of some sequences can be performed by the database curator. Colour code: programs are shown in blue, confirmed CRISPRs are in pink and questionable ones are in grey.</p>
               </text>
               <graphic file="1471-2105-8-172-2"/>
            </fig>
            <p>Authorizing small CRISPR-like structures in the database leads to an important amount of questionable data. Therefore a colour code is being used to differentiate the "confirmed CRISPR" shown in pink to the questionable structures shown in grey. However, and importantly, each time the database is updated, and new genomes are processed, DRs from all questionable structures are rechecked against the updated DR database.</p>
            <p>CRISPRs loci are identified from finished microbial genome sequences (as listed by the Genome Online Database <abbrgrp><abbr bid="B31">31</abbr></abbrgrp> and accessed from Genbank) and stored into the database. This procedure is repeated monthly to update the database.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Utility</p>
         </st>
         <sec>
            <st>
               <p>CRISPRdb: construction and content</p>
            </st>
            <p>Figure <figr fid="F3">3</figr> details some of the pages which can be viewed when browsing the database <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>. On the home page (extract, top left) is displayed an alphabetical list of Bacteria and Archaea strains for which genome sequence is published, and a colour code indicates whether a CRISPR has been detected or not: species without a CRISPR are coloured in yellow, and species having at least one CRISPR are coloured in pink. The list can also be sorted according to taxonomic order, or according to database processing date. This last option makes it easy to quickly browse the latest entries. The page which appears after selecting a genome (step 1) indicates how many CRISPRs have been found and on which replicon (chromosome or plasmid) they are located. In the following page (step 2) the CRISPR id is indicated together with its position on the genome, the number of spacers and the consensus DR sequence. Querying a CRISPR locus (step 3) leads to a page containing detailed characteristics together with sequence retrieval tools: the DR consensus is shown in yellow, the spacers are shown in different colours, together with their position in the genome, the flanking sequences and the whole CRISPR locus sequence (using the flanking sequence button). Flanking sequences are displayed with flexible positions that may be modified from the 100 bp default value. Spacers can be automatically compared to public sequences databases using blastn. From this page one can access a flanking sequence CLUSTALW multiple alignment tool (FlankAlign) which is used for defining the presence of a leader and searching for homologous sequences in other genomes.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Screenshots of the CRISPRs web-service</p>
               </caption>
               <text>
                  <p><b>Screenshots of the CRISPRs web-service</b>. 1. The opening page of the prokaryotic strains: strains in pink have at least one CRISPR, strains in grey have only questionable CRISPRs and strains in yellow have no CRISPR. 2. General information on the CRISPR clusters and their location. 3. Detailed information on the clusters: DRs are in yellow, spacers are in random colours. 4. Link to the spacers fasta file.</p>
               </text>
               <graphic file="1471-2105-8-172-3"/>
            </fig>
            <p>Furthermore, the ability to upload pre-calculated files (such as a summary of selected CRISPR properties or list of spacers in Fasta format, step 4) makes the tool very flexible, as the output can be analysed with other bioinformatics resources.</p>
         </sec>
         <sec>
            <st>
               <p>The CRISPR utilities page <abbrgrp><abbr bid="B33">33</abbr></abbrgrp></p>
            </st>
            <p>This page provides a global overview of CRISPRs present in the database, focusing on DRs and spacers (Figure <figr fid="F4">4</figr>). Firstly, all identified DRs are listed with their size expressed in base-pairs (bp), and the occurrences in the database of DRs with similar sequences is indicated as shown on the left panel of Figure <figr fid="F4">4</figr>. Selected DRs can be aligned using CLUSTALW and a dendrogram is produced (Figure <figr fid="F4">4</figr> right panels).</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>The DR comparison tool</p>
               </caption>
               <text>
                  <p><b>The DR comparison tool</b>. Screenshot from the Utilities page showing the list of DRs with an alignment example.</p>
               </text>
               <graphic file="1471-2105-8-172-4"/>
            </fig>
            <p>Secondly a list of spacers encountered more than once provides an easy way to identify for instance the relatively rare occurrences of internal duplications within a CRISPR. A BLAST (blastn) can be run using selected spacers against public sequence databases (GenBank, EMBL, DDBJ, PDB) with a cutoff of 0.1 for the E-value and a matching length of at least 70% the queried spacer size. Thirdly, this page provides a classification of CRISPRs according to the number of motifs. The CRISPR id provides the related strain name on mouse-up and links to the page describing the CRISPR properties. Links are also provided to the corresponding pre-computed lists of DRs and spacers which can be downloaded as text files.</p>
         </sec>
         <sec>
            <st>
               <p>The BLAST CRISPRs page <abbrgrp><abbr bid="B34">34</abbr></abbrgrp></p>
            </st>
            <p>This page will be of use to try and validate a questionable CRISPR. From this page, a candidate DR region (or spacer) can be compared to all DRs (or spacers) characterised so far from clear-cut CRISPR structures present in the database.</p>
         </sec>
         <sec>
            <st>
               <p>The Spacers Dictionary Creator page <abbrgrp><abbr bid="B35">35</abbr></abbrgrp></p>
            </st>
            <p>The analyses of CRISPRs in different strains of a species has shown that polymorphism exists in the number and nature of spacers <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B36">36</abbr><abbr bid="B37">37</abbr></abbrgrp>. This can be used to assess the degree of polymorphism inside the species thus providing additional information for epidemiological analyses. For this reason, it is important to be able to extract spacers from a sequence, and to store them into a database that can be queried when new sequences are produced. Upon submitting CRISPR sequences into the Spacer Dictionary Creator page, spacers are extracted and stored into an Excel file, either predefined or newly created. When a spacer is already present in the dictionary, its number appears in the output whereas a new spacer will be given a new number and will be added into the Excel file.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <sec>
            <st>
               <p>Sensitivity and selectivity of CRISPRFinder</p>
            </st>
            <p>To build the CRISPRdb we have used a new program, CRISPRFinder, specifically created to identify CRISPRs. We checked that all the CRISPRs described in the literature were detected with CRISPRFinder and, in addition, we found that CRISPRFinder performs better than other CRISPR finding tools in particular in defining the DR boundaries and in identifying short CRISPRs. Among available programs, we found that PILER_CR is the most efficient. However, in the chromosome of <it>Aquifex aeolicus </it>VF5 (NC_000918) for instance, PILER_CR (default parameters: minarray 2, mincons 0.7, minid 0.85) detects 9 CRISPRs, three of which have misidentified DR boundaries and three are missing the truncated DR. In addition, one CRISPR locus is missed because only CRISPRs of at least three repeats are detected. CRISPR NC_000918_6 (one spacer) in the CRISPRdb was not detected by PILER_CR although it has the same DR as CRISPRs NC_000918_1, NC_000918_2, NC_000918_3 and NC_000918_10 containing respectively 5, 4, 3 and 3 repeats). Furthermore, CRISPRFinder is capable of detecting CRISPRs which DRs contain multiple differences such as NC_009009_1 and NC_009009_2 in <it>Streptococcus sanguinis</it>. Using the default parameters of PILER_CR no CRISPR was detected in this bacterium. When parameters were changed, only part of the CRISPRs were found. It will be interesting in the future to check whether these exceptional CRISPRs and <it>cas </it>genes are functional. Conversely, CRISPRFinder occasionally fails to exclude some false positives. We manually analysed all the CRISPRs identified in the current version of the database and eliminated a few false positive structures, principally tandem repeats with a low internal conservation. We estimate these cases to be less than 1% of "confirmed" CRISPRs.</p>
         </sec>
         <sec>
            <st>
               <p>Characteristics of CRISPRs</p>
            </st>
            <p>CRISPRdb has been constructed using public domain genome sequences (unpublished sequences can be submitted to CRISPRFinder to detect CRISPRs and extract the spacers). Sixty three percent (63%) of the structures qualifying as CRISPRs using the defined parameters possess 4 or less than 4 spacers. The majority of these are classified as questionable. Their confirmation or exclusion as <it>bona fide </it>CRISPR structures will require additional evidence, such as the presence of a DR already described in a CRISPR, the presence of <it>cas </it>genes in the vicinity or the search for polymorphism within multiple isolates from the same species.</p>
            <p>We have chosen to restrict the definition of CRISPRs to comprise DRs 23 to 55 bp-long and spacers 0.6 to 2.5 the DR size because these sizes are in excess of the range of previously described CRISPRs. These parameters do not exclude CRISPRs also containing a subset of much larger spacers as can be seen in <it>Methanopyrus kandleri </it>with spacers 51 to 72 bp-long. There are no clear rules defining the limits of a DR or a spacer and we might be missing currently unknown CRISPRs with characteristics outside of the range currently covered, even if the present rules were deduced from the published investigation by various means of more than 300 genomes. Should such CRISPRs be observed in the future, the database, as designed, can be easily adjusted.</p>
            <p>Wide differences are observed among the CRISPRs, in the DR sequence, its size and the size of the spacers. Table <tblr tid="T1">1</tblr> summarizes the size distribution observed for DRs. Interestingly, in both Archaea and Bacteria, three well-separated size classes are observed: small DRs (24&#8211;25 base-pairs), medium-size (28&#8211;30 bp) and large (36&#8211;37 bp). The smaller DRs group is more represented in Archaea (42% versus less than 2% for this size class in Bacteria) and curiously it is also where the differences between DR and spacer size are the largest. In <it>Pyrobaculum aerophilum </it>7 CRISPRs have a 24 or 25 bp-long DRs whereas the spacer sizes range from 38 to 53 bp. The longer spacers were observed in <it>Methanopyrus kandleri </it>which possess 5 CRISPRS with DRs 35 or 36 bp-long and spacers 51 to 72 bp-long, as previously mentioned. In contrast, a remarkably constant spacer length is observed in some bacteria. In <it>Geobacter sulfureducens </it>a single CRISPR with a 29 bp DR possess one hundred and thirty eight 32 bp-long spacers and three 33 bp-long spacers. A similar situation is observed in <it>Mycoplasma mobile </it>and in <it>Treponema denticola</it>. The longest DR presently found is 47 bp-long in the CRISPR of <it>Bacteroides fragilis</it>. The associated spacers are either 29 or 30 bp-long. This suggests that the precise mechanisms producing spacers is different from one bacterium or archaeon to another although a common set of CAS proteins is generally associated with all the CRISPRs. The largest CRISPR locus was found in <it>Verminephrobacter eiseniae </it>consisting of 245 repeats on one side and 45 repeats on the other side of an IS element (NC_008786_2 and NC_008786_3). The DR is 28 bp-long and the average spacer length is 32 bp. The longest CRISPR previously described was NC_003869_3 from <it>Thermoanaerobacter tengcongensis </it>MB4 with 217 repeats.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Summary of the characteristics and number of CRISPRs.</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c ca="left">
                        <p>
                           <b>DR length</b>
                        </p>
                     </c>
                     <c cspan="3" ca="left">
                        <p>
                           <b>Number of CRISPRs (percentage %)</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Bacteria</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Archaea</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Total</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>47</p>
                     </c>
                     <c ca="left">
                        <p>1 (&lt;1)</p>
                     </c>
                     <c ca="left">
                        <p>0</p>
                     </c>
                     <c ca="left">
                        <p>1 (&lt;1)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>38</p>
                     </c>
                     <c ca="left">
                        <p>3 (&lt;1)</p>
                     </c>
                     <c ca="left">
                        <p>0</p>
                     </c>
                     <c ca="left">
                        <p>3 (&lt;1)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>37</p>
                     </c>
                     <c ca="left">
                        <p>55 (14.7)</p>
                     </c>
                     <c ca="left">
                        <p>14 (8.9)</p>
                     </c>
                     <c ca="left">
                        <p>69 (13)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>36</p>
                     </c>
                     <c ca="left">
                        <p>69 (18.4)</p>
                     </c>
                     <c ca="left">
                        <p>9 (5.7)</p>
                     </c>
                     <c ca="left">
                        <p>78 (14.7)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>35</p>
                     </c>
                     <c ca="left">
                        <p>10 (2.7)</p>
                     </c>
                     <c ca="left">
                        <p>1 (&lt;1)</p>
                     </c>
                     <c ca="left">
                        <p>11 (2.6)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>34</p>
                     </c>
                     <c ca="left">
                        <p>1 (&lt;1)</p>
                     </c>
                     <c ca="left">
                        <p>0</p>
                     </c>
                     <c ca="left">
                        <p>1 (&lt;1)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>33</p>
                     </c>
                     <c ca="left">
                        <p>4 (1)</p>
                     </c>
                     <c ca="left">
                        <p>0</p>
                     </c>
                     <c ca="left">
                        <p>4 (&lt;1)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>32</p>
                     </c>
                     <c ca="left">
                        <p>31 (8.3)</p>
                     </c>
                     <c ca="left">
                        <p>1 (&lt;1)</p>
                     </c>
                     <c ca="left">
                        <p>32 (6)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>31</p>
                     </c>
                     <c ca="left">
                        <p>6 (1.6)</p>
                     </c>
                     <c ca="left">
                        <p>2 (1.3)</p>
                     </c>
                     <c ca="left">
                        <p>8 (1.5)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>30</p>
                     </c>
                     <c ca="left">
                        <p>51 (13.6)</p>
                     </c>
                     <c ca="left">
                        <p>46 (29.1)</p>
                     </c>
                     <c ca="left">
                        <p>97 (18.2)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>29</p>
                     </c>
                     <c ca="left">
                        <p>68 (18.1)</p>
                     </c>
                     <c ca="left">
                        <p>9 (5.7)</p>
                     </c>
                     <c ca="left">
                        <p>77 (14.4)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>28</p>
                     </c>
                     <c ca="left">
                        <p>67 (17.9)</p>
                     </c>
                     <c ca="left">
                        <p>7 (4.4)</p>
                     </c>
                     <c ca="left">
                        <p>74 (13.9)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>27</p>
                     </c>
                     <c ca="left">
                        <p>2 (0.53)</p>
                     </c>
                     <c ca="left">
                        <p>2 (1.7)</p>
                     </c>
                     <c ca="left">
                        <p>4 (&lt;1)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>26</p>
                     </c>
                     <c ca="left">
                        <p>1 (0.27)</p>
                     </c>
                     <c ca="left">
                        <p>0</p>
                     </c>
                     <c ca="left">
                        <p>1 (&lt;1)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>25</p>
                     </c>
                     <c ca="left">
                        <p>6 (1.6)</p>
                     </c>
                     <c ca="left">
                        <p>37 (23.4)</p>
                     </c>
                     <c ca="left">
                        <p>43 (8)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>24</p>
                     </c>
                     <c ca="left">
                        <p>0</p>
                     </c>
                     <c ca="left">
                        <p>30 (19)</p>
                     </c>
                     <c ca="left">
                        <p>30 (5.7)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Total Number</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>375</p>
                     </c>
                     <c ca="left">
                        <p>158</p>
                     </c>
                     <c ca="left">
                        <p>533</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Mean Length</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>32</p>
                     </c>
                     <c ca="left">
                        <p>32</p>
                     </c>
                     <c ca="left">
                        <p>32</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Only confirmed CRISPRs are counted. The first column shows the DR length. In the second and the third columns are shown the number of clusters having the corresponding DR length in Bacteria and Archaea respectively (the percentage of CRISPR DR having this length is indicated). Only one strain per species is counted. In the last column, the two populations of CRISPRs are merged. The last two lines are respectively the total number of CRISPRs in each category, and the average DR length.</p>
               </tblfn>
            </tbl>
            <p>Mojica and col. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> observed the existence of terminal and inner-inverted repeats in the DR sequence, and Jansen and col. <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> further suggested that the secondary structure might play an essential biological role. A protein binding on one side of the repeat and producing an opening of the opposite side of the DNA structure was described in <it>Sulfolobus solfataricus </it><abbrgrp><abbr bid="B38">38</abbr></abbrgrp> and might be used in the processing of small RNAs <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>. A future development of our work will be the analysis of all the DRs in search for a common secondary structure that might help in understanding the role of the DR.</p>
            <p>Inside a species several strains can share a set of spacers, but in a given CRISPR spacers are generally unique except in a few cases where duplications of one to 7 motifs (a DR and a spacer) were observed <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>. Apparently, duplications are more frequently observed in Archaea as described in detail by Lillestol <it>et </it><it>al</it>. <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>.</p>
            <p>It is important to note that the absence of CRISPR in one strain does not imply that CRISPRs are absent from all the members of the corresponding species. However in some species or genus no CRISPR has been identified yet although a number of strains have been fully sequenced. This is the case for example in <it>Staphylococcus aureus </it>and <it>Burkholderia sp</it>.</p>
         </sec>
         <sec>
            <st>
               <p>Multiplication of CRISPR</p>
            </st>
            <p>It is believed that CRISPR and associated genes <it>cas </it>can be horizontally transferred between bacteria of different species and possibly between Archaea and Bacteria. This is strongly suggested by comparison of CAS protein sequences, but it does not explain how several CRISPRs with a similar DR can be present in a single genome, only one of which being associated with <it>cas </it>genes. The small CRISPRs are particularly interesting in this respect to try and elucidate the mechanism of creation of a new CRISPR and of insertion of new motifs in an existing CRISPR. For example in <it>Clostridium tetani </it>among eight CRISPRs possessing 1 to 33 motifs, seven are clustered between position 1570766 and 1595950 (spanning 25.184 bp), five of which with exactly the same DR and two with a derivative (6 different nucleotides out of 30). The leaders of the seven clustered CRISPR aligned over about 150 bp with 80% similarity, <it>cas </it>genes are present once between CRISPR 5 and CRISPR 6 and no spacer is in common. It is then most likely that starting from an ancestral complete CRISPR and <it>cas </it>genes locus, new CRISPRs have been created not by duplication of the complete complex but rather by the insertion of a minimum structure comprising a leader sequence, a DR, and no spacer, which then grows by adding new motifs. This absence of common spacers even when several CRISPRs are present in a single Bacteria or Archea is also suggesting that gene conversion is not a significant process for new motif acquisition.</p>
         </sec>
         <sec>
            <st>
               <p>The CRISPR intra-species polymorphism: insight into the mechanism of acquisition of new motifs</p>
            </st>
            <p>We developed the spacer dictionary tool to facilitate the extraction of spacers and their analysis, principally for phylogenetic studies. To better demonstrate the efficiency of this tool we propose a demonstrator based on the sequences of five <it>Y. pestis </it>genomes. An initial dictionary was first created from the 26 published spacers, named using the alphabet from "a" to "z" <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. The CRISPRs of newly sequenced alleles as could be derived from sequencing the locus in a collection of diverse strains can be submitted to the dictionary tool in fasta format. The spacers which were not already present in the dictionary are given a number and they are added sequentially into the dictionary. The alleles are coded in a convenient way using this dictionary.</p>
            <p>In our previous study of three CRISPRs in 180 <it>Y. pestis </it>isolates, most of which were genetically very similar, we described the polymorphism at each locus due to different number of motifs <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. Our observations suggested that one or several motifs could be lost by precise deletion between 2 DRs whereas new motifs were added precisely at the level of the last DR flanking the leader. A similar suggestion was made based upon observations in <it>S. solfataricus </it>P1 and in <it>S. thermophilus </it><abbrgrp><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr></abbrgrp>. This mechanism is further supported by the analysis of the structure of some CRISPRs in which a first series of motifs containing a particular DR is followed by motifs with a DR differing at a single nucleotide up to the last one near the leader. For example in the CRISPR NC_005085_3 of <it>Chromobacterium violaceum</it>, 13 motifs with DR "GTGTTCCCCACG<b>TG</b>CGTGGGGATGAACCG" are followed by 6 motifs with DR "GTGTTCCCCACG<b>CC</b>CGTGGGGATGAACCG". Another interesting example is found in <it>Carboxydothermus hydrogenoformans </it>where two CRISPRs, NC_007503_3 and NC_007503_4 (59 and 84 spacers respectively) share the same 30bp-DR, although in one of them the last 13 repeats adjacent to the leader have a modified DR. The first three bases of the DR are absent whereas the three bases AAC are added to the other end to produce a modified DR (Figure <figr fid="F5">5</figr>). This suggests that at some point the last DR plus 3 bases of a newly added spacer were duplicated to create a new DR which then served as a matrix for subsequent duplications. Alternatively, the AAC addition could be the result of some stuttering since the initial DR ends by AAAAC (and the modified DR by AAAACAAC). These observations are in favour of the model of polarised sequential insertion of new motifs by duplication of the last DR and insertion of a new spacer <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B10">10</abbr></abbrgrp>, rather than random insertion by homologous recombination as proposed by Makarova <it>et al</it>. <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. If the newly copied last DR contains a mutation, compatible with CRISPR metabolism, then this mutation will be copied in all subsequent motif acquisitions.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>The first and last 17 motifs of CRISPR NC_007503_3 from <it>Carboxydothermus hydrogenoformans </it>Z-2901</p>
               </caption>
               <text>
                  <p><b>The first and last 17 motifs of CRISPR NC_007503_3 from <it>Carboxydothermus hydrogenoformans </it>Z-2901</b>. The DRs shared by the two CRISPR loci NC_007503_3 and NC_007503_4 are shown in yellow and the variant DR observed only in NC_007503_3 is in red. CRISPR units (DR + spacer) are numbered on the left and spacers' length is indicated on the right.</p>
               </text>
               <graphic file="1471-2105-8-172-5"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Future developments</p>
            </st>
            <p>Further development of our software will include new parameters to analyse genomes for which only questionable structures were detected. An additional aspect will be the identification of minimum CRISPRs structure, devoid of spacers and comprising only a DR and leader.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>The described software and database are exclusively devoted to the identification and the analysis of CRISPRs structures, <it>i.e</it>. the succession of motifs made up of DRs and spacers. A database for <it>cas </it>gene identification has been developed by TIGR <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. We have added a link to this web page in order to search for the presence of <it>cas </it>genes in the vicinity of a CRISPR.</p>
         <p>CRISPRs are fascinating structures, which conceal complex biological mechanisms to account for their transfer, evolution and behaviour. They have probably played an important role in the evolution of Archaea and Bacteria by providing a defence mechanism against foreign DNA. A lot remains to be discovered, and this necessitates the possibility to rapidly investigate newly sequenced genomes, and to be able to easily browse across many different species. The CRISPRdb and associated web service provides all the necessary tools to decipher the organisation of these structures. Several studies have shown that when an origin can be found for a spacer, it is most frequently a virus or a plasmid sequence. Thus the spacer database will serve as a repository of sequences of probable viral or plasmid origin. Finally the intra-species polymorphism of CRISPRs and their evolution mode (organised acquisition and loss of motifs) make them interesting tools for epidemiological studies. The possibility exists that a given spacer be added twice independently into a CRISPR, which could hamper its use for phylogenetic studies. However the polarized addition of motifs, and limited events of recombination insure that their order should be preserved. In <it>Y. pestis </it>we believe that they could be used to investigate ancient DNAs (Vergnaud et al. in press).</p>
      </sec>
      <sec>
         <st>
            <p>Availability and requirements</p>
         </st>
         <p>The resource described here is accessible with no restrictions, except for the demand to quote the site <abbrgrp><abbr bid="B32">32</abbr></abbrgrp> (see Creative Commons license on the site).</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>GV and CP designed the study. IG developed the programs and database, and ran initial tests. Additional tests were done by IG, GV and CP together with collaborators. CP, GV and IG wrote the manuscript. All authors read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>We thank Bernard Labedan and Olivier Lespinet for their valuable comments. We thank the reviewers for their constructive analysis and comments of the manuscript.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Unusual nucleotide arrangement with repeated sequences in the <it>Escherichia coli</it> K-12 chromosome</p>
            </title>
            <aug>
               <au>
                  <snm>Nakata</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Amemura</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Makino</snm>
                  <fnm>K</fnm>
               </au>
            </aug>
            <source>J Bacteriol</source>
            <pubdate>1989</pubdate>
            <volume>171</volume>
            <issue>6</issue>
            <fpage>3553</fpage>
            <lpage>3556</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">210085</pubid>
                  <pubid idtype="pmpid" link="fulltext">2656660</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Nature of DNA polymorphism in the direct repeat cluster of <it>Mycobacterium tuberculosis</it>; application for strain differentiation by a novel typing method</p>
            </title>
            <aug>
               <au>
                  <snm>Groenen</snm>
                  <fnm>PM</fnm>
               </au>
               <au>
                  <snm>Bunschoten</snm>
                  <fnm>AE</fnm>
               </au>
               <au>
                  <snm>van Soolingen</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>van Embden</snm>
                  <fnm>JD</fnm>
               </au>
            </aug>
            <source>Mol Microbiol</source>
            <pubdate>1993</pubdate>
            <volume>10</volume>
            <issue>5</issue>
            <fpage>1057</fpage>
            <lpage>1065</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1111/j.1365-2958.1993.tb00976.x</pubid>
                  <pubid idtype="pmpid">7934856</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Long stretches of short tandem repeats are present in the largest replicons of the Archaea <it>Haloferax mediterranei</it> and <it>Haloferax volcanii</it> and could be involved in replicon partitioning</p>
            </title>
            <aug>
               <au>
                  <snm>Mojica</snm>
                  <fnm>FJ</fnm>
               </au>
               <au>
                  <snm>Ferrer</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Juez</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Rodriguez-Valera</snm>
                  <fnm>F</fnm>
               </au>
            </aug>
            <source>Mol Microbiol</source>
            <pubdate>1995</pubdate>
            <volume>17</volume>
            <issue>1</issue>
            <fpage>85</fpage>
            <lpage>93</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1111/j.1365-2958.1995.mmi_17010085.x</pubid>
                  <pubid idtype="pmpid">7476211</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Biological significance of a family of regularly spaced repeats in the genomes of Archaea, Bacteria and mitochondria</p>
            </title>
            <aug>
               <au>
                  <snm>Mojica</snm>
                  <fnm>FJ</fnm>
               </au>
               <au>
                  <snm>Diez-Villasenor</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Soria</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Juez</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Mol Microbiol</source>
            <pubdate>2000</pubdate>
            <volume>36</volume>
            <issue>1</issue>
            <fpage>244</fpage>
            <lpage>246</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1046/j.1365-2958.2000.01838.x</pubid>
                  <pubid idtype="pmpid" link="fulltext">10760181</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Identification of genes that are associated with DNA repeats in prokaryotes</p>
            </title>
            <aug>
               <au>
                  <snm>Jansen</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Embden</snm>
                  <fnm>JD</fnm>
               </au>
               <au>
                  <snm>Gaastra</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Schouls</snm>
                  <fnm>LM</fnm>
               </au>
            </aug>
            <source>Mol Microbiol</source>
            <pubdate>2002</pubdate>
            <volume>43</volume>
            <issue>6</issue>
            <fpage>1565</fpage>
            <lpage>1575</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1046/j.1365-2958.2002.02839.x</pubid>
                  <pubid idtype="pmpid" link="fulltext">11952905</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Clustered regularly interspaced short palindrome repeats (CRISPRs) have spacers of extrachromosomal origin</p>
            </title>
            <aug>
               <au>
                  <snm>Bolotin</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Quinquis</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Sorokin</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Ehrlich</snm>
                  <fnm>SD</fnm>
               </au>
            </aug>
            <source>Microbiology</source>
            <pubdate>2005</pubdate>
            <volume>151</volume>
            <issue>Pt 8</issue>
            <fpage>2551</fpage>
            <lpage>2561</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1099/mic.0.28048-0</pubid>
                  <pubid idtype="pmpid" link="fulltext">16079334</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Intervening sequences of regularly spaced prokaryotic repeats derive from foreign genetic elements</p>
            </title>
            <aug>
               <au>
                  <snm>Mojica</snm>
                  <fnm>FJ</fnm>
               </au>
               <au>
                  <snm>Diez-Villasenor</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Garcia-Martinez</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Soria</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>J Mol Evol</source>
            <pubdate>2005</pubdate>
            <volume>60</volume>
            <issue>2</issue>
            <fpage>174</fpage>
            <lpage>182</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/s00239-004-0046-3</pubid>
                  <pubid idtype="pmpid" link="fulltext">15791728</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>CRISPR elements in <it>Yersinia pestis</it> acquire new repeats by preferential uptake of bacteriophage DNA, and provide additional tools for evolutionary studies</p>
            </title>
            <aug>
               <au>
                  <snm>Pourcel</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Salvignol</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Vergnaud</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Microbiology</source>
            <pubdate>2005</pubdate>
            <volume>151</volume>
            <issue>Pt 3</issue>
            <fpage>653</fpage>
            <lpage>663</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1099/mic.0.27437-0</pubid>
                  <pubid idtype="pmpid" link="fulltext">15758212</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>CRISPR provides acquired resistance against viruses in prokaryotes</p>
            </title>
            <aug>
               <au>
                  <snm>Barrangou</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Fremaux</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Deveau</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Richards</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Boyaval</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Moineau</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Romero</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Horvath</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>2007</pubdate>
            <volume>315</volume>
            <fpage>1709</fpage>
            <lpage>1712</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.1138140</pubid>
                  <pubid idtype="pmpid" link="fulltext">17379808</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>A putative viral defence mechanism in archaeal cells</p>
            </title>
            <aug>
               <au>
                  <snm>Lillestol</snm>
                  <fnm>RK</fnm>
               </au>
               <au>
                  <snm>Redder</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Garrett</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Brugger</snm>
                  <fnm>K</fnm>
               </au>
            </aug>
            <source>Archaea</source>
            <pubdate>2006</pubdate>
            <volume>2</volume>
            <issue>1</issue>
            <fpage>59</fpage>
            <lpage>72</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">16877322</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>A putative RNA-interference-based immune system in prokaryotes: computational analysis of the predicted enzymatic machinery, functional analogies with eukaryotic RNAi, and hypothetical mechanisms of action</p>
            </title>
            <aug>
               <au>
                  <snm>Makarova</snm>
                  <fnm>KS</fnm>
               </au>
               <au>
                  <snm>Grishin</snm>
                  <fnm>NV</fnm>
               </au>
               <au>
                  <snm>Shabalina</snm>
                  <fnm>SA</fnm>
               </au>
               <au>
                  <snm>Wolf</snm>
                  <fnm>YI</fnm>
               </au>
               <au>
                  <snm>Koonin</snm>
                  <fnm>EV</fnm>
               </au>
            </aug>
            <source>Biol Direct</source>
            <pubdate>2006</pubdate>
            <volume>1</volume>
            <fpage>7</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1462988</pubid>
                  <pubid idtype="pmpid" link="fulltext">16545108</pubid>
                  <pubid idtype="doi">10.1186/1745-6150-1-7</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>The repetitive DNA elements called CRISPRs and their associated genes: evidence of horizontal transfer among prokaryotes</p>
            </title>
            <aug>
               <au>
                  <snm>Godde</snm>
                  <fnm>JS</fnm>
               </au>
               <au>
                  <snm>Bickerton</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>J Mol Evol</source>
            <pubdate>2006</pubdate>
            <volume>62</volume>
            <issue>6</issue>
            <fpage>718</fpage>
            <lpage>729</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/s00239-005-0223-z</pubid>
                  <pubid idtype="pmpid" link="fulltext">16612537</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>A Guild of 45 CRISPR-Associated (Cas) Protein Families and Multiple CRISPR/Cas Subtypes Exist in Prokaryotic Genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Haft</snm>
                  <fnm>DH</fnm>
               </au>
               <au>
                  <snm>Selengut</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Mongodin</snm>
                  <fnm>EF</fnm>
               </au>
               <au>
                  <snm>Nelson</snm>
                  <fnm>KE</fnm>
               </au>
            </aug>
            <source>PLoS Comput Biol</source>
            <pubdate>2005</pubdate>
            <volume>1</volume>
            <issue>6</issue>
            <fpage>e60</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1282333</pubid>
                  <pubid idtype="pmpid" link="fulltext">16292354</pubid>
                  <pubid idtype="doi">10.1371/journal.pcbi.0010060</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Identification of 86 candidates for small non-messenger RNAs from the archaeon <it>Archaeoglobus fulgidus</it></p>
            </title>
            <aug>
               <au>
                  <snm>Tang</snm>
                  <fnm>TH</fnm>
               </au>
               <au>
                  <snm>Bachellerie</snm>
                  <fnm>JP</fnm>
               </au>
               <au>
                  <snm>Rozhdestvensky</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Bortolin</snm>
                  <fnm>ML</fnm>
               </au>
               <au>
                  <snm>Huber</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Drungowski</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Elge</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Brosius</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Huttenhofer</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>2002</pubdate>
            <volume>99</volume>
            <issue>11</issue>
            <fpage>7536</fpage>
            <lpage>7541</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">124276</pubid>
                  <pubid idtype="pmpid" link="fulltext">12032318</pubid>
                  <pubid idtype="doi">10.1073/pnas.112047299</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>The TIGRFAM page</p>
            </title>
            <url>http://www.tigr.org/TIGRFAMs/</url>
         </bibl>
         <bibl id="B16">
            <title>
               <p>REPuter: the manifold applications of repeat analysis on a genomic scale</p>
            </title>
            <aug>
               <au>
                  <snm>Kurtz</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Choudhuri</snm>
                  <fnm>JV</fnm>
               </au>
               <au>
                  <snm>Ohlebusch</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Schleiermacher</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Stoye</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Giegerich</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2001</pubdate>
            <volume>29</volume>
            <issue>22</issue>
            <fpage>4633</fpage>
            <lpage>4642</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">92531</pubid>
                  <pubid idtype="pmpid" link="fulltext">11713313</pubid>
                  <pubid idtype="doi">10.1093/nar/29.22.4633</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Identification of a novel family of sequence repeats among prokaryotes</p>
            </title>
            <aug>
               <au>
                  <snm>Jansen</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>van Embden</snm>
                  <fnm>JD</fnm>
               </au>
               <au>
                  <snm>Gaastra</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Schouls</snm>
                  <fnm>LM</fnm>
               </au>
            </aug>
            <source>Omics</source>
            <pubdate>2002</pubdate>
            <volume>6</volume>
            <issue>1</issue>
            <fpage>23</fpage>
            <lpage>33</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1089/15362310252780816</pubid>
                  <pubid idtype="pmpid">11883425</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Tandem repeats finder: a program to analyze DNA sequences.</p>
            </title>
            <aug>
               <au>
                  <snm>Benson</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1999</pubdate>
            <volume>27</volume>
            <fpage>573</fpage>
            <lpage>580</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">148217</pubid>
                  <pubid idtype="pmpid" link="fulltext">9862982</pubid>
                  <pubid idtype="doi">10.1093/nar/27.2.573</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Browsing repeats in genomes: Pygram and an application to non-coding region analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Durand</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Mahe</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Valin</snm>
                  <fnm>AS</fnm>
               </au>
               <au>
                  <snm>Nicolas</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>477</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1635066</pubid>
                  <pubid idtype="pmpid" link="fulltext">17067389</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-7-477</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>PILER-CR: fast and accurate identification of CRISPR repeats</p>
            </title>
            <aug>
               <au>
                  <snm>Edgar</snm>
                  <fnm>RC</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>8</volume>
            <fpage>18</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1790904</pubid>
                  <pubid idtype="pmpid" link="fulltext">17239253</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-8-18</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>The Perl directory</p>
            </title>
            <url>http://www.perl.org/</url>
         </bibl>
         <bibl id="B22">
            <title>
               <p>BioPerl </p>
            </title>
            <url>http://www.bioperl.org/</url>
         </bibl>
         <bibl id="B23">
            <title>
               <p>The Apache Software Foundation </p>
            </title>
            <url>http://www.apache.org/</url>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Debian </p>
            </title>
            <url>http://www.debian.org/</url>
         </bibl>
         <bibl id="B25">
            <title>
               <p>The NCBI ftp site for Bacterial and Archaeal genome sequences </p>
            </title>
            <url>ftp://ftp.ncbi.nih.gov/genomes/Bacteria</url>
         </bibl>
         <bibl id="B26">
            <title>
               <p>The CRISPRFinder </p>
            </title>
            <url>http://crispr.u-psud.fr/Server/CRISPRfinder.php</url>
         </bibl>
         <bibl id="B27">
            <title>
               <p>PHP </p>
            </title>
            <url>http://www.php.net/</url>
         </bibl>
         <bibl id="B28">
            <title>
               <p>MySQL </p>
            </title>
            <url>http://www.mysql.com/</url>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Vmatch</p>
            </title>
            <url>http://www.vmatch.de/</url>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Replacing suffix trees with enhanced suffix arrays.</p>
            </title>
            <aug>
               <au>
                  <snm>Abouelhoda</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Kurtz</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Ohlebusch</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Journal of Discrete Algorithms</source>
            <pubdate>2004</pubdate>
            <volume>2</volume>
            <fpage>53</fpage>
            <lpage>86</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/S1570-8667(03)00065-0</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B31">
            <title>
               <p>The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide</p>
            </title>
            <aug>
               <au>
                  <snm>Liolios</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Tavernarakis</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Hugenholtz</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Kyrpides</snm>
                  <fnm>NC</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2006</pubdate>
            <volume>34</volume>
            <issue>Database issue</issue>
            <fpage>D332</fpage>
            <lpage>4</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1347507</pubid>
                  <pubid idtype="pmpid" link="fulltext">16381880</pubid>
                  <pubid idtype="doi">10.1093/nar/gkj145</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>The CRISPR database </p>
            </title>
            <url>http://crispr.u-psud.fr</url>
         </bibl>
         <bibl id="B33">
            <title>
               <p>CRISPRUtilities </p>
            </title>
            <url>http://crispr.u-psud.fr/crispr/CRISPRUtilitiesPage.html</url>
         </bibl>
         <bibl id="B34">
            <title>
               <p>BLAST CRISPRs </p>
            </title>
            <url>http://crispr.u-psud.fr/crispr/BLAST/CRISPRsBlast.php</url>
         </bibl>
         <bibl id="B35">
            <title>
               <p>The CRISPR spacers dictionary </p>
            </title>
            <url>http://crispr.u-psud.fr/crispr/MultipleAnalysis/CRISPRdetector.php</url>
         </bibl>
         <bibl id="B36">
            <title>
               <p>Comparative genotyping of <it>Campylobacter jejuni</it> by amplified fragment length polymorphism, multilocus sequence typing, and short repeat sequencing: strain diversity, host range, and recombination</p>
            </title>
            <aug>
               <au>
                  <snm>Schouls</snm>
                  <fnm>LM</fnm>
               </au>
               <au>
                  <snm>Reulen</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Duim</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Wagenaar</snm>
                  <fnm>JA</fnm>
               </au>
               <au>
                  <snm>Willems</snm>
                  <fnm>RJ</fnm>
               </au>
               <au>
                  <snm>Dingle</snm>
                  <fnm>KE</fnm>
               </au>
               <au>
                  <snm>Colles</snm>
                  <fnm>FM</fnm>
               </au>
               <au>
                  <snm>Van Embden</snm>
                  <fnm>JD</fnm>
               </au>
            </aug>
            <source>J Clin Microbiol</source>
            <pubdate>2003</pubdate>
            <volume>41</volume>
            <issue>1</issue>
            <fpage>15</fpage>
            <lpage>26</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">149617</pubid>
                  <pubid idtype="pmpid" link="fulltext">12517820</pubid>
                  <pubid idtype="doi">10.1128/JCM.41.1.15-26.2003</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B37">
            <title>
               <p>Rapid molecular genetic subtyping of serotype M1 group A <it>Streptococcus </it>strains</p>
            </title>
            <aug>
               <au>
                  <snm>Hoe</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Nakashima</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Grigsby</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Pan</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Dou</snm>
                  <fnm>SJ</fnm>
               </au>
               <au>
                  <snm>Naidich</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Garcia</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Kahn</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Bergmire-Sweat</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Musser</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>Emerg Infect Dis</source>
            <pubdate>1999</pubdate>
            <volume>5</volume>
            <issue>2</issue>
            <fpage>254</fpage>
            <lpage>263</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">10221878</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B38">
            <title>
               <p>Genus-specific protein binding to the large clusters of DNA repeats (short regularly spaced repeats) present in Sulfolobus genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Peng</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Brugger</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Shen</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>She</snm>
                  <fnm>Q</fnm>
               </au>
               <au>
                  <snm>Garrett</snm>
                  <fnm>RA</fnm>
               </au>
            </aug>
            <source>J Bacteriol</source>
            <pubdate>2003</pubdate>
            <volume>185</volume>
            <issue>8</issue>
            <fpage>2410</fpage>
            <lpage>2417</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">152625</pubid>
                  <pubid idtype="pmpid" link="fulltext">12670964</pubid>
                  <pubid idtype="doi">10.1128/JB.185.8.2410-2417.2003</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
