<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>gb-2007-8-4-r61</ui>
   <ji>GBJ</ji>
   <fm>
      <dochead>Research</dochead>
      <bibl>
         <title>
            <p>Evolutionary conservation of sequence and secondary structures in CRISPR repeats</p>
         </title>
         <aug>
            <au id="A1" ca="yes" ce="yes">
               <snm>Kunin</snm>
               <fnm>Victor</fnm>
               <insr iid="I1"/>
               <email>vkunin@lbl.gov</email>
            </au>
            <au id="A2" ce="yes">
               <snm>Sorek</snm>
               <fnm>Rotem</fnm>
               <insr iid="I1"/>
               <email>rsorek@lbl.gov</email>
            </au>
            <au id="A3">
               <snm>Hugenholtz</snm>
               <fnm>Philip</fnm>
               <insr iid="I1"/>
               <email>phugenholtz@lbl.gov</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>DOE Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CA 94598, USA</p>
            </ins>
         </insg>
         <source>Genome Biology</source>
         <issn>1465-6906</issn>
         <pubdate>2007</pubdate>
         <volume>8</volume>
         <issue>4</issue>
         <fpage>R61</fpage>
         <url>http://genomebiology.com/2007/8/4/R61</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17442114</pubid>
               <pubid idtype="doi">10.1186/gb-2007-8-4-r61</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>9</day>
               <month>10</month>
               <year>2006</year>
            </date>
         </rec>
         <revrec>
            <date>
               <day>24</day>
               <month>1</month>
               <year>2007</year>
            </date>
         </revrec>
         <acc>
            <date>
               <day>18</day>
               <month>04</month>
               <year>2007</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>18</day>
               <month>04</month>
               <year>2007</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2007</year>
         <collab>Kunin et al.; licensee BioMed Central Ltd.</collab>
         <note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <shorttitle>
         <p>Clustered regularly interspaced short palindromic repeat</p>
      </shorttitle>
      <shortabs>
         <p>The categorisation and structural analysis of Clustered Regularly Interspaced Short Palindromic Repeats (CRISPRs) sequences from 195 microbial genomes show that repeats from diverse organisms can be grouped based on sequence similarity, and that some groups have pronounced secondary structures with compensatory base changes.</p>
      </shortabs>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Clustered regularly interspaced short palindromic repeats (CRISPRs) are a novel class of direct repeats, separated by unique spacer sequences of similar length, that are present in approximately 40% of bacterial and most archaeal genomes analyzed to date. More than 40 gene families, called CRISPR-associated sequences (CASs), appear in conjunction with these repeats and are thought to be involved in the propagation and functioning of CRISPRs. It has been recently shown that CRISPR provides acquired resistance against viruses in prokaryotes.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>Here we analyze CRISPR repeats identified in 195 microbial genomes and show that they can be organized into multiple clusters based on sequence similarity. Some of the clusters present stable, highly conserved RNA secondary structures, while others lack detectable structures. Stable secondary structures exhibit multiple compensatory base changes in the stem region, indicating evolutionary and functional conservation.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>We show that the repeat-based classification corresponds to, and expands upon, a previously reported CAS gene-based classification, including specific relationships between CRISPR and CAS subtypes.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010010">Genome studies</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010014">Microbiology and parasitology</classification>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Clustered regularly interspaced short palindromic repeats (CRISPRs) are repetitive structures in Bacteria and Archaea composed of exact repeat sequences 24 to 48 bases long (herein called repeats) separated by unique spacers of similar length (herein called spacers) <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>. The CRISPR sequences appear to be among the most rapidly evolving elements in the genome, to the point that closely related species and strains, sometimes more than 99% identical at the DNA level, differ in their CRISPR composition <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr></abbrgrp>.</p>
         <p>Up to 45 gene families, called CRISPR-associated sequences (CASs), appear in conjunction with these repeats and are hypothesized to be responsible for CRISPR propagation and functioning <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>. It has been proposed that CASs can be divided into seven or eight subtypes, according to their operon organization and gene phylogeny <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>. Phylogenetic analysis additionally indicates that CASs have undergone extensive horizontal gene transfer, as very similar CAS genes are found in distantly related organisms <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp>. CRISPRs and CASs have been found on mobile genetic elements, such as plasmids, <it>skin </it>mobile elements, and even prophages, suggesting a possible distribution mechanism for the system <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>.</p>
         <p>CRISPRs have been suggested to play roles in replicon partitioning <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>, DNA repair <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>, regulation <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> and chromosomal rearrangement <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. It was recently reported that the spacers are often highly similar to fragments of extrachromosomal DNA, such as phage or plasmid DNA <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B12">12</abbr></abbrgrp>. It was suggested that the CRISPR/CAS system participates in an antiviral response, probably by an RNA interference-like mechanism. The proposed mechanism for this CRISPR function involves sampling and maintaining a record of invasive DNA elements, and inhibition of gene functions necessary for invasion <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. Indeed, it was recently shown that CRISPRs provide acquired resistance against viruses in prokaryotes <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>.</p>
         <p>Despite in-depth analyses of CASs, the nature of the repeat sequences has not been examined closely. This is presumably because repeats, as short DNA sequences, have less comparative potential than protein-coding genes. Previous studies have noted only that repeats are highly variable, and do not appear to be similar between organisms <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B7">7</abbr></abbrgrp>. However, we show that repeats from diverse organisms can be grouped into clusters based on sequence similarity, and that some clusters have pronounced secondary structures with compensatory base changes. We further show that there is a clear correspondence between CAS subtypes and repeat clusters. Our findings have important implications for CRISPR function and diversity.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <p>To obtain a set of CRISPR arrays we employed the PILER-CR program <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> on 439 currently available bacterial and archaeal genomes in IMG version 1.50 <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. We found 561 arrays, ranging in size from 3 to 220 repeats, in 195 genomes (44% of the genomes tested). These results are in agreement with the results of Godde <it>et al</it>. <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>, who found CRISPR arrays in 40% of the genomes they tested. Overall, our set of CRISPRs contained 561 repeat sequences (as repeats are generally identical within an array) and 13,372 spacers.</p>
         <p>Repeats were first noticed to be palindromic by Mojica <it>et al</it>. <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>, a feature that was subsequently incorporated into the acronym CRISPR <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>. We hypothesized that the palindromic signature might be indicative of a functional RNA secondary structure within the repeat. This hypothesis is supported by the experimental demonstration that CRISPRs are transcribed and processed into non-messenger RNAs in several Archaea <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>, indicating that they are active through an RNA intermediate.</p>
         <p>To assess the possibility that CRISPR repeats form stable RNA secondary structures, we used the RNAfold software <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> (see Materials and methods) to predict the intramolecular RNA structure for each of the repeats in our set. This software provides a bit-score that reflects the stability of each secondary structure. We compared the stability of the predicted secondary structure of repeats and spacers to that of similarly sized sequences selected randomly from bacterial genomes (Figure <figr fid="F1">1a</figr>). We found that the folding-score distribution of repeats deviates from the scores for random sequences, indicating a tendency of repeats to form stable secondary structure.</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>Distributions of folding scores of <b>(a) </b>all CRISPR repeats and all spacers, as compared to random sequences and <b>(b) </b>individual repeat clusters</p>
            </caption>
            <text>
               <p>Distributions of folding scores of <b>(a) </b>all CRISPR repeats and all spacers, as compared to random sequences and <b>(b) </b>individual repeat clusters. X-axis, negative folding scores; Y-axis, fraction (percent) of total.</p>
            </text>
            <graphic file="gb-2007-8-4-r61-1"/>
         </fig>
         <p>The trimodal pattern of the RNA folding distribution for CRISPR repeats (Figure <figr fid="F1">1a</figr>) suggests that they are not homogeneous, and that a large subset form stable secondary structures, in contrast to spacers and random sequences. To identify repeat subtypes we first attempted to align each of the 561 repeats in our set to all other repeats using the Smith-Waterman algorithm <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>. The sequence similarity results were then clustered using the MCL algorithm <abbrgrp><abbr bid="B20">20</abbr></abbrgrp> (see Materials and methods). This procedure generated 33 clusters, 12 of which contained 10 or more members, with the largest cluster (cluster 1) containing 94 repeat sequences. Some clusters contained repeats from organisms as distantly related as Archaea and Bacteria, supporting the inference that CRISPR/CAS systems can be horizontally transferred between microorganisms <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp>.</p>
         <p>As an independent measure for the validity of the clustering, we examined the RNA stability scores in each of the MCL-defined clusters (note that RNA stability was not taken into account in the clustering procedure). As seen in Figure <figr fid="F1">1b</figr>, clusters 2 and 3 comprise repeats with consistently high folding scores, indicating pronounced secondary structure. By contrast, clusters 1, 6, 7, 9, 10 and 11 contain repeats with consistently poor folding scores. Clusters 4, 5, 8 and 12 show intermediate folding scores, suggesting they have weaker secondary structures. Together, these groups explain the trimodal distribution observed in Figure <figr fid="F1">1a</figr>. The homogeneity of RNA structure stability scores within each cluster, along with the dramatic difference in scores between clusters, suggests that our clustering method is valid.</p>
         <p>To further explore the observation that repeats form stable RNA secondary structures, we examined sequence alignments of the repeat clusters. CRISPR repeats are generally considered to be highly dissimilar to each other <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>, except for similar repeats in strains of the same species or in closely related species <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. However, repeats within the clusters we generated, although often containing sequences from vastly different phylogenetic groups, were generally more similar to each other and hence alignable. Figure <figr fid="F2">2a</figr> presents a multiple alignment of a subset of the repeats in cluster 3. A highly stable stem-loop structure was consistently predicted for repeats in this cluster by RNAfold <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> (Figure <figr fid="F1">1b</figr>). Notably, substitutions in the predicted stem structure are consistently accompanied by compensatory changes that preserve the base pairing (Figure <figr fid="F2">2a</figr>). This mutational pattern, together with the presence of G:U base pairs (Figure <figr fid="F2">2a</figr>), is typical of conserved RNA secondary structures and highlights the importance of the stem-loop in the repeats for the functionality of CRISPRs.</p>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>Evidence for secondary structure in cluster 3</p>
            </caption>
            <text>
               <p>Evidence for secondary structure in cluster 3. <b>(a) </b>Multiple alignment of a subset (for clarity) of repeats in cluster 3. Numbers 1 to 7 and 7 to 1 indicate the residues involved in stem base-pairing, some compensatory mutations in the stem are highlighted with circles. Note G:U base pairing at position 5 in <it>Xanthomonas oryzae </it>and relaxed conservation of loop residues typical of RNA secondary structure in which the structure is functional rather than the sequence. <b>(b) </b>Sequence logo for all repeats in cluster 3. <b>(c) </b>Predicted secondary structure of <it>Syntrophus acidotrophicus </it>repeat using RNAfold. Stem positions are numbered in accordance with the alignment.</p>
            </text>
            <graphic file="gb-2007-8-4-r61-2"/>
         </fig>
         <p>A summary of the repeat similarity space is presented in Figure <figr fid="F3">3</figr>. As with cluster 3 (Figure <figr fid="F2">2</figr>), repeats in other clusters with high and intermediate folding scores also form stem-loop structures (Figure <figr fid="F3">3</figr>) and display compensatory mutations, suggesting stable structures. While the stem-loop motif is seen in all of these clusters, the actual sequence, as well as the length of the stem, its position relative to the unstructured region, and the size of the unstructured sequence varies between clusters. For example, while the stem in cluster 4 is typically 5 bp long and is found in the middle of the repeat, the stem in cluster 3 is typically 7 bp long, and is found towards the 5' end of the repeat (Figures <figr fid="F2">2</figr> and <figr fid="F3">3</figr>). The difference in calculated folding scores between clusters with high and intermediate scores is likely to be due to the stem length and the frequency of GC as opposed to AT base pairings. Consistent with previous reports <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>, many repeat clusters have a conserved 3' terminus of GAAA(C/G), possibly acting as a binding site for one of the conserved CAS proteins.</p>
         <fig id="F3">
            <title>
               <p>Figure 3</p>
            </title>
            <caption>
               <p>The sequence similarity space of CRISPR repeats visualized with the BioLayout (Java) program [26]</p>
            </caption>
            <text>
               <p>The sequence similarity space of CRISPR repeats visualized with the BioLayout (Java) program [26]. Dots denote individual repeat sequences; connecting lines represent Smith-Waterman similarities, such that closer dots represent more similar sequences. Dot colors denote cluster association as derived from MCL clustering. The 12 largest clusters are indicated by circles together with their sequence logos, coarse phylogenetic composition, and sample secondary structures where applicable.</p>
            </text>
            <graphic file="gb-2007-8-4-r61-3"/>
         </fig>
         <p>Two recent studies identified between 20 and 45 gene families of CASs <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>. Based on the tendency of CAS genes to appear together, Haft <it>et al</it>. <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> defined eight CAS subtypes (named Ecoli, Ypest, Nmeni, Dvulg, Tneap, Hmari, Apern and Mtube). We sought to determine whether our CRISPR repeat clusters corresponded to particular CAS subtypes. For this, we searched 20 kb of sequence flanking each side of the repeat array for CAS genes using the 45 CAS families TIGRFAM hidden Markov models (HMMs) defined by Haft <it>et al</it>. <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>.</p>
         <p>We found that the Ecoli CAS subtype genes appear exclusively in the proximity of structured repeat cluster 2, and, similarly, the Dvulg and Ypest CAS subtypes correspond strictly to our structured clusters 3 and 4, respectively (Table <tblr tid="T1">1</tblr> and Table S1 in Additional data file 1). Presumably, specific and different sets of genes are needed in order to recognize, bind and process the different repeat types. Despite the overall pronounced correspondence between the CAS subtypes and repeat clusters, particularly for structured clusters, there are notable exceptions. For example, the reported frequent co-occurrence of the Mtube subtype with other CAS subtypes <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> is consistent with its promiscuous association with numerous repeat clusters (Table <tblr tid="T1">1</tblr>). Another interesting exception is the co-occurrence of the Tneap and Apern subtypes in the <it>Thermococcus kodakaraensis </it>genome with cluster 6, which is apparently due to a fusion of the Tneap and Apern subtypes (Figure S1 and Table S1 in Additional data file 1). This genome contains three CRISPR arrays, all with identical repeat sequences classified as cluster 6 (Table S1 in Additional data file 1). In some cases the CAS subtype for one or more repeat cluster members differs from the consensus for that cluster (Table S1 in Additional data file 1), suggesting that the association between CRISPR repeat subtypes and CAS subtypes is somewhat flexible.</p>
         <tbl id="T1">
            <title>
               <p>Table 1</p>
            </title>
            <caption>
               <p>Occurrence of CAS subtypes in the proximity (&#177; 20 kb) of the 12 largest repeat clusters</p>
            </caption>
            <tblbdy cols="13">
               <r>
                  <c>
                     <p/>
                  </c>
                  <c cspan="12" ca="center">
                     <p>Repeat cluster</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c cspan="12">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>CAS subtype</p>
                  </c>
                  <c ca="center">
                     <p>1</p>
                  </c>
                  <c ca="center">
                     <p>2</p>
                  </c>
                  <c ca="center">
                     <p>3</p>
                  </c>
                  <c ca="center">
                     <p>4</p>
                  </c>
                  <c ca="center">
                     <p>5</p>
                  </c>
                  <c ca="center">
                     <p>6</p>
                  </c>
                  <c ca="center">
                     <p>7</p>
                  </c>
                  <c ca="center">
                     <p>8</p>
                  </c>
                  <c ca="center">
                     <p>9</p>
                  </c>
                  <c ca="center">
                     <p>10</p>
                  </c>
                  <c ca="center">
                     <p>11</p>
                  </c>
                  <c ca="center">
                     <p>12</p>
                  </c>
               </r>
               <r>
                  <c cspan="13">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Ecoli</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>X</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Ypest</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>X</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Nmeni</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>X</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Dvulg</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>X</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Tneap</p>
                  </c>
                  <c ca="center">
                     <p>X</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>X</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Hmari</p>
                  </c>
                  <c ca="center">
                     <p>X</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>X</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Apern</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>F</p>
                  </c>
                  <c ca="center">
                     <p>X</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>X</p>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Mtube</p>
                  </c>
                  <c ca="center">
                     <p>X</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>X</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>X</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>X</p>
                  </c>
               </r>
            </tblbdy>
            <tblfn>
               <p>CAS subtypes are as defined in [5]. Associations are indicated by an X. An instance of a putative fusion between two CAS subtypes is indicated by an F.</p>
            </tblfn>
         </tbl>
         <p>We also identified a repeat cluster (cluster 5) that is not associated with any of the recognized CAS subtypes. We found that it is associated with most of the core CASs (cas1-4 and cas6), but lacks any of the additional type-defining genes. Cluster 5 occurs exclusively in genomes that contain other CRISPR repeat subtypes and it is possible that it employs at least part of their CAS machinery.</p>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>This study shows that CRISPR repeats are not structurally homogeneous and can be divided into distinct types based on sequence similarity and ability to form stable secondary structures. This explains why previous attempts to align all repeats resulted in a poorly defined consensus sequence <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. We observed compensatory base changes in the stems of the structured repeat clusters, including G:U base pairs, indicating that the CRISPR system likely functions through an RNA intermediate.</p>
         <p>Some clusters, such as clusters 2, 3 and 4, are discrete in the sequence similarity space, whereas the boundaries of others, such as clusters 1, 6 and 7, were not clearly defined. The discrete clusters were generally composed of structure-forming repeats, and the less well-defined clusters were composed of unstructured repeats. This may be a reflection of the greater evolutionary constraints on the stem structure.</p>
         <p>The inference of stem-loop formation within individual CRISPR repeats is in contrast to the speculation that pairs of repeats form duplexes, and are subsequently cleaved to release spacers <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. Such hypothesized duplexing would unlikely require the ubiquitous presence of the less conserved interior nucleotides, which would form a loop in the single repeat folding model (Figure <figr fid="F2">2</figr>) and an unpaired bulge in the duplex repeat folding model. A CRISPR array in <it>Sulfolobus </it>is transcribed and processed into 60 nucleotide long non-messenger RNAs, a size consistent with a single repeat-spacer unit <abbrgrp><abbr bid="B17">17</abbr><abbr bid="B21">21</abbr></abbrgrp>, supporting the argument that transcribed spacers remain associated with their repeats. The repeats may serve to mediate contact between the spacer-targeted foreign RNA or DNA and CAS-encoded proteins. A stem-loop structure of some repeats may have evolved to facilitate recognition <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> by RNA-binding CAS-encoded proteins, although unstructured <it>Sulfolobus </it>repeats (in cluster 7; Figure <figr fid="F3">3</figr>) have been shown to bind via a sequence-specific interaction to a genus-specific protein <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. This may partly explain the sequence conservation observed in unstructured repeats.</p>
         <p>A previous report suggested that spacer regions contribute to the formation of secondary structures in CRISPR arrays <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. However, we could not detect a significant deviation of spacer secondary structures from random sequences (Figure <figr fid="F1">1</figr>), indicating that spacers are unlikely to be selected based on their secondary structure. In fact, the spacers appear to have slightly weaker structures than random sequences. This is probably due to the AT richness of spacers (46% GC) relative to average bacterial genomic sequences (53% GC), as AT base pairs form less stable structures than GC pairs. The lower spacer GC content is consistent with a proposed viral origin of spacer sequences <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>, as viruses are, on average, 7% lower in GC content than bacteria.</p>
         <p>Previous attempts to classify CRISPR/CAS systems were based on CAS gene content and phylogeny (mostly of cas1) <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>. We add a further dimension to this classification by showing that the repeat sequence itself is also a classifying feature. This can be advantageous in instances where CRISPR arrays occur in the absence of CAS genes. For example, <it>Thermoplasma acidophilum </it>contains a CRISPR array but lacks CAS genes <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, so it cannot be classified based on CASs. Our clustering indicates that the <it>T</it>. <it>acidophilum </it>repeat belongs to (euryarchaeal) cluster 6 (Figure <figr fid="F3">3</figr>; Table S1 in Additional data file 1). In some instances, the repeat classification was able to provide higher resolution than the existing CAS classification. For example, the Nmeni subtype was reported to have an optional gene <it>csn2 </it><abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. Our clustering divides this subtype into three clusters (10, 16 and 22). The <it>csn2 </it>gene is invariably present in one cluster (cluster 10) and absent in the other two. The finding of a repeat cluster (cluster 5) that cannot be readily resolved by associated CAS genes (see Results) further demonstrates the power of CRISPR-based classification.</p>
         <p>The significant differences between CRISPR/CAS subtypes, both in CRISPR repeat sequence and structure, and in CAS gene content and phylogeny, raises the possibility that the subtypes also differ functionally. Support for this hypothesis could be the fact that frequently several CRISPR/CAS subtypes are found in the same genome and at least four functions have been hypothesized for these elements (host cell defense <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>, regulation <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, chromosomal segregation <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> and rearrangement <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>). The study of CRISPRs is in its infancy, and their mode and function is still highly speculative. Our results provide another step toward a comprehensive understanding of these intriguing elements.</p>
      </sec>
      <sec>
         <st>
            <p>Materials and methods</p>
         </st>
         <sec>
            <st>
               <p>Identification of CRISPR arrays</p>
            </st>
            <p>All genome sequences available through the IMG database version 1.50 <abbrgrp><abbr bid="B15">15</abbr></abbrgrp> were analyzed for CRISPR arrays using the PILER-CR program <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Delineation of repeat clusters</p>
            </st>
            <p>Pairwise similarities between repeats were calculated using an in-house implementation of the Smith-Waterman algorithm <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>. The best scoring similarity from the two possible repeat pair orientations, and only scores >7, were used for further analysis. Clustering of pairwise similarities was performed using the MCL program with default parameters <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. Multiple alignments were performed using MUSCLE <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>, and the alignments were manually curated, including removal of outliers. Sequence logos for each cluster were generated using WebLogo <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. The similarity space of repeats was visualized using BioLayout (Java) <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. The sequences of the repeats, the assignments to clusters and the multiple alignments are provided as Additional data file 1.</p>
         </sec>
         <sec>
            <st>
               <p>Determining orientation of repeats</p>
            </st>
            <p>The PILER-CR program provides an arbitrary orientation for the repeats. To determine the correct orientation, we compared each repeat to the ones found experimentally to be transcribed into RNA <abbrgrp><abbr bid="B17">17</abbr><abbr bid="B21">21</abbr></abbrgrp>, assuming that the transcribed direction is the 'correct' direction. The direction most similar to the transcribed repeats (using Waterman similarity scores <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>) was selected as the correct one. We also used the GAAA(C/G) signature at the end of some repeats in cases where the Waterman similarity scores were ambiguous. It is possible, therefore, that some repeats may be presented in the wrong orientation.</p>
         </sec>
         <sec>
            <st>
               <p>Determination of repeat secondary structures</p>
            </st>
            <p>Structural predictions were performed using the RNA Vienna Package <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> downloaded from the Vienna Package server <abbrgrp><abbr bid="B28">28</abbr><abbr bid="B29">29</abbr></abbrgrp>. Folding scores for all repeats or individual repeat clusters were divided into bins of 2 score units and plotted as percentages. Random sequence strings with the same length distribution as repeats were generated from the analyzed genomes. The average GC contents were calculated for archaeal, bacterial and viral genomes in the IMG database, version 1.50, and the average GC content was calculated for all spacers in all genomes.</p>
         </sec>
         <sec>
            <st>
               <p>CAS gene identification</p>
            </st>
            <p>The HMMs for CAS genes described in <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> were obtained from the TIGRFAM database, version 6.0 <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. To identify CAS genes, all coding sequences within 20 kb of the identified CRISPR arrays were searched with the CAS HMMs using hmmpfam <abbrgrp><abbr bid="B31">31</abbr></abbrgrp> with the thresholds of an e-value &lt;0.001 and a positive score.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Additional data files</p>
         </st>
         <p>The following additional data are available with the online version of this paper. Additional data file <supplr sid="S1">1</supplr> contains several files showing alignments of clusters 1-12, the arrangement of the CAS cassette in the <it>Thermococcus kodakaraensis </it>genome, and CAS genes in the neighborhood of CRISPR arrays as predicted by TIGRFAM, as well as an index of organisms used in the study, a sequence fasta file containing all repeats, and a description of automatic assignment of repeats to clusters with MCL. Some files may be mac-formatted.</p>
         <suppl id="S1">
            <title>
               <p>Additional data file 1</p>
            </title>
            <caption>
               <p>Alignments of clusters 1-12, the arrangement of the CAS cassette in the <it>Thermococcus kodakaraensis </it>genome, CAS genes in the neighborhood of CRISPR arrays as predicted by TIGRFAM, an index of organisms used in the study, a sequence fasta file containing all repeats, and a description of automatic assignment of repeats to clusters with MCL</p>
            </caption>
            <text>
               <p>Readme.txt contains a description of the files in the archive. Alignments is a directory containing manually curated fasta alignments of clusters 1-12. FigureS1.png contains a figure showing the arrangement of the CAS cassette in the <it>Thermococcus kodakaraensis </it>genome. Chromosomal coordinates are given at the top of the figure. A CRISPR array is shown to the left of the figure as red vertical lines (1 line = 5 repeats). Core CAS genes are shown in black, Apern subtype genes are shown in blue and Tneap subtype genes in red as predicted by TIGRFAM analysis (see Materials and methods). TableS1.xls is an excel-formated table containing CAS genes in the neighborhood of CRISPR arrays, as predicted by TIGRFAM (see Materials and methods). Core and type-specific genes are indicated, each genome is given both with its full name and an IMG accession code. IMG gene OIDs are given for each protein. Organisms.index is a table containing an index of organisms used in the study. Repeats.fasta is a sequence fasta file containing all repeats. Repeats.mcl describes automatic assignment of repeats to clusters with MCL. Each line contains a cluster number followed by space-separated member repeats. Some files may be mac-formatted.</p>
            </text>
            <file name="gb-2007-8-4-r61-S1.gz">
               <p>Click here for file</p>
            </file>
         </suppl>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>We thank two anonymous reviewers for their detailed and informative feedback on this manuscript. This work was performed under the auspices of the US Department of Energy's Office of Science, Biological and Environmental Research Program, and by the University of California, Lawrence Livermore National Laboratory under Contract No. W-7405-Eng-48, Lawrence Berkeley National Laboratory under contract No. DE-AC02-05CH11231 and Los Alamos National Laboratory under contract No. DE-AC02-06NA25396.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Long stretches of short tandem repeats are present in the largest replicons of the Archaea <it>Haloferax mediterranei </it>and <it>Haloferax volcanii </it>and could be involved in replicon partitioning.</p>
            </title>
            <aug>
               <au>
                  <snm>Mojica</snm>
                  <fnm>FJ</fnm>
               </au>
               <au>
                  <snm>Ferrer</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Juez</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Rodriguez-Valera</snm>
                  <fnm>F</fnm>
               </au>
            </aug>
            <source>Mol Microbiol</source>
            <pubdate>1995</pubdate>
            <volume>17</volume>
            <fpage>85</fpage>
            <lpage>93</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1111/j.1365-2958.1995.mmi_17010085.x</pubid>
                  <pubid idtype="pmpid">7476211</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Identification of genes that are associated with DNA repeats in prokaryotes.</p>
            </title>
            <aug>
               <au>
                  <snm>Jansen</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Embden</snm>
                  <fnm>JD</fnm>
               </au>
               <au>
                  <snm>Gaastra</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Schouls</snm>
                  <fnm>LM</fnm>
               </au>
            </aug>
            <source>Mol Microbiol</source>
            <pubdate>2002</pubdate>
            <volume>43</volume>
            <fpage>1565</fpage>
            <lpage>1575</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1046/j.1365-2958.2002.02839.x</pubid>
                  <pubid idtype="pmpid" link="fulltext">11952905</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>CRISPR elements in <it>Yersinia pestis </it>acquire new repeats by preferential uptake of bacteriophage DNA, and provide additional tools for evolutionary studies.</p>
            </title>
            <aug>
               <au>
                  <snm>Pourcel</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Salvignol</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Vergnaud</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Microbiology</source>
            <pubdate>2005</pubdate>
            <volume>151</volume>
            <fpage>653</fpage>
            <lpage>663</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1099/mic.0.27437-0</pubid>
                  <pubid idtype="pmpid" link="fulltext">15758212</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Complete sequence and comparative genome analysis of the dairy bacterium <it>Streptococcus thermophilus</it>.</p>
            </title>
            <aug>
               <au>
                  <snm>Bolotin</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Quinquis</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Renault</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Sorokin</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Ehrlich</snm>
                  <fnm>SD</fnm>
               </au>
               <au>
                  <snm>Kulakauskas</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Lapidus</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Goltsman</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Mazur</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Pusch</snm>
                  <fnm>GD</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nat Biotechnol</source>
            <pubdate>2004</pubdate>
            <volume>22</volume>
            <fpage>1554</fpage>
            <lpage>1558</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nbt1034</pubid>
                  <pubid idtype="pmpid" link="fulltext">15543133</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>A guild of 45 CRISPR-associated (Cas) protein families and multiple CRISPR/Cas subtypes exist in prokaryotic genomes.</p>
            </title>
            <aug>
               <au>
                  <snm>Haft</snm>
                  <fnm>DH</fnm>
               </au>
               <au>
                  <snm>Selengut</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Mongodin</snm>
                  <fnm>EF</fnm>
               </au>
               <au>
                  <snm>Nelson</snm>
                  <fnm>KE</fnm>
               </au>
            </aug>
            <source>PLoS Comput Biol</source>
            <pubdate>2005</pubdate>
            <volume>1</volume>
            <fpage>e60</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1282333</pubid>
                  <pubid idtype="pmpid" link="fulltext">16292354</pubid>
                  <pubid idtype="doi">10.1371/journal.pcbi.0010060</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>A putative RNA-interference-based immune system in prokaryotes: computational analysis of the predicted enzymatic machinery, functional analogies with eukaryotic RNAi, and hypothetical mechanisms of action.</p>
            </title>
            <aug>
               <au>
                  <snm>Makarova</snm>
                  <fnm>KS</fnm>
               </au>
               <au>
                  <snm>Grishin</snm>
                  <fnm>NV</fnm>
               </au>
               <au>
                  <snm>Shabalina</snm>
                  <fnm>SA</fnm>
               </au>
               <au>
                  <snm>Wolf</snm>
                  <fnm>YI</fnm>
               </au>
               <au>
                  <snm>Koonin</snm>
                  <fnm>EV</fnm>
               </au>
            </aug>
            <source>Biol Direct</source>
            <pubdate>2006</pubdate>
            <volume>1</volume>
            <fpage>7</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1462988</pubid>
                  <pubid idtype="pmpid" link="fulltext">16545108</pubid>
                  <pubid idtype="doi">10.1186/1745-6150-1-7</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>The repetitive DNA elements called CRISPRs and their associated genes: evidence of horizontal transfer among prokaryotes.</p>
            </title>
            <aug>
               <au>
                  <snm>Godde</snm>
                  <fnm>JS</fnm>
               </au>
               <au>
                  <snm>Bickerton</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>J Mol Evol</source>
            <pubdate>2006</pubdate>
            <volume>62</volume>
            <fpage>718</fpage>
            <lpage>729</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/s00239-005-0223-z</pubid>
                  <pubid idtype="pmpid" link="fulltext">16612537</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>The multidrug-resistant human pathogen <it>Clostridium difficile </it>has a highly mobile, mosaic genome.</p>
            </title>
            <aug>
               <au>
                  <snm>Sebaihia</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Wren</snm>
                  <fnm>BW</fnm>
               </au>
               <au>
                  <snm>Mullany</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Fairweather</snm>
                  <fnm>NF</fnm>
               </au>
               <au>
                  <snm>Minton</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Stabler</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Thomson</snm>
                  <fnm>NR</fnm>
               </au>
               <au>
                  <snm>Roberts</snm>
                  <fnm>AP</fnm>
               </au>
               <au>
                  <snm>Cerdeno-Tarraga</snm>
                  <fnm>AM</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>H</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nat Genet</source>
            <pubdate>2006</pubdate>
            <volume>38</volume>
            <fpage>779</fpage>
            <lpage>786</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/ng1830</pubid>
                  <pubid idtype="pmpid" link="fulltext">16804543</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Genomic comparison of archaeal conjugative plasmids from <it>Sulfolobus</it>.</p>
            </title>
            <aug>
               <au>
                  <snm>Greve</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Jensen</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Brugger</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Zillig</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Garrett</snm>
                  <fnm>RA</fnm>
               </au>
            </aug>
            <source>Archaea</source>
            <pubdate>2004</pubdate>
            <volume>1</volume>
            <fpage>231</fpage>
            <lpage>239</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15810432</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>A DNA repair system specific for thermophilic Archaea and bacteria predicted by genomic context analysis.</p>
            </title>
            <aug>
               <au>
                  <snm>Makarova</snm>
                  <fnm>KS</fnm>
               </au>
               <au>
                  <snm>Aravind</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Grishin</snm>
                  <fnm>NV</fnm>
               </au>
               <au>
                  <snm>Rogozin</snm>
                  <fnm>IB</fnm>
               </au>
               <au>
                  <snm>Koonin</snm>
                  <fnm>EV</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <fpage>482</fpage>
            <lpage>496</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">99818</pubid>
                  <pubid idtype="pmpid" link="fulltext">11788711</pubid>
                  <pubid idtype="doi">10.1093/nar/30.2.482</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Chromosome evolution in the Thermotogales: large-scale inversions and strain diversification of CRISPR sequences.</p>
            </title>
            <aug>
               <au>
                  <snm>DeBoy</snm>
                  <fnm>RT</fnm>
               </au>
               <au>
                  <snm>Mongodin</snm>
                  <fnm>EF</fnm>
               </au>
               <au>
                  <snm>Emerson</snm>
                  <fnm>JB</fnm>
               </au>
               <au>
                  <snm>Nelson</snm>
                  <fnm>KE</fnm>
               </au>
            </aug>
            <source>J Bacteriol</source>
            <pubdate>2006</pubdate>
            <volume>188</volume>
            <fpage>2364</fpage>
            <lpage>2374</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1428405</pubid>
                  <pubid idtype="pmpid" link="fulltext">16547022</pubid>
                  <pubid idtype="doi">10.1128/JB.188.7.2364-2374.2006</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Intervening sequences of regularly spaced prokaryotic repeats derive from foreign genetic elements.</p>
            </title>
            <aug>
               <au>
                  <snm>Mojica</snm>
                  <fnm>FJ</fnm>
               </au>
               <au>
                  <snm>Diez-Villasenor</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Garcia-Martinez</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Soria</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>J Mol Evol</source>
            <pubdate>2005</pubdate>
            <volume>60</volume>
            <fpage>174</fpage>
            <lpage>182</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/s00239-004-0046-3</pubid>
                  <pubid idtype="pmpid" link="fulltext">15791728</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>CRISPR provides acquired resistance against viruses in prokaryotes.</p>
            </title>
            <aug>
               <au>
                  <snm>Barrangou</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Fremaux</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Deveau</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Richards</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Boyaval</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Moineau</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Romero</snm>
                  <fnm>DA</fnm>
               </au>
               <au>
                  <snm>Horvath</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>2007</pubdate>
            <volume>315</volume>
            <fpage>1709</fpage>
            <lpage>1712</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.1138140</pubid>
                  <pubid idtype="pmpid" link="fulltext">17379808</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>PILER-CR: Fast and accurate identification of CRISPR repeats.</p>
            </title>
            <aug>
               <au>
                  <snm>Edgar</snm>
                  <fnm>RC</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>8</volume>
            <fpage>18</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1790904</pubid>
                  <pubid idtype="pmpid" link="fulltext">17239253</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-8-18</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>An experimental metagenome data management and analysis system.</p>
            </title>
            <aug>
               <au>
                  <snm>Markowitz</snm>
                  <fnm>VM</fnm>
               </au>
               <au>
                  <snm>Ivanova</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Palaniappan</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Szeto</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Korzeniewski</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Lykidis</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Anderson</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Mavrommatis</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Kunin</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Garcia Martin</snm>
                  <fnm>H</fnm>
               </au>
               <etal/>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>22</volume>
            <fpage>e359</fpage>
            <lpage>367</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btl217</pubid>
                  <pubid idtype="pmpid" link="fulltext">16873494</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Biological significance of a family of regularly spaced repeats in the genomes of Archaea, Bacteria and mitochondria.</p>
            </title>
            <aug>
               <au>
                  <snm>Mojica</snm>
                  <fnm>FJ</fnm>
               </au>
               <au>
                  <snm>Diez-Villasenor</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Soria</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Juez</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Mol Microbiol</source>
            <pubdate>2000</pubdate>
            <volume>36</volume>
            <fpage>244</fpage>
            <lpage>246</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1046/j.1365-2958.2000.01838.x</pubid>
                  <pubid idtype="pmpid" link="fulltext">10760181</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Identification of 86 candidates for small non-messenger RNAs from the archaeon <it>Archaeoglobus fulgidus</it>.</p>
            </title>
            <aug>
               <au>
                  <snm>Tang</snm>
                  <fnm>TH</fnm>
               </au>
               <au>
                  <snm>Bachellerie</snm>
                  <fnm>JP</fnm>
               </au>
               <au>
                  <snm>Rozhdestvensky</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Bortolin</snm>
                  <fnm>ML</fnm>
               </au>
               <au>
                  <snm>Huber</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Drungowski</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Elge</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Brosius</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Huttenhofer</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2002</pubdate>
            <volume>99</volume>
            <fpage>7536</fpage>
            <lpage>7541</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">124276</pubid>
                  <pubid idtype="pmpid" link="fulltext">12032318</pubid>
                  <pubid idtype="doi">10.1073/pnas.112047299</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Fast folding and comparison of RNA secondary structures.</p>
            </title>
            <aug>
               <au>
                  <snm>Hofacker</snm>
                  <fnm>IL</fnm>
               </au>
               <au>
                  <snm>Fontana</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Stadler</snm>
                  <fnm>PF</fnm>
               </au>
               <au>
                  <snm>Bonhoeffer</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Tacker</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Schuster</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Monatshefte f Chemie</source>
            <pubdate>1994</pubdate>
            <volume>125</volume>
            <fpage>167</fpage>
            <lpage>188</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1007/BF00818163</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Identification of common molecular subsequences.</p>
            </title>
            <aug>
               <au>
                  <snm>Smith</snm>
                  <fnm>TF</fnm>
               </au>
               <au>
                  <snm>Waterman</snm>
                  <fnm>MS</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1981</pubdate>
            <volume>147</volume>
            <fpage>195</fpage>
            <lpage>197</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/0022-2836(81)90087-5</pubid>
                  <pubid idtype="pmpid" link="fulltext">7265238</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Graph clustering by flow simulation.</p>
            </title>
            <aug>
               <au>
                  <snm>Van Dongen</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>PhD thesis</source>
            <publisher>University of Utrecht</publisher>
            <pubdate>2000</pubdate>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Identification of novel non-coding RNAs as potential antisense regulators in the archaeon <it>Sulfolobus solfataricus</it>.</p>
            </title>
            <aug>
               <au>
                  <snm>Tang</snm>
                  <fnm>TH</fnm>
               </au>
               <au>
                  <snm>Polacek</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Zywicki</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Huber</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Brugger</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Garrett</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Bachellerie</snm>
                  <fnm>JP</fnm>
               </au>
               <au>
                  <snm>Huttenhofer</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Mol Microbiol</source>
            <pubdate>2005</pubdate>
            <volume>55</volume>
            <fpage>469</fpage>
            <lpage>481</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1111/j.1365-2958.2004.04428.x</pubid>
                  <pubid idtype="pmpid" link="fulltext">15659164</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>RNA-protein complexes.</p>
            </title>
            <aug>
               <au>
                  <snm>Cusack</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Curr Opin Struct Biol</source>
            <pubdate>1999</pubdate>
            <volume>9</volume>
            <fpage>66</fpage>
            <lpage>73</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0959-440X(99)80009-8</pubid>
                  <pubid idtype="pmpid" link="fulltext">10400475</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Genus-specific protein binding to the large clusters of DNA repeats (short regularly spaced repeats) present in <it>Sulfolobus </it>genomes.</p>
            </title>
            <aug>
               <au>
                  <snm>Peng</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Brugger</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Shen</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>She</snm>
                  <fnm>Q</fnm>
               </au>
               <au>
                  <snm>Garrett</snm>
                  <fnm>RA</fnm>
               </au>
            </aug>
            <source>J Bacteriol</source>
            <pubdate>2003</pubdate>
            <volume>185</volume>
            <fpage>2410</fpage>
            <lpage>2417</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">152625</pubid>
                  <pubid idtype="pmpid" link="fulltext">12670964</pubid>
                  <pubid idtype="doi">10.1128/JB.185.8.2410-2417.2003</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>MUSCLE: multiple sequence alignment with high accuracy and high throughput.</p>
            </title>
            <aug>
               <au>
                  <snm>Edgar</snm>
                  <fnm>RC</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <fpage>1792</fpage>
            <lpage>1797</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">390337</pubid>
                  <pubid idtype="pmpid" link="fulltext">15034147</pubid>
                  <pubid idtype="doi">10.1093/nar/gkh340</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>WebLogo: a sequence logo generator.</p>
            </title>
            <aug>
               <au>
                  <snm>Crooks</snm>
                  <fnm>GE</fnm>
               </au>
               <au>
                  <snm>Hon</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Chandonia</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Brenner</snm>
                  <fnm>SE</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2004</pubdate>
            <volume>14</volume>
            <fpage>1188</fpage>
            <lpage>1190</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">419797</pubid>
                  <pubid idtype="pmpid" link="fulltext">15173120</pubid>
                  <pubid idtype="doi">10.1101/gr.849004</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>BioLayout(Java): versatile network visualisation of structural and functional relationships.</p>
            </title>
            <aug>
               <au>
                  <snm>Goldovsky</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Cases</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Enright</snm>
                  <fnm>AJ</fnm>
               </au>
               <au>
                  <snm>Ouzounis</snm>
                  <fnm>CA</fnm>
               </au>
            </aug>
            <source>Appl Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>4</volume>
            <fpage>71</fpage>
            <lpage>74</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.2165/00822942-200504010-00009</pubid>
                  <pubid idtype="pmpid">16000016</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure.</p>
            </title>
            <aug>
               <au>
                  <snm>Mathews</snm>
                  <fnm>DH</fnm>
               </au>
               <au>
                  <snm>Sabina</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Zuker</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Turner</snm>
                  <fnm>DH</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1999</pubdate>
            <volume>288</volume>
            <fpage>911</fpage>
            <lpage>940</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.1999.2700</pubid>
                  <pubid idtype="pmpid" link="fulltext">10329189</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>RNA Vienna Package</p>
            </title>
            <url>http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi</url>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Vienna RNA secondary structure server.</p>
            </title>
            <aug>
               <au>
                  <snm>Hofacker</snm>
                  <fnm>IL</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <fpage>3429</fpage>
            <lpage>3431</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">169005</pubid>
                  <pubid idtype="pmpid" link="fulltext">12824340</pubid>
                  <pubid idtype="doi">10.1093/nar/gkg599</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>TIGRFAMs Home Page</p>
            </title>
            <url>http://www.tigr.org/TIGRFAMs/</url>
         </bibl>
         <bibl id="B31">
            <title>
               <p>HMMER</p>
            </title>
            <url>http://hmmer.janelia.org/</url>
         </bibl>
      </refgrp>
   </bm>
</art>
