<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-5-152</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Methodology article</dochead>
      <bibl>
         <title>
            <p>Incidence of "quasi-ditags" in catalogs generated by Serial Analysis of Gene Expression (SAGE)</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Anisimov</snm>
               <mi>V</mi>
               <fnm>Sergey</fnm>
               <insr iid="I1"/>
               <email>Sergey.Anisimov@mphy.lu.se</email>
            </au>
            <au id="A2">
               <snm>Sharov</snm>
               <mi>A</mi>
               <fnm>Alexei</fnm>
               <insr iid="I2"/>
               <email>sharoval@grc.nia.nih.gov</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Section for Neuronal Survival, Wallenberg Neuroscience Center, Lund University, 221 84 Lund, Sweden</p>
            </ins>
            <ins id="I2">
               <p>Laboratory of Genetics, National Institute on Aging, National Institutes of Health, Baltimore, MD, 21224, USA</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2004</pubdate>
         <volume>5</volume>
         <issue>1</issue>
         <fpage>152</fpage>
         <url>http://www.biomedcentral.com/1471-2105/5/152</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">15491492</pubid>
               <pubid idtype="doi">10.1186/1471-2105-5-152</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>27</day>
               <month>5</month>
               <year>2004</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>18</day>
               <month>10</month>
               <year>2004</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>18</day>
               <month>10</month>
               <year>2004</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2004</year>
         <collab>Anisimov and Sharov; licensee BioMed Central Ltd.</collab>
         <note>This is an open-access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Serial Analysis of Gene Expression (SAGE) is a functional genomic technique that quantitatively analyzes the cellular transcriptome. The analysis of SAGE libraries relies on the identification of ditags from sequencing files; however, the software used to examine SAGE libraries cannot distinguish between authentic versus false ditags ("quasi-ditags").</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We provide examples of quasi-ditags that originate from cloning and sequencing artifacts (i.e. genomic contamination or random combinations of nucleotides) that are included in SAGE libraries. We have employed a mathematical model to predict the frequency of quasi-ditags in random nucleotide sequences, and our data show that clones containing less than or equal to 2 ditags (which include chromosomal cloning artifacts) should be excluded from the analysis of SAGE catalogs.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusions</p>
               </st>
               <p>Cloning and sequencing artifacts contaminating SAGE libraries could be eliminated using simple pre-screening procedure to increase the reliability of the data.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Serial Analysis of Gene Expression (SAGE) is a rapid method to study mRNA transcripts in cell populations <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. Two major principles underline SAGE: (1) short expressed sequenced tags (ESTs) are sufficient to identify individual gene products, and (2) multiple tags can be concatenated and identified by sequence analysis <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>. With the ever-expanding sequence information available in public databases, identification of gene transcripts with SAGE tags has greatly facilitated transcriptome comparisons and gene identification <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>.</p>
         <p>SAGE data are usually analyzed with software packages like "SAGE300" or "SAGE2000". The majority of SAGE libraries use <it>NlaIII </it>or <it>Sau3A </it>(<it>SalAI</it>) as anchoring enzymes (AE) to create SAGE tags. Both of these enzymes have 4-bp palindromic recognition sequences (CATG for <it>NlaIII </it>and GATC for <it>Sau3A</it>) that flank individual ditags within concatemers. A major component of the software analysis is the identification of anchoring enzyme recognition sequences (AERS) that flank target sequences (SAGE ditags). After finding the first AE recognition sequence, the software continues reading the sequence until it finds the next one. The software then compares the distance between these recognition sequences with predicted ditag lengths (20&#8211;24 bp in the case of <it>NlaIII </it>or <it>Sau3A</it>), and ditags that are too short (&lt;20 bp) or too long (>24 bp) are excluded. However, if the length of the AERS-flanked sequence satisfies the size criteria, it is identified as a ditag. This algorithm relies on the assumption that <it>all </it>sequences have correctly organized ditag concatemers; however, the cloning efficiency of SAGE rarely reaches 100%. In this report, we show that up to 5% of ditags from some SAGE libraries should be omitted from the final analysis. These false ditags (termed "quasi-ditags") result from genomic contaminants and apparently random combinations of nucleotides generated by cloning or sequencing errors. Using a mathematical model to simulate the frequency of quasi-ditags in DNA, we propose a method to exclude quasi-ditags from SAGE catalogs.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <p>From twelve independent SAGE libraries, we analyzed numerous clones lacking organized ditag concatemers that would be excluded by SAGE software packages, including clones lacking inserts, clones with inserts containing bacterial or rodent genomic sequences, and clones with unidentifiable sequences (Figure <figr fid="F1">1</figr>). Depending on the quality of the SAGE library, examples of clones in Figure <figr fid="F1">1A,1B,1C</figr> can represent up to 50% of the total volume of clones sequenced <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>, but generally range from 2&#8211;20%. A more typical example is taken from our R1 ES cell and AMH-II SAGE libraries <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>, which contained 5,988 and 4,478 clones, respectively. The cloning efficiency was ~79% and ~76% (4,714 and 3,413 clones with inserts, respectively). Among these, 411 and 167 clones in the R1 ES SAGE library and 305 and 194 clones in the AMH-II library contained sequences with only 1 or 2 ditags, respectively.</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>Raw SAGE sequence data showing cloning and potential sequencing artifacts excluded by SAGE software</p>
            </caption>
            <text>
               <p>Raw SAGE sequence data showing cloning and potential sequencing artifacts excluded by SAGE software. (A) Clone with fragment of <it>E. Coli </it>genomic DNA. Italics denote <it>E. Coli </it>sequence (AE000256). (B) Clone containing a fragment of rodent genomic DNA. Italics denote <it>M. musculus </it>sequence (AI894042). (C) Clone with unidentifiable insert, which lack normal SAGE concatemer. pZErO-1 sequences are underlined; Anchoring enzyme recognition sites (AERS<sub>(CATG)</sub>) are shown in bold.</p>
            </text>
            <graphic file="1471-2105-5-152-1"/>
         </fig>
         <p>During our sequence analysis of the clones that had produced a least number of ditags (1&#8211;2 per clone), we identified a subset of sequences (up to 40%) that contain ditags that may be false. Importantly, some of these "ditags" matched bacterial genomic sequences (Figure <figr fid="F2">2A</figr>), while others seemed to represent random combinations of nucleotides. Figure <figr fid="F2">2B</figr> show an example of a clone that contains a single ditag sequence embedded within a sequence of unidentifiable origin. Because most of this sequence is not composed of concatenated ditags, this embedded ditag may therefore represent a quasi-ditag, which should be excluded from further analysis. These two examples, among others, suggest that some inserts in pZErO-1 contain sequences that just by chance mimic SAGE ditags.</p>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>Raw SAGE sequence data showing cloning and potential sequencing artifacts not excluded by SAGE software</p>
            </caption>
            <text>
               <p>Raw SAGE sequence data showing cloning and potential sequencing artifacts not excluded by SAGE software. (A) Clone with fragment of <it>E. Coli </it>genomic DNA. Italics denote <it>E. Coli </it>sequence (AE000307). (B) Clone with unidentifiable insert, which lack normal SAGE concatemer. Sequences like (A-B) represent quasi-ditags that should have been removed. pZErO-1 sequence is underlined; Anchoring enzyme recognition sites (AERS<sub>(CATG)</sub>) are shown in bold, and potential SAGE tags are shown by dotted underlines.</p>
            </text>
            <graphic file="1471-2105-5-152-2"/>
         </fig>
         <p>To predict the potential frequency of randomly occurring quasi-ditags, we employed a stochastic model system to generate random sequences. We then used both computer-generated random sequences and true genomic DNA sequences to test this possibility. Random sequences were generated and analyzed with a Visual Basic program designed to mimic SAGE software analysis of ditags. The simulated sequences varied in length from 600 to 1200 nucleotides, which corresponds to the average sequence lengths generated by automated sequence analyzers. One million random sequence strings with L = 600, 700, 800, 900, 1000, 1100, and 1200 nucleotides were generated. Table <tblr tid="T1">1</tblr> shows expected frequencies of quasi-ditags according to the model (equation (5)) and the observed frequencies based on computer simulations. The line plots of the expected (model) and observed (computer simulation) quasi-ditag frequencies are almost identical (Figures <figr fid="F3">3</figr> and <figr fid="F4">4</figr>). Fragmented <it>Saccharomyces cerevisiae </it>genomic DNA that lack SAGE ditag concatemers was also employed for <it>in vivo / in silico </it>model validation, and a number of quasi-ditags was detected in these (Figures <figr fid="F4">4</figr> and <figr fid="F5">5</figr>). When compared to <it>Saccharomyces cerevisiae </it>genomic DNA <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>, quasi-ditag frequencies were somewhat less abundant than those generated by the computer, potentially due to the presence of nucleotide repeats and unequal frequencies of individual nucleotides in the Yeast genomic sequences. These data, however, support our hypothesis that quasi-ditags can be generated randomly from potential sequencing errors or from genomic contaminants. This analysis furthermore underscores the limited extent that quasi-ditags occur: the distribution of expected number of quasi-ditags per clone is clearly bimodal, with peaks at 1 and 2 ditags (<it>Q<sub>1 </sub></it>and <it>Q<sub>2</sub></it>, respectively). At the same time, the frequency of occurrence for three quasi-ditags (<it>Q<sub>3</sub></it>) is extremely low (0.01% for L = 600 to 0.02% for L = 1200), such that the value of P<sup>3 </sup><sub>(20&#8211;24) </sub>effectively converges to zero for the majority of the SAGE catalogs (i.e. &lt;3000&#8211;5000 clones) (Figure <figr fid="F4">4</figr>). Accordingly, the clones that include ditag concatemers of higher length should lack quasi-ditags.</p>
         <tbl id="T1">
            <title>
               <p>Table 1</p>
            </title>
            <caption>
               <p>Probability to find one or more "quasi-ditag" in the nucleotide sequence of the given length (P<sub>(20&#8211;24)</sub>)</p>
            </caption>
            <tblbdy cols="5">
               <r>
                  <c ca="center">
                     <p>Sequence Length (L)</p>
                  </c>
                  <c cspan="3" ca="center">
                     <p>Frequency of &#8805; 1 quasi-ditags in sequence</p>
                  </c>
                  <c ca="center">
                     <p>
                        <it>S. Cerevisiae chromosome </it>
                        <sup>3</sup>
                     </p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c cspan="3">
                     <hr/>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>Mathematical model</p>
                  </c>
                  <c ca="center">
                     <p>Computer simulation <sup>1</sup></p>
                  </c>
                  <c ca="center">
                     <p>In vivo <sup>2 </sup>simulation (S. Cerevisiae)</p>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c cspan="5">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="right">
                     <p>600 bp</p>
                  </c>
                  <c ca="center">
                     <p>0.039392</p>
                  </c>
                  <c ca="center">
                     <p>0.039858</p>
                  </c>
                  <c ca="center">
                     <p>0.016666</p>
                  </c>
                  <c ca="left">
                     <p>IV [NC_001136]</p>
                  </c>
               </r>
               <r>
                  <c ca="right">
                     <p>700 bp</p>
                  </c>
                  <c ca="center">
                     <p>0.046231</p>
                  </c>
                  <c ca="center">
                     <p>0.046655</p>
                  </c>
                  <c ca="center">
                     <p>0.026666</p>
                  </c>
                  <c ca="left">
                     <p>X [NC_001142]</p>
                  </c>
               </r>
               <r>
                  <c ca="right">
                     <p>800 bp</p>
                  </c>
                  <c ca="center">
                     <p>0.053070</p>
                  </c>
                  <c ca="center">
                     <p>0.053383</p>
                  </c>
                  <c ca="center">
                     <p>0.036666</p>
                  </c>
                  <c ca="left">
                     <p>XIV [NC_001146]</p>
                  </c>
               </r>
               <r>
                  <c ca="right">
                     <p>900 bp</p>
                  </c>
                  <c ca="center">
                     <p>0.059909</p>
                  </c>
                  <c ca="center">
                     <p>0.060051</p>
                  </c>
                  <c ca="center">
                     <p>0.040000</p>
                  </c>
                  <c ca="left">
                     <p>VIII [NC_001140]</p>
                  </c>
               </r>
               <r>
                  <c ca="right">
                     <p>1,000 bp</p>
                  </c>
                  <c ca="center">
                     <p>0.066748</p>
                  </c>
                  <c ca="center">
                     <p>0.066743</p>
                  </c>
                  <c ca="center">
                     <p>0.046666</p>
                  </c>
                  <c ca="left">
                     <p>V [NC_001137]</p>
                  </c>
               </r>
               <r>
                  <c ca="right">
                     <p>1,100 bp</p>
                  </c>
                  <c ca="center">
                     <p>0.073587</p>
                  </c>
                  <c ca="center">
                     <p>0.073225</p>
                  </c>
                  <c ca="center">
                     <p>0.066666</p>
                  </c>
                  <c ca="left">
                     <p>IX [NC_001141]</p>
                  </c>
               </r>
               <r>
                  <c ca="right">
                     <p>1,200 bp</p>
                  </c>
                  <c ca="center">
                     <p>0.080426</p>
                  </c>
                  <c ca="center">
                     <p>0.079793</p>
                  </c>
                  <c ca="center">
                     <p>0.070000</p>
                  </c>
                  <c ca="left">
                     <p>XI [NC_001143]</p>
                  </c>
               </r>
            </tblbdy>
            <tblfn>
               <p><sup>1 </sup>For computer simulation, 1,000,000 files consisting of the sequence-imitating random combination of A, C, G and T nucleotides of selected length were analyzed in search of SAGE "quasi-ditags".</p>
               <p><sup>2 </sup>For <it>in vivo / in silico </it>simulations 300 sequences were created by fragmentation of randomly selected chromosomes of <it>Saccharomyces Cerevisiae </it>for each L value. Larger samplings (900&#8211;1,400 sequences) were created and tested for selected sequence lengths and did not change results significantly.</p>
               <p><sup>3 </sup>GenBank database accession numbers are given in brackets.</p>
            </tblfn>
         </tbl>
         <fig id="F3">
            <title>
               <p>Figure 3</p>
            </title>
            <caption>
               <p>Probability (p(k)) to find k AERS<sub>(CATG) </sub>in a random sequence for L = 600 and L = 1200 bp</p>
            </caption>
            <text>
               <p>Probability (p(k)) to find k AERS<sub>(CATG) </sub>in a random sequence for L = 600 and L = 1200 bp. Dotted lines represent p(k) mean values. L, sequence length; Model, mathematical modeling; CompSim, computer simulation (1,000,000 simulations).</p>
            </text>
            <graphic file="1471-2105-5-152-3"/>
         </fig>
         <fig id="F4">
            <title>
               <p>Figure 4</p>
            </title>
            <caption>
               <p>Probability to find various numbers of quasi-ditags (<it>Q<sub>N</sub></it>) in the same nucleotide sequence of the given length (L = 1200 bp)</p>
            </caption>
            <text>
               <p>Probability to find various numbers of quasi-ditags (<it>Q</it><sub><it>N</it></sub>) in the same nucleotide sequence of the given length (L = 1200 bp). L, sequence length; Model, mathematical modeling; CompSim, computer simulation (1,000,000 simulations); In vivo Sim, fragments of <it>S. Cerevisiae </it>chromosome (300 simulations).</p>
            </text>
            <graphic file="1471-2105-5-152-4"/>
         </fig>
         <fig id="F5">
            <title>
               <p>Figure 5</p>
            </title>
            <caption>
               <p>Probability of finding one or more quasi-ditag in the nucleotide sequence of a given length (L = 600 to 1200 bp)</p>
            </caption>
            <text>
               <p>Probability of finding one or more quasi-ditag in the nucleotide sequence of a given length (L = 600 to 1200 bp). Model, mathematical modeling; CompSim, computer simulation (1,000,000 simulations for each L value); In vivo Sim, fragments of <it>S. Cerevisiae </it>chromosomes (300 simulations for each L value). Dotted line represents trendline for In vivo Sim.</p>
            </text>
            <graphic file="1471-2105-5-152-5"/>
         </fig>
         <p>Clones containing only one or two ditags/quasi-ditags, however, could be excluded from SAGE analyses, without adversely affecting the data set (Figure <figr fid="F6">6</figr>). As an example, we extracted sequences from clones that produce 1&#8211;2 total ditags from AMH-II and R1 ES cell libraries. This reduced the total number of tags by 1.06% for ES R1 and 1.94% for AMH-II, but it effectively removed all contaminating bacterial sequences and improved the data reliability. However, the total AMH-I library (2,365 clones, ~78% cloning efficiency; <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>) had a larger proportion of ditags extracted as being too long (>24 bp), as indicated by lower tag per clone ratio (average insert size of 12.2 tags/clone vs. 22.6 in AMH-II library) amid the same average sequence length, suggesting higher proportion of quasi-ditags. Analysis of the AMH-I SAGE library has revealed 353 and 52 clones that contained just 1 or 2 ditags, respectively. Exclusion of these sequences decreased the total number of tags by 5.21% (calculated after duplicate dimer exclusion), and proved critical to our subsequent quantitative SAGE comparisons <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. Failure to remove these quasi-ditag sequences decreased the quantitative reproducibility (R values) between AMH-I and AMH-II SAGE libraries, showing that quasi-ditags can adversely affect the reliability of SAGE libraries.</p>
         <fig id="F6">
            <title>
               <p>Figure 6</p>
            </title>
            <caption>
               <p>Frequency distribution of the number of ditags in SAGE output</p>
            </caption>
            <text>
               <p>Frequency distribution of the number of ditags in SAGE output. Probability to find various numbers of ditags in the clone sequence has been plotted as a function of a number of total ditags per clone. Model, mathematical modeling; CompSim, computer simulation (1,000,000 simulations); In vivo Sim, fragments of <it>S. Cerevisiae </it>chromosome (300 simulations); AMH-I, -II, ES R1, actual SAGE data (sequences from 3 randomly chosen 96-well plates). Sequence length (L) = 800 bp for Model, CompSim and In vivo Sim; average sequence length &#8776; 800 bp for all three SAGE libraries.</p>
            </text>
            <graphic file="1471-2105-5-152-6"/>
         </fig>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>SAGE is an important tool of modern molecular biology widely used in a number of applications. We hypothesized that actual SAGE catalogs could be contaminated by false ditags ("quasi-ditags") of various origins. Although SAGE software packages are designed to ignore sequences that lack 20&#8211;24 bp sequences flanked by two anchoring enzyme recognition sites, it does not exclude quasi-ditags originating from genomic contaminants or unknown sequences that may arise as cloning or sequencing artifacts (Figure <figr fid="F2">2</figr>). Negative controls (self-ligated vector) do not produce any colonies after Zeocin selection and cannot account for the appearance of background clones and quasi-ditags in Zeocin-resistant bacteria. Since some quasi-ditags, however, originate directly from <it>E. Coli</it>, we suggest that one probable source for these contaminanting tags is from recombination events that occur in <it>E. Coli</it>. Indeed, such a mechanism has already been documented <abbrgrp><abbr bid="B8">8</abbr></abbrgrp> and has led to the development of Stbl2 bacteria that are mcrA<sup>-</sup>/mcrBC<sup>-</sup>hsdRMS<sup>-</sup>mrr<sup>-</sup>. Since pZErO-1 was not translated into recombination deficient bacteria (DH10B), large-scale amplifications of this plasmid within bacteria would be expected to lead to some random recombinations, and the generation of quasi-ditags (e.g. Figure <figr fid="F2">2A</figr>).</p>
         <p>Some of the ditags derived from the clones that had produced a least number of ditags (1&#8211;2 per clone) do not match genomic sequences and thus might be originated from sequencing errors. We therefore suggested a model that provides a mathematical basis for the hypothesis that such a possibility exists. The mathematical model presented in the manuscript is an attempt to predict the frequency distribution of quasi-ditags in random sequences. The phenomenon itself is rather complex and there is no simple model that would capture it in full complexity. We, however, believe that we have selected a reasonable level of model complexity that captures the major pattern of frequency distribution.</p>
         <p>Using the computer simulation we show that random combinations of nucleotides generated could be indeed recognized by SAGE software as valid SAGE ditags. We also demonstrate that quasi-ditags may constitute a non-negligible proportion of SAGE catalogs. Our model, which simulates the frequency of quasi-ditags in DNA (equations (1&#8211;6)), suggests that single or double ditags may represent quasi-ditags; however, the results of the <it>in silico </it>experiments show that the probability of finding more than two quasi-ditags in the same sequence converges effectively to zero (Table <tblr tid="T1">1</tblr> and Figure <figr fid="F4">4</figr>). Based on these findings, we suggest that additional steps be performed with SAGE libraries. We recommend removing clones with sequences containing &#8804; 2 ditags at a pre-processing step ("clean-up"). The removal of clones containing 1 or 2 ditags can effectively remove bacterial genomic sequences and potential sequencing artifacts from SAGE libraries. The overall number of SAGE tags excluded by this additional step (authentic and quasi-ditags) is usually low, and generally does not exceed 1.0&#8211;1.8% of the total number of sequenced SAGE tags <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr></abbrgrp>; however, the frequency of potential quasi-ditags could be high (>5%) in some SAGE libraries. In AMH-I library, for example, the fraction of clones lacking appropriate ditag concatemers was >20%. In these instances, quasi-ditags significantly contribute to the final SAGE tag count, and should be removed.</p>
         <p>Chart in Figure <figr fid="F6">6</figr> plots values for ditag distribution from both the model-based simulations (L = 800 bp) and actual clones from the SAGE libraries that had sequences of the same mean length (L &#8776; 800 bp). The expected maximum frequency of 1&#8211;2 quasi-ditags in the plotted model data approximated the observed frequency of clones with 1&#8211;2 total ditags detected in the pool of the actual SAGE clones. Contrary to that, the frequency of occurrence of three or more quasi-ditags predicted by the model is extremely low, demonstrating a divergence in the distribution of expected quasi-ditags and valid SAGE ditags for higher number of ditags per clone. Note that owing to the gel-purification of concatemers the majority of clones in the representative samples belong to the clusters of higher ditag numbers (AMH-II and ES R1 libraries, 13&#8211;26 total ditags; AMH-I library, 4&#8211;11 total ditags).</p>
         <p>Comparing values of observed frequencies of the actual SAGE clones that produce 1&#8211;2 total ditags with those of expected quasi-ditag frequencies for the sequences of given length might be indicative on the possible contribution of cloning and sequencing artifact-derived quasi-ditags (Figure <figr fid="F6">6</figr>). The possible contribution of quasi-ditags to the final tag yield in SAGE libraries cannot be accurately predicted in advance but a failure to report the cloning efficiency and the number of clones with 1 or 2 ditags precludes an evaluation of potential false tags present in SAGE catalogs. Current SAGE protocols do not ensure 100% accurate size fractioning of concatemers: some of the smallest concatemers could therefore be cloned and sequenced. We recognize that some authentic tags (representing valid, but extremely short inserts that were not extracted during gel-purification of concatemers) will be excluded by removing all clones containing only 1 or 2 ditags. Nevertheless, we suggest that any potential loss of authentic ditags in the clean-up procedure is negligible compared to the advantage of having more reliable SAGE results.</p>
         <p>SAGE protocols are extremely complex technologically and every possible mean should be employed to ensure qualitative and quantitative accuracy of catalogs on both the experimental and analytical steps. Evaluation of the cloning efficiency and precision (e.g. with RAST-PCR <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>) and sequencing accuracy are therefore essential on the stage preceding large-scale sequencing of the clones. Nonetheless, introduction of the simple pre-processing step eliminating false ditags would further improve the accuracy of the method resulting in its wider application.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusions</p>
         </st>
         <p>We have hypothesized that actual SAGE catalogs could be contaminated by false ditags (termed "quasi-ditags") of various origins and employed a mathematical model to predict the frequency of quasi-ditags in random nucleotide sequences. Cloning and sequencing artifacts contaminating SAGE libraries could be eliminated using simple pre-screening procedure to increase the reliability of the data.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>SAGE</p>
            </st>
            <p>Serial analysis of gene expression (SAGE) was performed according to the original protocol <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> with minor modifications <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B12">12</abbr></abbrgrp>. Human (PC3) and mouse (P19, R1, D3, EG-1, MEF) cells and tissues (adult and old heart) have been employed for construction of SAGE libraries and sequence analysis to illustrate the "clean-up" process. SAGE tags were generated with <it>NlaIII </it>and <it>BsmFI </it>restriction enzymes (New England Biolabs, Beverly, MA, USA). Sequencing was performed by Perkin-Elmer Applied Biosystems / Celera Genomics (Foster City, CA, USA) and Agencourt Bioscience Corporation (Beverly, MA, USA).</p>
         </sec>
         <sec>
            <st>
               <p>Stochastic model</p>
            </st>
            <p>Anchoring enzyme recognition sites (AERS) are 4 bp long. Assuming for simplicity that all 4 nucleotide bases (A, T, C, and G) have equal frequencies, a probability that a random combination of 4 nucleotides would match the AERS is 4<sup>-4 </sup>= 1/256. In a sequence of length <it>L</it>, the expected number of AERS (e.g. CATG for <it>NlaIII </it>anchoring enzyme) is <it>L</it>/256. Thus, the probability of finding <it>k </it>tags CATG in a random sequence of length L is determined by the Poisson distribution:</p>
            <p>
               <graphic file="1471-2105-5-152-i1.gif"/>
            </p>
            <p>If two CATG sequences (AERS<sub>(CATG)</sub>) are located within the sequence of length <it>L</it>, then the probability that they are separated by a 20&#8211;24 bp distance (<it>P</it><sub>(20&#8211;24)</sub>) is approximately:</p>
            <p>
               <graphic file="1471-2105-5-152-i2.gif"/>
            </p>
            <p>where 10 is the number of possible relative positions of two AERS<sub>(CATG) </sub>that yield a quasi-ditag and 24 is the mean distance from the center of one SAGE tag to the end of the sequence that does not leave enough space for another tag to form a quasi-ditag.</p>
            <p>If >2 AERS are present in the sequence, then there is a chance that additional AERS would appear within the quasi-ditag formed by first two AERS. A probability that additional AERS will not appear within the quasi-ditag is approximately:</p>
            <p>
               <graphic file="1471-2105-5-152-i3.gif"/>
            </p>
            <p>where 30 is the average length of a nucleotide string outside of the ditag.</p>
            <p>If the total number of AERS<sub>(CATG) </sub>equals k, then the number of possible AERS pairs is:</p>
            <p>
               <graphic file="1471-2105-5-152-i4.gif"/>
            </p>
            <p>Taken together, a probability of at least one quasi-ditag in the sequence that has exactly <it>k </it>AERS<sub>(CATG) </sub>is:</p>
            <p>
               <graphic file="1471-2105-5-152-i5.gif"/>
            </p>
            <p>Then, a probability (<it>Q</it><sub>1</sub>) to find at least one quasi-ditag in a sequence of given length <it>L </it>is:</p>
            <p>
               <graphic file="1471-2105-5-152-i6.gif"/>
            </p>
            <p>where p(k) is given by equation (1).</p>
            <p>There is also a probability that more than one quasi-ditag exists within the sequence. In some cases the same AERS<sub>(CATG) </sub>could serve as a portion of the two neighboring quasi-ditags (...CATG-(N)<sub>20&#8211;24</sub>-CATG-(N)<sub>20&#8211;24</sub>-CATG...). In other cases, two or more quasi-ditags can be located independently in the sequence. If a sequence with <it>k </it>tags already has one quasi-ditag bounded by two tags, then other (k-2) tags may form additional quasi-ditags. The probability of existence of additional quasi-ditags on condition that one ditag is already present is approximately q(k-2). Then the total probability that any random sequence has at least two quasi-ditags is:</p>
            <p>
               <graphic file="1471-2105-5-152-i7.gif"/>
            </p>
            <p>In the same way,</p>
            <p>
               <graphic file="1471-2105-5-152-i8.gif"/>
            </p>
            <p>and so on.</p>
            <p>The probability that a random sequence has exactly n quasi-ditags is:</p>
            <p><it>R</it><sub><it>n </it></sub>= <it>Q</it>(<it>n</it>) - <it>Q</it>(<it>n </it>+ 1) &#160;&#160;&#160; (9).</p>
         </sec>
         <sec>
            <st>
               <p>Software and analysis</p>
            </st>
            <p>A random nucleotide generator (for L = 600&#8211;1200) and analysis program that mimics "SAGE300" or "SAGE2000" software algorithms was written in Visual Basic and is available upon request. Genomic DNA sequences of <it>Saccharomyces cerevisiae </it>that lack SAGE ditag concatemers were also employed for <it>in vivo / in silico </it>model validation. Randomly selected <it>S. cerevisiae </it>chromosomes were downloaded from GenBank, fragmented to create a minimum of 300 sequences (L = 600&#8211;1200) and searched for quasi-ditags using "SAGE2000" software (available at SAGE website <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>). Frequency distribution of the number of ditags was analyzed in raw sequences from 3 randomly chosen 96-well plates from AMH-I, AMH-II and ES R1 SAGE libraries (285 sequences for each library) using the same software.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>SVA developed the hypothesis, overall plan and performed SAGE, computer simulations, and analysis of Yeast genome fragments. AAS developed and implemented the mathematical model predicting the appearance of "quasi-ditags" in random sequences of given length. Both authors have contributed to the writing and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>We would like to thank Dr. Paul Pullen (NIA/NIH, USA) for writing a code for software effecting computer simulations and Dr. Kenneth Boheler (NIA/NIH, USA) for the valuable help in preparing this manuscript.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Serial Analysis of Gene Expression</p>
            </title>
            <aug>
               <au>
                  <snm>Velculescu</snm>
                  <fnm>VE</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Vogelstein</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Kinzler</snm>
                  <fnm>KW</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1995</pubdate>
            <volume>270</volume>
            <fpage>484</fpage>
            <lpage>487</lpage>
            <xrefbib>
               <pubid idtype="pmpid">7570003</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Analysis of human transcriptomes</p>
            </title>
            <aug>
               <au>
                  <snm>Velculescu</snm>
                  <fnm>VE</fnm>
               </au>
               <au>
                  <snm>Madden</snm>
                  <fnm>SL</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Lash</snm>
                  <fnm>AE</fnm>
               </au>
               <au>
                  <snm>Yu</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Rago</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Lal</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Beaudry</snm>
                  <fnm>GA</fnm>
               </au>
               <au>
                  <snm>Ciriello</snm>
                  <fnm>KM</fnm>
               </au>
               <au>
                  <snm>Cook</snm>
                  <fnm>BP</fnm>
               </au>
               <au>
                  <snm>Dufault</snm>
                  <fnm>MR</fnm>
               </au>
               <au>
                  <snm>Ferguson</snm>
                  <fnm>AT</fnm>
               </au>
               <au>
                  <snm>Gao</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>He</snm>
                  <fnm>TC</fnm>
               </au>
               <au>
                  <snm>Hermeking</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Hiraldo</snm>
                  <fnm>SK</fnm>
               </au>
               <au>
                  <snm>Hwang</snm>
                  <fnm>PM</fnm>
               </au>
               <au>
                  <snm>Lopez</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Luderer</snm>
                  <fnm>HF</fnm>
               </au>
               <au>
                  <snm>Mathews</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Petroziello</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Polyak</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Zawel</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Zhou</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Haluska</snm>
                  <fnm>FG</fnm>
               </au>
               <au>
                  <snm>Jen</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Sukumar</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Landes</snm>
                  <fnm>GM</fnm>
               </au>
               <au>
                  <snm>Riggins</snm>
                  <fnm>GJ</fnm>
               </au>
               <au>
                  <snm>Vogelstein</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Kinzler</snm>
                  <fnm>KW</fnm>
               </au>
            </aug>
            <source>Nat Genet</source>
            <pubdate>1999</pubdate>
            <volume>23</volume>
            <fpage>387</fpage>
            <lpage>388</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/70487</pubid>
                  <pubid idtype="pmpid" link="fulltext">10581018</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>The new role of SAGE in gene discovery</p>
            </title>
            <aug>
               <au>
                  <snm>Boheler</snm>
                  <fnm>KR</fnm>
               </au>
               <au>
                  <snm>Stern</snm>
                  <fnm>MD</fnm>
               </au>
            </aug>
            <source>Trends Biotechnol</source>
            <pubdate>2003</pubdate>
            <volume>21</volume>
            <fpage>55</fpage>
            <lpage>57</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0167-7799(02)00031-8</pubid>
                  <pubid idtype="pmpid" link="fulltext">12573851</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Blue-white selection step enhances the yield of SAGE concatemers</p>
            </title>
            <aug>
               <au>
                  <snm>Angelastro</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Ryu</snm>
                  <fnm>EJ</fnm>
               </au>
               <au>
                  <snm>Torocsik</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Fiske</snm>
                  <fnm>BK</fnm>
               </au>
               <au>
                  <snm>Greene</snm>
                  <fnm>LA</fnm>
               </au>
            </aug>
            <source>Biotechniques</source>
            <pubdate>2002</pubdate>
            <volume>32</volume>
            <fpage>484</fpage>
            <lpage>486</lpage>
            <xrefbib>
               <pubid idtype="pmpid">11911650</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>SAGE identification of gene transcripts with abundances unique to pluripotent mouse R1 embryonic stem cells</p>
            </title>
            <aug>
               <au>
                  <snm>Anisimov</snm>
                  <fnm>SV</fnm>
               </au>
               <au>
                  <snm>Tarasov</snm>
                  <fnm>KV</fnm>
               </au>
               <au>
                  <snm>Tweedie</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Stern</snm>
                  <fnm>MD</fnm>
               </au>
               <au>
                  <snm>Wobus</snm>
                  <fnm>AM</fnm>
               </au>
               <au>
                  <snm>Boheler</snm>
                  <fnm>KR</fnm>
               </au>
            </aug>
            <source>Genomics</source>
            <pubdate>2002</pubdate>
            <volume>79</volume>
            <fpage>169</fpage>
            <lpage>176</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/geno.2002.6687</pubid>
                  <pubid idtype="pmpid" link="fulltext">11829487</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>A quantitative and validated SAGE transcriptome reference for adult mouse heart</p>
            </title>
            <aug>
               <au>
                  <snm>Anisimov</snm>
                  <fnm>SV</fnm>
               </au>
               <au>
                  <snm>Tarasov</snm>
                  <fnm>KV</fnm>
               </au>
               <au>
                  <snm>Stern</snm>
                  <fnm>MD</fnm>
               </au>
               <au>
                  <snm>Lakatta</snm>
                  <fnm>EG</fnm>
               </au>
               <au>
                  <snm>Boheler</snm>
                  <fnm>KR</fnm>
               </au>
            </aug>
            <source>Genomics</source>
            <pubdate>2002</pubdate>
            <volume>80</volume>
            <fpage>213</fpage>
            <lpage>222</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/geno.2002.6821</pubid>
                  <pubid idtype="pmpid" link="fulltext">12160735</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Complete genomes in WWW Entrez: data representation and analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Tatusova</snm>
                  <fnm>TA</fnm>
               </au>
               <au>
                  <snm>Karsch-Mizrachi</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Ostell</snm>
                  <fnm>JA</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>1999</pubdate>
            <volume>15</volume>
            <fpage>536</fpage>
            <lpage>43</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/15.7.536</pubid>
                  <pubid idtype="pmpid" link="fulltext">10487861</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Expression and functional characterization of the cardiac muscle ryanodine receptor Ca(2+) release channel in Chinese hamster ovary cells</p>
            </title>
            <aug>
               <au>
                  <snm>Bhat</snm>
                  <fnm>MB</fnm>
               </au>
               <au>
                  <snm>Hayek</snm>
                  <fnm>SM</fnm>
               </au>
               <au>
                  <snm>Zhao</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Zang</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Takeshima</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Wier</snm>
                  <fnm>WG</fnm>
               </au>
               <au>
                  <snm>Ma</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Biophys J</source>
            <pubdate>1999</pubdate>
            <volume>77</volume>
            <fpage>808</fpage>
            <lpage>816</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">10423427</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>SAGE Identification of Differentiation Responsive Genes in P19 Embryonic Cells Induced to Form Cardiomyocytes in vitro</p>
            </title>
            <aug>
               <au>
                  <snm>Anisimov</snm>
                  <fnm>SV</fnm>
               </au>
               <au>
                  <snm>Tarasov</snm>
                  <fnm>KV</fnm>
               </au>
               <au>
                  <snm>Riordon</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Wobus</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Boheler</snm>
                  <fnm>KR</fnm>
               </au>
            </aug>
            <source>Mech Devel</source>
            <pubdate>2002</pubdate>
            <volume>117</volume>
            <fpage>25</fpage>
            <lpage>74</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/S0925-4773(02)00177-6</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Identification of targets of JNK2 signaling involved in regulation of human tumor cell growth</p>
            </title>
            <aug>
               <au>
                  <snm>Potapova</snm>
                  <fnm>OU</fnm>
               </au>
               <au>
                  <snm>Anisimov</snm>
                  <fnm>SV</fnm>
               </au>
               <au>
                  <snm>Gorospe</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Dougherty</snm>
                  <fnm>RH</fnm>
               </au>
               <au>
                  <snm>Gaarde</snm>
                  <fnm>WA</fnm>
               </au>
               <au>
                  <snm>Boheler</snm>
                  <fnm>KR</fnm>
               </au>
               <au>
                  <snm>Holbrook</snm>
                  <fnm>NJ</fnm>
               </au>
            </aug>
            <source>Cancer Research</source>
            <pubdate>2002</pubdate>
            <volume>62</volume>
            <fpage>3257</fpage>
            <lpage>3263</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12036942</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Serial analysis of gene expression: rapid RT-PCR analysis of unknown SAGE tags</p>
            </title>
            <aug>
               <au>
                  <snm>van den Berg</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>van der Leij</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Poppema</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1999</pubdate>
            <volume>27</volume>
            <fpage>e17</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">148578</pubid>
                  <pubid idtype="pmpid" link="fulltext">10446260</pubid>
                  <pubid idtype="doi">10.1093/nar/27.17.e17</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Substantially enhanced cloning efficiency of SAGE (Serial Analysis of Gene Expression) by adding a heating step to the original protocol</p>
            </title>
            <aug>
               <au>
                  <snm>Kenzelmann</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Muhlemann</snm>
                  <fnm>K</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1999</pubdate>
            <volume>27</volume>
            <fpage>917</fpage>
            <lpage>918</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">148268</pubid>
                  <pubid idtype="pmpid" link="fulltext">9889294</pubid>
                  <pubid idtype="doi">10.1093/nar/27.3.917</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>SAGE</p>
            </title>
            <url>http://www.sagenet.org/protocol/index.htm</url>
         </bibl>
      </refgrp>
   </bm>
</art>
