<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-9-431</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Methodology article</dochead>
      <bibl>
         <title>
            <p>Probabilistic base calling of Solexa sequencing data</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Rougemont</snm>
               <fnm>Jacques</fnm>
               <insr iid="I1"/>
               <insr iid="I3"/>
               <email>jacques.rougemont@epfl.ch</email>
            </au>
            <au id="A2">
               <snm>Amzallag</snm>
               <fnm>Arnaud</fnm>
               <insr iid="I1"/>
               <insr iid="I3"/>
               <email>arnaud.amzallag@epfl.ch</email>
            </au>
            <au id="A3">
               <snm>Iseli</snm>
               <fnm>Christian</fnm>
               <insr iid="I2"/>
               <insr iid="I3"/>
               <email>christian.iseli@licr.org</email>
            </au>
            <au id="A4">
               <snm>Farinelli</snm>
               <fnm>Laurent</fnm>
               <insr iid="I5"/>
               <email>laurent.farinelli@fasteris.com</email>
            </au>
            <au id="A5">
               <snm>Xenarios</snm>
               <fnm>Ioannis</fnm>
               <insr iid="I3"/>
               <insr iid="I4"/>
               <email>ioannis.xenarios@isb-sib.ch</email>
            </au>
            <au id="A6" ca="yes">
               <snm>Naef</snm>
               <fnm>Felix</fnm>
               <insr iid="I1"/>
               <insr iid="I3"/>
               <email>felix.naef@epfl.ch</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>School of Life Sciences, Ecole Polytechnique F&#233;d&#233;rale de Lausanne (EPFL), 1015 Lausanne, Switzerland</p>
            </ins>
            <ins id="I2">
               <p>Ludwig Institute for Cancer Research (LICR), B&#226;timent G&#233;nopode, Universit&#233; de Lausanne, 1015 Lausanne, Switzerland</p>
            </ins>
            <ins id="I3">
               <p>Swiss Institute of Bioinformatics (SIB), B&#226;timent G&#233;nopode, Universit&#233; de Lausanne, 1015 Lausanne, Switzerland</p>
            </ins>
            <ins id="I4">
               <p>Vital-IT, B&#226;timent G&#233;nopode, Universit&#233; de Lausanne, 1015 Lausanne, Switzerland</p>
            </ins>
            <ins id="I5">
               <p>Fasteris SA, P.O. box 28, 1228 Plan-les-Ouates, Switzerland</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2008</pubdate>
         <volume>9</volume>
         <issue>1</issue>
         <fpage>431</fpage>
         <url>http://www.biomedcentral.com/1471-2105/9/431</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">18851737</pubid>
               <pubid idtype="doi">10.1186/1471-2105-9-431</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>04</day>
               <month>6</month>
               <year>2008</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>13</day>
               <month>10</month>
               <year>2008</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>13</day>
               <month>10</month>
               <year>2008</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2008</year>
         <collab>Rougemont et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair proportion of the tags are routinely discarded due to an inability to match them to a reference sequence, thereby reducing the effective throughput of the technology.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We propose a novel base calling algorithm using model-based clustering and probability theory to identify ambiguous bases and code them with IUPAC symbols. We also select optimal sub-tags using a score based on information content to remove uncertain bases towards the ends of the reads.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>We show that the method improves genome coverage and number of usable tags as compared with Solexa's data processing pipeline by an average of 15%. An R package is provided which allows fast and accurate base calling of Solexa's fluorescence intensity files and the production of informative diagnostic plots.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="bmc" subtype="user_supplied_xml" id="endnote"/>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Ultra-high-throughput sequencing is having a growing impact on biological research by providing a fast and high resolution access to genome-scale information. The versatile technique can be used for unbiased genotyping <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr></abbrgrp>, transcriptome analysis <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>, protein-DNA interactions<abbrgrp><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp>, <it>de-novo </it>sequencing<abbrgrp><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr></abbrgrp>. While the sample processing is relatively streamlined, innovations in data management and information processing are necessary to exploit the full potential of the technology. A standard Solexa/Illumina Genome Analyzer "classic" run produces 700 Gb of image files and 200 Gb of processed data files over 3.5 days totaling nearly 400,000 image files and 20,000 processed files. The latest GAII upgrade further increases this volume of data, mostly by acquiring larger images (although only 100 tiles) and with the ability to perform paired-end sequencing (72 bases per colony). The computing infrastructure required for managing daily sequencing runs is extremely costly to set up and maintain. Developing new algorithms to extract more information from available images and reduce the number of sequencing runs per project will therefore prove extremely valuable. Finally, well-designed quality metrics and diagnostic tools will allow a rapid assessment of the quality of the sequencing runs and decide the applicable data retention policy.</p>
         <p>The Solexa/Illumina Genome Analyzer performs sequencing-by-synthesis of a random array of clonal DNA colonies attached to the surface of a flow cell. There are about 8 million such colonies on each of the 8 lanes of the cell. At each cycle of synthesis all four nucleotides, labelled with four different fluorescent dyes and blocked at the 3'-ends, are introduced in the flow cell. Up to 36 such cycles of synthesis are performed.</p>
         <p>The data acquisition on the Genome Analyzer "classic" proceeds as follows: each lane of the cell is divided into roughly 300 tiles that are individually photographed through four different filters. The image analysis software localizes each colony on each picture and quantifies the corresponding four fluorescence intensities. The output consists of one file per tile with one row per colony made of four coordinates and up to 144 real numbers for 36 intensity quadruples. The base calling starts downstream of this quantification and reconstructs the DNA sequence that likely generated each colony. The Solexa data analysis pipeline outputs two important files for each tile in each lane: a sequence file with the sequence determined from each intensity row and a fast-q file with a quality score for each base called. This fast-q score measures the most likely base intensity relative to the three other intensities on a logarithmic scale from -5 to 40 (it is asymptotically equal to a Phred score<abbrgrp><abbr bid="B11">11</abbr></abbrgrp>). Here we propose an alternative probabilistic base calling method based on the fluorescence intensity quantifications that uses the extended IUPAC alphabet to code ambiguous bases. An information criterion is used to control the length of trustable reads. We show that this methodology increases the specific mapping of the tags onto reference genomes by about 15% (typically 10&#8211;25%) on raw sequences and an increase of up to 70% after quality filtering. The method is implemented in a freely distributed software called Rolexa.</p>
         <p>Similar approaches have recently been published. Closest to ours in their use of Gaussian mixtures is the method introduced by Cokus et al. in their analysis of Arabidopsis methylation patterns<abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. The Alta-Cyclic base caller <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> uses a support vector machine that needs to be trained on a known dataset. Our approach is computationally light and modular in that it offers a set of complementary functionalities that attempt to address the various biases observed in Solexa sequence <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr></abbrgrp> based on simple models of the biochemistry involved.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>Statistical properties of the fluorescent emissions</p>
            </st>
            <p>Several sources of noise perturb the acquisition step: signal over noise ratio in the images depends on the position of the colony within the imaging field (boundary effect), colonies can be hard to segment on the pictures, fluorophore emission spectra partially overlap as emissions "leak" into adjacent channels. Moreover synthesis efficiency is limited and therefore, within each colony, some DNA strands incorporate a non-complementary base or are de-synchronized because they failed to incorporate a nucleotide at a previous step. Both effects lead to the emission of a different fluorophore than the majority of the colony. These effects are possibly dependent on the base composition of the sequence<abbrgrp><abbr bid="B17">17</abbr></abbrgrp> and are obviously deteriorating with each additional chemistry cycle.</p>
            <p>We use the sequencing of the phiX174 (see Material and Methods) to analyze the signal in the four color channels as the sequencing progresses. We first observe that the distribution of intensities in the individual channels shows a good separation between background noise and signal, although the shape of the histograms strongly depends on the dye used (Fig. <figr fid="F1">1A</figr> and Additional file <supplr sid="S1">1</supplr>). For example, <it>G </it>has a tighter dynamical range than <it>T </it>and the range generally decreases with the cycle number. The largest range spans 4&#8211;5 logs. As the sequencing progresses, dynamic range decreases, signal over noise ratios worsen and the separation between background noise and signal becomes increasingly blurred (Additional file <supplr sid="S1">1</supplr>). Next, we observe that the <it>A </it>and <it>C </it>channels, as well as the <it>T </it>and <it>G </it>channels, are highly correlated (Fig. <figr fid="F1">1A</figr>).</p>
            <suppl id="S1">
               <title>
                  <p>Additional File 1</p>
               </title>
               <text>
                  <p><b>Signal over noise decays with sequencing cycle number</b>. Histograms of the raw fluorescence intensities are shown for cycles 5, 15, 25, and 35. The separation between signal and noise is increasingly blurred and faster in the <it>A </it>and <it>G </it>channels than in the <it>C </it>and <it>T </it>channels. Red lines indicate a fit by a mixture of two Gaussians distributions with blue vertical bars indicating the mean and one standard deviation for the highest component of the mixture.</p>
               </text>
               <file name="1471-2105-9-431-S1.png">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Signal and noise in fluorescence intensities</p>
               </caption>
               <text>
                  <p><b>Signal and noise in fluorescence intensities</b>. Representation of the first cycle of synthesis on five concatenated tiles of the phiX174 sequencing data. <b>A</b>. Projection of the intensity quadruples on the axes corresponding to the <it>A </it>and <it>C </it>channels and the <it>G </it>and <it>T </it>channels at cycles 1 an 15. The ellipses represent the Gaussian mixtures (centers and the line for one standard deviation are shown). <b>B</b>. Same data after de-correlation transformations (see Methods). Coloring reflects the mixture component with largest probability.</p>
               </text>
               <graphic file="1471-2105-9-431-1"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Reducing positional bias, dephasing and cross-talk</p>
            </st>
            <p>As observed above, there are three main sources of systematic bias at the level of intensity data. The first is the cross-talk between color channels: for example the <it>A </it>and <it>C </it>channels are not independent. Thus we transformed the raw intensities by a linear mapping to the basis with axes at angles <it>&#981; </it>and <it>&#952; </it>with respect to the original axes (cf. methods). We optimize the two angles so as to minimize the overall correlation between the transformed coordinates. We repeat this operation at each cycle of sequencing as well as with the other two, <it>G </it>and <it>T </it>channels (Fig. <figr fid="F1">1B</figr>).</p>
            <p>The second important bias is the colony dephasing: the amount of fluorescence emitted in a particular channel at cycle <it>n </it>depends on the number of corresponding bases present in the sequence at positions <it>1</it>, ..., <it>n</it>-<it>1 </it>because incorporation failures accumulated from previous cycles will be partly compensated at cycle <it>n </it>thereby increasing the signal in all channels. This cross-cycle dependence can be modelled by a binomial distribution with parameter <it>q </it>which is the probability of not elongating the complementary strand at each cycle of synthesis. We assume that this rate is equal for all nucleotides and all cycles. We determine the value of <it>q </it>by minimizing the average correlation between intensities at cycle <it>n </it>and <it>n+1</it>.</p>
            <p>The last major source of systematic variation is due to an optical effect: on each tile, the colonies near the center of the image appear brighter than the ones near the edges (Additional file <supplr sid="S2">2</supplr>). We correct this by fitting a two-dimensional lowess <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> model to the intensities for each tile and subtracting the difference between the fit and the median intensity.</p>
            <suppl id="S2">
               <title>
                  <p>Additional File 2</p>
               </title>
               <text>
                  <p><b>Correction of positional bias</b>. <b>A</b>. Images show local averages of the fluorescence intensities across the area of a tile. The center of the tile is clearly brighter than the edges. <b>B</b>. After correction by lowess fit, the averages are visually more constant across the tile.</p>
               </text>
               <file name="1471-2105-9-431-S2.png">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <p>The three corrections are applied sequentially (cf. Methods) to the raw intensities before applying the model-based clustering algorithm described next.</p>
         </sec>
         <sec>
            <st>
               <p>Model-based clustering and information-theoretic base calling</p>
            </st>
            <p>We used a model-based clustering algorithm<abbrgrp><abbr bid="B12">12</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr></abbrgrp> to classify the intensity quadruples into four groups. Clearly, four well-delineated clusters corresponding to the four bases emerge (Fig. <figr fid="F1">1A&#8211;B</figr>). Specifically, we model the intensities measured in each channel by a mixture of four 4-dimensional Gaussian random variables and we use the intensity quadruples from all colonies in one or few combined tiles to fit the model parameters. The fitted model provides four probability distributions on the space of intensity quadruples, namely the probability <it>P</it><sub>A</sub>(<it>k</it>) = <it>P</it>(<it>A</it>|<it>I</it><sub>1</sub>(<it>k</it>), ..., <it>I</it><sub>4</sub>(<it>k</it>)) that the <it>k</it><sup>th </sup>base to call is an <it>A </it>knowing the measured intensities in all four channels at cycle <it>k</it>, and similarly for <it>P</it><sub>C</sub>, <it>P</it><sub>G </sub>and <it>P</it><sub>T</sub>. We can measure the level of uncertainty in our base calling by the entropy <inline-formula><m:math name="1471-2105-9-431-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mi>h</m:mi><m:mo stretchy="false">(</m:mo><m:mi>k</m:mi><m:mo stretchy="false">)</m:mo><m:mo>=</m:mo><m:mo>&#8722;</m:mo><m:mstyle displaystyle="true"><m:munder><m:mo>&#8721;</m:mo><m:mrow><m:mi>&#945;</m:mi><m:mo>&#8712;</m:mo><m:mo>{</m:mo><m:mtext>ACGT</m:mtext><m:mo>}</m:mo></m:mrow></m:munder><m:mrow><m:msub><m:mi>P</m:mi><m:mi>&#945;</m:mi></m:msub><m:mo stretchy="false">(</m:mo><m:mi>k</m:mi><m:mo stretchy="false">)</m:mo><m:mi>l</m:mi><m:mi>o</m:mi><m:msub><m:mi>g</m:mi><m:mn>2</m:mn></m:msub><m:msub><m:mi>P</m:mi><m:mi>&#945;</m:mi></m:msub><m:mo stretchy="false">(</m:mo><m:mi>k</m:mi><m:mo stretchy="false">)</m:mo></m:mrow></m:mstyle></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiAaGMaeiikaGIaem4AaSMaeiykaKIaeyypa0JaeyOeI0YaaabuaeaacqWGqbaudaWgaaWcbaGaeqySdegabeaakiabcIcaOiabdUgaRjabcMcaPiabcYgaSjabc+gaVjabcEgaNnaaBaaaleaacqaIYaGmaeqaaOGaemiuaa1aaSbaaSqaaiabeg7aHbqabaGccqGGOaakcqWGRbWAcqGGPaqkaSqaaiabeg7aHjabgIGiolabcUha7jabbgeabjabboeadjabbEeahjabbsfaujabc2ha9bqab0GaeyyeIuoaaaa@5034@</m:annotation></m:semantics></m:math></inline-formula> which measures the uncertainty (in bits) in the determination of the correct <it>k</it><sup>th </sup>base<abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. Knowing <it>h </it>and the four probabilities we then use cutoffs in the probability simplex to decide which IUPAC code to call (Figure <figr fid="F2">2A</figr>, Methods). As the sequencing progresses, we also compute the cumulative entropy of each colony, <inline-formula><m:math name="1471-2105-9-431-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mi>H</m:mi><m:mo stretchy="false">(</m:mo><m:mi>n</m:mi><m:mo stretchy="false">)</m:mo><m:mo>=</m:mo><m:mstyle displaystyle="true"><m:munder><m:mo>&#8721;</m:mo><m:mrow><m:mi>k</m:mi><m:mo>=</m:mo><m:mn>1</m:mn><m:mo>,</m:mo><m:mo>&#8230;</m:mo><m:mo>,</m:mo><m:mi>n</m:mi></m:mrow></m:munder><m:mrow><m:mi>h</m:mi><m:mo stretchy="false">(</m:mo><m:mi>k</m:mi><m:mo stretchy="false">)</m:mo></m:mrow></m:mstyle></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemisaGKaeiikaGIaemOBa4MaeiykaKIaeyypa0ZaaabuaeaacqWGObaAcqGGOaakcqWGRbWAcqGGPaqkaSqaaiabdUgaRjabg2da9iabigdaXiabcYcaSiablAciljabcYcaSiabd6gaUbqab0GaeyyeIuoaaaa@3F34@</m:annotation></m:semantics></m:math></inline-formula>, which estimates the log<sub>2 </sub>of the number of actual sequences compatible with the codes called up to position <it>n</it>. This total entropy is used to rank tags from least to most ambiguous. Figure <figr fid="F3">3A</figr> shows that this ambiguity score correlates with, but is markedly different from the Solexa fast-q quality score. The ambiguity metric is useful for genome assembly or polymorphism identification by allowing down-weighting the low quality tags when deriving statistics from multiple alignments of tags. As shown below, this metric can also be used to optimize tag lengths and increase the chance of identifying a match on the reference genome.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Base calling determined by entropy</p>
               </caption>
               <text>
                  <p><b>Base calling determined by entropy</b>. <b>A</b>. Probability simplex for a 3-letter alphabet (<it>A </it>= blue, <it>C </it>= red, <it>G </it>= green). Each point in the triangle is a probability triplet (<it>P</it><sub>A</sub>, <it>P</it><sub>C</sub>, <it>P</it><sub>G</sub>) represented by the corresponding color mixture. Blue lines are iso-entropic levels, black lines are the cutoffs between the various IUPAC codes. These correspond to midpoints in the state variable (<it>S </it>= 2<sup><it>h</it></sup>). <b>B</b>. Distribution of entropy per base across 10 tiles on 36 bases. Red lines at the bottom indicate the IUPAC cutoffs. Mass within each segment is indicated in red.</p>
               </text>
               <graphic file="1471-2105-9-431-2"/>
            </fig>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Quality and entropy depend on position in the sequence</p>
               </caption>
               <text>
                  <p><b>Quality and entropy depend on position in the sequence</b>. <b>A</b>. Quantile-quantile plot of fast-q quality score against the information content per base. The two measures are loosely correlated, but clearly not equivalent. <b>B</b>. Boxplot of the fast-q score along the first 35 bases of the sequencing. The overall base quality decreases sharply after base 14, but the distribution still extends up to the top 40 score at bases 30&#8211;35. <b>C</b>. Frequency of the four categories of ambiguous IUPAC codes as a function of the position in the sequence.</p>
               </text>
               <graphic file="1471-2105-9-431-3"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Genome coverage statistics</p>
            </st>
            <p>To assess the quality of our base calling and to compare it with the sequences obtained via Solexa's analysis pipeline, we compute the mapping efficiency #{reads mapping exactly to the genome}/#{total number of reads}. We used the <it>fetchGWI </it>tool <abbrgrp><abbr bid="B24">24</abbr></abbrgrp> to search for unique exact matches of each sequenced tag encoded in the IUPAC code on the 5386 nt reference phiX174 genome sequence [RefSeq:NC_001422]. We thus discard every tag that matches at more than one position or does not match exactly anywhere on the reference sequence. One lane (330 tiles) of the Solexa flow cell produced 8 M tags, 3 M unique tags and 3.8 mappable tags, which amounts to a throughput of 137 million immediately usable bases per run. Sorting tags by decreasing quality we see (Figure <figr fid="F4">4</figr>) that low-entropy tags are easily identified by both the Solexa and Rolexa pipelines, but that the coverage achieved by Rolexa-called tags increases significantly among the low-quality sequences and results in an increased total coverage of up to 10&#8211;25% (average 15%). We also see that ranking by quality (or entropy, data not shown) is a judicious prioritization strategy since the coverage increase is sharp in the top part of the list and subsequently plateaus off.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Rolexa base-calling increases the coverage</p>
               </caption>
               <text>
                  <p><b>Rolexa base-calling increases the coverage</b>. Black: Solexa base calling, blue: Rolexa base calling using only the ACGT alphabet (most probable base calling), green: Rolexa base calling using IUPAC codes, red: Rolexa base calling with IUPAC codes and tag length optimization. Numbers in the right margin are the number of matching tags in millions. Sequence tags were sorted by decreasing quality (fast-q) and unique exact matches on the reference phiX174 genome were searched. Vertical axis shows the proportion of tags finding an exact match.</p>
               </text>
               <graphic file="1471-2105-9-431-4"/>
            </fig>
            <p>To estimate error rates of sequencing, we used <it>align0 </it><abbrgrp><abbr bid="B25">25</abbr></abbrgrp> to search for an optimal match between each tag and the phiX genome, and then computed the number of mismatches between tag and reference. Figure <figr fid="F5">5A</figr> shows how the error rates increases as a function of the sequencing cycles for Solexa tags. Rolexa tags called with the most probable ACGT base showed a slower increase, and introducing IUPAC codes significantly decreased both the intercept and slope of the error rate as a function of the sequencing cycle.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Disequilibrium between complementary bases ratio</p>
               </caption>
               <text>
                  <p><b>Disequilibrium between complementary bases ratio</b>. <b>A</b>. Error rate at each cycle of sequencing. Each tag was aligned on the genome using <it>align0 </it>and the error rate defined by counting the number of differences between the bases called and the reference at the corresponding position. Black is the error rate for Solexa-called tags, blue for Rolexa tags called using only the ACGT alphabet and green for Rolexa-called tags with IUPAC codes. <b>B</b>. Proportion of bases <it>A</it>, <it>C</it>, <it>G </it>and <it>T </it>at each position in the tags for Solexa base calling (dashed lines) and Rolexa base calling (continuous line). The complementary <it>A </it>and <it>T </it>proportions are different (ratio is not 1) and are degrading along the sequences (lines drift apart). The proportions are less dependent on position with Rolexa base calling, although the ratios remain different from 1. Label on y-axis is wrong. Panels <b>C-D </b>focuses on tags "rescued" by Rolexa base calling, namely those tags that could not be mapped on the genome after Solexa base calling, but had a matching position via Rolexa base calling. <b>C</b>. The distribution of substitutions between the Solexa tags and the corresponding Rolexa tags shows a predominance of <it>C </it>to <it>A </it>and <it>T </it>to <it>G </it>substitutions which is consistent with a re-equilibration of the base complementarity.<b>D</b>. Introducing one to six mutations in the Solexa tags with the same frequencies as the Rolexa algorithm at random positions only rescues about 2% of the tags that were rescued by Rolexa with the same number of ambiguous bases (green bars).</p>
               </text>
               <graphic file="1471-2105-9-431-5"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Base distribution statistics</p>
            </st>
            <p>A surprising property of Solexa sequences is the imbalance between complementary <it>A </it>and <it>T </it>base counts as well as between <it>G </it>and <it>C</it><abbrgrp><abbr bid="B14">14</abbr></abbrgrp>. As shown in Figure <figr fid="F5">5B</figr>, there is progressive deterioration in the proportions as the sequencing progresses, which is likely related to the varying noise levels across fluorescent dyes for complementary base pairs as well as dye-specific chemical effects (see Fig. <figr fid="F1">1</figr>). In consequence an intensity close to the background is more likely to be called <it>T </it>than <it>A</it>, or <it>C </it>than <it>G</it>. Applying our corrections at the level of intensities stabilizes the proportions of bases, which is particularly pronounced for the T's. For reasons we do not currently understand the A/T ratio is not exactly one but stabilizes around 0.9 (Figure <figr fid="F5">5B</figr>).</p>
            <p>To ascertain whether our increased coverage is not simply the consequence of the more degenerate alphabet, we verified that introducing ambiguities at random positions does not similarly improve the mapping. We thus selected the tags that did not match on the genome based on Solexa base calling, but did match after Rolexa introduced one to five ambiguous bases. Then we introduced ambiguities in these tags, with the same frequency as Rolexa, but at random positions. Figure <figr fid="F5">5D</figr> shows that only about 2% of those randomized mutations found a match on the genome, indicating that the entropy is a specific predictor of ambiguous positions.</p>
         </sec>
         <sec>
            <st>
               <p>Optimizing tag length</p>
            </st>
            <p>While Solexa's quality score tends to decrease along the sequence, its distribution mostly spreads, rather than shifts, downwards (Fig. <figr fid="F3">3B</figr>). Computing a global length cutoff based on the average quality will therefore discard a lot of high-quality bases and not necessarily ensure a uniform quality. Thus we expect to increase the number of tags that can be mapped to a reference sequence by cutting them to a shorter length <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. However this procedure has a downside since it will reduce the coverage length per tag and increase the probability of finding multiple matches. Similarly, standard Solexa procedures suggest selecting tags with high average fast-q. Yet, a low average can be the result of just a few uncertain bases near the end of an otherwise useful tag.</p>
            <p>We tested the different selections by applying the following quality filters. For the Solexa method we cut the tags at length 20, 25, 26, 28, 30, and then filtered all sequences with average fast-q score bellow 30, 25, or 20. In comparison, we used the following filtering procedure for Rolexa tags: we chose 3 different length-dependent entropy cutoffs <it>IT(k) </it>(see methods) and searched within each read for the longest <it>k</it>-mer with total entropy less than <it>IT(k)</it>. We then extended this subsequence in both directions up to the next ambiguous base and eventually removed all tags shorter than 10 bases. The coverage statistics for the different filters are summarized in Figure <figr fid="F6">6</figr>. We performed a similar analysis of the 330 tiles of the sequencing of targeted human genomic regions and found an average of 50% increase in nucleotide coverage (Additional file <supplr sid="S3">3</supplr>). We see that the efficiency of Rolexa is superior in all datasets as measured by the ratio of actual coverage to expected coverage as well as by the ratio of tags having a unique match on the genome. The latter criterion is important since in many application of high-throughput sequencing (such as gene expression measures or ChIP-Seq), the extent of the coverage is less important than the number of hits on the genome. Similarly, in genotyping and targeted re-sequencing, where inexact matches are expected, the ability to reliably filter out low-quality tags before doing the matching to the reference sequence is of the highest importance, since actual polymorphisms must be distinguished from sequencing errors.</p>
            <suppl id="S3">
               <title>
                  <p>Additional File 3</p>
               </title>
               <text>
                  <p><b>Increased coverage of Rolexa data relative to Solexa data on a human sample</b>. A complete sequencing lane (330 tiles) was analyzed with Rolexa and Solexa pipelines. The X axis represents the number of nucleotides covered by the sequences of a tile with Rolexa base-calling and the Y axis represents the ratio with the corresponding Solexa base-calling with tags restricted to 25 bases or the full 36 bases length.</p>
               </text>
               <file name="1471-2105-9-431-S3.png">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>Tag-dependent quality filtering improves the mapping efficiency</p>
               </caption>
               <text>
                  <p><b>Tag-dependent quality filtering improves the mapping efficiency</b>. Several entropy cutoffs were used to filter low-quality Rolexa-called tags and to reduce tags to higher scoring sub-tags. Solexa-called tags were filtered to the same length as the average length of the previous sets and to various average fast-q score. <b>A</b>. The actual coverage of the target genome as a function of the expected coverage (if all tags could have been mapped). <b>B</b>. The efficiency of the filtering in coverage ratio (actual number of nucleotides covered divided by expected number, X axis) and in tag mapping ratio (number of tags mapped to the genome divided by number of tags passing the quality filter, Y axis). Rolexa (red points) has superior efficiency to Solexa (green points) in all data sets. Points are labeled with the cutoffs used (see text): Rolexa cutoffs are either constant (2, 4, 6, 8), growing logarithmically (Log) or exponentially (Exp), Solexa cutoffs are indicated by two numbers, the length cutoff followed by the fast-q cutoff.</p>
               </text>
               <graphic file="1471-2105-9-431-6"/>
            </fig>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>Several points in the analysis of Solexa high throughput sequencing technology can likely benefit from further improvements. First the disequilibrium between complementary bases should be reduced. Although the phiX174 is a single-stranded DNA virus, the library was prepared from the double-stranded covalently closed circular form of the genome. As shown, the output of the sequencing shows an increasing deterioration of the equilibrium between complementary bases as the sequencing cycles proceed (Figure <figr fid="F5">5B</figr>). Our approach improves on this but does not solve the issue completely.</p>
         <p>Similar approaches have recently been Dohm et al.<abbrgrp><abbr bid="B14">14</abbr></abbrgrp> have observed similar bias to the ones described here, but only proposed to correct them at the level of the sequence alignment, not at the level of the base calling. Cokus et al.<abbrgrp><abbr bid="B12">12</abbr></abbrgrp> use Solexa's pre-treated data (_sig2 files) and apply a very similar EM procedure to fit a Gaussian mixture model for probabilistic base calling. They do not use information based metrics to reduce the probabilities to IUPAC codes, but rather construct position-weight matrices with which they scan the reference genome, which is computationally expensive and not directly applicable for <it>de-novo </it>sequencing. Erlich et al.<abbrgrp><abbr bid="B13">13</abbr></abbrgrp> train a Support Vector Machine optimized on a reference sequence which is computationally highly expensive. Rolexa only needs a (nowadays common) multi-core computer and runs a complete analysis of one lane in 10 hours over 5 cores. Moreover it is based on modeling the bio-chemical properties of the system.</p>
         <p>We have not considered here the potentially important benefits of fine-tuning the image analysis algorithms. Looking at images generated by the microscopic device shows that when the density of colonies is high in some region of the images, bleeding-over occurs and assigning the correct fluorescence intensity to each colony is clearly a delicate problem (see <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>).</p>
         <p>Due to the large file size and format of the Solexa output data, concurrently (and randomly) accessing 20,000 text files puts a heavy strain on any standard file system, not to mention backup devices. Rolexa works with compressed inputs and outputs, which already reduces file size considerably. Still, a better suited file format could help both the storage and the processing, e.g. using suffix tables and trees<abbrgrp><abbr bid="B27">27</abbr><abbr bid="B28">28</abbr></abbrgrp>. The latest GAII upgrade to the Solexa/Illumina sequencer generates even more data, through larger acquisition area, longer reads, and paired-end sequencing. Generating longer reads require efficient and reliable algorithms for base calling with reasonable levels of accuracy up to the end of the read. Furthermore, this increased throughput requires these algorithms to be fast and be based on direct and simple methods that are re-usable without tuning from one run to the next.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>Solexa/Illumina high-throughput sequencing has already and will increasingly produce vast amounts of systems scale genomics and functional genomics data. As with other high-throughput techniques, improvements in signal processing and statistical assessment of the data will prove to be a key step in the maturation of the technology and the progress towards reliable applications and new discoveries<abbrgrp><abbr bid="B29">29</abbr></abbrgrp>.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Sample preparation and Genome Analyzer sequencing</p>
            </st>
            <p>The phiX174 Control Library used was prepared by Illumina (Cat. No CT-901-1001). Briefly, the double-stranded covalently closed circular form of the viral DNA was broken into 100&#8211;400 bp fragments by nebulization; the ends repaired with Klenow, T4 DNA polymerase and PNK; and a base <it>A </it>was added on the 3'ends. After ligation of the double-stranded genomic adapters the sample was gel-purified to isolate fragments with "inserts" of approximately 200 bp and amplified by 18 cycles of PCR (Illumina protocol "Preparing Samples for Sequencing Genomic DNA", Part # 11251892 Rev. A). The library is quality controlled by cloning an aliquot into a TOPO plasmid and capillary sequencing 5&#8211;10 clones.</p>
            <p>DNA Colonies were prepared by using a "Standard Cluster Generation Kit" (Cat. No. FC-103-1001) and 35 cycles of isothermal amplification in the flow-cell on the "Illumina Cluster Station" using a pM dilution of the 10 nM library. After amplification, one of the strands is removed; the free 3'-ends are blocked by terminal transferase in presence of dideoxynucleotides; and the genomic sequencing primer hybridized. The flow-cell was transferred to the Genome Analyzer "classic" and sequencing was performed for 36 cycles using a "36 Cycle Sequencing Kit" (Cat. No FC-104-1003) with the version 2.0 of the scanning buffer.</p>
         </sec>
         <sec>
            <st>
               <p>Sequencing of Human cells</p>
            </st>
            <p>The samples used for Additional file <supplr sid="S3">3</supplr> came from the pooled DNA obtained by long-range PCR amplification<abbrgrp><abbr bid="B30">30</abbr></abbrgrp> of a 30 kb region of chromosome 19 from 3 different individuals plus a 50 kb region of chromosome 3 from a fourth individual. Sequencing was performed as described above for the phiX174.</p>
         </sec>
         <sec>
            <st>
               <p>Data analysis</p>
            </st>
            <p>All data analysis for this paper has been performed with the R statistical framework <url>http://www.r-project.org/</url> and the Rolexa package. This package uses the <it>mclust </it>routines<abbrgrp><abbr bid="B20">20</abbr></abbrgrp> as well as the <it>fork </it>package to run efficiently on multi-core architectures. Matching of short tags onto the genome have been performed with the <it>fetchGWI </it>tool<abbrgrp><abbr bid="B24">24</abbr></abbrgrp> by first generating a comprehensive index of the phiX174 genome and matching each query with its index entry. We used <it>align0 </it><abbrgrp><abbr bid="B25">25</abbr></abbrgrp> to search for best matches from tags to the genome and estimate error rates (see Fig. <figr fid="F5">5A</figr>). When counting errors, an alignment of IUPAC code with one of its compatible bases was counted as correct match.</p>
            <p>Raw data analysis (image analysis, initial base calling and fast-q scores) used the <it>Firecrest </it>image analysis module and the <it>Bustard </it>base-caller from the Illumina software suite (SolexaPipeline-0.2.2.6). No filtering or analysis with <it>Gerald </it>was performed.</p>
         </sec>
         <sec>
            <st>
               <p>Preliminary data transformation</p>
            </st>
            <p>We model the measured intensities I(<it>&#945;</it>, <it>n</it>, <it>x</it>) (<it>&#945; </it>= <it>A</it>, <it>C</it>, <it>G</it>, <it>T </it>is the dye channel, <it>n </it>= 1, ..., <it>36 </it>is the cycle number and <it>x </it>denotes the colony coordinates) as the following combination of unbiased intensities <it>J</it>(<it>&#945;</it>, <it>n</it>, <it>x</it>):</p>
            <p>
               <display-formula>
                  <m:math name="1471-2105-9-431-i3" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>I</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>&#945;</m:mi>
                           <m:mo>,</m:mo>
                           <m:mi>n</m:mi>
                           <m:mo>,</m:mo>
                           <m:mi>x</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mstyle displaystyle="true">
                              <m:munder>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mrow>
                                    <m:mi>m</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>1</m:mn>
                                    <m:mo>,</m:mo>
                                    <m:mn>...</m:mn>
                                    <m:mo>,</m:mo>
                                    <m:mi>n</m:mi>
                                 </m:mrow>
                              </m:munder>
                              <m:mrow>
                                 <m:mstyle displaystyle="true">
                                    <m:munder>
                                       <m:mo>&#8721;</m:mo>
                                       <m:mrow>
                                          <m:mi>&#946;</m:mi>
                                          <m:mo>=</m:mo>
                                          <m:mi>A</m:mi>
                                          <m:mo>,</m:mo>
                                          <m:mi>C</m:mi>
                                          <m:mo>,</m:mo>
                                          <m:mi>G</m:mi>
                                          <m:mo>,</m:mo>
                                          <m:mi>T</m:mi>
                                       </m:mrow>
                                    </m:munder>
                                    <m:mrow>
                                       <m:mi>M</m:mi>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>&#945;</m:mi>
                                       <m:mo>,</m:mo>
                                       <m:mi>&#946;</m:mi>
                                       <m:mo stretchy="false">)</m:mo>
                                       <m:mi>J</m:mi>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>&#946;</m:mi>
                                       <m:mo>,</m:mo>
                                       <m:mi>m</m:mi>
                                       <m:mo>,</m:mo>
                                       <m:mi>x</m:mi>
                                       <m:mo stretchy="false">)</m:mo>
                                       <m:mi>R</m:mi>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>m</m:mi>
                                       <m:mo>,</m:mo>
                                       <m:mi>n</m:mi>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:mstyle>
                              </m:mrow>
                           </m:mstyle>
                           <m:mo>,</m:mo>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemysaKKaeiikaGIaeqySdeMaeiilaWIaemOBa4MaeiilaWIaemiEaGNaeiykaKIaeyypa0Zaaabuaeaadaaeqbqaaiabd2eanjabcIcaOiabeg7aHjabcYcaSiabek7aIjabcMcaPiabdQeakjabcIcaOiabek7aIjabcYcaSiabd2gaTjabcYcaSiabdIha4jabcMcaPiabdkfasjabcIcaOiabd2gaTjabcYcaSiabd6gaUjabcMcaPaWcbaGaeqOSdiMaeyypa0JaemyqaeKaeiilaWIaem4qamKaeiilaWIaem4raCKaeiilaWIaemivaqfabeqdcqGHris5aaWcbaGaemyBa0Maeyypa0JaeGymaeJaeiilaWIaeiOla4IaeiOla4IaeiOla4IaeiilaWIaemOBa4gabeqdcqGHris5aOGaeiilaWcaaa@64BE@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>where the 4 &#215; 4 matrix <it>M </it>is a mixture matrix which is block diagonal and depends on the 4 parameters <it>&#981;</it><sub><it>AC</it></sub>, <it>&#952;</it><sub><it>AC</it></sub>, <it>&#981;</it><sub><it>GT </it></sub>and <it>&#952;</it><sub><it>GT</it></sub>:</p>
            <p>
               <display-formula>
                  <m:math name="1471-2105-9-431-i4" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>M</m:mi>
                           <m:mrow>
                              <m:mo>(</m:mo>
                              <m:mrow>
                                 <m:mrow>
                                    <m:mo>{</m:mo>
                                    <m:mrow>
                                       <m:mi>A</m:mi>
                                       <m:mo>,</m:mo>
                                       <m:mi>C</m:mi>
                                    </m:mrow>
                                    <m:mo>}</m:mo>
                                 </m:mrow>
                                 <m:mo>,</m:mo>
                                 <m:mrow>
                                    <m:mo>{</m:mo>
                                    <m:mrow>
                                       <m:mi>A</m:mi>
                                       <m:mo>,</m:mo>
                                       <m:mi>C</m:mi>
                                    </m:mrow>
                                    <m:mo>}</m:mo>
                                 </m:mrow>
                              </m:mrow>
                              <m:mo>)</m:mo>
                           </m:mrow>
                           <m:mo>=</m:mo>
                           <m:mrow>
                              <m:mo>(</m:mo>
                              <m:mrow>
                                 <m:mtable>
                                    <m:mtr>
                                       <m:mtd>
                                          <m:mrow>
                                             <m:mi>c</m:mi>
                                             <m:mi>o</m:mi>
                                             <m:mi>s</m:mi>
                                             <m:msub>
                                                <m:mi>&#952;</m:mi>
                                                <m:mrow>
                                                   <m:mi>A</m:mi>
                                                   <m:mi>C</m:mi>
                                                </m:mrow>
                                             </m:msub>
                                          </m:mrow>
                                       </m:mtd>
                                       <m:mtd>
                                          <m:mrow>
                                             <m:mi>sin</m:mi>
                                             <m:mo>&#8289;</m:mo>
                                             <m:msub>
                                                <m:mi>&#952;</m:mi>
                                                <m:mrow>
                                                   <m:mi>A</m:mi>
                                                   <m:mi>C</m:mi>
                                                </m:mrow>
                                             </m:msub>
                                          </m:mrow>
                                       </m:mtd>
                                    </m:mtr>
                                    <m:mtr>
                                       <m:mtd>
                                          <m:mrow>
                                             <m:mi>cos</m:mi>
                                             <m:mo>&#8289;</m:mo>
                                             <m:msub>
                                                <m:mi>&#966;</m:mi>
                                                <m:mrow>
                                                   <m:mi>A</m:mi>
                                                   <m:mi>C</m:mi>
                                                </m:mrow>
                                             </m:msub>
                                          </m:mrow>
                                       </m:mtd>
                                       <m:mtd>
                                          <m:mrow>
                                             <m:mi>sin</m:mi>
                                             <m:mo>&#8289;</m:mo>
                                             <m:msub>
                                                <m:mi>&#966;</m:mi>
                                                <m:mrow>
                                                   <m:mi>A</m:mi>
                                                   <m:mi>C</m:mi>
                                                </m:mrow>
                                             </m:msub>
                                          </m:mrow>
                                       </m:mtd>
                                    </m:mtr>
                                 </m:mtable>
                              </m:mrow>
                              <m:mo>)</m:mo>
                           </m:mrow>
                           <m:mo>,</m:mo>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyta00aaeWaaeaadaGadaqaaiabdgeabjabcYcaSiabdoeadbGaay5Eaiaaw2haaiabcYcaSmaacmaabaGaemyqaeKaeiilaWIaem4qameacaGL7bGaayzFaaaacaGLOaGaayzkaaGaeyypa0ZaaeWaaeaafaqabeGacaaabaGaei4yamMaei4Ba8Maei4CamNaeqiUde3aaSbaaSqaaiabdgeabjabdoeadbqabaaakeaacyGGZbWCcqGGPbqAcqGGUbGBcqaH4oqCdaWgaaWcbaGaemyqaeKaem4qameabeaaaOqaaiGbcogaJjabc+gaVjabcohaZjabeA8aQnaaBaaaleaacqWGbbqqcqWGdbWqaeqaaaGcbaGagi4CamNaeiyAaKMaeiOBa4MaeqOXdO2aaSbaaSqaaiabdgeabjabdoeadbqabaaaaaGccaGLOaGaayzkaaGaeiilaWcaaa@5E4C@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>and similarly for the <it>G</it>, <it>T </it>block, and the dephasing matrix <it>R </it>is a function of the parameter <it>q </it>and has a binomial structure:</p>
            <p>
               <display-formula>
                  <m:math name="1471-2105-9-431-i5" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>R</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>m</m:mi>
                           <m:mo>,</m:mo>
                           <m:mi>n</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mrow>
                              <m:mo>{</m:mo>
                              <m:mrow>
                                 <m:mtable>
                                    <m:mtr>
                                       <m:mtd>
                                          <m:mrow>
                                             <m:mn>0</m:mn>
                                             <m:mtext>&#160;if&#160;</m:mtext>
                                             <m:mi>m</m:mi>
                                             <m:mo>></m:mo>
                                             <m:mi>n</m:mi>
                                             <m:mo>,</m:mo>
                                          </m:mrow>
                                       </m:mtd>
                                    </m:mtr>
                                    <m:mtr>
                                       <m:mtd>
                                          <m:mrow>
                                             <m:mrow>
                                                <m:mo>(</m:mo>
                                                <m:mrow>
                                                   <m:mtable>
                                                      <m:mtr>
                                                         <m:mtd>
                                                            <m:mi>n</m:mi>
                                                         </m:mtd>
                                                      </m:mtr>
                                                      <m:mtr>
                                                         <m:mtd>
                                                            <m:mi>m</m:mi>
                                                         </m:mtd>
                                                      </m:mtr>
                                                   </m:mtable>
                                                </m:mrow>
                                                <m:mo>)</m:mo>
                                             </m:mrow>
                                             <m:msup>
                                                <m:mi>q</m:mi>
                                                <m:mrow>
                                                   <m:mi>n</m:mi>
                                                   <m:mo>&#8722;</m:mo>
                                                   <m:mi>m</m:mi>
                                                </m:mrow>
                                             </m:msup>
                                             <m:msup>
                                                <m:mrow>
                                                   <m:mo stretchy="false">(</m:mo>
                                                   <m:mn>1</m:mn>
                                                   <m:mo>&#8722;</m:mo>
                                                   <m:mi>q</m:mi>
                                                   <m:mo stretchy="false">)</m:mo>
                                                </m:mrow>
                                                <m:mi>m</m:mi>
                                             </m:msup>
                                             <m:mtext>&#160;if&#160;</m:mtext>
                                             <m:mi>m</m:mi>
                                             <m:mo>&#8804;</m:mo>
                                             <m:mi>n</m:mi>
                                             <m:mo>.</m:mo>
                                          </m:mrow>
                                       </m:mtd>
                                    </m:mtr>
                                 </m:mtable>
                              </m:mrow>
                           </m:mrow>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOuaiLaeiikaGIaemyBa0MaeiilaWIaemOBa4MaeiykaKIaeyypa0ZaaiqaaeaafaqabeGabaaabaGaeGimaaJaeeiiaaIaeeyAaKMaeeOzayMaeeiiaaIaemyBa0MaeyOpa4JaemOBa4MaeiilaWcabaWaaeWaaeaafaqabeGabaaabaGaemOBa4gabaGaemyBa0gaaaGaayjkaiaawMcaaiabdghaXnaaCaaaleqabaGaemOBa4MaeyOeI0IaemyBa0gaaOGaeiikaGIaeGymaeJaeyOeI0IaemyCaeNaeiykaKYaaWbaaSqabeaacqWGTbqBaaGccqqGGaaicqqGPbqAcqqGMbGzcqqGGaaicqWGTbqBcqGHKjYOcqWGUbGBcqGGUaGlaaaacaGL7baaaaa@5893@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>The parameters <it>&#981;</it><sub><it>AC</it></sub>, <it>&#952;</it><sub><it>AC</it></sub>, <it>&#981;</it><sub><it>GT</it></sub>, <it>&#952;</it><sub><it>GT </it></sub>are determined by minimizing the following function:</p>
            <p>
               <display-formula><it>F</it><sub><it>n</it></sub>(<it>&#952;</it><sub><it>AC</it></sub>, <it>&#981;</it><sub><it>AC</it></sub>, <it>&#952;</it><sub><it>GT</it></sub>, <it>&#981;</it><sub><it>GT</it></sub>) = cor(<it>M</it><sup>-1</sup><it>I </it>(<it>A</it>, <it>n</it>, &#8226;), <it>M</it><sup>-1 </sup><it>I</it>(<it>C</it>, <it>n</it>, &#8226;))<sup>2 </sup>+ cor(<it>M</it><sup>-1 </sup><it>I</it>(<it>G</it>, <it>n</it>, &#8226;), <it>M</it><sup>-1</sup><it>I</it>(<it>T</it>, <it>n</it>, &#8226;))<sup>2</sup>,</display-formula>
            </p>
            <p>which defines an intermediate intensity matrix <it>K </it>= <it>M</it><sup>-1 </sup><it>I</it>. This is then introduced into the function</p>
            <p>
               <display-formula>
                  <m:math name="1471-2105-9-431-i6" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>G</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>q</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mstyle displaystyle="true">
                              <m:munder>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mrow>
                                    <m:mi>&#945;</m:mi>
                                    <m:mo>,</m:mo>
                                    <m:mi>n</m:mi>
                                 </m:mrow>
                              </m:munder>
                              <m:mrow>
                                 <m:mi>c</m:mi>
                                 <m:mi>o</m:mi>
                                 <m:mi>r</m:mi>
                                 <m:msup>
                                    <m:mrow>
                                       <m:mrow>
                                          <m:mo>(</m:mo>
                                          <m:mrow>
                                             <m:msup>
                                                <m:mi>R</m:mi>
                                                <m:mrow>
                                                   <m:mo>&#8722;</m:mo>
                                                   <m:mn>1</m:mn>
                                                </m:mrow>
                                             </m:msup>
                                             <m:mi>K</m:mi>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>&#945;</m:mi>
                                             <m:mo>,</m:mo>
                                             <m:mi>n</m:mi>
                                             <m:mo>,</m:mo>
                                             <m:mo>&#8226;</m:mo>
                                             <m:mo stretchy="false">)</m:mo>
                                             <m:mo>,</m:mo>
                                             <m:msup>
                                                <m:mi>R</m:mi>
                                                <m:mrow>
                                                   <m:mo>&#8722;</m:mo>
                                                   <m:mn>1</m:mn>
                                                </m:mrow>
                                             </m:msup>
                                             <m:mi>K</m:mi>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>&#945;</m:mi>
                                             <m:mo>,</m:mo>
                                             <m:mi>n</m:mi>
                                             <m:mo>+</m:mo>
                                             <m:mn>1</m:mn>
                                             <m:mo>,</m:mo>
                                             <m:mo>&#8226;</m:mo>
                                             <m:mo stretchy="false">)</m:mo>
                                          </m:mrow>
                                          <m:mo>)</m:mo>
                                       </m:mrow>
                                    </m:mrow>
                                    <m:mn>2</m:mn>
                                 </m:msup>
                              </m:mrow>
                           </m:mstyle>
                           <m:mo>,</m:mo>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4raCKaeiikaGIaemyCaeNaeiykaKIaeyypa0ZaaabuaeaacqGGJbWycqGGVbWBcqGGYbGCdaqadaqaaiabdkfasnaaCaaaleqabaGaeyOeI0IaeGymaedaaOGaem4saSKaeiikaGIaeqySdeMaeiilaWIaemOBa4MaeiilaWIaeyOiGCRaeiykaKIaeiilaWIaemOuai1aaWbaaSqabeaacqGHsislcqaIXaqmaaGccqWGlbWscqGGOaakcqaHXoqycqGGSaalcqWGUbGBcqGHRaWkcqaIXaqmcqGGSaalcqGHIaYTcqGGPaqkaiaawIcacaGLPaaadaahaaWcbeqaaiabikdaYaaaaeaacqaHXoqycqGGSaalcqWGUbGBaeqaniabggHiLdGccqGGSaalaaa@5A73@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>which is minimized to determine <it>q</it>.</p>
            <p>Lastly, we correct systematic bias in function of the cluster coordinate as follows: we fit a 2-dimensional lowess <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> as a function of <it>(x</it>, <it>y) </it>coordinates and then subtract the difference between that fit and the median intensity across all four channels, for each tile and cycle.</p>
         </sec>
         <sec>
            <st>
               <p>Model-based clustering and data fitting</p>
            </st>
            <p>We used the <it>EEV </it>model of the <it>mclust </it>algorithm<abbrgrp><abbr bid="B20">20</abbr></abbrgrp> to fit the Gaussian mixtures used to assign base probabilities in function of the four-dimensional intensity vector, similar as what was performed in <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. This model assumes Gaussian mixtures with four covariance matrices of the same shape and volume but with varying orientation. We initialize the classification by attributing each colony to the nucleotide with the highest (corrected) intensity. Given that initial classification, an M step of the <it>mclust </it>algorithm is performed which estimates the maximum likelihood parameters given the class attributions, where the parameters to estimate are the global scale and shape parameters as well as the centers and orientations of each class (using the covariance parameterization described in <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>). This is then followed by an E step of the EM algorithm to estimate the conditional probabilities of each data point belonging to each class given the parameters estimates obtained previously. Full convergence of the EM algorithm is offered as an option but occasionally runs into spurious optima due to the effect of outliers (similarly to what was observed in <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>). Further details of the implementation can be found in the package documentation (see Availability section).</p>
         </sec>
         <sec>
            <st>
               <p>Cutoffs for base calling and tag length</p>
            </st>
            <p>The Rolexa algorithms require two types of cutoffs, which can both be easily user-defined in the Rolexa package. In the analyses presented, the limits between the different IUPAC bases in the probability simplex (Figure <figr fid="F2">2A</figr>) were set to <it>HT(n) </it>= log<sub>2</sub>(<it>n</it>+0.5) with <it>n </it>= 1,2,3 (Figure <figr fid="F2">2B</figr>). Secondly the length-dependent cutoffs <it>IT(n) </it>were used to filter out uncertain bases by selecting the longest sub-tag <it>S </it>with total entropy smaller than <it>IT(n </it>= length <it>(S))</it>. In Figure <figr fid="F6">6</figr> we used the following 6 choices: constants <it>IT</it><sub><it>c</it></sub><it>(n) </it>= <it>c </it>with the constant <it>c </it>set to 2, 4, 6, or 8, and two cutoffs increasing with the tag length: <it>IT</it><sub>Log </sub>(<it>n</it>) = log<sub>2 </sub>(4 + (<it>n </it>- 1)/5) and <it>IT</it><sub>Exp </sub>(<it>n</it>) = 2<sup>(1+(<it>n</it>-1)/36)</sup>. The latter two cutoffs interpolate between 2 and approximately 4 over the length of the sequence, but the first cutoff is concave (increases faster at the beginning) and the second is convex.</p>
         </sec>
         <sec>
            <st>
               <p>Availability</p>
            </st>
            <p>We have developed an R package called Rolexa which is freely available from <url>http://bbcf.epfl.ch/Software</url>. It is distributed under the GPL license and uses the <it>mclust </it>package which is part of the R distribution.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>JR and AA implemented the method, JR and CI analyzed the data, JR and FN wrote the manuscript, FN and IX designed and supervised the study. LF provided insight and data and performed the experiments. All authors read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>FN thanks the Swiss National Science Foundation grant no 3100A0-113617 for financial support. We are grateful to Carlo Rivolta for providing early access to his data. Part of the data analysis was performed on the Vital-IT high-performance computing facility of the Swiss Institute of Bioinformatics.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Whole-genome re-sequencing</p>
            </title>
            <aug>
               <au>
                  <snm>Bentley</snm>
                  <fnm>DR</fnm>
               </au>
            </aug>
            <source>Current Opinion in Genetics &amp; Development</source>
            <pubdate>2006</pubdate>
            <volume>16</volume>
            <issue>6</issue>
            <fpage>545</fpage>
            <lpage>552</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.gde.2006.10.009</pubid>
                  <pubid idtype="pmpid" link="fulltext">17055251</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Mapping translocation breakpoints by next-generation sequencing</p>
            </title>
            <aug>
               <au>
                  <snm>Chen</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Kalscheu</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Tzschach</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Menzel</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Ullmann</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Schulz</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Erdogan</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Kijas</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Arkesteijn</snm>
                  <fnm>G</fnm>
               </au>
               <etal/>
            </aug>
            <source>Genome Research</source>
            <pubdate>2008</pubdate>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pubmed">18326688</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Paired-end mapping reveals extensive structural variation in the human genome</p>
            </title>
            <aug>
               <au>
                  <snm>Korbel</snm>
                  <fnm>JO</fnm>
               </au>
               <au>
                  <snm>Urban</snm>
                  <fnm>AE</fnm>
               </au>
               <au>
                  <snm>Affourtit</snm>
                  <fnm>JP</fnm>
               </au>
               <au>
                  <snm>Godwin</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Grubert</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Simons</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Kim</snm>
                  <fnm>PM</fnm>
               </au>
               <au>
                  <snm>Palejev</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Carriero</snm>
                  <fnm>NJ</fnm>
               </au>
               <au>
                  <snm>Du</snm>
                  <fnm>L</fnm>
               </au>
               <etal/>
            </aug>
            <source>Science</source>
            <pubdate>2007</pubdate>
            <volume>318</volume>
            <issue>5849</issue>
            <fpage>420</fpage>
            <lpage>426</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.1149504</pubid>
                  <pubid idtype="pmpid" link="fulltext">17901297</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Identification of microRNAs and other small regulatory RNAs using cDNA library sequencing</p>
            </title>
            <aug>
               <au>
                  <snm>Hafner</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Landgraf</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Ludwig</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Rice</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Ojo</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Lin</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Holoch</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Lim</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Tuschl</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Methods</source>
            <pubdate>2008</pubdate>
            <volume>44</volume>
            <issue>1</issue>
            <fpage>3</fpage>
            <lpage>12</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.ymeth.2007.09.009</pubid>
                  <pubid idtype="pmpid" link="fulltext">18158127</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing</p>
            </title>
            <aug>
               <au>
                  <snm>Vera</snm>
                  <fnm>JC</fnm>
               </au>
               <au>
                  <snm>Wheat</snm>
                  <fnm>CW</fnm>
               </au>
               <au>
                  <snm>Fescemyer</snm>
                  <fnm>HW</fnm>
               </au>
               <au>
                  <snm>Frilander</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Crawford</snm>
                  <fnm>DL</fnm>
               </au>
               <au>
                  <snm>Hanski</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Marden</snm>
                  <fnm>JH</fnm>
               </au>
            </aug>
            <source>Mol Ecol</source>
            <pubdate>2008</pubdate>
            <volume>17</volume>
            <issue>7</issue>
            <fpage>1636</fpage>
            <lpage>1647</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1111/j.1365-294X.2008.03666.x</pubid>
                  <pubid idtype="pmpid" link="fulltext">18266620</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Discovering microRNAs from deep sequencing data using miRDeep</p>
            </title>
            <aug>
               <au>
                  <snm>Friedl&#228;nder</snm>
                  <fnm>MR</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Adamidi</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Maaskola</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Einspanier</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Knespel</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Rajewsky</snm>
                  <fnm>N</fnm>
               </au>
            </aug>
            <source>Nat Biotechnol</source>
            <pubdate>2008</pubdate>
            <volume>26</volume>
            <issue>4</issue>
            <fpage>407</fpage>
            <lpage>415</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nbt1394</pubid>
                  <pubid idtype="pmpid" link="fulltext">18392026</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Genome-wide maps of chromatin state in pluripotent and lineage-committed cells</p>
            </title>
            <aug>
               <au>
                  <snm>Mikkelsen</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Ku</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Jaffe</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Issac</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Lieberman</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Giannoukos</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Alvarez</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Brockman</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Kim</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Koche</snm>
                  <fnm>R</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nature</source>
            <pubdate>2007</pubdate>
            <volume>448</volume>
            <issue>7153</issue>
            <fpage>553</fpage>
            <lpage>560</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature06008</pubid>
                  <pubid idtype="pmpid" link="fulltext">17603471</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>High-resolution profiling of histone methylations in the human genome</p>
            </title>
            <aug>
               <au>
                  <snm>Barski</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Cuddapah</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Cui</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Roh</snm>
                  <fnm>TY</fnm>
               </au>
               <au>
                  <snm>Schones</snm>
                  <fnm>DE</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Wei</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Chepelev</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Zhao</snm>
                  <fnm>K</fnm>
               </au>
            </aug>
            <source>Cell</source>
            <pubdate>2007</pubdate>
            <volume>129</volume>
            <issue>4</issue>
            <fpage>823</fpage>
            <lpage>837</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.cell.2007.05.009</pubid>
                  <pubid idtype="pmpid" link="fulltext">17512414</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer</p>
            </title>
            <aug>
               <au>
                  <snm>Hernandez</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Fran&#231;ois</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Farinelli</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Oster&#229;s</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Schrenzel</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Genome Research</source>
            <pubdate>2008</pubdate>
            <volume>18</volume>
            <issue>5</issue>
            <fpage>802</fpage>
            <lpage>809</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1101/gr.072033.107</pubid>
                  <pubid idtype="pmpid" link="fulltext">18332092</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Genome sequencing in microfabricated high-density picolitre reactors</p>
            </title>
            <aug>
               <au>
                  <snm>Margulies</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Egholm</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Altman</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Attiya</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Bader</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Bemben</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Berka</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Braverman</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>Z</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nature</source>
            <pubdate>2005</pubdate>
            <volume>437</volume>
            <issue>7057</issue>
            <fpage>376</fpage>
            <lpage>380</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1464427</pubid>
                  <pubid idtype="pmpid" link="fulltext">16056220</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Base-calling of automated sequencer traces using phred. II. Error probabilities</p>
            </title>
            <aug>
               <au>
                  <snm>Ewing</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Green</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Genome Research</source>
            <pubdate>1998</pubdate>
            <volume>8</volume>
            <issue>3</issue>
            <fpage>186</fpage>
            <lpage>194</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">9521922</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning</p>
            </title>
            <aug>
               <au>
                  <snm>Cokus</snm>
                  <fnm>SJ</fnm>
               </au>
               <au>
                  <snm>Feng</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Merriman</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Haudenschild</snm>
                  <fnm>CD</fnm>
               </au>
               <au>
                  <snm>Pradhan</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Nelson</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Pellegrini</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Jacobsen</snm>
                  <fnm>SE</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2008</pubdate>
            <volume>452</volume>
            <issue>7184</issue>
            <fpage>215</fpage>
            <lpage>219</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2377394</pubid>
                  <pubid idtype="pmpid" link="fulltext">18278030</pubid>
                  <pubid idtype="doi">10.1038/nature06745</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Alta-Cyclic: a self-optimizing base caller for next-generation sequencing</p>
            </title>
            <aug>
               <au>
                  <snm>Erlich</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Mitra</snm>
                  <fnm>PP</fnm>
               </au>
               <au>
                  <snm>Delabastide</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>McCombie</snm>
                  <fnm>WR</fnm>
               </au>
               <au>
                  <snm>Hannon</snm>
                  <fnm>GJ</fnm>
               </au>
            </aug>
            <source>Nat Methods</source>
            <pubdate>2008</pubdate>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">18604217</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Substantial biases in ultra-short read data sets from high-throughput DNA sequencing</p>
            </title>
            <aug>
               <au>
                  <snm>Dohm</snm>
                  <fnm>JC</fnm>
               </au>
               <au>
                  <snm>Lottaz</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Borodina</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Himmelbauer</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>2008</pubdate>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2532726</pubid>
                  <pubid idtype="pmpid" link="fulltext">18660515</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Using quality scores and longer reads improves accuracy of Solexa read mapping</p>
            </title>
            <aug>
               <au>
                  <snm>Smith</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Xuan</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2008</pubdate>
            <volume>9</volume>
            <fpage>128</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2335322</pubid>
                  <pubid idtype="pmpid" link="fulltext">18307793</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-9-128</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>TileQC: a system for tile-based quality control of Solexa data</p>
            </title>
            <aug>
               <au>
                  <snm>Dolan</snm>
                  <fnm>PC</fnm>
               </au>
               <au>
                  <snm>Denver</snm>
                  <fnm>DR</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2008</pubdate>
            <volume>9</volume>
            <issue>1</issue>
            <fpage>250</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2443380</pubid>
                  <pubid idtype="pmpid" link="fulltext">18507856</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-9-250</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Base-stacking and base-pairing contributions into thermal stability of the DNA double helix</p>
            </title>
            <aug>
               <au>
                  <snm>Yakovchuk</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Protozanova</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Frank-Kamenetskii</snm>
                  <fnm>MD</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>2006</pubdate>
            <volume>34</volume>
            <issue>2</issue>
            <fpage>564</fpage>
            <lpage>574</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1360284</pubid>
                  <pubid idtype="pmpid" link="fulltext">16449200</pubid>
                  <pubid idtype="doi">10.1093/nar/gkj454</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Robust locally weighted regression and smoothing scatterplots</p>
            </title>
            <aug>
               <au>
                  <snm>Cleveland</snm>
                  <fnm>WS</fnm>
               </au>
            </aug>
            <source>J Amer Statist Assoc</source>
            <pubdate>1979</pubdate>
            <volume>74</volume>
            <issue>368</issue>
            <fpage>829</fpage>
            <lpage>836</lpage>
            <xrefbib>
               <pubid idtype="doi">10.2307/2286407</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Model-based Gaussian and non-Gaussian clustering</p>
            </title>
            <aug>
               <au>
                  <snm>Banfield</snm>
                  <fnm>JD</fnm>
               </au>
               <au>
                  <snm>Raftery</snm>
                  <fnm>AE</fnm>
               </au>
            </aug>
            <source>Biometrics</source>
            <pubdate>1993</pubdate>
            <volume>49</volume>
            <issue>3</issue>
            <fpage>803</fpage>
            <lpage>821</lpage>
            <xrefbib>
               <pubid idtype="doi">10.2307/2532201</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>MCLUST: Software for model-based cluster analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Fraley</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Raftery</snm>
                  <fnm>AE</fnm>
               </au>
            </aug>
            <source>J Classification</source>
            <pubdate>1999</pubdate>
            <volume>16</volume>
            <issue>2</issue>
            <fpage>297</fpage>
            <lpage>306</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1007/s003579900058</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Model-based clustering, discriminant analysis, and density estimation</p>
            </title>
            <aug>
               <au>
                  <snm>Fraley</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Raftery</snm>
                  <fnm>AE</fnm>
               </au>
            </aug>
            <source>J Amer Statist Assoc</source>
            <pubdate>2002</pubdate>
            <volume>97</volume>
            <issue>458</issue>
            <fpage>611</fpage>
            <lpage>631</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1198/016214502760047131</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Enhanced model-based clustering, density estimation, and discriminant analysis software: MCLUST</p>
            </title>
            <aug>
               <au>
                  <snm>Fraley</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Raftery</snm>
                  <fnm>AE</fnm>
               </au>
            </aug>
            <source>J Classification</source>
            <pubdate>2003</pubdate>
            <volume>20</volume>
            <issue>2</issue>
            <fpage>263</fpage>
            <lpage>286</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1007/s00357-003-0015-3</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Elements of Information Theory</p>
            </title>
            <aug>
               <au>
                  <snm>Cover</snm>
                  <fnm>TM</fnm>
               </au>
               <au>
                  <snm>Thomas</snm>
                  <fnm>JA</fnm>
               </au>
            </aug>
            <publisher>John Wiley</publisher>
            <pubdate>1991</pubdate>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Indexing strategies for rapid searches of short words in genome sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Iseli</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Ambrosini</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Bucher</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Jongeneel</snm>
                  <fnm>CV</fnm>
               </au>
            </aug>
            <source>PLoS ONE</source>
            <pubdate>2007</pubdate>
            <volume>2</volume>
            <issue>6</issue>
            <fpage>e579</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1894650</pubid>
                  <pubid idtype="pmpid" link="fulltext">17593978</pubid>
                  <pubid idtype="doi">10.1371/journal.pone.0000579</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Optimal alignments in linear space</p>
            </title>
            <aug>
               <au>
                  <snm>Myers</snm>
                  <fnm>EW</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>Comput Appl Biosci</source>
            <pubdate>1988</pubdate>
            <volume>4</volume>
            <issue>1</issue>
            <fpage>11</fpage>
            <lpage>17</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid">3382986</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Using quality scores and longer reads improves accuracy of Solexa read mapping</p>
            </title>
            <aug>
               <au>
                  <snm>Smith</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Xuan</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2008</pubdate>
            <volume>9</volume>
            <issue>1</issue>
            <fpage>128</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2335322</pubid>
                  <pubid idtype="pmpid" link="fulltext">18307793</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-9-128</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Compressed representations of sequences and full-text indexes</p>
            </title>
            <aug>
               <au>
                  <snm>Ferragina</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Manzini</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>M&#228;kinen</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Navarro</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>ACM Transactions on Algorithms (TALG)</source>
            <pubdate>2007</pubdate>
            <volume>3</volume>
            <issue>2</issue>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Optimized design and assessment of whole genome tiling arrays</p>
            </title>
            <aug>
               <au>
                  <snm>Gr&#228;f</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Nielsen</snm>
                  <fnm>FG</fnm>
               </au>
               <au>
                  <snm>Kurtz</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Huynen</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Birney</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Stunnenberg</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Flicek</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>23</volume>
            <issue>13</issue>
            <fpage>i195</fpage>
            <lpage>204</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btm200</pubid>
                  <pubid idtype="pmpid" link="fulltext">17646297</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Bioinformatics challenges of new sequencing technology</p>
            </title>
            <aug>
               <au>
                  <snm>Pop</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Salzberg</snm>
                  <fnm>SL</fnm>
               </au>
            </aug>
            <source>Trends Genet</source>
            <pubdate>2008</pubdate>
            <volume>24</volume>
            <issue>3</issue>
            <fpage>142</fpage>
            <lpage>149</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">18262676</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Whole-genome patterns of common DNA variation in three human populations</p>
            </title>
            <aug>
               <au>
                  <snm>Hinds</snm>
                  <fnm>DA</fnm>
               </au>
               <au>
                  <snm>Stuve</snm>
                  <fnm>LL</fnm>
               </au>
               <au>
                  <snm>Nilsen</snm>
                  <fnm>GB</fnm>
               </au>
               <au>
                  <snm>Halperin</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Eskin</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Ballinger</snm>
                  <fnm>DG</fnm>
               </au>
               <au>
                  <snm>Frazer</snm>
                  <fnm>KA</fnm>
               </au>
               <au>
                  <snm>Cox</snm>
                  <fnm>DR</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>2005</pubdate>
            <volume>307</volume>
            <issue>5712</issue>
            <fpage>1072</fpage>
            <lpage>1079</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.1105436</pubid>
                  <pubid idtype="pmpid" link="fulltext">15718463</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
