Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Methodology article

Probabilistic base calling of Solexa sequencing data

Jacques Rougemont13, Arnaud Amzallag13, Christian Iseli23, Laurent Farinelli5, Ioannis Xenarios34 and Felix Naef13*

Author Affiliations

1 School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland

2 Ludwig Institute for Cancer Research (LICR), Bâtiment Génopode, Université de Lausanne, 1015 Lausanne, Switzerland

3 Swiss Institute of Bioinformatics (SIB), Bâtiment Génopode, Université de Lausanne, 1015 Lausanne, Switzerland

4 Vital-IT, Bâtiment Génopode, Université de Lausanne, 1015 Lausanne, Switzerland

5 Fasteris SA, P.O. box 28, 1228 Plan-les-Ouates, Switzerland

For all author emails, please log on.

BMC Bioinformatics 2008, 9:431  doi:10.1186/1471-2105-9-431

Published: 13 October 2008



Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair proportion of the tags are routinely discarded due to an inability to match them to a reference sequence, thereby reducing the effective throughput of the technology.


We propose a novel base calling algorithm using model-based clustering and probability theory to identify ambiguous bases and code them with IUPAC symbols. We also select optimal sub-tags using a score based on information content to remove uncertain bases towards the ends of the reads.


We show that the method improves genome coverage and number of usable tags as compared with Solexa's data processing pipeline by an average of 15%. An R package is provided which allows fast and accurate base calling of Solexa's fluorescence intensity files and the production of informative diagnostic plots.