Optimal spliced alignments of short sequence reads

De Bona, Fabio; Ossowski, Stephan; Schneeberger, Korbinian; Rätsch, Gunnar

doi:10.1186/1471-2105-9-S10-O7

Volume 9 Supplement 10

Highlights from the Fourth International Society for Computational Biology (ISCB) Student Council Symposium

Oral presentation
Open access
Published: 30 October 2008

Optimal spliced alignments of short sequence reads

Fabio De Bona¹,
Stephan Ossowski²,
Korbinian Schneeberger² &
…
Gunnar Rätsch¹

BMC Bioinformatics volume 9, Article number: O7 (2008) Cite this article

4761 Accesses
8 Citations
6 Altmetric
Metrics details

Motivation

Next generation sequencing technologies open exciting new possibilities for genome and transcriptome sequencing. While reads produced by these technologies are relatively short and error-prone compared to the Sanger method, their throughput is several magnitudes higher. We present a novel approach, called QPALMA, for computing accurate spliced alignments of short sequence reads that take advantage of the read's quality information as well as computational splice site predictions. In computational experiments we illustrate that the quality information as well as the splice site predictions [1] help to considerably improve the alignment quality. Our algorithms were optimized and tested using artificially spliced genomic reads produced with the Illumina Genome Analyzer for the model plant Arabidopsis thaliana.

Methods

In this work we aim to develop a method exploiting all available information to accurately align as many as possible spliced reads to the genome. In previous work we already proposed methods taking advantage of splice site predictions and an intron length model (Palma [2]). We extend this method to benefit from the read's quality scores. The algorithm is based on extensions of the Smith-Waterman algorithm using more sophisticated parametrized scoring functions. The idea is to tune the parameters of the scoring functions such that the true alignment does not only achieve a large score, but also that all other alignments score lower than the true alignment [3].

In a typical application scenario one needs to align millions of short reads against the genome. In this case the direct application of the extended Smith-Waterman algorithm is not feasible. We therefore propose to combine our method with a fast suffix-array based approach to identify a seed for the alignment. This combined strategy allows us to efficiently align even very large numbers of reads (cf. Figure 1).

Results

We first studied the accuracy of aligning 30, 000 spliced sequences using different variants of QPALMA: with and without quality information, splice site predictions, and intron length information. From the results given in Table 2 we can conclude that all three components help to reduce the alignment error rate. We also tested the proposed pipeline on about 3 million short reads which contained about 10% spliced reads. The alignment took 12.5 h (on one CPU) and almost all of the reads (98.4%) were aligned correctly. This illustrates that the approach is not only accurate but also fast enough to be used in a next generation mRNA sequencing project.

Conclusion

We have presented a novel approach to solve the difficult task of aligning short reads as generated by NG sequencing techniques over exon boundaries. We were able to successfully exploit all available information sources – the read including its quality score information, splice site predictions, the intron length and, of course, the genome – each significantly contributing to decreasing the alignment error rate.

As future work, it would be interesting to consider the downstream analysis of deriving the gene structure based on these reads and to estimate its error as well. This will be particularly interesting for predicting gene structures with alternative transcripts.

References

Sonnenburg S, Schweikert G, Philips P, Behr J, Rätsch G: Accurate Splice Site Prediction Using Support Vector Machines. BMC Bioinformatics 2007, 8(Suppl 10):S7. 10.1186/1471-2105-8-S10-S7
Article PubMed Central PubMed Google Scholar
Schulze U, Ong C, Hepp B, Rätsch G: PALMA: mRNA to genome alignments using large margin algorithms. Bioinformatics 2007, 23(15):1892–1900. 10.1093/bioinformatics/btm275
Article CAS PubMed Google Scholar
Tsochantaridis I, Hofmann T, Joachims T, Altun Y: Support Vector Machine Learning for Interdependent and Structured Output Spaces. Proceedings of the 16th International Conference on Machine Learning 2004.
Google Scholar

Download references

Author information

Authors and Affiliations

Friedrich Miescher Laboratory, Max Planck Society, Spemannstr. 39, 72076, Tübingen, Germany
Fabio De Bona & Gunnar Rätsch
Max Planck Institute for Developmental Biology, Spemannstr. 35, 72076, Tübingen, Germany
Stephan Ossowski & Korbinian Schneeberger

Authors

Fabio De Bona
View author publications
You can also search for this author in PubMed Google Scholar
Stephan Ossowski
View author publications
You can also search for this author in PubMed Google Scholar
Korbinian Schneeberger
View author publications
You can also search for this author in PubMed Google Scholar
Gunnar Rätsch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fabio De Bona.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

De Bona, F., Ossowski, S., Schneeberger, K. et al. Optimal spliced alignments of short sequence reads. BMC Bioinformatics 9 (Suppl 10), O7 (2008). https://doi.org/10.1186/1471-2105-9-S10-O7

Download citation

Published: 30 October 2008
DOI: https://doi.org/10.1186/1471-2105-9-S10-O7

Highlights from the Fourth International Society for Computational Biology (ISCB) Student Council Symposium

Optimal spliced alignments of short sequence reads

Motivation

Methods

Results

Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

BMC Bioinformatics

Contact us

Highlights from the Fourth International Society for Computational Biology (ISCB) Student Council Symposium

Optimal spliced alignments of short sequence reads

Motivation

Methods

Results

Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us