Log on / register
Feedback | Support | My details
Open AccessMethodology article

Universal sequence map (USM) of arbitrary discrete sequences

Jonas S Almeida1,2 email and Susana Vinga2 email

1Dept Biometry & Epidemiology, Medical Univ South Carolina, 135 Cannon street, Suite 303, PO Box 250835, Charleston SC 29425, USA

2Inst. Tecnologia Química e Biológica Univ. Nova Lisboa, Av. da República (EAN), PO Box 127, 2781-901 Oeiras, Portugal

author email corresponding author email

BMC Bioinformatics 2002, 3:6doi:10.1186/1471-2105-3-6

Published: 5 February 2002

Abstract

Background

For over a decade the idea of representing biological sequences in a continuous coordinate space has maintained its appeal but not been fully realized. The basic idea is that any sequence of symbols may define trajectories in the continuous space conserving all its statistical properties. Ideally, such a representation would allow scale independent sequence analysis – without the context of fixed memory length. A simple example would consist on being able to infer the homology between two sequences solely by comparing the coordinates of any two homologous units.

Results

We have successfully identified such an iterative function for bijective mappingψ of discrete sequences into objects of continuous state space that enable scale-independent sequence analysis. The technique, named Universal Sequence Mapping (USM), is applicable to sequences with an arbitrary length and arbitrary number of unique units and generates a representation where map distance estimates sequence similarity. The novel USM procedure is based on earlier work by these and other authors on the properties of Chaos Game Representation (CGR). The latter enables the representation of 4 unit type sequences (like DNA) as an order free Markov Chain transition table. The properties of USM are illustrated with test data and can be verified for other data by using the accompanying web-based tool:http://bioinformatics.musc.edu/~jonas/usm/ webcite.

Conclusions

USM is shown to enable a statistical mechanics approach to sequence analysis. The scale independent representation frees sequence analysis from the need to assume a memory length in the investigation of syntactic rules.


© 1999-2008 BioMed Central Ltd unless otherwise stated