MS4 - Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences
-
* Corresponding author: Claudine Devauchelle devauchelle@genopole.cnrs.fr
1 Georg-August-Universität, Institut für Mikrobiologie und Genetik, Goldschmidtstraβe 1, 37077 Göttingen, Germany
2 Partner Institute for Computational Biology, CAS-MPG, 320 Yue Yang Rd, 200031 Shanghai, China
3 Laboratoire Statistique et Génome (LSG), CNRS UMR 8071, INRA 1152, Université d'Evry, Tour Evry2, Place des Terrasses, 91034 Evry Cedex, France
4 Institut de Mathématiques de Luminy, UMR 6206, Luminy, Marseille, France
BMC Bioinformatics 2010, 11:406 doi:10.1186/1471-2105-11-406
Published: 30 July 2010Additional files
Additional file 1:
Network for Compendium2000 sequences. Network for the 46 Compendium2000 sequences computed by SplitsTree4 on our MS4 dissimilarity matrix with κ = 1 (from N = 2 to N = 60).
Format: PNG Size: 64KB Download file
Additional file 2:
Network for gag sequences. Network for the 70 gag sequences computed by SplitsTree4 on MS4 dissimilarity matrix with κ = 1 (Nmax = 510).
Format: PNG Size: 56KB Download file
Additional file 3:
Network for the pol sequences. Network for the 66 pol sequences computed by SplitsTree4 on MS4 dissimilarity matrix with κ = 1 (Nmax = 962).
Format: PNG Size: 59KB Download file
Additional file 4:
Network for env sequences. Network for the 66 env sequences computed by SplitsTree4 on MS4 dissimilarity matrix with κ = 1 (Nmax = 794).
Format: PNG Size: 72KB Download file
Additional file 5:
Network for LTR sequences obtained with NLD. The SplitsTree4 network for non-coding LTR sequences computed with the NLD method for a fixed word length of N = 11. NLD method is described in [10], it uses a similar similarity index but with a fixed length word. In [10] we used Neighbor Joining instead of Splits Networks.
Format: PNG Size: 67KB Download file
Additional file 6:
SplitsTree network for k = 5 for LTR sequences. Network for the 43 non coding sequences parts of HIV LTR computed by SplitsTree4 on MS4 dissimilarity matrix for the value κ = 5 (N from 2 to 100).
Format: PNG Size: 17KB Download file
Additional file 7:
SplitsTree network for k = 10 for LTR sequences. Network for the 43 non coding sequences parts of HIV LTR computed by SplitsTree4 on MS4 dissimilarity matrix for the value κ = 10 (N from 2 to 100).
Format: PNG Size: 17KB Download file
Additional file 8:
Similarity blocks found by MS4 in non coding LTR sequences. Superposition of MS4 classes on a manually expertised alignment of the non coding part of 43 HIV-SIV LTR sequences focused on NFκB region. This is a nucleotide sequences alignment of the 43 non-coding LTR sequences. Apart from minor modifications the alignment is the same as that in Fig. 5 in [10]. The alignment is focused on the transcription factor NFκB binding site (GGGACTTTCC[A|G]) and its flanking regions. The names of sequences are indicated with their accession number in Los Alamos HIV sequence databank. The sequence are regrouped according to their phylogeny. The letters are rewritten by applying the MS4 method to the whole non coding LTR sequences. The MS4 identifier is constructed as follows: e.g. C24_8 (class C24 for a N value of 8). Identical recoded letters that are in the same columns are displayed in the same colour. When they are not all aligned on the same column no colour is used (as well as when they are unique in this part of the alignment). The repeated motifs inside one sequence are put one under the other. Therefore the sequences are often written on several lines to highlight similarities between sequences and inside sequences. Most often the similarity blocks are aligned and the great majority of identical indexed letters are on only one column.
Format: XLS Size: 54KB Download file
This file can be viewed with: Microsoft Excel Viewer
Additional file 9:
Region of NFκ B fixation site. The complete alignment, part of which is featured in Fig. 4. This figure corresponds to the figure in Additional File 8. The colours are the same as in the figure in Additional File 8 but in this figure the MS4 identifier has been simplified as follows: we have just indicated the letter and the value of N. Therefore it can be that two different MS4 classes that lie on the same column, with the same letter and the same N value are only distinguished by their colour (e.g. A18 and also T18 HIV-1-M/G, that are red or green).
Format: PDF Size: 32KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 10:
Network for the nef sequences. Network for the 66 nef nucleic sequences computed by SplitsTree4 on MS4 dissimilarity matrix with κ = 1 (for Nmax = 543).
Format: PNG Size: 62KB Download file
Additional file 11:
Network for the Nef protein sequences. Network for the 66 Nef protein sequences on MS4 dissimilarity matrix with κ = 1 (for N = 2 to N = 100).
Format: PNG Size: 452KB Download file
Additional file 12:
Python code source. Python implementation of MS4 algorithm for linux systems. See INSTALL and README files to use it.
Format: GZ Size: 75KB Download file
