Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Methodology article

JACOP: A simple and robust method for the automated classification of protein sequences with modular architecture

Peter Sperisen1* and Marco Pagni2

Author Affiliations

1 Swiss Institute of Bioinformatics, Computational Cancer Genomics Group – ISREC, Ch. des Boveresses 155, 1066 Epalinges, Switzerland

2 Swiss Institute of Bioinformatics, Vital IT Group, BEP-UNIL, 1015 Lausanne, Switzerland

For all author emails, please log on.

BMC Bioinformatics 2005, 6:216  doi:10.1186/1471-2105-6-216

Published: 31 August 2005



Whole-genome sequencing projects are rapidly producing an enormous number of new sequences. Consequently almost every family of proteins now contains hundreds of members. It has thus become necessary to develop tools, which classify protein sequences automatically and also quickly and reliably. The difficulty of this task is intimately linked to the mechanism by which protein sequences diverge, i.e. by simultaneous residue substitutions, insertions and/or deletions and whole domain reorganisations (duplications/swapping/fusion).


Here we present a novel approach, which is based on random sampling of sub-sequences (probes) out of a set of input sequences. The probes are compared to the input sequences, after a normalisation step; the results are used to partition the input sequences into homogeneous groups of proteins. In addition, this method provides information on diagnostic parts of the proteins. The performance of this method is challenged by two data sets. The first one contains the sequences of prokaryotic lyases that could be arranged as a multiple sequence alignment. The second one contains all proteins from Swiss-Prot Release 36 with at least one Src homology 2 (SH2) domain – a classical example for proteins with modular architecture.


The outcome of our method is robust, highly reproducible as shown using bootstrap and resampling validation procedures. The results are essentially coherent with the biology. This method depends solely on well-established publicly available software and algorithms.