Open Access Research article

A model-based information sharing protocol for profile Hidden Markov Models used for HIV-1 recombination detection

Ingo Bulla12*, Anne-Kathrin Schultz3, Christophe Chesneau4, Tanya Mark5 and Florin Serea6

Author Affiliations

1 Institut für Mathematik und Informatik, Universität Greifswald, Walther-Rathenau-Straße 47, 17487 Greifswald, Germany

2 Theoretical Biology and Biophysics, Group T-6, Los Alamos National Laboratory, Los Alamos, New Mexico, USA

3 Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany

4 Université de Caen, LMNO, CNRS UMR 6139, Bd Maréchal Juin, BP 5186, 14032 Caen Cedex, France

5 University of Guelph, 50 Stone Road, Guelph, Ontario N1G 2W1, Canada

6 Technical University Gheorghe Asachi, Faculty of Electrical Engineering, Power Engineering and Applied Informatics, Bld. Dimitrie Mangeron 23, 700050 Iasi, Romania

For all author emails, please log on.

BMC Bioinformatics 2014, 15:205  doi:10.1186/1471-2105-15-205

Published: 19 June 2014



In many applications, a family of nucleotide or protein sequences classified into several subfamilies has to be modeled. Profile Hidden Markov Models (pHMMs) are widely used for this task, modeling each subfamily separately by one pHMM. However, a major drawback of this approach is the difficulty of dealing with subfamilies composed of very few sequences. One of the most crucial bioinformatical tasks affected by the problem of small-size subfamilies is the subtyping of human immunodeficiency virus type 1 (HIV-1) sequences, i.e., HIV-1 subtypes for which only a small number of sequences is known.


To deal with small samples for particular subfamilies of HIV-1, we introduce a novel model-based information sharing protocol. It estimates the emission probabilities of the pHMM modeling a particular subfamily not only based on the nucleotide frequencies of the respective subfamily but also incorporating the nucleotide frequencies of all available subfamilies. To this end, the underlying probabilistic model mimics the pattern of commonality and variation between the subtypes with regards to the biological characteristics of HI viruses. In order to implement the proposed protocol, we make use of an existing HMM architecture and its associated inference engine.


We apply the modified algorithm to classify HIV-1 sequence data in the form of partial HIV-1 sequences and semi-artificial recombinants. Thereby, we demonstrate that the performance of pHMMs can be significantly improved by the proposed technique. Moreover, we show that our algorithm performs significantly better than Simplot and Bootscanning.