This article is part of the supplement: Proceedings of the Fifth Annual MCBIOS Conference. Systems Biology: Bridging the Omics
Clustering ionic flow blockade toggles with a Mixture of HMMs
1 The Research Institute for Children, 200 Henry Clay Ave., New Orleans, LA 70118, USA
2 Department of Computer Science, University of New Orleans, New Orleans, LA, 70148, USA
BMC Bioinformatics 2008, 9(Suppl 9):S13 doi:10.1186/1471-2105-9-S9-S13Published: 12 August 2008
Ionic current blockade signal processing, for use in nanopore detection, offers a promising new way to analyze single molecule properties with potential implications for DNA sequencing. The α-Hemolysin transmembrane channel interacts with a translocating molecule in a nontrivial way, frequently evidenced by a complex ionic flow blockade pattern with readily distinguishable modes of toggling. Effective processing of such signals requires developing machine learning methods capable of learning the various blockade modes for classification and knowledge discovery purposes. Here we propose a method aimed to improve our stochastic analysis capabilities to better understand the discriminatory capabilities of the observed the nanopore channel interactions with analyte.
We tailored our memory-sparse distributed implementation of a Mixture of Hidden Markov Models (MHMMs) to the problem of channel current blockade clustering and associated analyte classification. By using probabilistic fully connected HMM profiles as mixture components we were able to cluster the various 9 base-pair hairpin channel blockades. We obtained very high Maximum a Posteriori (MAP) classification with a mixture of 12 different channel blockade profiles, each with 4 levels, a configuration that can be computed with sufficient speed for real-time experimental feedback. MAP classification performance depends on several factors such as the number of mixture components, the number of levels in each profile, and the duration of a channel blockade event. We distribute Baum-Welch Expectation Maximization (EM) algorithms running on our model in two ways. A distributed implementation of the MHMM data processing accelerates data clustering efforts. The second, simultanteous, strategy uses an EM checkpointing algorithm to lower the memory use and efficiently distribute the bulk of EM processing in processing large data sequences (such as for the progressive sums used in the HMM parameter estimates).
The proposed distributed MHMM method has many appealing properties, such as precise classification of analyte in real-time scenarios, and the ability to incorporate new domain knowledge into a flexible, easily distributable, architecture. The distributed HMM provides a feature extraction that is equivalent to that of the sequential HMM with a speedup factor approximately equal to the number of independent CPUs operating on the data. The MHMM topology learns clusters existing within data samples via distributed HMM EM learning. A Java implementation of the MHMM algorithm is available at http://logos.cs.uno.edu/~achurban webcite.