Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Software

zipHMMlib: a highly optimised HMM library exploiting repetitions in the input to speed up the forward algorithm

Andreas Sand12*, Martin Kristiansen2, Christian NS Pedersen12 and Thomas Mailund1

Author Affiliations

1 Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark

2 Department of Computer Science, Aarhus University, Aarhus, Denmark

For all author emails, please log on.

BMC Bioinformatics 2013, 14:339  doi:10.1186/1471-2105-14-339


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/14/339


Received:18 June 2013
Accepted:14 November 2013
Published:22 November 2013

© 2013 Sand et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Hidden Markov models are widely used for genome analysis as they combine ease of modelling with efficient analysis algorithms. Calculating the likelihood of a model using the forward algorithm has worst case time complexity linear in the length of the sequence and quadratic in the number of states in the model. For genome analysis, however, the length runs to millions or billions of observations, and when maximising the likelihood hundreds of evaluations are often needed. A time efficient forward algorithm is therefore a key ingredient in an efficient hidden Markov model library.

Results

We have built a software library for efficiently computing the likelihood of a hidden Markov model. The library exploits commonly occurring substrings in the input to reuse computations in the forward algorithm. In a pre-processing step our library identifies common substrings and builds a structure over the computations in the forward algorithm which can be reused. This analysis can be saved between uses of the library and is independent of concrete hidden Markov models so one preprocessing can be used to run a number of different models.

Using this library, we achieve up to 78 times shorter wall-clock time for realistic whole-genome analyses with a real and reasonably complex hidden Markov model. In one particular case the analysis was performed in less than 8 minutes compared to 9.6 hours for the previously fastest library.

Conclusions

We have implemented the preprocessing procedure and forward algorithm as a C++ library, zipHMM, with Python bindings for use in scripts. The library is available at http://birc.au.dk/software/ziphmm/ webcite.

Background

Hidden Markov models (HMMs) are a class of statistical models for sequential data with an underlying hidden structure. They were first introduced to bioinformatics in the late 1980s [1] and have since then been used in a wide variety of applications, for example for gene finding [2], modelling of protein structures [3,4], sequence alignment [5] and phylogenetic analysis [6-9]. Because of their computational efficiency HMMs are one of the few widely used statistical methodologies that are feasible for genome wide analysis where sequences often are several hundred million characters long. With data sets of this size, however, analysis time is still often measured in hours and days. Improving on the performance of HMM analysis is therefore important to keep up with the quickly growing amount of biological sequence data to be analysed, and to make more complex analyses feasible.

The most time consuming part of using hidden Markov models is often parameter fitting, since the likelihood of a model needs to be computed repeatedly when optimising the parameters. Depending on the optimisation strategy, this means that the forward algorithm (or both the forward and the backward algorithm) will be evaluated in potentially hundreds of points in parameter space. Optimising the forward algorithm is therefore the most effective strategy for efficient HMM implementations.

The forward algorithm can be seen as a sequence of vector-matrix operations along an input sequence. This, however, can be rewritten as a sequence of matrix-matrix operations. This replaces an <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M1">View MathML</a> time vector-matrix operation with an <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M2">View MathML</a> time matrix-matrix operation but opens up possibilities for changing the evaluation order since different parts of the computation now can be handled independently. This then makes it possible to reuse computations whenever the input contains repeated substrings [10] or to parallelise the algorithm across a number of independent threads [11].

The main contribution of this paper is a software library that uses both of these ideas to greatly speed up the forward algorithm. We present a preprocessing of the observed sequence that finds common substrings and constructs a data structure that makes the evaluation of the likelihood close to two orders of magnitude faster (not including the preprocessing time). The preprocessing of a specific sequence can be saved and later reused in the analysis of a different HMM topology. The algorithms have been implemented in a C++ library, zipHMMlib, available at http://birc.au.dk/software/ziphmm/ webcite. The library also provides an interface to the Python programming language.

Much of the theory used in zipHMMlib was also developed by Lifshits et al. [10], but while they developed the theory in the context of the Viterbi algorithm, where the preprocessing cannot be reused, we concentrate on the forward algorithm and introduce a data structure to save the preprocessing for later reuse. We furthermore extend the theory to make the computations numerically stable and introduce practical measures to make the algorithm run fast in practice and make the library accessible.

Our implementation is tested on simulated data and on alignments of chromosomes from humans with chimpanzees, gorillas and orangutans analysed with the CoalHMM framework [7,8,12], a framework which uses changes in coalescence trees along a sequence alignment to make inference in population genetics and phylogenetics and which has been used in a number of whole-genome analyses [13-16]. Using an “isolation-with-migration” CoalHMM [17], we train the model using the Nelder-Mead-simplex algorithm and measure the preprocessing time and total optimisation time. Looking at the time required to perform the entire training procedure, we achieve up to 78 times shorter wall-clock time compared to the previously fastest implementation of the forward algorithm. Even for data of high complexity and with few repetitions we achive a speedup of a factor 4.4.

Implementation

Hidden Markov models

A Hidden Markov Model (HMM) describes a joint probability distribution over an observed sequence <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M3">View MathML</a> and a hidden sequence <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M4">View MathML</a>, where and are finite alphabets of observables and hidden states, respectively. The hidden sequence is a realisation of a Markov process which explains hidden properties of the observed data. We can formally define an HMM [18] as consisting of:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M7">View MathML</a>, a finite alphabet of hidden states;

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M8">View MathML</a>, a finite alphabet of observables;

•a vector Π=(πi)1≤iN, where πi= Pr(x1=hi) is the probability of the model starting in hidden state hi;

•a matrix A={aij}1≤i, jN, where aij= Pr(xt=hj|xt−1=hi) is the probability of a transition from state hi to state hj;

•a matrix <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M9">View MathML</a>, where bij= Pr(yt=oj|xt=hi) is the probability of state hi emitting oj.

An HMM is parameterised by π, A and B, which we will denote by λ=(π,A,B).

The classical forward algorithm

The forward algorithm [18] finds the probability of observing a sequence Y1:T in a model λ by summing the joint probability of the observed and hidden sequences for all possible hidden sequences: <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M10">View MathML</a> Pr(Y1:T,X1:T=x1:T | λ). This sum is normally computed by first filling out a table, α, with entries αt(xt)= Pr(Y1:t,<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M11">View MathML</a> column by column from left to right, using the recursion

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M12">View MathML</a>

(1)

After filling out α, Pr(Y1:T | λ) can be computed as <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M13">View MathML</a>.

The algorithm as linear algebra

In the classical forward algorithm, we compute the columns of α from left to right by the recursion in equation (1). If we can compute the last one of these columns, αT, efficiently, we can compute <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M14">View MathML</a>. Now let αt be the column vector containing the αt(xt)’s:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M15">View MathML</a>

let <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M16">View MathML</a> be the diagonal matrix, having the emission probabilities of oi on the diagonal:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M17">View MathML</a>

and let

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M18">View MathML</a>

(2)

where A is the transpose of A. Then αt can be computed using only <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M19">View MathML</a> and the previous column vector αt−1:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M20">View MathML</a>

(3)

Thus the classical forward algorithm can be described as a series of matrix-vector multiplications of length T−1 as illustrated in Figure 1. The classical forward algorithm corresponds to computing this from right to left, but since matrix-matrix multiplication is associative the product can be computed in any order. Since repeated substrings corresponds to repeated matrix-matrix multiplications, running time can be improved by reusing shared expressions [10,11]. In the following sections we will show how we precompute a grouping of the terms based on Y1:T in order to minimise the workload in the actual computation of the likelihood.

thumbnailFigure 1. Classical approach to the forward algorithm. The classical forward algorithm, as described by Rabiner [18]. The rectangles represent matrices and vectors. The black lines denote matrix-vector multiplications. The top row is the <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M21">View MathML</a> matrices. αi is obtained from αi−1 and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M22">View MathML</a>. Note that the input sequence is inverted to illustrate that the series of matrix-vector multiplication should be carried out from right to left.

Exploiting repetitions in the observed sequence

Let <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M23">View MathML</a> be the pair of symbols occurring most often without overlap in Y1:T, and let <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M24">View MathML</a> be the number of occurrences. We can then reduce the length of Y1:T with <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M25">View MathML</a> characters by introducing a new symbol oM+1 and replacing all occurrences of oioj by this symbol. To mimic this in the computation described above, we introduce a new C matrix:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M26">View MathML</a>

Now notice that we only need to compute this matrix once and substitute it for all occurrences of <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M27">View MathML</a> in equation (3). Hence we can save <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M28">View MathML</a> matrix-vector multiplications by introducing one matrix-matrix multiplication, potentially saving us a large amount of work.

These observations suggest that we can split the computation of the likelihood of a given observed sequence in a preprocessing of the sequence and in the actual computation of the likelihood. In the preprocessing phase we compress the observed sequence by repeatedly finding the most frequent pair of symbols oioj in the current sequence and replacing all occurrences of this pair by a new symbol. This is repeated until <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M29">View MathML</a> becomes too small to gain a speedup (see next section). The result is a sequence <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M30">View MathML</a> over a new alphabet <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M31">View MathML</a>, where li,ri∈{o1,o2,…oi−1}. This compression will be identical independent of the HMM, meaning that we can save it along with the observed sequence and reuse it for any HMM.

The actual computation of the likelihood is then split in two stages. In the first stage we compute C1 and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M32">View MathML</a> for i=1,…,M using (2). We then compute <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M33">View MathML</a> for increasing i=M+1,…,M by <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M34">View MathML</a>. In the second stage, we compute αT by

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M35">View MathML</a>

This is illustrated in Figure 2 where the actual computation is drawn in solid black, while the saved work due to redundancy is shown in gray.

thumbnailFigure 2. Reusing common expressions to speed up the forward algorithm. The actual computation of the likelihood of the observed sequence 10100010011001 given a specific HMM. The rectangles represent matrices and vectors and lines between them represent dependencies. In stage 1 the <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M36">View MathML</a> matrices are computed. The solid black lines show the amount of work performed, while the grey lines show the amount of work saved due to redundancy. C2 is for example computed as the product C1C0, and this multiplication is saved three times. In the second stage αT is computed from the compressed sequence and the <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M37">View MathML</a> matrices. As in Figure 1αi is computed from αi−1 and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M38">View MathML</a>.

Compression stopping criterion

While the first iterations of the preprocessing procedure compress the sequence very effectively, the last iterations do not decrease the sequence length by much, since most pairs are uncommon when more characters are introduced. This is illustrated in Figure 3, where we see that the number of occurrences of the most frequent pair of symbols decreases superexponentially as a function of the number of iterations performed on an alignment of the human and chimpanzee chromosome 1. This means that we potentially save a lot of time on the likelihood computation by performing the first iterations, but as the slope of the curve increases towards 0 we risk to spend a long time on the preprocessing and save very little time on the actual likelihood computation. To overcome this problem, we do not compress the input sequence all the way down to a single character. Assume we know that the preprocessing will not be reused for an HMM with less than Nmin states, and let tmv be the time required for an (Nmin×NminNmin matrix-vector multiplication and tmm be the time required for an (Nmin×Nmin)×(Nmin×Nmin) matrix-matrix multiplication. In iteration i of the preprocessing we replace the most frequent pair of two symbols in the current sequence and find the most frequent pair of two symbols in the resulting sequence.

thumbnailFigure 3. Number of pattern occurrences against the number of iterations performed. The number of occurrences of the most frequent pair of symbols plotted on a log-scale against the number of iterations performed of the preprocessing procedure on an alignment of the human and chimpanzee chromosome 1 encoded using the alphabet {0,1,2} for identical sites, differing sites or missing data, respectively.

Thus if pi is the number of occurrences of the pair found in iteration i, prei is the time required for iteration i, and e is an estimate (given by the user) of the number of times the preprocessing is going to be reused (for example in a number of training procedures each calling forward several times), then, assuming that the matrix-vector multiplications and matrix-matrix multiplications dominate the runtime of the actual likelihood computations, the amount of time that is saved by running iteration i is e(tmvpi−1tmm)−prei, as we save pi−1 matrix-vector multiplications in each likelihood computation, and we do this by introducing one new matrix-matrix multiplication. This means that the optimal time to stop the preprocessing is before iteration j, where j is the minimal value of i making e(tmvpi−1tmm)−prei less than or equal to 0. However, we do not know prei before iteration i has been completed, but we can estimate it by prei−1. Thus we stop the preprocessing just before iteration j, where j is the minimal value of i making e(tmvpi−1tmm)−prei−1 less than or equal to 0.

The values tmv and tmm are measured prior to the preprocessing, whereas the user has to supply an estimate, e, of the number of reuses of the preprocessing and Nmin. If a single value of Nmin can not be determined, we allow the user to specify a list of state space sizes <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M39">View MathML</a> for which he wants the preprocessing to be saved. If no Nmin values are provided, the compression is stopped whenever pi=pi−1 for the first time.

Numerical stability

All our matrices contain probabilities, so all entries are between 0 and 1. This means that their products will tend towards 0 exponentially fast. The values of these products will normally be stored in one of the IEEE 754 floating-point formats. These formats have limited precision, and if the above was implemented naïvely the results would quickly underflow.

If we can make do with log(Pr(Y1:T | λ)), we can prevent this underflow by repeatedly rescaling the matrices, much in the same way as the columns are rescaled in the numerically stable version of the classical forward algorithm [18]. To make this work in our case, we will normalise the results of each matrix-matrix multiplication or matrix-vector multiplication we do throughout the algorithm and work with the normalised matrices instead. We first take care of the rounding errors that can propagate through the first stage of the likelihood computations (depicted in the top part of Figure 3) if the dependency graph between the new symbols is deep. Let

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M40','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M40">View MathML</a>

be the sum of all entries in <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M41','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M41">View MathML</a> for all symbols in the original observed sequence, and let <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M42','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M42">View MathML</a> be the corresponding normalised matrix. Now for each new symbol oi=(li,ri) in the compressed sequence define

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M43','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M43">View MathML</a>

and let

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M44','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M44">View MathML</a>

Finally let

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M45">View MathML</a>

Then

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M46','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M46">View MathML</a>

(4)

Thus to handle the underflow in the first stage, we compute <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M47','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M47">View MathML</a> along with <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M48','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M48">View MathML</a> for i=1,…M (see Figure 4) and compute the product above in the second stage of the likelihood computation.

thumbnailFigure 4. Dependencies between scaling factors. Dependencies between scaling factors. The square boxes are the <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M49','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M49">View MathML</a> matrices and s0,s1,s2 and s3 are scaling factors. The gray lines show the dependencies between the <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M50">View MathML</a> matrices, while the solid black arrows show how the scaling factors are computed. For example, s0 is computed directly from <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M51','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M51">View MathML</a> as the sum of all its entries. s2 is the sum of all entries in <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M52','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M52">View MathML</a> times s1 and s0.

However, the <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M53','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M53">View MathML</a> matrices still only contain values between 0 and 1, and their product will therefore still tend towards 0 exponentially fast, causing underflow. To prevent this we introduce a scaling factor di for each of the T−1 matrix-vector multiplications in (4), set to be the sum of the entries in the resulting vector. Each di is used two times: First we normalise the corresponding resulting vector by dividing each entry by di, and next we use it to restore the correct result at the end of the computations. Assume that <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M54','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M54">View MathML</a> is the result of the T−1 normalised matrix-vector multiplications. Then

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M55','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M55">View MathML</a>

and we can compute the final likelihood as

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M56">View MathML</a>

Notice, however, that we now risk getting an underflow when computing these products if T is big. We handle this by working in log-space. Define

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M57','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M57">View MathML</a>

and

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M58','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M58">View MathML</a>

Then

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M59','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M59">View MathML</a>

Practical implementation details

In our implementation of the preprocessing phase described above, we simply build a map symbol2pair, mapping each new alphabet symbol oi to its two constituents (li,ri). In each scan every pair of symbols is counted, and the most frequent pair in the previous round is replaced by a new symbol. The data structure being saved in the end is symbol2pair along with two other maps: nstates2alphabetsize and nstates2seq. The map nstates2alphabetsize maps each Nmini to the size of the alphabet Mj′ after j iterations, where j is the number of iterations determined by the stopping criterion. The map nstates2seq maps <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M60','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M60">View MathML</a> to the resulting sequence after j rounds of compression. These maps can be saved to disk for later use along with the original observed sequence.

Given a specific HMM with N states, the first stage of the actual computation of the likelihood builds a list, symbol2matrix, containing the <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M61','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M61">View MathML</a> matrices, and a list of scaling values, symbol2scale, containing the <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M62','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M62">View MathML</a> values. These are computed in <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M63','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M63">View MathML</a> time in an iteration over the first M symbols in the symbol2pair map, where M is the alphabet size saved in nstates2alphabetsize in the preprocessing procedure. In the second stage log(Pr(Y1:T | λ)) is computed in <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M64','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M64">View MathML</a> time by

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M65','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M65">View MathML</a>

(5)

using the two maps created in the first stage. To obtain maximal performance, we use a BLAS implementation for C++ to perform the series of matrix multiplication.

Our implementation uses <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M66','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M66">View MathML</a> space in the preprocessing phase and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M67','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M67">View MathML</a> space in the actual computation, where k is the number of Nmin values supplied by the user, N is the number of states in the HMM used in the actual computation, and M is the number of symbols in the extended alphabet corresponding to N in nstates2alphabetsize. If the preprocessed data structure is saved to disk, it will take up <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M68','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M68">View MathML</a> space.

We have also implemented the algorithm in a parallelised version. In this version, stage 2 is parallelised much like the implementation in parredHMMlib [11], where the series of matrix multiplications in (5) is split into a number of blocks which are then processed in parallel. Stage 1 can clearly also be parallelised by computing independent <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/339/mathml/M69','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/339/mathml/M69">View MathML</a> matrices in parallel. However, we found that this does not work well in practice, as the workload in stage 1 is not big enough to justify the parallelization. Stage 1 is therefore not parallelised in the library. The parallelisation of stage 2 gives the greatest speedup for long sequences that are not very compressible. This is because the parallelisation in general works best for long sequences [11], and if the input sequence is very compressible then the compressed sequence will be short and more work will be done in the non-parallelised stage 1. The experiments presented in the next section have all been run single-threaded to get a clearer picture of how the runtime of the basic algorithm is influenced by the characteristica of the input sequence and model. But in general a slightly faster running time can be expected if parallelisation is enabled, especially for long sequences of high complexity.

Results and discussion

We have implemented the above algorithms in a C++ library named zipHMM. The code provides both a C++ and a Python interface to the functionality of reading and writing HMMs to files, preprocessing input sequences and saving the results, and computing the likelihood of a model using the forward algorithm described in the previous section. The library uses BLAS for linear algebra operations and pthreads for multi-threaded parallelisation.

Using the library

The library can be used directly in C++ programs or through Python wrappers in scripts.

Using zipHMM from C++

When using the library in C++ the most important objects are from the Forwarder class, which is responsible for both preprocessing sequences, reading and writing the preprocessed data structure, and for computing the likelihood of a hidden Markov model. The code snippet in Figure 5(a) shows a complete C++ program that reads in an input sequence, f.read_seq(...), preprocess it (as part of reading in the sequence), stores the preprocessed structure to disk, f.write_to_directory(...), reads in an HMM from disk, read_HMM(...), and computes the likelihood of the HMM, f.forward(...).

thumbnailFigure 5. Using the library in C++(a)and Python(b).

The sequence reader takes the alphabet size as parameter. This is because we cannot necessarily assume that the observed symbols in the input sequence are all the possible symbols the HMM can emit, so we need to know the alphabet size explicitly. It furthermore takes an optional parameter in which the user can specify an estimate, e, of the number of times the preprocessing will be reused. The default value of this parameter is 1.

If the preprocessed sequence is already stored on disk, we can simply read that instead like this:

Forwarder f;number_of_states = 4;f.read_from_directory(“example_preprocessed”, number_of_states);

This will cause the saved sequence matching Nmin≤4 to be read from the directory example_preprocessed together with additional information on the extended alphabet used in this sequence.

In the library, HMMs are implicitly represented simply by a vector and two matrices, the π vector of initial state probabilities and the transition, A, and emission, B, matrices as described in the Implementation section. These are all represented in a Matrix class, and in the program in Figure 5(a) these are read in from disk. They can also be directly constructed and manipulated in a program. In our own programs we use this, together with a numerical optimisation library, to fit parameters by maximising the likelihood.

The f.forward(...) method computes the likelihood sequentially using the preprocessed structure. To use the multi-threaded parallelisation instead, one simply uses the f.pthread_forward(...) function, with the same parameters, instead.

For completeness the library also offers implementations of the Viterbi and posterior decoding algorithms. To use these in C++ the headers viterbi.hpp and posterior_decoding.hpp should be included and the functions viterbi(...) and posterior_decoding(...) should be called as described in the README file in the library.

Using zipHMM from Python

All the C++ classes in the library are wrapped in a Python module so the full functionality of the zipHMM is available for Python scripting using essentially the same API, except with a more Python flavour where appropriate, e.g. reading in data is handled by returning multiple values from function calls instead of pass-by-reference function arguments and with a more typical Python naming convention. Figure 5(b) shows the equivalent of the C++ code in Figure 5(a) in Python.

Performance

To evaluate the performance of zipHMM we performed a number of experiments using a hidden Markov model previously developed to infer population genetics parameters of a speciation model. All experiments were run on a machine with two Intel Sandy Bridge E5-2670 CPUs, each with 8 cores running at 2.67GHz and having access to a 64Gb main memory. We compare the performance of our forward algorithm to the performance of the implementations of the forward algorithm in HMMlib [19] and in parredHMMlib [11] and to a simple implementation of equation (3) using BLAS to perform the series of matrix-vector multiplications. HMMlib is an implementation that takes advantage of all the features of a modern computer, such as SSE instructions and multiple cores. The individual features of HMMlib can be turned on or off by the user, and we recommend only enabling these features for HMMs with large state spaces. In all our experiments we enabled the SSE parallelisation but used only a single thread. The parredHMM library implements equation (3) as a parallel reduction, splitting the series of matrix multiplications into a number of blocks and processing the blocks in parallel. The parredForward algorithm was calibrated to use the optimal number of threads.

For performance evaluation we wanted to evaluate how well the new algorithm compares to other optimised forward implementations, evaluate the trade-off between preprocessing and computing the likelihood, and explore how the complexity of the input string affects the running time.

Our new implementation of the forward algorithm is expected to perform best on strings of low complexity because they are more compressible. To investigate this we measured the per-iteration running time of the forward algorithm for parredHMMlib, HMMlib and the simple implementation of equation (3) on random binary sequences (over the alphabet {0,1}) of length L=107 with the frequency of 1s varying from 0.0001 to 0.05, and divided it by the per-iteration running time for zipHMMlib (excluding the preprocessing time) to obtain the speedup factor. This experiment is summarised in Figure 6, where we note that the speedup factor decreases linearly with the complexity of the input sequence; however, speedup factors of more than two orders of magnitude are obtained for less complex sequences, and even for sequences of low complexity a (modest) speedup is obtained.

thumbnailFigure 6. Speedup vs. sequence complexity. The speedup factor decreases linearly with the sequence complexity.

In the rest of our experiments, we used a coalescent hidden Markov model (CoalHMM) from [17] together with real genomic data for the experiments. A CoalHMM [7,8,12] exploits changing gene-trees along a genomic alignment to infer population genetics parameters. The “Isolation-with-Migration” CoalHMM from [17] considers a pairwise alignment as its observed sequence and a discretisation of the time to the most recent common ancestor, or “coalescence time”, of the aligned sequences as its hidden states. The coalescence time can change from any point to another, so the transition matrix of the CoalHMM is fully connected, and the number of hidden states can be varied depending on how fine-grained we want to model time. Varying the number of states lets us explore the performance as a function of the number of states. The performance as a function of the length of the input was explored by using alignments of varying length. Finally, to explore how the complexity of the string affects the performance we used alignments of sequences at varying evolutionary distance, since closer related genomes have fewer variable sites and thus the alignments have lower complexity. The CoalHMM model uses a Jukes-Cantor model in its emission probabilities and thus only distinguishes between if a specific site has two identical nucleotides or two different nucleotides in the alignment. We therefore also varied the complexity of the strings by compressing either the actual sequence alignment or summarising it as an alphabet of size three, {0,1,2}, for identical sites, differing sites, or missing data/gaps. This way we obtain sequences with alphabets of size M=3 ({0,1,2}), M=16 (full alignments over {A,C,G,T}×{A,C,G,T}, where columns with missing data or gaps were deleted) and M = 25 (full alignments over {A,C,G,T,N}×{A,C, G,T,N}, where N is either missing data or a gap).

For all experiments we trained the model using the Nelder-Mead-simplex algorithm and measured the preprocessing time and total optimisation time, and the expected number of likelihood computations was set to e=500.

Figure 7(a) shows how the performance of zipForward changes, when the size of the model is increased. We note that the total time, as expected, depends very heavily on the number of states (the time complexities of stage 1 and 2 are qubic and quadratic in the number of states, respectively), while the time used on preprocessing, also as expected, varies very little and shows no clear pattern as the number of states is increased.

thumbnailFigure 7. Performance experiments with the CoalHMM framework, showing the running time of zipHMMlib as a function of (a) model size, (b) sequence length, (c) sequence complexity, and (d) compared to other implementations.

Figure 7(b) shows how the runtime for training the CoalHMM with zipForward changes when the sequence length is varied. We expected the runtime to increase with the sequence length, however this is not what the results show for the shorter sequences. This is due to the optimisation procedure, which required more iterations of the likelihood computation for the shorter sequences than for the longer sequences. For the longest sequences the runtime grows sublinearly, which was expected, since longer sequences often compress relatively more than shorter sequences.

We expected alignments of sequences at short evolutionary distance to be more compressible than alignments of sequences at longer evolutionary distance, and therefore expected the training procedure to be faster for alignments of sequences at short evolutionary distance. We recognise this in Figure 7(c) except for the sequences with M=16, where the human-orangutan alignment was processed faster than the human-gorilla alignment. However, this was caused by the trade-off between the time for the preprocessing and the time for the actual training procedure: the preprocessing procedure took significantly longer time for the human-gorilla alignment (because it was more compressible) than for the human-orangutan alignment, but this extra time was not all gained back in the training procedure, although the compressed sequence indeed was shorter (the per-iteration running time for the human-gorilla alignment was 7.913s and 8.488s for the human-orangutan alignment).

We also expected the total time of the training procedure to increase as the number of symbols in the initial alphabet was increased, because sequences with small initial alphabets are expected to be more compressible than sequences with larger initial alphabets. But as Figure 7(c) shows, the sequences with an initial alphabet of size M=25 were processed faster than the sequences with an initial alphabet of size M = 16. This is again caused by the optimisation procedure, which converges faster for the sequences with M=25 than for the sequences with M= 16 (e.g. for the human-gorilla alignments, the number of evaluations of the likelihood were 860 and 1160 for M=25 and M=16, respectively). This may be a result of the sequences with M=25 containing more information than the sequences with M=16. The lengths of the compressed sequences and the per-iteration running times match our expectations, and for the sequences with M=3 and M=25, which contain the same data, the algorithm behaves as expected.

Figure 7(d) shows the performance of the four different implementations of the forward algorithm. Each of the four algorithms was used to train the CoalHMM on an alignment of the entire human and chimpanzee chromosome 1, using an HMM with 16 states and an initial alphabet of size 3. The training procedure was finished in 7.4 minutes (446 seconds) using zipForward and including the preprocessing time. This gives a speedup factor of 77.7 compared to the previously fastest implementation using parredHMMlib, which used 9.6 hours (34,657 seconds). It is therefore evident that zipForward is clearly superior to the three other implementations on this kind of input. The time used per iteration of the likelihood computation was 0.5042 seconds for zipHMMlib, while it was 46.772 seconds for parredHMMlib, leading to a speedup of a factor 92.8 on the actual optimization procedure (excluding preprocessing time). Repeating the same experiment on full alignments over alphabets of size 16 and 25 (not shown here), where zipForward clearly performs worse than for sequences with alphabets of size 3 (see Figure 7(c)), we still obtained total speedup factors of 4.4 for both experiments.

Conclusions

We have engineered a variant of the HMM forward algorithm to exploit repetitions in strings to reduce the total amount of computation, by exploring shared sub-expressions. We have implemented this in an easy to use C++ library, with a Python interface for use in scripting, and we have demonstrated that our library can be used to achieve speedups of 4 - 78 factors for realistic whole-genome analysis with a reasonably complex hidden Markov model.

Availability and requirements

Project name: zipHMMProject home page:http://birc.au.dk/software/ziphmm/ webciteOperating system(s): Platform independentProgramming language: C++, PythonLicense: LGPL

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

AS, MK and TM developed the algorithms. AS implemented the library. AS, CNSP and TM designed experiments. AS, CNSP and TM drafted the manuscript. All authors read and approved the final manuscript.

Acknowledgements

TM receives funding from the Danish Council for Independent Research. We also thank the three anonymous reviewers for their helpful comments which helped to improve this paper.

References

  1. Churchill GA: Stochastic models for heterogeneous DNA sequences.

    Bull Math Biol 1989, 51:79-94. PubMed Abstract OpenURL

  2. Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA.

    J Mol Biol 1997, 268:78-94. PubMed Abstract | Publisher Full Text OpenURL

  3. Krogh A, et al.: Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.

    J Mol Biol 2001, 305(3):567-580. PubMed Abstract | Publisher Full Text OpenURL

  4. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, et al.: The Pfam protein families database.

    Nucleic Acids Res 2004, 32(suppl 1):D138-D141. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  5. Eddy S: Profile hidden Markov models.

    Bioinformatics 1998, 14(9):755. PubMed Abstract | Publisher Full Text OpenURL

  6. Siepel A, Haussler D: Phylogenetic hidden Markov models. In Statistical Methods in Molecular Evolution. Edited by Nielsen R. New York: Springer; 2005:325-351. OpenURL

  7. Hobolth A, et al.: Genomic relationships and speciation times of human, chimpanzee, and gorilla inferred from a coalescent hidden Markov model.

    PLoS Genet 2007, 3(2):e7. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  8. Dutheil JY, et al.: Ancestral population genomics: the coalescent hidden Markov model approach.

    Genetics 2009, 183:259-274. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  9. Kern AD, Haussler D: A population genetic hidden Markov model for detecting genomic regions under selection.

    Mol Biol Evol 2010, 27(7):1673-1685. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  10. Lifshits Y, Mozes S, Weimann O, Ziv-Ukelson M: Speeding up HMM decoding and training by exploiting sequence repetitions.

    Algorithmica 2009, 54(3):379-399. Publisher Full Text OpenURL

  11. Nielsen J, Sand A: Algorithms for a parallel implementation of hidden Markov models with a small state space.

    Proceedings of the 2011 IEEE IPDPS Workshops & PhD Forum, IEEE 2011, 452-459.

    Anchorage, Alaska, USA. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6008865 webcite

    OpenURL

  12. Mailund T, et al.: Estimating divergence time and ancestral effective population size of Bornean and Sumatran Orangutan subspecies using a Coalescent Hidden Markov model.

    PLoS Genet 2011, 7(3):e1001319. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  13. Locke DP, et al.: Comparative and demographic analysis of orang-utan genomes.

    Nature 2011, 469(7331):529-533. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  14. Hobolth A, et al.: Incomplete lineage sorting patterns among human, chimpanzee, and orangutan suggest recent orangutan speciation and widespread selection.

    Genome Res 2011, 21(3):349-356. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  15. Scally A, et al.: Insights into hominid evolution from the gorilla genome sequence.

    Nature 2012, 483(7388):169-175. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  16. Prüfer K, et al.: The bonobo genome compared with the chimpanzee and human genomes.

    Nature 2012, 486:527-531. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  17. Mailund T, et al.: A new isolation with migration model along complete genomes infers very different divergence processes among closely related great ape species.

    PLoS Genet 2012, 8(12):e1003125.

    http://dx.doi.org/10.1371 webcite

    PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  18. Rabiner L: A tutorial on hidden Markov models and selected applications in speech recognition.

    Proc IEEE 1989, 77(2):257-286. Publisher Full Text OpenURL

  19. Sand A, et al.: HMMlib: A C++ Library for General Hidden Markov Models Exploiting Modern CPUs.

    2010 Ninth International Workshop on Parallel and Distributed Methods in Verification/2010 Second International Workshop on High Performance Computational Systems Biology., IEEE 2010, 126-134.

    Enschede, The Netherlands. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5698478 webcite

    OpenURL