Abstract
Markov statistical methods may make it possible to develop an unsupervised learning process that can automatically identify genomic structure in prokaryotes in a comprehensive way. This approach is based on mutual information, probabilistic measures, hidden Markov models, and other purely statistical inputs. This approach also provides a uniquely common ground for comparative prokaryotic genomics. The approach is an ongoing effort by its nature, as a multipass learning process, where each round is more informed than the last, and thereby allows a shift to the more powerful methods available for supervised learning at each iteration. It is envisaged that this "bootstrap" learning process will also be useful as a knowledge discovery tool. For such an ab initio prokaryotic genefinder to work, however, it needs a mechanism to identify critical motif structure, such as those around the start of coding or start of transcription (and then, hopefully more).
For eukaryotes, even with better startofcoding identification, parsing of eukaryotic coding regions by the HMM is still limited by the HMM's single gene assumption, as evidenced by the poor performance in alternatively spliced regions. To address these complications an approach is described to expand the states in a eukaryotic genepredictor HMM, to operate with two layers of DNA parsing. This extension from the single layer gene prediction parse is indicated after preliminary analysis of the C. elegans altsplice statistics. State profiles have made use of a novel hashinterpolating MM (hIMM) method. A new implementation for an HMMwithDuration is also described, with farreaching application to genestructure identification and analysis of channel current blockade data.
Background
Motivations for seeking MM and HMM Variants
Part of the problem in developing a reliable gene predictor is having reliable, biologistverified, training data. In its simplest form, the training data needed is the raw genomic DNA together with a minimal annotation that labels coding regions. For the prokaryotic gene prediction much of the problem with obtaining highconfidence training data can be circumvented by using a bootstrap geneprediction approach. This is possible in prokaryotes because of their simpler and more compact genomic structure: simpler in that long ORFs (open reading frames) are usually long genes, and compact in that motif searches upstream usually range over hundreds of bases rather than thousands (as in human).
In work on prokaryotic gene prediction (focusing on V. cholerae), software (smORF) was developed for an augmented ORF (open reading frame) characterization. Using that software, with a simple startofcoding heuristic, it was possible to establish very good gene prediction for ORFs of length greater than 500 nucleotides. The smORF gene identification was used in a bootstrap geneannotation process (where no initial training data was provided). The strength of the gene identification was then improved by use of a software tool that performed gapinterpolatingMarkovmodeling (gIMM). When applied to the identified coding regions (most of the >500 length ORFs), six gIMMs were used (one for each frame of the codons, with forward and backward read senses). If poorly gIMMscoring coding regions were rejected, performance improved. Failure analysis clearly indicates that startcodon modeling is needed in order to strengthen predictions. One of the benefits of the gIMM software is its gapinterpolating generalization. This permits motifs to be identified, particularly those sharing the same approximate alignment with the startofcoding region. Using the bootstrapidentified genes from the smORFbased geneprediction (including errors) as a train set permitted an unsupervised search for upstream regulatory structure. The classic ShineDalgarno sequence (the ribosome binding site) was the strongest signal in the 30base window upstream from the start codon.
In preliminary work, a Hidden Markov Model based gene predictor was trained and tested on the C. elegans genome. Splice signal motifs and intron/exon statistical profiles were extracted, from which a gene predictor was constructed. To boost detection of exons, and indirectly introns, EST information was included. As with the prokaryotic research, further work entails identification of transcription regulation fingerprints, such as the promoter motifs, to clarify starts on transcription, etc. Even with better startofcoding identification, however, parsing of eukaryotic coding regions by the HMM is still limited by the HMM's single gene assumption, as evidenced by the typically poor geneprediction performance in alternatively spliced regions. To address these complications an approach is described to expand the states in the genepredictor HMM to operate with two layers of DNA parsing. This extension from the single layer gene prediction parse was indicated after preliminary analysis of the C. elegans altsplice statistics. State profiles were implemented using a hashinterpolating MM (hIMM) in this process (see Methods).
Another development described here is the incorporation of length distribution information, on exons and introns for example (see Figure 1, figure 2 and figure 3), into the HMM optimization via a novel, algorithmically convenient, implementation of HMMwithDuration (an HMM that includes true length distribution information). One important application of such an HMMwithduration includes feature extraction from channel current data. In this setting, information about the mean dwell times for the upper level and lower level is used as a first step in discriminating the class type. HMM with duration incorporates the length distribution information (dwell times distribution) on upper level and lower level states into the HMM optimization.
Figure 1. Frequencies of different length junk regions in a test subset of C. elegans.
Figure 2. Frequencies of different length intron regions in a test subset of C. elegans (with smoothing).
Figure 3. Frequencies of different length exon regions in a test subset of C. elegans.
Markov Chains
Key property of a Markov chain:
P(x_{i } x_{i1}, ..., x_{1}) = P (x_{i } x_{i1}) = ,
where are sometimes referred to as "transition probabilities" due to their sequential product contribution to sequence probability:
P(x) = P(x_{L}, x_{L1 }..., x_{1}) = P(x_{1}) ∏_{i = 2..L }
C_{y }is the count of events y, and C_{xy }is the count of simultaneous events x and y, T_{y }is the count of strings of length one, and T_{xy }is the count of strings of length two:
= P(x  y) = P(x,y)/P(y) = [C_{xy}/T_{xy}]/[C_{y}/T_{y}]
Since T_{xy}+1 = T_{y }→ T_{xy }≅ T_{y }(sequential data sample property if one long training block), = C_{xy}/C_{y }= C_{xy}/Σ_{x}C_{xy}, so C_{xy }is complete info for determining transition probabilities.
Viterbi Path
The recursive algorithm for the most likely state path given an observed sequence (the Viterbi algorithm) is expressed in terms of v_{ki}, the probability of the most probable path that ends with observation Z_{i }= z_{i}, and state S_{i }= k. The recursive relation is v_{ki }= max_{n}{e_{ki}a_{nk}v_{n(i1)}}, where the max_{n}{...} operation returns the maximum value of the argument over different values of index n, and the boundary condition on the recursion is v_{k0 }= e_{k0}p_{k}. The a_{kl }are the transition probabilities P(X_{i }= lX_{i1 }= k) to go from state k to state l. The e_{kb }are the emission probabilities P(S_{i }= bX_{i }= k) while in state k. The Viterbi path labelings are then recursively defined by p(S_{i}S_{(i+1) }= n) = argmax_{k}{v_{ki}a_{kn}}, where the argmax_{n}{...} operation returns the index n with maximum value of the argument. The evaluation of sequence probability (and its Viterbi labeling) take the emission and transition probabilities as a given. Estimates on those emission and transition probabilities are usually obtained by the Expectation/Maximization (EM) algorithm (commonly referred to as the BaumWelch algorithm in the context of HMMs [1]).
Further details on the information measures used, such as mutual information, and on Expectation/Maximization (EM), are placed in Appendix A.
EVA Projection
The HMM method is based on a stationary set of emission and transition probabilities. Emission broadening, via amplification of the emission state variances, is a filtering heuristic that leads to levelprojection that strongly preserves transition times between major levels (see [13] for further details). Results from the emission variance amplification (EVA) emission broadening method are described in [13] (with varying amounts of variance amplification). This approach does not require the user to define the number of levels (classes). This is a major advantage compared to existing tools that require the user to determine the levels (classes) and perform a state projection. This allows kinetic features to be extracted with a "simple" FSA (Finite State Automaton) that requires minimal tuning.
Results
Ab initio prokaryotic genefinding
Ab initio genefinding begins with identification of codon structure on the basis of a simple, genomewide, mutual information analysis. The idea is to examine the genome's twobase pairings, so first consider pairings where the bases are adjacent in the genome, i.e., a sample set consisting of all dinucleotides found upon moving a window of twobase width across the genome. The counts on those pairings can be used to obtain a mutual information between the first base and second base. The next iteration is to take twobasepairings separated by one base (gap = 1), etc., with mutual information results as shown in Figure 4 (applied to V. Cholerae). This overall method is generalized further to define a gapinterpolating Markov model (details in Methods).
Figure 4. The yaxis shows the mutual information on basepairs in he V. cholerae gemone, the xaxis is the gap size between bases used to construct the base pairs. Mutual Information, MI(X;Y) is: μ = Σ_{x}Σ_{y}p(xy)log(p(xy)/p(x)p(y)). If X and Y are independent r.v's then MI = 0. If we have a DNA sequence x_{1}.... x_{i}x_{i}+1 x_{i+2}........x_{n }(where x_{k }= {a,c,g, or t}) then we can get counts on pairs x_{i}x_{i+1 }for i = 1..n, and assuming stationarity on the data, and large enough n, we can speak of the joint probability p(X,Y). Calculation of MI(X,Y) then gives an indication of the linkage between base probabilities in dinucleotide probabilities. This can be extended to linkages when the two bases aren't sequential (have a base gap between them greater than zero), such as pairs based on xixi+2 (gap = 1), etc. This type of statistical framework can then be iterated to higher order MI calculations in a variety of ways to explore a number of statistical linkages and build towards a motif identifier based on such linkages (gIMM). Such an analysis on the V. Cholerae Chr. I genome, above, clearly indicates a threecomponent encoding of data, i.e., a 3element codon structure is revealed. Furthermore, judging from the strong linkages for gaps 1 through 5, it is also clear that hexamer Markov model statistics will be strong in many regions of the genome.
Ab initio genefinding can then identify the stop codons and, thus, ORFs (see Figure 5). A generalization to codon void regions, then, leads to recognition of different, overlapping, potential gene regions (with two orientations). A tool has been developed to identify the genomic ORF topology in this generalized way, named "smORF" (see Figure 6 and Table 1 for further definitions and results). This genometopology tool also clearly shows differences between bacteria (a possible "fingerprint" tool) (see Figure 6 and Figure 7).
Figure 5. An examination of the void sizes encountered for various codons (smORFs), or groupings of codons, reveals the stop codons very clearly as codons with anomalously lengthy stopcodon void regions. Shown in green are the standard ORFs (voids in the codon subset {(taa),(tag),(tga)}), which are clearly behaving differently from other codon voids (blue) when length is greater than 500 bases. The voids shown in blue are voids in the codon subset {(aaa),(gaa),(gat)}.
Figure 6. a. Topology Index Histograms shown for the V. cholerae Chr. I genome, where the xaxes are the topology index, and the yaxes show the event counts (i.e., occurance of that particular topolgy index in the genome). The topology index is computed by the following scheme: (i) initialize index for all bases in sequesnce to zero. (ii) Each base in a forward sense ORF, with length greater than a specified cutoff, is incremented by +10,000 for each such ORF overlap. Similarly, bases in reverse sense ORFs are incremented by +1,000 for each such overlap. Voids larger than the cutoff length in the nonstandard smORFs each give rise to an increment of +1. The figure shows that V. cholerae only has a small portion of its genome involved in multiple gene encodings. b. Shows a closeup view of the 10000 topology index peak.
Table 1. TopologyIndex Histogram results are shown for Vibrio Cholerae. The N = 400 set correspond to indexing with ORFs greater than 400 bases long. Similarly, the N = 200 set correspond to indexing with ORFs greater than 200 bases long.
Figure 7. a. TopologyIndex histograms are shown for the Chlamydia trachomatis genome. b. TopologyIndex histograms are shown for the Deinococcus radiodurans genome. C. trachomatis, like V. cholerae, shows very little overlapping gene structure. D. radiodurans, on the other hand, is dominated by genes that overlap other genes (note the strong 11000 peak).
smORF offers information about open reading frames (ORFs), and tallies information about other such codon void regions (an ORF is a void in three codons: TAA, TAG, TGA). This allows for a more informed selection process when sampling from a genome, such that nonoverlapping gene starts can be cleanly and unambiguously sampled. The goal is, initially, to identify key gene structures (e.g., stop codons, etc.) and use only the highest confidence examples to train profilers. Once this is done, Markov models (MMs) can be constructed on the suspected start/stop regions and coding/noncoding regions. The algorithm then iterates again, informed with the MM information, and partly relaxes the high fidelity sampling restrictions (essentially, the minimum allowed ORF length is made smaller). A crude genefinder can be constructed on the high fidelity ORFs by use of a very simple heuristic: scan from the start of an ORF and stop at the first inframe "atg". This analysis was applied to the V. cholerae genome (Chr. I). 1253 high fidelity ORFs were identified out of 2775 known genes. This first"atg" heuristic provided a gene prediction accuracy of 1154/1253 (92.1% of predictions of gene regions were exactly correct). If small shifts are allowed in the predicted position of the startcodon relative to the first"atg" (within 25 bases on either side), then prediction accuracy improves to 1250/1253 (99.8%). This actually elucidates a key piece of information needed to improve such a prokaryotic genefinder. Basically, information is needed to help identify the correct start codon in a 50 base window. Such information exists in the form of DNA motifs corresponding to the binding footprint of regulatory biomolecules (that play a role in transcriptional or translational control).
For an ab initio genefinder to work, it will need to have a mechanism to identify critical motif structure, such as those around the start of coding or start of transcription (and then, hopefully more). In essence, a Markov model is needed with greater "reach" – the gapinterpolating Markov model (gIMM) was developed for this purpose, and is described in the Methods. To set up an ab initio motif discovery windows around the (1253) purported start of genes were sampled. The windows ranged from the 40 bases preceding the start codon to the first 20 bases of coding (a 60 base window). Some of the windows represent noise, as the first pass of the bootstrap feature extraction has only 92.1% accuracy. Even so, the gIMM is able to clearly discern the ShineDalgarno consensus sequence. With the critical motifs already discerned, further iterations of the MM construction, possibly as an HMM now, will undoubtedly aid in improving performance.
AlternateSplice Labeling Scheme for Eukaryotic HMM
The labeling scheme assigns a label to each base in the sequence. Exon frame position 0 bases have label 0 if in the forward read or A if in the reverse. Likewise, exon frame position 1 bases have label 1 or B, and exon frame position 2 bases have label 2 or C. Introns, for purposes of the analysis here, are represented as 'i' or 'I' for intron on the forward or reverse strand (in the HMM implementation the intron states are actually split out in order to maintain proper frame on reentry to coding regions via state transition restrictions). Junk DNA is labeled 'j'.
Track Label Information
Suppose there were multiple annotations regarding the labeling of a base (i.e., alternative splicing). As the genome is traversed in the forward direction, gene annotations that aren't in conflict with annotations already seen are used to determine labels on labeltrackone. If a gene annotation is in conflict (an alternative splicing) then its label information is recorded on a second, adjacent, label track. Table 2 shows the label counts on track one and on track two (where the default base label is taken to be 'j'). From Table 2 it can be seen that about 8% of the first chromosome of C. elegans genes has alternate splicing. Similarly, Table 3 shows the transition counts between labels.
Table 2. (a) shows the Track 1 Label Counts, and (b) shows the Track 2 Label Counts.
Table 3. (a) shows the Track 1 Transition Counts, and (b) shows the Track 2 Transition Counts.
VLabels and VTransitions
Counts on codingoverlap Vlabel are shown in Table 4. Notice how the Vlabels tend NOT to favor overlapping that is a simple frameshift in a given read direction (i.e., the V01 count is very low compared to V00, etc.). There are 263 transitions on Vlabels with nonzero counts. Many of the Vtransitions have very low counts and can either be ignored in the initial model, or can have their stat's boosted by bringing information from related genomes (C. Briggsae). Ignoring those Vtransitions with negligible counts as not allowed transitions, as well as those implicitly describing no alternative splicing locally (an overlap with 'j' in either track), reduces to an active Vtransition set consisting of 86 transitions between Vlabels. This is a tractable number of states to manage in the HMM analysis, suggesting a simple and direct approach to alternative splice HMM analysis. The number of Vtransitions, whether counting all 263, or the 86 'active' ones, is still much smaller than the 72*72 = 5,184 transitions that would have been surmised for track annotations that were entirely independent.
Table 4. Vlabel Counts.
HMMwithDuration with application to channel current signal analysis
Synthetic Data Generation
We have generated two sets of synthetic data with states upper level (UL) and lower level (LL) (see Figure 8). The life times of the UL & LL are drawn from a Poisson distribution. The means of UL & LL dwell times for data set 1 are 20 ms & 40 ms respectively. The means of UL & LL dwell times for data set 2 are 40 ms each. A mean value of 48 pA is used for LL and a mean value of 72 pA is used for UL. We have assumed that the noise at UL & LL vary at around +/ 5pA around their respective means (Gaussian distribution). Hence we have assumed a variance of 2.78. See Figure 8.
Figure 8. a. Synthetic data with Poisson distributed length statistics is shown in the upper trace. Emission broadening is introduced with an emission variance amplification factor of 20. This effectively broadens the noise band (thickness) seen in the upper trace by a factor of 20, which leads to a blurring between the upper and lower levels of blockade since the noise bands now overlap (i.e., a toggling crossover instability is introduced to challenge the projection method). The middle trace shows the clean, highly accurate Viterbi parsing into the appropriate levels that is still obtained with the HMMwithDuration implementation. The lower trace shows the Viterbi parse with a simple HMM, that is uninformed about the underlying length distributions, thus giving rise to a Viterbi traceback parse that fails to penalize unlikely, very short duration, blockade events (seen as the unstable, rapid levelprojection toggles). b. Synthetic data with Gaussian distributed length statistics is shown in the upper trace, with the suceessful HMMwithDuration parsing shown in the middle trace. Emission broadening is introduced with an emission variance amplification factor of 20 as in Fig. 8a, with a similar failure in the HMMwithoutDuration's ability to parse critical kinetic feature information.
The HMMwithDuration implementation (described in the Methods) has been tested in terms of its performance at parsing synthetic blockade signals with attributes and visual appearance almost indiscernible from the experimentally observed blockade data. The benefit of the synthetic data is that the mean lifetimes of the blockade levels can range over an exhaustive set of possibilities for thorough testing of the HMMwithDuration. The synthetic data was designed to have two levels, with lifetime in each level determined by a governing distribution (Poisson and Gaussian distributions with a range of mean values were considered). In Figure 8, the data was generated with Poisson distributed lifetimes and parsed by an HMM's with and without duration, with length distribution generated from a Poisson (Figure 8a) or a Gaussian (Figure 8b) Distributions. The results clearly demonstrate the superior performance of the HMMwithduration over its simpler, HMM without Duration, formulation. With use of the EVAprojection method described in the Background (and in [13]) this affords a robust means to obtain kinetic feature extraction. The emission broadening introduced in the HMM w/wo duration comparison in Figure 8a and 8b, is precisely what occurs with EVAprojection, with similar susceptibility to failure due to overrepresentation of brief blockade lifetimes in the HMM without Duration parse. Using HMM with duration this problem is resolved, brief toggles between states are now penalized according to the statistical rarity of the comparatively brief lifetime events (see Fig.'s 8a and 8b). Resolving this problem is critical for accurate kinetic feature extraction, and the results suggests that this problem can be elegantly solved with a pairing of the HMMwithDuration stabilization with EVAprojection.
Discussion
aTOPO
As already mentioned, the gIMM tool identifies statistically anomalous motifs (usually restricted to some zone upstream). Another tool being developed, "aTOPO" (for motifs with Anomalous TOPOlogy), uses the information from the upstream zones swept with the gIMM tool to annotate upstream motifs, and offers further information on whether an anomalous motif is biologically significant by looking at is positional distribution and its correlations to other motifs. This work is still preliminary, so this will not be discussed further.
Methods
Expectation/Maximization
Expectation/Maximization, EM, is a general method to estimate the maximum likelihood when there is hidden or missing data. The method is guaranteed to find a maximum, but it may only be a local maximum, as is shown here (along the lines of [1]). For a statistical model with parameters θ, observed quantities x, and hidden labels π, the EM goal is to maximize the log likelihood of the observed quantities with respect to θ : log P(xθ) = log[Σ_{π}P(x,πθ)]. At each iteration of the estimation process we would like the new log likelihood, P(xθ), to be greater than the old, P(xθ*). The difference in log likelihoods can be written such that one part is a relative entropy, the positivity of which makes the EM algorithm work:
log P(xθ)  log P(xθ*) = Q(θθ*)  Q(θ*θ*) + D[P(πx,θ*)P(πx,θ)],
where D[......] is the KullbackLeibler divergence, or relative entropy, and Q(θθ*) = Σ_{π}P(x,πθ). Now a greater log likelihood results simply by maximizing Q(θθ*) with respect to parameters θ. The EM iteration is comprised of two steps: (1) Estimation – calculate Q(θθ*), and (2) Maximization – maximize Q(θθ*) with respect to parameters θ.
For an HMM the hidden labels π correspond to a path of states. Along path π the emission and transition parameters will be used to varying degrees. Along path π, denote usage counts on transition probability a_{kl }by A_{kl}(π) and those on emission probabilities e_{kb }by E_{k}(b,π) (following [1] conventions), P(x,πθ) can then be written:
P(x,πθ) = ∏_{k = 0}∏_{b}[e_{kb}]^E_{k}(b,π) ∏_{k = 0}∏_{l = 1}[a_{kl}]^A_{kl}(π).
Using the above form for P(x,πθ), A_{kl }for the expected value of A_{kl}(π) on path π, and E_{k}(b) for the expected value of E_{k}(b,π) on path π, it is then possible to write Q(θθ*) as:
Q(θθ*) = Σ_{k = 1}Σ_{b }E_{k}(b) log[e_{kb}] + Σ_{k = 0}Σ_{l = 1 }A_{kl }log[a_{kl}].
It then follows (relative entropy positivity argument again) that the maximum likelihood estimators (MLEs) for a_{kl }and e_{kb }are:
a_{kl }= A_{kl }/(Σ_{l }A_{kl}) and e_{kb }= E_{k}(b)/(Σ_{b }E_{k}(b)).
The latter estimation is for when the state sequence is known. For an HMM (with BaumWelch algorithm) it completes the Q maximization step (Mstep), which is obtained with the MLEs for a_{kl }and e_{kb}. The Estep requires that Q be calculated, for the HMM this requires that A_{kl }and E_{k}(b) be calculated. This calculation is done using the forward/backward formalism with rescaling in the next section.
Emission and Transition Expectations with Rescaling
For an HMM, the probability that transition a_{kl }is used at position i in sequence Z is:
p(S_{i }= k, S_{(i+1) }= lX) = p(S_{i }= k, S_{(i+1) }= l, Z)/p(Z), where
p(S_{i }= k,S_{(i+1) }= l,Z) = p(z_{0},...,z_{i},S_{i }= k)p(S_{(i+1) }= lS_{i }= k)p(z_{i+1}S_{(i+1) }= l)p(z_{i+2},...,z_{L1}S_{(i+1) }= l).
In terms of the previous notation with forward/backward variables:
p(S_{i }= k, S_{(i+1) }= lX) = f_{ki }a_{kl }e_{l(i+1) }b_{l(i+1) }/p(Z),
So the expected number of times a_{kl }is used, A_{kl}, simply sums over all positions i (except last with indexing):
A_{kl }= Σ_{i }f_{ki }a_{kl }e_{l(i+1)}b_{l(i+1)}/p(Z),
Similarly, the probability that b is emitted by state k at position i in sequence Z:
p(z_{i }= b, S_{i }= kX) = [ p(z_{0},...,z_{i},S_{i }= k) p(z_{i+1},...,z_{L1}S_{i }= k)/p(Z) ] δ(z_{i}b),
where a Kronecker delta function is used to enforce emission of b at position i. The expected number of times b is emitted by state k for sequence Z:
E_{k}(b) = Σ_{i }f_{ki }b_{ki}/p(Z) δ(z_{i}b),
In practice, direct computation of the forward and backward variables can run into underflow errors. Rescaling variables at each step can control this problem. One rescaling approach is to rescale the forward variables such that Σ_{i}F_{ki }= 1, where F_{ki }is the rescaled forward variable, and B_{ki }is the rescaled backward variable: F_{ki }= a_{βk}e_{ki}F_{β(i1)}/s_{i}, and B_{ki }= a_{kβ}e_{β(i+1) }B_{β(i+1)}/t_{i+1}, where s_{i }and t_{i+1}, are the rescaling constants. The expectation on counts for the various emissions and transitions then reduce to:
A_{kl }= Σ_{i }F_{ki }a_{kl }e_{l(i+1) }B_{l(i+1)}/[Σ_{k}Σ_{l }F_{ki }a_{kl }e_{l(i+1)}] and E_{k}(b) = Σ_{i }F_{ki }B_{ki }δ(z_{i}b).
Gap Interpolating Markov Model (gIMM)
We construct a gIMM using Mutual Information (MI). Based on the three reading frames of the DNA, the Mutual Information is frame dependent.
Find out which position among the 24 prior positions has the maximum mutual information with frame position 0. Suppose it is p1
Frame positions: ...012012012012012012012012012...
Prior positions: ...987654321
Find out a position from p = 1...24 and p ≠ p1 such that it has the maximum mutual information with frame position 0 and the prior position p1. If it is p2: MI(x0,x1;x2)>MI(x0,x1;xk).
Repeat the above procedure until we get an nthorder MIlinkage, this defines an interpolated Markov model that allows for gaps (gIMM). Likewise, for the frame 1 and 2, and reverse frames.
Most importantly, this can also be done for the regions upstream of coding regions, with reference base just prior to the start codon, in searches for regulatory motifs.
Hash Interpolated MM (hIMM)
Given the current state and the emitted sequence as x1,...., xL; compute:
P(xLx1,...., xL1) ≈ Count(x1,...., xL)/Count(x1,...., xL1)
Iff Count(x1,...., xL1) ≥ 400 i.e. only if the parental sequence shows statistical significance
Store P(xLx1,...., xL1) in the hash
Maintain a separate hash for each of the following states – Junk, Intron and Exon0, Exon1 & Exon2
Pseudo Code:
The emission probabilities for states such as (jj), (ee) and (ii) are computed using hash IMM as:
Given sequence 'x1,...., xL'
If sequence is defined;
Return Probability
Else
Recurse on 'x1,...., xL1'
Minimum string length allowed for:
1. Junk Region = 6
2. Intron Region = 6
3. Exon Region = 8
UL/LL Data Generation Method
We generated 75000 random numbers that were Gaussian distributed with mean 48 pA and variance 2.78. We also generated 50000 random numbers that were Gaussian distributed with mean 72 pA and variance 2.78.
Since the dwell times of the UL & LL are Poisson distributed, the transition times (duration times between state transitions) are assumed to be exponentially distributed. Therefore for data set 1, the mean interval between successive UL states is 40 ms and the mean interval between successive LL states is 20 ms.
 Cumulant Distribution Function (CDF) of exponential distribution:
F = 1  exp (t/a) where 'a' is the mean
Hence t = a.log_{e}(1  F)
We generate uniformly distributed 'F' between [0, 1). We then compute 't' from the above equation.
We have 50,000 samples per second. Hence a dwell time of 't' ms corresponds to 50.t samples rounded off to the nearest integer.
HMMwithDuration via Cumulant Transition Probabilities
The transition probabilities for (ee) – (ee); (jj) – (jj) and (ii) – (ii) type transitions are computed as:
Prob(jjj_{length }= L) = Prob(j_{length }≥ L+1)/Prob(j_{length }≥ L)
The transition probabilities for transitions such as (jj) – (je), (ee) – (ej), (ee) – (ei), (ii) – (ie) are computed as:
If the total number of (ej) transitions is 60 and the total number of (ei) transitions is 40, then:
Prob(eie_{length }= L) = Prob(e_{length }= L)/Prob(e_{length }≥ L) × 40/(40 + 60)
Prob(eje_{length }= L) = Prob(e_{length }= L)/Prob(e_{length }≥ L) × 60/(40 + 60)
Pseudo Code:
Maintain separate counters for the junk, exon and intron regions.
The counters are updated as:
1. The exon counter is set to 2 for a (je) – (ee) transition
2. The exon counter gets incremented by 1 for every (ee) – (ee) transition
Prob(e_{length }≥ L+1) is computed as:
We have Prob(e_{length }≥ L+1) = 1  Σ_{i = 1..L }Prob(e_{length }= i)
Hence we generate a list such that for each index 'k > 0', the value 1  Σ_{i = 1..k }Prob(e_{length }= i) is stored
HMMwithDuration Viterbi Implementation
The transition probabilities for (UL) – (UL); (LL) – (LL) are computed as:
Prob(ulul_len = L) = Prob(ul_len ≥ L+1)/Prob(ul_len ≥ L)
The transition probabilities for transitions such as (UL) – (LL), (LL) – (UL) are computed as:
Prob(ulllul_len = L) = Prob(ul_len = L)/Prob(ul_len ≥ L)
 Pseudo Code
Figure 9. The HMM dynamic programming table is modified to track new count index information at each cell to enable the HMMwithDuration implementation. The figure shows a view of Viterbi modifications in the dynamic programming table – the transition probabilities are now modified to be a ratio of transition probability cumulants. The transition probability cumulant is calculated based on a predetermined length distribution profile and uses information on the length that is also a (new) celllevel parameter. There are two new celllevel parameters, one each for tracking length on UL and LL states. (For genefinding, the information to track at the celllevel would be the length of the purported exon, intron, or junk, region.)
Maintain separate counters for the UL and LL regions
The counters are updated as:
1. The UL counter is set to 1 for a (ll) – (ul) transition
2. The UL counter gets incremented by 1 for every (ul) – (ul) transition
Prob(ul_len ≥ L+1) is computed as:
We have Prob(ul_len ≥ L+1) = 1  Σ_{i = 1 }^{L }Prob(ul_len = i)
Hence we generate a list such that for each index 'k > 0', the value 1  Σ_{i = 1 }^{k}Prob(ul_len = i) is stored
Prob(ul_len = L) is computed as:
We generate a list such that for each index 'k', the value Prob(ul_len = k) is stored
Appendix A
Information measures
The fundamental information measures are Shannon entropy, mutual information, and relative entropy (also known as the KullbackLeibler divergence or distance). Shannon entropy: σ = Σ_{x}p(x)log(p(x)), is a measure of the information in distribution p(x). Mutual Information: μ = Σ_{x}Σ_{y}p(xy)log(p(xy)/p(x)p(y)), is a measure of information one random variable has about another random variable. Relative Entropy (KullbackLeibler distance): ρ = Σ_{x }p(x) log(p(x)/q(x)), is a measure of distance between two probability distributions. Mutual information is a special case of relative entropy between a joint probability (twocomponent in simplest form) and the product of component probabilities.
Hidden Markov Models with possible Side Information
Hidden Markov Models [1] provide a statistical framework for sequences of observations obeying stationary Markov statistics. The "hidden" part of the HMM consists of the labelings, s_{i}, for each observation, z_{i}, where the index i labels the observation. The stationary statistics for a first order HMM are described in terms of emission probabilities, e_{ni }= p(Z_{j }= z_{i}S_{j }= n), and transition probabilities, a_{nm }= p(S_{j }= mS_{(j1) }= n). (The indexing on j is left in for clarity on the transition probability definition; from stationarity the expressions are valid for any choice of j.) Given (i) sequence of observations z_{i}, (ii) hidden labels s_{i}, and (iii) stationary Markov statistics, one can calculate: (i) p(Z_{0...L1}), or (ii) the most likely hidden labeling (path with largest contribution to p(Z_{0...L1})), or (iii) the reestimation of emission and transmission probabilities such that p(Z_{0...L1}) is maximized (using Expectation/Maximization).
Forward and Backward Variables
The forward/backward variables can be used to evaluate p(Z_{0...L1}) by breaking the sequence probability p(Z_{0...L1}) into two pieces via use of a single hidden variable treated as a Bayesian parameter: p(Z_{0...L1}) = Σ_{k}p(Z_{0...i},s_{i }= k)p(Z_{i+1...L1},s_{i }= k) = Σ_{k}f_{ki}b_{ki}, where f_{ki }= p(Z_{0...i},s_{i }= k) and b_{ki }= p(Z_{i+1...L1},s_{i }= k). A proof for the twopiece split is obtained by directly expanding the sequence probability (via Chow expansion: P(xyz) = P(x)P(yx)P(zxy), and Markov conditional reduction: P(xyz) = P(x)P(yx)P(zy), with appropriate notational bookkeeping). Given stationarity, the state transition probabilities and the state probabilities at the i^{th }observation satisfy the trivial relation p_{qi }= Σ_{k}a_{kq}p_{k(i1)}, where p_{qi }= p(S_{i }= q), and p_{q0 }= p(S = q), and the latter probabilities are the state priors. The trivial recursion relation that is implied can be thought of as an operator equation, with operation the product by a_{kq }followed by summation (contraction) on the k index. The operator equation can be rewritten using an implied summation convention on repeated Greekfont indices (Einstein summation convention): p_{q }= a_{βq}p_{β }. Transitionprobabilities in a similar operator role, but now taking into consideration local sequence information via the emission probabilities, are found in recursively defined expressions for the forward variables, f_{ki }= a_{βk}e_{ki }f_{β(i1)}, and backward variables, b_{ki }= a_{kβ}e_{β(i+1) }b_{β(i+1)}. The recursive definitions on forward and backward variables permit efficient computation of observed sequence probabilities using dynamic programming tables. It is at this critical juncture that side information must mesh well with the states (column components in the table), i.e., in a manner like the emission or transition probabilities. Length information, for example, can be incorporated via lengthdistributionbiased transition probabilities (as described in a new method in the Methods).
Acknowledgements
The author would like to thank Anil Yelundur for helpful discussions. Funding was provided by grants from the National Institutes for Health, The National Science Foundation, The Louisiana Board of Regents, and NASA.
References

Durbin R: Biological sequence analysis : probabilistic models of proteins and nucleic acids. Cambridge, UK & New York: Cambridge University Press; 1998.

Bezrukov SM, Vodyanoy I, Parsegian VA: Counting polymers moving through a single ion channel.
Nature 1994, 370(6457):279281. PubMed Abstract  Publisher Full Text

Bak P, Tang C, Wiesenfeld K: Selforganized Criticality – An explanation of 1/f noise.
Phys Rev Lett 1987, 59:381. PubMed Abstract  Publisher Full Text

Sinai Y: Introduction to Ergodic Theory. Princeton University Press; 1976.

Sklar L: Physics and Chance. Cambridge Univerisity Press; 1993.

Kittel C, Kroemer H: Thermal Physics. W.H. Freeman and Company, New York; 1980.

Jaynes E: Paradoxes of Probability Theory. [http://omega.albany.edu:8008/JaynesBook.html] webcite

Jumarie G: Relative Information : theories and applications. Volume 47. SpringerVerlag. (Springer series in synergetics); 1990.

Katchalsky A, Curran PF: Nonequilibrium Thermodynamics in Biophysics. Harvard University Press, Cambridge, MA; 1965.

Prigogine I: Dynamical Foundations of Thermodynamics and Statistical Mechanics. In A Critical Review of Thermodynamics. Edited by Stuart E, GalOr B, Brainard A. Mono Books, Baltimore; 1970.

A United Foundation of Dynamics and Thermodynamics. Chemica Scripta. 1973, 4:532.

Progigine I: Order out of Chaos. Bantam Books, New York; 1984.

WintersHilt S, Landry M, Akeson M, Tanase M, Amin I, Coombs A, Morales E, Millet J, Baribault C, Sendamangalam S: Cheminformatics Methods for Novel Nanopore analysis of HIV DNA termini.
BMC Bioinformatics 2006, 7(Suppl 2):S22.