Abstract
Background
Continuous time Markov chains (CTMCs) is a widely used model for describing the evolution of DNA sequences on the nucleotide, amino acid or codon level. The sufficient statistics for CTMCs are the time spent in a state and the number of changes between any two states. In applications past evolutionary events (exact times and types of changes) are unaccessible and the past must be inferred from DNA sequence data observed in the present.
Results
We describe and implement three algorithms for computing linear combinations of expected values of the sufficient statistics, conditioned on the endpoints of the chain, and compare their performance with respect to accuracy and running time. The first algorithm is based on an eigenvalue decomposition of the rate matrix (EVD), the second on uniformization (UNI), and the third on integrals of matrix exponentials (EXPM). The implementation in R of the algorithms is available at http://www.birc.au.dk/~paula/ webcite.
Conclusions
We use two different models to analyze the accuracy and eight experiments to investigate the speed of the three algorithms. We find that they have similar accuracy and that EXPM is the slowest method. Furthermore we find that UNI is usually faster than EVD.
Background
In this paper we consider the problem of calculating the expected time spent in a state and the expected number of jumps between any two states in discretely observed continuous time Markov chains (CTMCs). The case where the CTMC is only recorded at discretely observed time points arises in molecular evolution where DNA sequence data is extracted at present day and past evolutionary events are missing. In this situation, efficient methods for calculating these types of expectations are needed. In particular, two classes of applications can be identified.
The first class of applications is concerned with rate matrix estimation. [1] describes how the expectationmaximization (EM) algorithm can be applied to estimate the rate matrix from DNA sequence data observed in the leaves of an evolutionary tree. The EM algorithm is implemented in the software XRate [2] and has been applied in [3] for estimating empirical codon rate matrices. [1] uses the eigenvalue decomposition of the rate matrix to calculate the expected time spent in a state and the expected number of jumps between states.
The second class of applications is concerned with understanding and testing various aspects of evolutionary trajectories. In [4] it is emphasized that analytical results for jump numbers are superior to simulation approaches and various applications of jump number statistics are provided, including a test for the hypothesis that a trait changed its state no more than once in its evolutionary history and a diagnostic tool to measure discrepancies between the data and the model. [4] assumes that the rate matrix is diagonalizable and that the eigenvalues are real, and applies a spectral representation of the transition probability matrix to obtain the expected number of state changes.
[5] and [6] describe a method, termed substitution mapping, for detecting coevolution of evolutionary traits, and a similar method is described in [7]. The substitution mapping method is based on the expected number of substitutions while [7] base their statistics on the expected time spent in a state. Furthermore [7] describes an application concerned with mapping synonymous and nonsynonymous mutations on branches of a phylogenetic tree and employs the expected number of changes between any two states for this purpose. [8] uses the expected number of state changes to calculate certain labeled evolutionary distances. A labeled evolutionary distance could for example be the number of state changes from or to a specific nucleotide. In [9] substitution mapping is invoked for identifying biochemically constrained sites. In [7] and [8] the summary statistics are calculated using the eigenvalue decomposition method suggested by [1]. In [5,6] and [9] the substitution mapping is achieved using a more direct formula for calculating the number of state changes. In this direct approach an infinite sum must be truncated and it is difficult to control the error associated with the truncation. An alternative is described in [10] where uniformization is applied to obtain the expected number of jumps. [10] uses the expected number of jumps on a branch to detect lineages in a phylogenetic tree that are under selection.
A third algorithm for obtaining the number of changes or time spent in a state is outlined in [11]. The algorithm is based on [12] where a method for calculating integrals of matrix exponentials is described. A natural question arises: which of the three methods (eigenvalue decomposition, uniformization or matrix exponentiation) for calculating conditional expectations of summary statistics for a discretely observed CTMC should be preferred? The aim of this paper is to provide an answer to this question. We describe and compare the three methods. Our implementations in R [13] are available at http://www.birc.au.dk/~paula/ webcite. (Furthermore the eigenvalue decomposition and uniformization methods are also available as a C++ class in the bio++ library at http://biopp.univmontp2.fr/ webcite.) The performance and discussion of the algorithms are centered around two applications. The first application is concerned with rate matrix estimation; we estimate the GoldmanYang codon model [14] using the expectationmaximization algorithm. The second application is based on the labeled distance estimation presented in [8].
Consider a stochastic process {X(s): 0 ≤ s ≤ t} which can be described by a CTMC with n states and an n × n rate matrix Q = (q_{cd}). The offdiagonal entries in Q are nonnegative and rows sum to zero, i.e. q_{cc }=  Σ_{d≠c }q_{cd }= q_{c}. Maximum likelihood estimation of the rate matrix from a complete observation of the process is straight forward. The likelihood of the process, conditional on the beginning state X(0), is given by (e.g. [15])
where T_{c }is the total time spent in state c and N_{cd }is the number of jumps from c to d. The necessary sufficient statistics for a CTMC are thus the time spent in each state and the number of jumps between any two states. In applications, however, access is limited to DNA data from extant species. The CTMC is discretely observed and we must estimate the mean values of T_{c }and N_{cd }conditional on the endpoints X(0) = a and X(t) = b. From [15] we have that
where P(t) = (p_{ij}(t)) = e^{Qt }is the transition probability matrix and
Many applications require a linear combination of certain substitutions or times. Examples include the number of transitions, transversions, synonymous and nonsynonymous substitutions. In the two applications described below the statistics of interest is a linear combination of certain substitutions and times. Let therefore C be an n × n matrix and denote by Σ(C; t) the matrix with entries
We describe, compare and discuss three methods for calculating Σ(C; t). The evaluation of the integrals (4) takes O(n^{3}) time and therefore a naive calculation, assuming that C contains just one entry different from zero has a O(n^{5}) running time. Even worse, if C contains O(n^{2}) entries different from zero, then the naive implementation has a O(n^{7}) running time. For all three methods our implementations of Σ(C; t) run in O(n^{3}) time.
Results
Applications
Application 1: Rate matrix estimation
Our first application is the problem of estimating the parameters in a CTMC for evolution of coding DNA sequences which we describe using the 61 × 61 rate matrix (excluding stop codons) given by Goldman and Yang [14]:
where π is the stationary distribution, κ is the transition/transversion rate ratio, ω is the nonsynonymous/synonymous ratio and α is a scaling factor. The stationary distribution π is determined directly from the data using the codon frequencies. We estimate the remaining parameters θ = (α, κ, ω) using the expectationmaximization (EM) algorithm [16] as described below.
Suppose the complete data x is available, consisting of times and types of substitutions in all sites and in all branches of the tree. The complete data log likelihood is, using (1) and (6),
where we use the notation
where e.g.
A similar notation applies for L_{s,tv}, L_{ns,ts}, L_{ns},_{tv}, N_{ns }and N, where the last statistic is the sum of substitutions between all states (i, j) that differ at one position and s, ns, ts and tv subscripts stand for synonymous, nonsynonymous, transition and transversion.
The complete data log likelihood can be maximized easily by making the reparametrization β = ακ. We find that
where a = L_{ns,tv}L_{ns,ts}N_{s}, b = L_{ns,tv}L_{s,ts}(N_{ns } N_{tv}) + L_{ns,ts}L_{s,tv}(N_{ns } N_{ts}) and c = L_{s,tv}L_{s,ts}N_{ns}.
In reality the data y is only available in the leaves and the times and types of substitutions in all sites and all branches of the tree are unaccessible. The EM algorithm is an efficient tool for maximum likelihood estimation in problems where the complete data log likelihood is analytically tractable but full information about the data is missing.
The EM algorithm is an iterative procedure consisting of two steps. In the Estep the expected complete log likelihood
conditional on the data y and the current estimate of the parameters θ_{0 }is calculated. In the Mstep the parameters are updated by maximizing G(θ; θ_{0},y). The parameters converge to a local maximum of the likelihood for the observed data.
The expected log likelihood conditional on the data y and under the three parameters α, κ and ω is
Therefore the Estep requires expectations of linear combinations of waiting times in a set of states and number of jumps between certain states. Because of the Markov property this calculation can be divided in two parts. First we use the peeling algorithm [17,18] to obtain the probability that a branch k of length t_{k }with nodes γ_{k }and β_{k }above and below the branch, respectively, has endpoints a and b. Second, we calculate the desired summary statistic by summing over all branches. For example we have
The Estep thus consists of calculating conditional expectations of linear combinations of times such as and substitutions such as where L_{s,ts }and N_{ts }are given by (8). In our application n = 61 and the first type of statistics is (up to a factor p_{ab}(t)) on the form (5) with diagonal entries and all off diagonal entries equal to zero. The second type of statistics is also on the form (5) with offdiagonal entries and zeros on the diagonal.
Application 2: Robust distance estimation
The second application is a new approach for estimating labeled evolutionary distance, entitled robust counting and introduced in [8]. The purpose is to calculate a distance that is robust to model misspecification. The method is applied to labeled distances, for example, the synonymous distance between two coding DNA sequences. As it is believed that selection mainly acts at the protein level, synonymous substitutions are neutral and phylogenies built on these type of distances are more likely to reveal the true evolutionary history. The distance is calculated using the mean numbers of labeled substitutions conditioned on pairwise site patterns averaged over the empirical distribution of site patterns observed in the data. In the conventional method the average is done over the theoretical distribution of site patterns. The robustness is therefore achieved through the usage of more information from the data and less from the model.
Let Q be the rate matrix of the assumed model, P(t) = e^{Qt}, the labeling be given through a set of pairs ℒ and the data be represented by a pairwise alignment y = (y_{1}, y_{2}) of length m. As data only contains information about the product Qt, where t is the time distance between the sequences, we can set t = 1.
Suppose we observe the complete data consisting of the types of substitutions that occurred in all sites and let be the labeled number of substitutions. A natural labeled distance is given by . The labeled distance is estimated as the average across all sites of the expected number of labeled substitutions conditioned on the observed end points:
Therefore this application requires evaluating a sum on the form (5) with offdiagonal entries and zeros on the diagonal.
Algorithms
The calculation of Σ(C; t) is based on the integrals . In this section we present three existing methods for obtaining the integrals and extend them to obtain Σ(C; t).
Eigenvalue decomposition (EVD)
When the rate matrix Q is diagonalizable, the computation of transition probabilities p_{ab}(t) and integrals can be done via the eigenvalue decomposition (EVD). EVD is a widely used method for calculating matrix exponentials. Let Q = UΛU^{1 }be the eigenvalue decomposition, with Λ = diag(λ_{1}, ..., λ_{n}). It follows that
Because Λ is diagonal, e^{Λt }is also diagonal with .
The integral (4) becomes
Replacing with (16) in (5), rearranging the sums and using and we find
where ○ represents the entrywise product.
The eigenvalues and eigenvectors might be complex, but they come in complex conjugate pairs and the final result is always real; for more information we refer to the Supplementary Information in [2]. If the CTMC is reversible, the decomposition can be done on a symmetric matrix obtained from Q (e.g. [15]), which is faster and tends to be more robust. Let π be the stationary distribution. Due to reversibility, π_{a}q_{ab }= π_{b}q_{ba}, which can be written as ΠQ = Q*Π where Π = diag(π). Let S = Π^{1/2}QΠ^{1/2}.
We have that
where S* is the transpose of S. Then S is symmetric. Let Λ, V be its eigenvalues and eigenvectors, respectively. Then VΛV^{1 }= S = Π^{1/2}QΠ^{1/2}, which implies Q = (Π^{1/2}V)Λ(V^{1}Π^{1/2}) and it follows that Q has the same eigenvalues as S and Π^{1/2}V for eigenvectors.
The results can be summarized in the following algorithm:
Algorithm 1: EVD
Input: Q, C, t
Output: Σ(C; t)
Step 1: Determine eigenvalues λ_{i}.
Determine the eigenvectors U_{i }for Q and compute U^{1}.
Step 2: Determine matrix J(t) from (17).
Step 3: Determine matrix Σ(C;t) from (18).
Uniformization (UNI)
The uniformization method was first introduced in [19] for computing the matrix exponential P(t) = e^{Qt}. In [11] it was shown how this method can be used for calculating summary statistics, even for statistics that cannot be written in integral form. Let μ = max_{i }(q_{i}) and , where I is the identity matrix.
Then
where Pois(m; λ) is the probability of m occurrences from a Poisson distribution with mean λ. Using (20) we also have
Replacing (21) in (5), rearranging the sums and using that and we derive
The main challenge with this method is the infinite sum and we use (20) to determine a truncation point. In particular if we let λ = μt and truncate at s(λ) we can bound the error using the tail of the Poisson distribution:
We have that, for large values of λ, , where is the normal distribution with mean μ and variance σ^{2}. Therefore, for large λ, the error bound
where Φ(·) is the cumulative distribution function for the standard normal distribution. Consequently we can approximate the truncation point s(λ) with . If b = 10^{8 }we obtain Φ^{1 }(1  b) = 5.6.
Another way to determine s(λ) is to use R to evaluate Pois(m; λ) for values of m that gradually increase, until the tail is at most b = 10^{8}. Combining these two approaches, we performed a linear regression, approximating the tails from R by . We obtained c_{1 }= 4.0731, c_{2 }= 5.6469, c_{3 }= 0.9963 but, in order to be conservative, we use where ⌈x⌉ is the smallest integer greater than or equal to x. In Figure 1 we compare the exact truncation value and the linear regression approximation.
Figure 1. Poisson truncation. Comparison between the exact truncation value and the approximation. In the plot on the left, λ ranges from 0 to 30, while the plot on the right is a zoomin for values between 0 to 0.5. The plot shows that the approximation is a conservative fit of the Poission tail.
The linear regression provides an excellent fit to the tail of the distribution.
In summary we have the following algorithm:
Algorithm 2: UNI
Input: Q, C, t
Output: Σ(C; t)
Step 1: Determine μ, s(μt) and R.
Step 2: Calculate R^{m }for 2 ≤ m ≤ s(μt).
Step 3: Calculate for 0 ≤ m ≤ s(μt).
using the recursion A(m + 1) = A(m)R + R^{m+1}C.
Step 4: Determine Σ(C; t) from (22).
Exponentiation (EXPM)
This method for calculating the integral (4) was developed in [12] and emphasized in [11]. Suppose we want to evaluate , where Q and B are n × n matrices. To calculate this integral, we use an auxiliary matrix and the desired integral can be found in the upper right corner of the matrix exponential of A:
We are interested in
where is a matrix with 1 in entry (c, d) and zero otherwise. We can use this method to determine by simply setting , construct the auxiliary matrix A, calculate the matrix exponential of At, and finally read off the integral in entry (a, b) in the upper right corner of the matrix exponential.
Replacing (24) in (5) and rearranging the terms we have
Therefore by setting B = C in the auxiliary matrix we obtain Σ(C;t).
The EXPM algorithm is as follows:
Algorithm 3: EXPM
Input: Q, C, t
Output: Σ(C; t)
Step 2: Calculate the matrix exponential e^{At}.
Step 3: Σ(C; t) is the upper right corner of the matrix exponential.
Testing
We implemented the presented algorithms in R and tested them with respect to accuracy and speed.
Accuracy
The accuracy of the methods depends on the size of the rate matrix and the time t. To investigate how these factors influence the result, we used two different CTMCs that allow an analytical expression for (4). The first investigation is based on the JukesCantor model where the rate matrix has uniform rates and variable size n:
Q has two unique eigenvalues: 0 with multiplicity 1 and with multiplicity n1. We obtain
We compared the result from all three methods against the true value of (5) for size n ranging from 5 to 100, t = 0.1 and random binary matrices C. Entries in C are 1 with probability . For each fixed size, we generated 5 different matrices C. The average normalized deviation is shown in Figure 2.
Figure 2. Accuracy results. Accuracy has been tested using JC and HKY models. For each run, the normalized deviation is calculated: where Σ is the correct value and is the calculated one. Each plotted point represents the average over a, b and 5 different randomly generated matrices C as described in the main text.
The second CTMC is the HKY model:
where π = (0.2,0.2,0.3,0.3) is the stationary distribution and κ = 2.15 is the transition/transversion rate ratio. This rate matrix has an analytical result for (4) which can be obtained through the eigenvalue decomposition. The eigenvalues and eigenvectors of Q are
From this, using the symbolic operations in Matlab [20], we obtained the final analytic expression for (4). Using this model we compared for all three methods the true value of (5) for various values of t and randomly generated binary matrices C. For each t we generated 5 different matrices C. The average normalized deviation is shown in Figure 2.
In both cases, all methods showed good accuracy as the normalized deviation was no bigger than 3 × 10^{9}. We also note that EXPM tended to be the most precise while UNI provided the worst approximation. To further investigate the accuracy, we performed calculations on randomly generated reversible rate matrices: we first obtained the stationary distribution from the Dirichlet distribution with shape parameters equal to 1, then all entries q_{ij }with i ≥ j from the exponential distribution with parameter 1 and finally calculated the remaining entries using the reversibility property. In all the runs the relative difference between EVD, UNI and EXPM was less than 10^{5}. This indicated that all three methods have a similar performance in a wide range of applications.
Speed
Partition of computation
Assume we need to evaluate Σ(C; t) for a fixed matrix C and multiple time points t ∈ {t_{1},...t_{k}}. In each iteration of the EMalgorithm in Application 1 we need this type of calculation while in order to calculate the labeled distance in Application 2 just one time point is required. Using EVD (Algorithm 1) we do the eigenvalue decomposition (Step 1) once and then, for each time point t_{i}, we apply Step 2 and Step 3. The eigenvalue decomposition, achieved through the R function eigen, has a running time of O(n^{3}). In Step 2 we determine J(t) and this takes O(n^{2}) time. Step 3 has a running time of O(n^{3}) due to the matrix multiplications.
If instead we apply UNI (Algorithm 2), we run Steps 13 for the largest time point max(t_{i}) and then, for each time point t_{i}, we apply Step 4. Steps 13 take O (s(μmax (t_{i})) n^{3}) time, and Step 4 takes O(s(μt_{i})n^{2}) time for each i ∈ {1,..., k}. Therefore, even though the total time for both methods is O(n^{3}), the addition of one time point contributes with O(n^{3}) for EVD, but only O(s(μt)n^{2}) for UNI. Recall that the constant s(μt) is the truncation point for the infinite sum in the uniformization method.
In the case of EXPM (Algorithm 3) we need to calculate the matrix exponential at every single time point. We used the expm R package [21] with the Higham08 method. This is a Padé approximation combined with an improved scaling and squaring [22] and balancing [23]. The running time is O(n^{3}).
Table 1 provides an overview of the running times for each of the methods. The algorithms are divided into precomputation and main computation where the precomputation consists of steps that must be executed only once, while in the main computation we calculate the value of Σ(C;t) for every time point under consideration.
Table 1. Running time complexity
Experiments
We tested the speed of the algorithms in six experiments based on the presented applications and two more experiments using a nonreversible matrix.
GY
The first experiment corresponded to running the EM algorithm on real data consisting of DNA sequences from the HIV pol gene described in [24]. HIV has been extensively studied with respect to selection pressure and drug resistance and in [24] the authors document convergent evolution in pol gene caused by drug resistance mutations. The observed data y was a multiple codon alignment of the sequences. For simplicity, we did not consider the columns with gaps or ambiguous nucleotides. To compare the performance of the methods as a function of the size of the data set, we applied the EM algorithm for 15 data sets containing from 2 up to 16 sequences each, extracted from the HIV pol gene data. For each set we assumed the sequences were related according to a fixed tree; we have reconstructed the phylogenies in Mega [25] using the JukesCantor model and NeighborJoining. We ran the EM algorithm until all three parameters converged. Experiments two and three used the previously estimated matrix Q given by (6) with α = 10.5, κ = 4.27 and ω = 0.6. We let C_{ij }= q_{ij }and C_{ii }= 0, corresponding to calculating the total number of expected substitutions , and computed the value of Σ(C; t_{k}) for 10 equidistant sorted time points t_{k }with 1 ≤ k ≤ 10 (Table 2).
Table 2. Experimental design
GTR
In the fourth experiment we estimated the robust labeled distance of two sequences, using the same setup as in [8]. For each considered evolutionary distance t between 0.1 and 1, we generated 50 pairwise sequence data sets of length 2000 which have evolved for time t under the general time reversible (GTR) model with
where r = (0.5, 0.3,0.6, 0.2,0.3, 0.2) and π = (0.2, 0.2,0.3, 0.3). For labeling, we considered the jumps to and from nucleotide A, leading to C_{ij }= q_{ij }if i or j represents nucleotide A. For each data set, we estimated the GTR parameters as described in [8] and calculated the robust distance. Experiments 5 and 6 used the same GTR matrix and C_{ij }= q_{ij }if i or j represents nucleotide A and zero otherwise, and computed the value of Σ(C;t_{k}) for 10 equidistant sorted time points t_{k }with 1 ≤ k ≤ 10 (Table 2).
UNR
In the last two experiments we used the same setup as in experiments 5 and 6 but with a different matrix and time points (Table 2). As the speed of EVD is influenced by the type of the model, we decided to employ a nonreversible matrix. We chose the unrestricted model and carefully set the rates such that the matrix has a complex decomposition:
Figure 3 shows the results. For experiments 1 and 4, the plots show the recorded running time under each setup (different number of sequences or different evolutionary distance). For the remaining experiments each plot starts with the running time of the precomputation which, for UNI, is done on the largest time point t_{10}. Then, at position k, we plot the cumulative running time for precomputation and the evaluation of Σ(C;t_{i}) for all i ≤ k. Since EVD and EXPM have running times that are independent of t_{k}, the running times for these two algorithms are the same in experiments 2 and 3, 5 and 6, and 7 and 8. Even more, as EXPM is dependent only on the size of the matrix, the running times in experiments 58 are the same. We observe that in all our experiments EXPM is the slowest method. Deciding if EVD or UNI is faster depends on the size and type of the matrix, the number of time points and the values of s(μt). As the main computation for UNI has a running time of O(n^{2}) as opposed to O(n^{3}) for EVD (Table 1), this method should have an increased advantage when the rate matrix is bigger. This means that if many time points are considered, then UNI is generally the faster method. Importantly, we note that the EVD precomputation tends to be faster than the UNI precomputation. We remark that, in the first experiment, UNI proved to be the fastest method while, in the fourth experiment, UNI became slower with the increase of the evolutionary distance between the sequences and it was only faster than EVD for small distances (< 0.2). By setting t_{k }in an appropriate manner (Table 2), we have the same running time for UNI and EXPM for experiment 7 compared to experiment 6. Due to the fact that in experiment 7 we used the UNR matrix, EVD is slower as opposed to experiment 6. In this case, the difference is observable but not very big, but as the size of the matrix increases, this discrepancy increases too. We also note that the difference between the reversible and nonreversible cases is enough to make UNI faster than EVD in the latter case.
Figure 3. Experiments results. Running times for the eight experiments. Experiment 1: rate matrix estimation using EM. The plot shows the running time for calculating the statistics for each method, as a function of the number of sequences included in the data set. For experiments 2 and 3 we calculated the value of Σ(C;t_{k}) for 10 time points t_{k}. Each plot starts with the running time of the precomputation and at position k we plot the cumulative running time for precomputation and the evaluation of Σ(C;t_{i}) for all t_{i}∈ {t_{1},...,t_{k}}. The values of t_{k }are provided in Table 2. Experiment 4: robust distance estimation. The plot shows the running time for computing the robust distance as a function of the evolutionary distance t. Experiments 58: similar as for experiments 2 and 3 but with a GTR model (experiments 5 and 6) and UNR model (experiments 7 and 8) instead of a GY model.
Discussion
The EVD algorithm assumes that the rate matrix is diagonalizable. However, a direct calculation of the integral (4) in the nondiagonalizable case is actually possible using the Jordan normal form for the rate matrix. Let Q = PJP^{1 }where J is the Jordan normal form of Q and P consists of the generalized eigenvectors (we recognize that we used P and J for other quantities earlier but for this discussion this should not cause any confusion and we prefer to use standard notation), i.e. J has a block diagonal form J = diag(J_{1},..., J_{κ}) where J_{k }= λ_{k}I + N is a matrix with λ_{k }on the diagonal and 1 on the superdiagonal. We have
and noting that N is a nilpotent matrix with degree d_{k }(equal to the size of block J_{k}) we obtain
In order to calculate the integral (4) the expressions (26) and (27) are used. It is evident that this procedure is feasible but also requires much bookkeeping.
In [26] an extension of uniformization, adaptive uniformization, is described for calculating transition probabilities in a CTMC. The basic idea is to perform a local uniformization instead of a global uniformization of the rate matrix and thereby have fewer jumps in the jump process. [26] considers a model with rate matrix
(state 4 is an absorbing state). If this process starts in state 1 then the first jump is to state 2 and the second is from state 2 to either state 1 or state 3. This feature can be taken into account by having a socalled adaptive uniformized (AU) jump process where the rate for the first jump is 3ν, for the second is μ + 2ν and, assuming μ + ν > 3ν, the rate for the third jump is μ + ν. From the third jump the rate in the AU jump process is μ + 2ν as in the standard uniformized jump process. The AU jump process has a closedform expression for the jump probabilities (it is a pure birth process) but is of course more complicated than a Poisson jump process. The advantage is that the AU jump process exhibits fewer jumps. This procedure could very well be useful for codon models where the set of states that the process can be in after one or two jumps are limited because only one nucleotide change is allowed in each state change.
In an application concerned with modeling amongsite rate variation, [27] applies the uniformization procedure (20) to calculate the transition probabilities instead of the eigenvalue decomposition method (15). [27] shows, in agreement with our results, that uniformization is a faster computational method than eigenvalue decomposition.
The presented methods are not the only ones for calculating the desired summary statistics. For example, in [5] it is suggested to determine the expected number of jumps from the direct calculation
where the infinite sum is truncated at k = 10. The problem with this approach is that it is difficult to bound the error introduced by the truncation. In UNI a similar type of calculation applies but the truncation error can be controlled.
Conclusion
Recall that EVD assumes that the rate matrix is diagonalizable and this constraint means that EVD is less general than the other two algorithms. We have shown in the Discussion how a direct calculation of the integral (4) is actually still possible but requires much bookkeeping. On top of being less general, EVD is dependent on the type of the matrix: reversible or nonreversible. We have shown how this discrepancy can make EVD slower than UNI even when the state space has size of only 4.
We found that the presented methods have similar accuracy and EXPM is the most accurate one. With respect to running time, it is not straightforward which method is best. We found that both the eigenvalue decomposition (EVD) and uniformization (UNI) are faster than the matrix exponentiation method (EXPM). The main reason for EVD and UNI being faster is that they can be decomposed into a precomputation and a main computation. The precomputation only depends on the rate matrix for EVD while for UNI it also depends on the largest time point and the matrix C. We also remark that EXPM involves the exponentiation of a matrix double in size. UNI is particularly fast when the product μt is small because in this case only a few terms in the sum (22) are needed.
Authors' contributions
PT extended the existing methods to linear combinations of statistics, implemented the algorithms and performed the testing. AH conceived the study and guided the development and evaluation of the methods. Both authors wrote the paper. All authors read and approved the final manuscript.
Acknowledgements
We are grateful to Thomas Mailund and Julien Y. Dutheil for very useful discussions on the presentation and implementation of the algorithms. We would also like to thank the anonymous reviewers for constructive comments and suggestions that helped us improve the paper.
References

Holmes I, Rubin GM: An expectation maximization algorithm for training hidden substitution models.
J Mol Bio 2002, 317:753764. Publisher Full Text

Klosterman PS, Holmes I: XRate: a fast prototyping, training and annotation tool for phylogrammars.
BMC Bioinf 2006, 7:428. BioMed Central Full Text

Kosiol C, Holmes I, Goldman N: An empirical codon model for protein sequence evolution.
Mol Biol Evol 2007, 24:146479. PubMed Abstract  Publisher Full Text

Minin VN, Suchard MA: Counting labeled transitions in continuoustime Markov models of evolution.
J Math Biol 2008, 56:391412. PubMed Abstract  Publisher Full Text

Dutheil J, Pupko T, JeanMarie A, Galtier N: A ModelBased Approach for Detecting Coevolving Positions in a Molecule.

Dutheil J, Galtier N: Detecting groups of coevolving positions in a molecule: a clustering approach.
BMC Evol Biol 2007, 7:242. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Minin VN, Suchard MA: Fast, accurate and simulationfree stochastic mapping.
Phil Trans R Soc B 2008, 363(1512):39853995. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

O'Brien JD, Minin VN, Suchard MA: Learning to count: robust estimates for labeled distances between molecular sequences.
Mol Biol Evol 2009, 26:801814. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dutheil J: Detecting sitespecific biochemical constraints through substitution mapping.
J Mol Evol 2008, 67:25765. PubMed Abstract  Publisher Full Text

Siepel A, Pollard KS, Haussler D: New methods for detecting lineagespecific selection.
Proceedings of the 10th International Conference on Research in Computational Molecular Biology (RECOMB) 2006, 190205.

Hobolth A, Jensen JL: Summary statistics for endpoint conditioned continuoustime Markov chains.
J Appl Prob 2011, 48:114. Publisher Full Text

Van Loan CF: Computing integrals involving the matrix exponential.
IEEE Transactions on Automatic Control 1978, 23:395404. Publisher Full Text

R Development Core Team: R: A Language and Environment for Statistical Computing. [http://www.Rproject.org] webcite
R Foundation for Statistical Computing

Goldman N, Yang Z: A Codonbased Model of Nucleotide Substitution for Proteincoding DNA Sequences.
Mol Biol Evol 1994, 11:725736. PubMed Abstract  Publisher Full Text

Hobolth A, Jensen JL: Statistical Inference in Evolutionary Models of DNA Sequences via the EM Algorithm.

Dempster AP, Laird NM, Rubin DB: Maximum Likelihood from Incomplete Data via the EM Algorithm.

Yap VB, Speed T: Estimating Substitution Matrices. In Statistical Methods in Mol Evolution. Edited by Nielsen R. Springer; 2005:420422.

Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach.
J Mol Evol 1981, 17:368376. PubMed Abstract  Publisher Full Text

Jensen A: Markov chains as an aid in the study of Markov processes.

MATLAB R2010a Natick, Massachusetts: The MathWorks Incorporated;

Goulet V, et al.: expm: Matrix exponential. [http://CRAN.Rproject.org/package=expm] webcite

Higham J: The Scaling and Squaring Method for the Matrix Exponential Revisited.

Stadelmann M: Matrixfunktionen. Analyse und Implementierung. In Master thesis. ETH Zurich, Mathematics Department; 2009.

Lemey P, et al.: Molecular footprint of drugselective pressure in a human immunodeficiency virus transmission chain.
J Virol 2005, 79:1198111989. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tamura K, et al.: MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods.

Van Moorsel APA, Sanders WH: Adaptive uniformization.
Stochastic Models 1994, 10:619647. Publisher Full Text

Mateiu L, Rannala B: Inferring complex DNA substitution processes on phylogenies using uniformization and data augmentation.
Systematic Biol 2006, 55:259269. Publisher Full Text