Department of Plant Systems Biology, VIB, B9052 Ghent, Belgium
Bioinformatics and Evolutionary Genomics, Department of Molecular Genetics, Ghent University, B9052 Ghent, Belgium
Department of Microbiology and Immunology, Rega Institute, K.U. Leuven, Kapucijnenvoer 33 blok I bus 7001, B3000 Leuven, Belgium
Department of Applied Mathematics and Computer Science, Ghent University, Krijgslaan 281 S9, B9000 Ghent, Belgium
Abstract
Background
Accurate modelling of substitution processes in proteincoding sequences is often hampered by the computational burdens associated with full codon models. Lately, codon partition models have been proposed as a viable alternative, mimicking the substitution behaviour of codon models at a low computational cost. Such codon partition models however impose independent evolution of the different codon positions, which is overly restrictive from a biological point of view. Given that empirical research has provided indications of contextdependent substitution patterns at fourfold degenerate sites, we take those indications into account in this paper.
Results
We present socalled contextdependent codon partition models to assess previous empirical claims that the evolution of fourfold degenerate sites is strongly dependent on the composition of its two flanking bases. To this end, we have estimated and compared various existing independent models, codon models, codon partition models and contextdependent codon partition models for the atpB and rbcL genes of the chloroplast genome, which are frequently used in plant systematics. Such contextdependent codon partition models employ a full dependency scheme for fourfold degenerate sites, whilst maintaining the independence assumption for the first and second codon positions.
Conclusions
We show that, both in the atpB and rbcL alignments of a collection of land plants, these contextdependent codon partition models significantly improve model fit over existing codon partition models. Using Bayes factors based on thermodynamic integration, we show that in both datasets the same contextdependent codon partition model yields the largest increase in model fit compared to an independent evolutionary model. Contextdependent codon partition models hence perform closer to codon models, which remain the best performing models at a drastically increased computational cost, compared to codon partition models, but remain computationally interesting alternatives to codon models. Finally, we observe that the substitution patterns in both datasets are drastically different, leading to the conclusion that combined analysis of these two genes using a single model may not be advisable from a contextdependent point of view.
Background
While the modelling of evolutionary processes in noncoding sequences has received much attention from a contextdependence point of view in the last two decades, the same cannot be said for modelling approaches for coding sequences, at least not in terms of developed modelbased approaches. With the advent of new evolutionary models and drastic increases in computation power during the past decades, with desktop machines becoming more powerful and the advent of computer clusters with large amounts of processors (and processor cores) and a vast amount of memory, Maximum likelihood and Bayesian MCMC approaches now allow for very complex evolutionary models to be used in the analysis of large alignments.
Probabilistic modelling of sequence evolution has become the norm in phylogenetic inference, but complex evolutionary models are often not used in studies on molecular evolution, in part due to their increased computational burden but mainly due to the absence of such models in popular model testing tools
Chloroplast genes, such as atpB and rbcL (the subjects of the analyses in this paper), are proteincoding genes that are often analyzed in concatenated alignments with noncoding sequences using independent nucleotide models (see e.g.
Models of codon substitution (i.e. full codon models) consider a codon triplet as the unit of evolution and can distinguish between synonymous and nonsynonymous substitutions when analyzing proteincoding sequences
In this paper, we aim to provide an overview of current codon partition models and present an extension of these models based upon previous empirical observations. Morton
Here, we provide an overview of currently used codon partition models and assess their increase in model fit compared to the independent generaltime reversible model
Methods
Data
We have selected 26 sequences from the available 34 in the work of Karol et al.
Fixed consensus tree, based on the original tree inferred by Karol et al.
Fixed consensus tree, based on the original tree inferred by Karol et al.
Siteindependent evolutionary models
Siteindependent models of evolution have been the main subject of many phylogenetic studies since the inception of the model of Jukes and Cantor
Contextdependent (CD) models
Over the past two decades, various contextdependent models have been developed for analysing noncoding sequence (see
Even though proteincoding sequences can in principle not benefit from such models, we have tested the performance of our previously introduced contextdependent model
Full codon models
Goldman and Yang
Goldman and Yang
Codon Partition (CP) models
Codon partition (CP) models are nucleotide models that accommodate the differences in the evolutionary dynamics at the three codon positions (see e.g.
We here describe currently used codon partition models (along with their notation in this paper) but restrict ourselves to those models that use the general timereversible (GTR) evolutionary model. We do not use nor describe models of invariable sites plus gammadistributed rates (socalled ''I+ Γ'' models), given the strong correlation between the proportion of invariable sites and the gamma shape parameter (see e.g.
A first series of models, denoted GTR_{112 }and GTR_{123}, employ different general timereversible models depending on codon position. While the GTR_{112 }model groups the first and second codon position together and hence uses 2 models (10 free parameters), the GTR_{123 }model considers each codon positions separately and hence uses 3 models (15 free parameters). This is also referred to as the accommodation of substitution rate bias.
These two models can be linked with one common set of base frequencies (3 free parameters) in which case we simply denote them as GTR_{112 }and GTR_{123}. However, these two models can also be combined with 2 sets of base frequencies (6 free parameters) for the GTR_{112 }model and 3 sets of base frequencies (9 free parameters) for the GTR_{123 }model and we denote such models with a "+F" notation, i.e. GTR_{112}+F_{112 }and GTR_{123}+F_{123}. This is also referred to as the accommodation of nucleotide frequency bias.
An assumption concerning the evolutionary behaviour of sites in an alignment that typically greatly improves model fit as well as phylogenetic tree reconstruction is that of gammadistributed rateheterogeneity, although typically used in noncoding sequence data. In the case of a discrete approximation to rate heterogeneity for all positions, we use the typical "+Γ" notation. Modelling rate heterogeneity requires only one extra parameter. However, given the difference in evolutionary dynamics across codon positions, it may be important to allow for the extent of rate variation to vary across sites. This means using two independent gamma distributions when the first and second codon positions are grouped together and using three independent gamma distributions when the three codon positions are modelled separately. We denote these distributions, which require two and three extra parameters, respectively with a "+Γ_{112}" and a "+Γ_{123}" notation.
When simply allowing for different distributions for the rate heterogeneity at different codon positions, it is assumed that the mean rate for each of these distributions is 1. As this may turn out to be unrealistic, socalled rate ratios can be added to allow for different mean rates in the gamma distributions (i.e. to allow for variable mutation rates among the different codon positions). We have used the randomrates approach presented in the work of Burgess and Yang
ContextDependent Codon Partition (CDCP) models
While the current codon partition (CP) models (see e.g.
As a first attempt, we assume contextdependent evolution of the third codon position on its two immediate flanking bases (i.e. the second codon position of the same codon and the first codon position of the succeeding codon). The first and second codon positions however are still assumed to evolve independently from one another. We model this dependence by assuming that evolution at the third codon position occurs according to our full contextdependent GTR16C model
One of the main conclusions of the study of Morton
Fourfold degenerate sites
A position of a codon is said to be a fourfold degenerate site if any nucleotide at this position specifies the same amino acid. For example, the third position of the glycine codons (GGA, GGG, GGC, GGU) is a fourfold degenerate site, because all nucleotide substitutions at this site are synonymous, i.e. they do not change the amino acid. Only the third positions of some codons may be fourfold degenerate. The following amino acids have fourfold degenerate codon positions: alanine (GCX), arginine (CGX), glycine (GGX), leucine (CUX), praline (CCX), threonine (ACX), serine (UCX) and valine (GUX).
Here, we expand the contextdependent codon partition model introduced in the previous section in that the contextdependent model at the third codon position is only used for the fourfold degenerate sites, as identified at the start of each branch. Non fourfold degenerate sites are assumed to evolve according to a separate siteindependent general timereversible model, which shares the same (single) set of base frequencies as the contextdependent model at the fourfold degenerate sites. When assuming the full contextdependent model at the third codon position, we denote this model "+FF_{16}"; when using the contextdependent model at the third codon position that is aimed at modelling a correlation to the A+T context of its two immediate flanking bases, we denote the model as "+FF_{3}".
Ancestral root sequence distribution
In previous work on contextdependent models for noncoding sequences, we have extensively discussed the importance of an adequate ancestral root sequence distribution to estimate the evolutionary parameters of contextdependent models
The standard approach would be to use the values of the base frequencies describing the contents of the third codon position as the prior probability for the states of the third codon position (see e.g.
An extension of this approach is to use a firstorder Markov chain to allow for dependence of the third codon position on the second codon position (of the same codon) in the ancestral root sequence, which requires four independent sets of base frequencies
A final approach to model contextdependence at the third codon position in the ancestral root sequence is to allow its identity to depend upon the identity of its two immediate flanking bases, i.e. the identity of the second codon position of the same codon and the identity of the first codon position of the succeeding codon. We model this "secondorder" dependence through 16 sets of base frequencies
Bayesian Markov Chain Monte Carlo using data augmentation
Bayesian inference of phylogeny is based on a quantity called the posterior probability function of a tree, in the same way as maximumlikelihood inference is based on the likelihood function. While the posterior probability is generally tedious to calculate, simulating from it is relatively easy through the use of Markov chain Monte Carlo (MCMC) methods (
Prior Distributions
Let
For each set of model frequencies of which the ancestral root sequence is composed, the following prior distribution is assumed:
As mentioned earlier, we have used the randomrates approach presented in the work of Burgess and Yang
For each codon position, the parameter
For the model parameters of each context (i.e. neighbouring base combination) independently, the following prior distribution is assumed (see e.g.
Further, branch lengths are assumed i.i.d. given
and
Results
atpB dataset
Using the modelswitch thermodynamic integration scheme, we have compared all siteindependent, codon partition and contextdependent codon partition model to the independent general timereversible model (GTR). Figure
atpB dataset: Comparison of all models presented in this paper to the GTR model, when grouping the first and second codon positions together
atpB dataset: Comparison of all models presented in this paper to the GTR model, when grouping the first and second codon positions together. The performance of "traditional" codon partition models (up to the dotted horizontal line) can be improved significantly by assuming contextdependent evolution at the third codon position. As can be seen from this figure, such contextdependent codon partition models systematically outperform the (independent) codon partition models in terms of model fit.
atpB dataset: Comparison of all models presented in this paper to the GTR model, when treating each codon position separately
atpB dataset: Comparison of all models presented in this paper to the GTR model, when treating each codon position separately. The performance of "traditional" codon partition models (up to the dotted horizontal line) can be improved significantly by assuming contextdependent evolution at the third codon position. As is confirmed in this figure, such contextdependent codon partition models systematically outperform the (independent) codon partition models in terms of model fit. Further, the models that assume that first and second codon positions evolve according to a separate evolutionary model (in this figure) systematically outperform those models that treat first and second codon positions identically (from an evolutionary perspective).
Baele et al  Supplementary Material. File containing supplementary material and information that was not included in the main document. Word 2003 format.
Click here for file
The overall trend in Figures
Allowing for amongsite variation ("+Γ" in both Figures
Given that up to this point, one single set of base frequencies has been used for the different codon positions, we have relaxed this assumption and have modelled the socalled nucleotide frequency bias ("+F_{112}" in Figure
Of the different ancestral root distributions proposed to model a dependency scheme at the third codon position, the firstorder Markov chain (i.e., model GTR_{123}+Γ_{123}+CP_{123}+F_{123}+FF_{3}+3F) offers the largest improvement in model fit across the different model comparisons, followed by the GTR_{123}+Γ_{123}+CP_{123}+F_{123}+FF_{16}+3F model. The same pattern can be seen when grouping first and second codon positions together, but these models are consistently outperformed by those models that treat each codon position separately.
Figure
As we have shown in previous work
atpB dataset: Estimated substitution patterns for the independent evolutionary models at first, second and third codon positions and the contextdependent evolutionary model at the third codon position for the GTR_{123}+Γ_{123}+CP_{123}+F_{123}+FF_{16}+3F model
atpB dataset: Estimated substitution patterns for the independent evolutionary models at first, second and third codon positions and the contextdependent evolutionary model at the third codon position for the GTR_{123}+Γ_{123}+CP_{123}+F_{123}+FF_{16}+3F model. At the left of this figure, i.e. the first two columns, the independent model estimates for first (1^{st }CP), second (2^{nd }CP) and third (3^{rd }CP) codon positions are shown. At the right of the figure (columns 3 through 10), the contextdependent estimates for the third codon position are shown for each evolutionary context, i.e. each neighbouring base combination. In the figure, the substitution parameter types (i.e. rAG, rAC, ..., rTC) each have a unique colour, as indicated in the legend panel within the figure, and the same colour is used for every model (contextdependent or not).
The contextdependent substitution rates, which are grouped by the identity of the left neighbouring base, form the most interesting aspect of Figure
Table
atpB dataset: estimates per codon position for the nucleotide frequencies, amongsite rate variation distribution and relative rates (for the GTR_{123}+Γ_{123}+CP_{123}+F_{123}+FF_{16}+3F model).
Frequencies
CP
A
C
G
T
CP_{123}
1
0.2396
[0.21; 0.28]
0.2216
[0.19; 0.26]
0.3717
[0.33; 0.42]
0.1671
[0.14; 0.20]
0.1579
[0.11; 0.24]
0.1066
[0.08; 0.14]
2
0.2804
[0.24; 0.33]
0.2298
[0.19; 0.27]
0.1656
[0.13; 0.20]
0.3242
[0.28; 0.38]
0.0946
[0.01; 0.15]
0.0512
[0.04; 0.07]
3
0.3784
[0.35; 0.41]
0.0875
[0.08; 0.10]
0.0707
[0.06; 0.08]
0.4634
[0.43; 0.50]
0.8716
[0.73; 1.04]
2.8422
[2.80; 2.88]
Shown are mean estimates of 100.000 MCMC iterations, discarding the first 20.000 iterations as the burnin, and the corresponding 95% credibility intervals. The values in the table clearly illustrate the different substitution dynamics at the three codon positions.
Table
atpB dataset: estimates for the firstorder Markov chain distribution at the ancestral root sequence (for the GTR_{123}+Γ_{123}+CP_{123}+F_{123}+FF_{16}+3F model).
Root
X
A
0.5049
[0.38; 0.61]
0.4089
[0.28; 0.52]
0.0382
[0.00; 0.13]
0.0480
[0.00; 0.16]
C
0.1018
[0.00; 0.31]
0.2481
[0.01; 0.75]
0.1993
[0.01; 0.55]
0.4507
[0.06; 0.79]
G
0.0402
[0.00; 0.15]
0.1610
[0.00; 0.52]
0.1088
[0.01; 0.40]
0.6899
[0.16; 0.94]
T
0.1010
[0.01; 0.25]
0.3925
[0.22; 0.54]
0.4490
[0.29; 0.59]
0.0575
[0.00; 0.20]
Shown are mean estimates of 100.000 MCMC iterations, discarding the first 20.000 iterations as the burnin, and the corresponding 95% credibility intervals. Despite the wide 95% credibility intervals, the estimates show why such a firstorder dependence is useful.
rbcL dataset
As for the atpB dataset, we have compared all codon partition and contextdependent codon partition model to the independent general timereversible model using the modelswitch thermodynamic integration scheme. Figure
rbcL dataset: Comparison of all models presented in this paper to the GTR model, when grouping the first and second codon positions together
rbcL dataset: Comparison of all models presented in this paper to the GTR model, when grouping the first and second codon positions together. The performance of "traditional" codon partition models (up to the dotted horizontal line) can be improved significantly by assuming contextdependent evolution at the third codon position. As can be seen from this figure, such contextdependent codon partition models systematically outperform the (independent) codon partition models in terms of model fit.
rbcL dataset: Comparison of all models presented in this paper to the GTR model, when treating each codon position separately
rbcL dataset: Comparison of all models presented in this paper to the GTR model, when treating each codon position separately. The performance of "traditional" codon partition models (up to the dotted horizontal line) can be improved significantly by assuming contextdependent evolution at the third codon position. As is confirmed in this figure, such contextdependent codon partition models systematically outperform the (independent) codon partition models in terms of model fit. Further, the models that assume that first and second codon positions evolve according to a separately evolutionary model (in this figure) systematically outperform those models that treat first and second codon positions identically (from an evolutionary perspective).
The overall trend in Figures
Figure
The parameter estimates for the optimal GTR_{123}+Γ_{123}+CP_{123}+F_{123}+FF_{16}+3F model, which are shown in Figure
rbcL dataset: Estimated substitution patterns for the independent evolutionary models at first, second and third codon positions and the contextdependent evolutionary model at the third codon position for the GTR_{123}+Γ_{123}+CP_{123}+F_{123}+FF_{16}+3F model
rbcL dataset: Estimated substitution patterns for the independent evolutionary models at first, second and third codon positions and the contextdependent evolutionary model at the third codon position for the GTR_{123}+Γ_{123}+CP_{123}+F_{123}+FF_{16}+3F model. At the left of this figure, i.e. the first two columns, the independent model estimates for first (1^{st }CP), second (2^{nd }CP) and third (3^{rd }CP) codon positions are shown. At the right of the figure (columns 3 through 10), the contextdependent estimates for the third codon position are shown for each evolutionary context, i.e. each neighbouring base combination. In the figure, the substitution parameter types (i.e. rAG, rAC, ..., rTC) each have a unique colour, as indicated in the legend panel within the figure, and the same colour is used for every model (contextdependent or not).
The observation that the overall substitution behaviour as well as the contextdependent substitution patterns at the fourfold degenerate sites differs between the atpB and the rbcL genes is not surprising as Morton
As for the atpB dataset, Table
rbcL dataset: estimates per codon position for the nucleotide frequencies, amongsite rate variation distribution and relative rates (for the GTR_{123}+Γ_{123}+CP_{123}+F_{123}+FF_{16}+3F model).
Frequencies
CP
A
C
G
T
CP_{123}
1
0.2202
[0.19; 0.26]
0.2484
[0.22; 0.28]
0.3536
[0.31; 0.40]
0.1778
[0.15; 0.21]
0.1948
[0.15; 0.26]
0.2802
[0.23; 0.34]
2
0.2712
[0.22; 0.32]
0.2558
[0.20; 0.31]
0.2071
[0.16; 0.26]
0.2660
[0.22; 0.32]
0.0465
[0.00; 0.11]
0.1212
[0.09; 0.16]
3
0.2200
[0.18; 0.26]
0.2152
[0.19; 0.25]
0.0634
[0.05; 0.08]
0.5014
[0.46; 0.54]
2.6956
[1.92; 3.84]
2.5986
[2.53; 2.66]
Shown are mean estimates of 100.000 MCMC iterations, discarding the first 20.000 iterations as the burnin, and the corresponding 95% credibility intervals. The values in the table clearly illustrate the different substitution dynamics at the three codon positions. There are many differences with the estimates of the atpB dataset, shown in Table 1. The most notable differences are the amongsite rate variation distribution parameter and the nucleotide frequencies for the third codon position.
Table
rbcL dataset: estimates for the firstorder Markov chain distribution at the ancestral root sequence (for the GTR_{123}+Γ_{123}+CP_{123}+F_{123}+FF_{16}+3F model).
Root
X
A
0.4225
[0.33; 0.51]
0.5464
[0.46; 0.64]
0.0159
[0.00; 0.06]
0.0151
[0.00; 0.06]
C
0.4454
[0.23; 0.69]
0.0111
[0.00; 0.04]
0.0364
[0.00; 0.11]
0.5071
[0.26; 0.73]
G
0.0164
[0.00; 0.07]
0.0147
[0.00; 0.06]
0.1294
[0.05; 0.25]
0.8395
[0.71; 0.93]
T
0.3319
[0.17; 0.48]
0.2020
[0.11; 0.31]
0.2660
[0.14; 0.43]
0.2001
[0.07; 0.34]
Shown are mean estimates of 100.000 MCMC iterations, discarding the first 20.000 iterations as the burnin, and the corresponding 95% credibility intervals. Despite the wide 95% credibility intervals, the estimates show why such a firstorder dependence is useful. Based on the mean estimates, many differences with the estimates of the atpB dataset, shown in Table 2, can be seen.
Discussion
While nucleotide contextdependent models can offer large improvements in terms of model fit to the data as compared to independent evolutionary models, their performance has to be compared to the performance of both codonbased and codon partition (CP) models. Shapiro et al.
We have shown in this paper that the computational requirements for integrating full codon models in a thermodynamic integration framework for model comparison increase drastically. While we have focused on the model of Goldman and Yang
We have shown in this work that codon partition models that are extended with a contextdependence pattern for the third codon position across the entire underlying phylogenetic tree (socalled contextdependent codon partition models) significantly improve model fit compared to traditional codon partition models. While this work was mainly inspired by empirical findings
The approach we have taken to model contextdependent evolution at the third codon position selects one of 16 models using the neighbouring base combination at the start of each branch, which means that for a given site the context might change at each internal node of the underlying phylogenetic tree. A limiting aspect of this approach however is that the neighbouring base combination is assumed not to change along the length of a branch. This leads to only one in three codon positions (i.e. the third codon position and its two immediate neighbours) that can undergo substitutions, which is an assumption similar to that of full codon models (i.e. only one position in the codon can undergo substitutions). Contrary to the original contextdependent model developed for noncoding sequences
Apart from drastically increasing the number of parameters to model the evolution at the third codon position more closely, as indicated in the previous paragraph, a continuoustime approximation is often used to allow the neighbouring base combinations to evolve along the length of a branch. The approach we've taken to do this partitions each branch into parts with length no greater than 0.005
The results shown in this paper hence support the notion that fourfold degenerate sites in the atpB and rbcL proteincoding genes of land plants have a substitution process that is dependent on the composition of its neighbouring bases. In the work that inspired this paper, Morton
Even though we have found that the same contextdependent codon partition model performed best amongst all considered competing models for both atpB and rbcL genes, we have shown that the (contextdependent) substitution patterns at the third codon position differ greatly between both datasets. This means that, should both genes be concatenated to form a larger alignment, different contextdependent codon partition models may need to be used for each gene to perform accurate phylogenetic reconstruction for such a concatenated alignment. The substitution patterns discussed in this paper hence provide additional indications that great care needs to be taken in the analysis of concatenated genes.
As we have introduced the concept of contextdependent codon partition models in this paper, we have not yet undertaken an attempt to perform phylogenetic inference on proteincoding sequences using these models as the relationship between (increases in) model fit and the ability to accurately reconstruct phylogenetic trees is intricately complex
The computational burden of our contextdependent codon partition models is, as for standard CP models, much lower than for actual codon models due to the decrease in number of parameters and the easier computation of eigenvalues and eigenvectors of the substitution matrices. Indeed, the spectral decomposition of the codon probability matrix (i.e. model) as well as the computation of its powers is considerably slower than for a nucleotide model. Computational effort in computing the eigenvalues and eigenvectors of a matrix rises as the cube of the number of rows or columns, hence the effort is 61^{3}/4^{3 }≈ 3,547 times greater for a codon than for a nucleotide model
codon partition models (contextdependent or not) replace the high dimensionality of codon models by a series (or combination) of low dimensional matrices, for which spectral decomposition requires much less computational efforts.
The computational differences listed in the above paragraph are theoretical however, since often other estimations (such as data augmentation) need to be performed alongside eigenvalue decompositions, which do not require calculation of new eigenvalues. We have hence measured the computation time for the different classes of model present in this paper and have listed the results in Table
Computational demands: number of iterations and computation time for the comparison of the different models in the thermodynamic integration framework.
Model
Iterations
System
Time
Time (corr.)
CP
1.020.000
4core Opteron
7d 23 h

CDCP
1.050.000
4core Opteron
16d 7 h

GY94
300.000
4core Opteron
57d 21 h
203 days
GY94+Γ
150.000
8core Xeon
29d 21 h
210 days
The first column shows the type of model being considered: CP (codon partition), CDCP (contextdependent codon partition), with the most parameterrich in each of these classes being chosen to provide the computational estimates. The second column contains the number of Bayesian MCMC iterations for the comparison of each class of models visàvis the GTR model in the thermodynamic integration framework. Given the increased computational demands, fewer iterations were run for the GY94 model, which is reflected in the annealing and melting estimates being further apart (see Results section). The third column lists the computer architecture on which the calculations were run. The fourth column contains the computation time for the model comparisons if they would be executed on a single processor core. Finally, the fifth column provides an estimate of the computation time required for the GY94 model should an equal amount of iterations be run as for the CP and CDCP models.
Conclusions
Designing accurate contextdependent models is a complex process, with many different assumptions that require testing using an accurate procedure for model testing, which is computationally very demanding. In this paper, we show that current codon partition models may benefit from allowing the evolution of fourfold degenerate sites to depend upon the two immediate neighboring sites. Hence, the models we present here do not simply present an intermediate step between codon partition models and full codon models, given that the dependency patterns studied transcend the codon boundaries. Hence, dependencies are inferred that cannot be inferred even from highdimensional full codon models.
Our analyses of the atpB and rbcL coding regions of a dataset of land plants show that the contextdependent codon partition models presented in this paper significantly outperform current codon partition models. Further, even though for both datasets the same model is selected as yielding the highest increase in model fit, the parameter estimates indicate different substitution patterns in these datasets from a contextdependent point of view. Additional datasets will need to be analyzed to further study these substitution patterns and the way they differ among proteincoding genes.
Authors' contributions
GB initiated the study, designed the contextdependent codon partition models and the different ancestral root distributions accompanying these models, performed all the analyses, programmed the software routines and wrote a first complete version of the manuscript. YVdP contributed biological expertise to the analyses. SV contributed statistical expertise to the analyses and edited the manuscript. All authors read and approved the final manuscript.
Acknowledgements
We would like to thank three anonymous reviewers for helpful comments concerning a first version of the manuscript. Yves Van de Peer acknowledges support from an Interuniversity Attraction Pole (IUAP) grant for the BioMaGNet project (Bioinformatics and Modelling: from Genomes to Networks, ref. p6/25). Stijn Vansteelandt acknowledges support from IAP research network grant nr. P06/03 from the Belgian government (Belgian Science Policy). We acknowledge the support of Ghent University (Multidisciplinary Research Partnership "Bioinformatics: from nucleotides to networks"). The research leading to these results has received funding from the European Research Council under the European Community's Seventh Framework Programme (FP7/20072013)/ERC Grant agreement n° 260864.