Email updates

Keep up to date with the latest news and content from BMC Evolutionary Biology and BioMed Central.

This article is part of the supplement: First International Conference on Phylogenomics

Open Access Methodology

Methods for selecting fixed-effect models for heterogeneous codon evolution, with comments on their application to gene and genome data

Le Bao1, Hong Gu1, Katherine A Dunn2 and Joseph P Bielawski12*

Author Affiliations

1 Department of Mathematics and Statistics, Dalhousie University Halifax Nova Scotia, Canada

2 Department of Biology, Dalhousie University, Halifax Nova Scotia, Canada

For all author emails, please log on.

BMC Evolutionary Biology 2007, 7(Suppl 1):S5  doi:10.1186/1471-2148-7-S1-S5

The electronic version of this article is the complete one and can be found online at:


Published:8 February 2007

© 2007 Bao et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Models of codon evolution have proven useful for investigating the strength and direction of natural selection. In some cases, a priori biological knowledge has been used successfully to model heterogeneous evolutionary dynamics among codon sites. These are called fixed-effect models, and they require that all codon sites are assigned to one of several partitions which are permitted to have independent parameters for selection pressure, evolutionary rate, transition to transversion ratio or codon frequencies. For single gene analysis, partitions might be defined according to protein tertiary structure, and for multiple gene analysis partitions might be defined according to a gene's functional category. Given a set of related fixed-effect models, the task of selecting the model that best fits the data is not trivial.

Results

In this study, we implement a set of fixed-effect codon models which allow for different levels of heterogeneity among partitions in the substitution process. We describe strategies for selecting among these models by a backward elimination procedure, Akaike information criterion (AIC) or a corrected Akaike information criterion (AICc). We evaluate the performance of these model selection methods via a simulation study, and make several recommendations for real data analysis. Our simulation study indicates that the backward elimination procedure can provide a reliable method for model selection in this setting. We also demonstrate the utility of these models by application to a single-gene dataset partitioned according to tertiary structure (abalone sperm lysin), and a multi-gene dataset partitioned according to the functional category of the gene (flagellar-related proteins of Listeria).

Conclusion

Fixed-effect models have advantages and disadvantages. Fixed-effect models are desirable when data partitions are known to exhibit significant heterogeneity or when a statistical test of such heterogeneity is desired. They have the disadvantage of requiring a priori knowledge for partitioning sites. We recommend: (i) selection of models by using backward elimination rather than AIC or AICc, (ii) use a stringent cut-off, e.g., p = 0.0001, and (iii) conduct sensitivity analysis of results. With thoughtful application, fixed-effect codon models should provide a useful tool for large scale multi-gene analyses.

Background

The ratio dN/dS (ω) has proven a valuable index of the strength and direction of selection pressure. Because genetic data are typically subject to a diversity of evolutionary constraints, estimating ω as an average over many sites diminishes the effectiveness of this approach [1]. Statistical power is substantially improved, however, by accommodating variable selection pressures among sites (e.g., [2-4]). We follow Kosakovsky Pond and Frost [5] by placing such methods in three groups: (i) the counting methods, which estimate ω from counts of substitutions at individual sites (e.g., [3]), (ii) the random-effect models, which assume a parametric distribution of variability in the ω ratio across sites (e.g., [2]), and (iii) the fixed-effect models, which assume sites can be assigned a priori to different partitions [4]. The most generalized form of the fixed-effect models treats each site as an independent partition [5,6].

The recent growth of genome scale sequencing projects has sparked interest in using codon models to study mechanisms of innovation and functional divergence in genome-scale datasets [7]. Although the fixed-effect models were developed for analysis of multiple partitions of sites within a single gene, they are also appropriate for joint analyses of multi-gene datasets [4,8]. Fixed-effect models categorize codon sites into different classes which are allowed to have heterogeneous evolutionary dynamics, and such partitions are easily delineated on the basis of complete gene sequences. Moreover, by partitioning genes according to criteria such as their functional category, or role in a metabolic pathway, the fixed-effect models provide a statistical framework for making use of such information when analysing multi-gene datasets.

Yang and Swanson [4] introduced six fixed-effect models (Table 1) based on the codon model of Goldman and Yang [9]. The simplest model (A) assumes that the pattern of substitution is homogeneous over all sites; i.e., there are no partitions under model A. Branch lengths are included as parameters of the model. The most complex model (F) treats the different site partitions as independent datasets, having independent model parameters. As it involves a substantial increase in branch length parameters, model F is not recommended for datasets with many partitions [4]. The remaining four models (B-E in Table 1) lie between A and F in complexity. These four models scale the branch lengths of k partitions according to the parameter ck, which is a multiple of the branch lengths of the first partition; hence c1 = 1. Models B through E differ in their treatment of parameters ω, κ (transition to transversion ratio) and π (codon frequencies) among partitions (Table 1). We implemented 11 more fixed-effect models, which represent all the remaining combinations of heterogeneity or homogeneity among partitions for the parameters c, ω, κ and π (Table 2). A full description of the fixed effect models and the details of our implementation are presented in the methods section. Hereafter we refer to the complete set of fixed-effect (FE) models by using the revised notation shown in table 2 (FE1–FE16). Note that a capacity to specify fixed-effects under the alternative formulation of Muse and Gaut [10] is available through the program HyPhy [11], although it has not been documented.

Table 1. Fixed-effect models implemented by Yang and Swanson [4].

Table 2. An expanded set of fixed-effect models.

Given a related set of fixed-effect models (Figure 1), one is immediately faced with the non-trivial task of selecting the model that best fits the data in hand. Likelihood ratio tests (LRTs) have been shown to be a powerful and reliable means of testing site specific heterogeneity in selective pressure [8,12]. However, Figure 1 illustrates that there are 32 possible nested comparisons of models. It is not desirable to conduct 32 LRTs because computational costs are expensive for datasets with too many sequences or partitions. A popular method of model selection based on LRTs is "backward elimination" [13-15]. Backward elimination reduces a comparatively complex model to a simpler one in a step-by-step fashion. An alternative to "backward elimination" is the Akaike Information Criterion (AIC) [16], where the model with the smallest AIC score is chosen as the ideal model. For a small sample correction, typically when the number of observations is less than 40 times the number of parameters in the model [17], we borrowed the corrected Akaike Information Criterion (AICc) originally developed by Hurvich and Tsai [17] for regression settings.

thumbnailFigure 1. Relationships among fixed-effect codon models. The most complex model (FE1) is located at the top and a completely homogenous model (FE16) is located at the bottom. Parameters heterogeneous among partitions for a given model are shown after the model name. Lines between models indicate "1-step" differences in complexity among the models.

Although the statistical issues surrounding model selection are well known within the field of molecular evolution [18-20], the established statistical techniques have not been applied to the fixed-effect setting. In this study we used computer simulation to evaluate the performance of backward elimination, AIC and AICc for selecting an optimal model from an array of models specifying different levels of heterogeneity among partitions. We then illustrated the application of these methods on two real datasets. The first was comprised of the buried and exposed sites of the abalone sperm lysin gene; this lysin partition was one of the original test cases of Yang and Swanson [4]. For the second case, we examined the evolutionary heterogeneity of a multi-gene dataset; the region of the genome encoding all the components of the flagellar system of Listeria species and several proteins of unknown function.

Results

Simulated data

A simulation study was used to measure the accuracy of fixed-effect model selection. We simulated under the 16 different scenarios for heterogeneous codon evolution among data partitions shown in table 2 (see methods for a detailed description of the simulation study), and measured the number of cases where each procedure identified the correct generating model. The backward elimination procedure uses the likelihood ratio test (LRT) to simplify a complex model one parameter at a time; in this case we start at the top of Figure 1 (FE1) and use the LRT to remove non-significant parameters in a step-wise fashion. A more detailed description is provided in the methods. When we applied the LRT under a cut-off probability of 0.05, the backward elimination procedure provided more accurate model specification than either AIC or AICc in all the cases except for model 2 (Table 3). Among all 336 datasets, the accuracy of backward elimination was 78% whereas the accuracy of AIC and AICc was 63% and 64% respectively (Table 3). Note that each model can be related to all other models by the number of connections, or "steps", between them in Figure 1. For all models that are wrongly specified by backward elimination, most were just one step away from the true model (85%). Among these 1-step wrong models, there was a bias in the direction of greater complexity for one of ω, κ or c; replicates heterogeneous for these parameters were never misclassified as homogenous. Taken over all replicates homogenous for ω, κ or c, this bias was generally low, with 1-step error rates of 13%, 9% and 9% respectively.

Table 3. Accuracy of model selection under backward elimination, AIC and AICc. Letters in parentheses indicate the model codes formerly used by Yang and Swanson [4].

Similar results were observed for AIC and AICc. Most misclassifications were 1-step errors (77% and 78%), with a bias in the direction of greater complexity for parameter ω, κ or c. Again, heterogeneous replicates were not misclassified as homogenous for these parameters. The 1-step error rates across replicates homogenous for ω, κ or c were 28%, 18% and 26% for AIC, and 26%, 17% and 22% for AICc.

As the number of misclassification errors ≥2-steps was much smaller than the number of 1-step errors, we examined these as an average over backward elimination, AIC and AICc. In 90% of the cases these errors resulted in too simple a model. The involved parameters were ω, c and π; the κ parameter was rarely misclassified.

We simulated under two models of codon frequencies: (i) unbiased (πi = 1/61) and (ii) biased frequencies taken from empirical frequencies of the lysin gene. In composite datasets with a 90:10 partition the number of codons in the smaller partition is too low for reliable empirical estimation of 61 different codon frequencies ("F61" method). Hence, in only those cases we used the "F3 × 4" method, which computes codon frequencies from nucleotide frequencies at the three positions of the codon [9]. In the 50:50 and 70:30 datasets we used the empirical estimates of codon frequencies (F61) in each partition. We note that such empirical estimates do not satisfy the requirements of LRT [21] and, hence, the backward elimination procedure. For backward elimination the 1-step error rate for incorrectly specifying heterogeneous π was 6%, and for incorrectly specifying homogenous π was 14%, indicating a greater tendency towards too simple a model. For AIC and AICc, the misspecification of π was almost entirely for too simple a model. Note that most of these errors were made in the 90:10 datasets, suggesting that misspecification of codon frequency heterogeneity is mainly due to large empirical estimation-errors of codon frequencies due to the insufficient information of small partitions. Thus power is lowest to identify heterogeneity in codon frequencies when a partition consists of a small number of codon sites. We anticipate that power also will be low in larger partitions of real datasets where the difference among partitions is not as great as in our simulations.

Next we investigated the possibility of tuning the cut-off p-value of the backward elimination procedure to improve the accuracy of model specification. We evaluated accuracy for cut-off p-values of 0.01, 0.001 and 0.0001. Substantial improvements were obtained, with average accuracy increasing from 78% (under the original cut-off p-value of 0.05) to 83%, 87% and 88% (Table 3) respectively. Under a cut-off value of 0.0001, 39 models were misspecified, with 33 being too simple with respect to codon frequencies in the 90:10 datasets. Among the 6 remaining misspecified models, 4 were too complex for ω and 2 were too complex for π. All but one of the misspecified models under cut-off p = 0.0001 were one-step errors. Based on these findings we used a cut-off p = 0.0001 in our application of these models to real data.

Abalone sperm lysin gene

Abalone sperm lysin is a reproductive protein well known for rapid evolution under strong diversifying selection [22]. We partitioned the lysin dataset into the same set of 46 buried sites and 88 solvent-exposed sites as in Yang and Swanson [4] and applied backward elimination (cut-off p = 0.0001), AIC and AICc to select among the full set of fixed-effect models. Under backward elimination, we used the likelihood ratio test to compare FE1, which assumes different κ, ω, c, and π's for buried and exposed sites, with those nested models at the next level in Figure 1 (FE2, FE3, FE5 and FE9). Each model at the next level assumes one of these four parameters (κ, ω, c or π) is homogeneous among site classes. FE9 assumes homogenous κ for both buried and solvent-exposed sites, and the likelihood ratio statistic comparing FE1 against FE9 is 2 × (4474.75-4473.88) = 1.74, which is not significant (d.f. = 1; p = 0.1871). As all other LRTs at this level are significant, we simplified our model according to κ and compared FE9 to those models nested at the next level in Figure 1 (FE10, FE11 and FE13). As subsequent LRTs involving FE9 were significant, model FE9 was selected by backward elimination. We note that even when we use a cut-off p = 0.05, we still select FE9 by backward elimination. Table 4 illustrates that FE9 is also selected by using AIC or AICc.

Table 4. Likelihood scores, parameter estimates, ΔAIC and ΔAICC scores for the abalone sperm lysin gene under codon models with two fixed partitions.

Yang and Swanson [4] conducted LRTs of the subset of models shown in Table 1 and found that Model E (FE1) provided the best fit to the lysin data. FE1 and FE9 are qualitatively similar in suggesting heterogeneity in ω, c and π's among the buried (b) and solvent exposed (e) sites. Moreover, these models provide similar quantitative estimates of the strength of selection and rate of evolution in these two partitions (FE1: ωb = 0.45, ωe = 1.28, cb = 1, ce = 2.54; FE9: ωb = 0.43, ωe = 1.31, cb = 1, ce = 2.56). Both models suggest that buried sites are evolving under strong purifying selection and exposed sites under diversifying selection. Note that Yang and Swanson [4] used a model that specified heterogeneous κ's (FE1), as it was not possible to test for heterogeneity in κ's independently of ω's (Table 1). As estimates of ω were very similar under FE1 and FE9, and κ's for partitions under FE1 were very similar (κb = 1.9 and κe = 1.5), the use of FE1 was not problematic in this case.

Components of the Listeria flagellar system

Listeria are gram positive rod shaped bacteria which are motile between 4°C and 30°C, and grow in a wide range of pHs, temperatures, and osmotic pressures. The natural habitat of Listeria is thought to be soil rich in decaying matter; however, Listeria monocytogenes is an important food-borne pathogen of humans and animals capable of both a free-living and intracellular lifestyle. Interestingly, the motility of Listeria monocytogenes is thermoregulated, being reduced above 30°C, and completely shut down above 37°C [23], temperatures which correspond to their host intracellular environment. The consensus opinion is that the shut down of expression of flagellar related proteins, thereby shutting down motility, is an adaptation to avoid recognition by the hosts innate immune system [24]. Specifically, recognition of the flagellin protein, a product of the flaA gene, activates the host inflammatory responses through Toll-like receptor 5 (TLR5) [25].

Twenty-eight genes encoding putative flagellar related proteins, including flaA, are located together in the genome of Listeria. Several proteins having functions unrelated to flagellar machinery, or unknown functions, also are encoded in this region of the genome. We analysed the genes from this region with three issues in mind: (i) to test for heterogeneous evolutionary dynamics among genes, (ii) to examine the evolutionary dynamics of proteins with unknown function and determine if they have any similarities to flagellar machinery proteins or proteins having unrelated functions, and (iii) to compare selection pressures on flaA with other genes known to encode flagellar related proteins. We note that thermoregulation of motility is not always perfect, thereby raising the possibility that the host's innate immune system is occasionally able to recognize flagellin [24]. This would set up selection pressure for a "co-evolutionary chase" between host and pathogen, leading to an elevated rate of nonsynonymous evolution in flaA.

Our dataset was comprised of 37 of the 43 genes (lmo0675 – lmo0717) located contiguously within the genomes of 5 lineages of Listeria. Two genes (lmo0684 and lmo0711) were excluded because their gene trees were incompatible with the genome-tree. Four genes (lmo0677, lmo0698, lmo0709 and lmo0712) were excluded because they were less than 100 codons long. Next we partitioned the genes according to functional category: (i) flagellar machinery (7973 codons), (ii) non-flagellar functions (1427 codons) and (iii) unknown functions (2308 codons). We then applied backward elimination (cut-off p = 0.0001), AIC and AICc to select among the full set of fixed-effect models (Table 5). Unlike the lysin example above, the model selected by backward elimination (FE9) differed from the model selected by AIC and AICc (FE10). Both FE9 and FE10 indicate heterogeneity in c and ω, and homogeneity in κ among partitions. They differ in that FE9 specifies heterogeneous codon frequencies and FE10 does not. Clearly, the genes in this region of the Listeria genome are subject to heterogeneous evolutionary dynamics.

Table 5. Likelihood scores, parameter estimates, ΔAIC and ΔAICC scores for genes located in the regions of the Listeria genome encoding putative flagellar related proteins.

Interestingly, genes encoding proteins of unknown function had levels of selection pressure (FE9: ω = 0.036; FE10: ω = 0.038) highly similar to genes encoding proteins known to comprise the flagellar machinery (FE9: ω = 0.016; FE10: ω = 0.018), whereas those genes that do not encode components of the flagellar machinery were evolving under substantially higher relative rates of amino acid substitution (FE9: ω = 0.11; FE10: ω = 0.11). Genes encoding several components of the flagellar machinery (FliO, FliJ, FliT, FlgM, and FliK) are present in other bacilli but unaccounted for in Listeria. We used BLAST to compare the present set of unknown proteins to other bacilli and found one case (lmo0715) that was similar to a known flagellar protein (FliH). We note that the KEGG database [26] has annotated lmo0715 as a putative flagellar assembly protein. Based on genome location and levels of selection pressure, we suggest that the "unknown" genes in this dataset represent the best candidates for the unaccounted components of the Listeria flagellar machinery. Genes that evolve at high rates can be difficult to identify [27]; however this is not the case here, as estimates of ω indicate a relatively slow rate of nonsynonymous evolution. If these genes indeed encode the missing components of the Listeria flagellar machinery, we speculate an ancestor of Listeria might have acquired them via an LGT event.

The rapidly-evolving non-flagellar genes encoded (i) a protein involved in regulating chemotaxis (lmo0691), (ii) a chemotaxis-related sensory protein (lmo0692), (iii) a cell surface protein (lmo0701), and (iv) a phage-related protein similar to transglycosylase (lmo0717). To further investigate the evolutionary dynamics of these genes we applied the full set of fixed-effect models to them, with each having a unique partition. Again, the model selected by backward elimination (FE9) differed from the model selected by AIC and AICc (FE10). Results under FE9 (Table 6) and FE10 are consistent in suggesting heterogeneous ω among them, with one gene, the cell surface protein, having a substantially higher relative rate of nonsynonymous evolution. Interestingly, a genome wide survey of Listeria genes reveals that, in general, cell-surface proteins exhibit accelerated evolutionary rates as compared to housekeeping genes (unpublished data).

Table 6. Likelihood scores and parameter estimates obtained under the model selected by backward elimination for subsets of the genes located in the regions of the Listeria genome encoding putative flagellar related proteins.

Lastly, we investigated the evolutionary dynamics of flaA as compared to those genes known to encode flagellar related proteins. We applied the full set of models to this subset of proteins, with flaA having a unique partition and the remaining 23 proteins placed in a second partition. In this case backward elimination, AIC and AICc selected model FE13. This indicates that, despite heterogeneity in both c and π, selection pressure on flaA does not differ significantly from the average for genes encoding a flagellar related protein. This result supports the hypothesis that thermoregulation of motility remains an effective adaptation to avoid recognition by the host's innate immune system [24], despite sometimes less than perfect control over gene expression.

Discussion

The simulation results show that under a cut-off p-value = 0.05 the likelihood ratio test is more accurate than AIC and AICc. With the exception of the π parameters, AIC and AICc chose overly complex models more frequently than did the backward elimination procedure. For the π parameters, AIC and AICc chose overly simple models more frequently than did backward elimination. The difference lies in the different cut-off values that are used to penalize the more complex model. Take the heterogeneity test of κ as an example (df = 1), the LRT statistic is defined as twice the difference in log likelihood values between a pair of nested models: Λ = 2 × (ln L( | x) - ln L(θ0 | x)). Based on the LRT under a significance level of 0.05, we reduce the complexity of the model if Λ is smaller than the critical value 3.84. Under AIC we choose a simpler model only when Λ < 2; hence, AIC tends toward more complex models. However when we reduce a model by more than 7 parameters (e.g. homogeneous π's versus heterogeneous π's, with d.f. = 9), the critical value becomes 16.92 for the LRT, which is less than the critical value of Λ under AIC, 18. In this case, AIC will select the same or simpler model than the LRT. Note that AICc compares Λ with 2 × k × n/(n-2) which is always greater than 2 × k used by AIC, hence AICc will always choose the same or simpler model than AIC. This property of AICc had only a small effect on the results of model selection, as AICc performed only slightly better than AIC but substantially worse than backward elimination.

LRT statistics involving parameters such as ω appear to be asymptotically χ2 distributed for random-effect codon models; such models employ a parametric distribution (e.g., the β distribution of Models M7 and M8 [2]) to accommodate among site variability in the ω ratio. However, Aagaard and Phillips [8] reported that for comparisons of Yang and Swanson's [4] models C and E (FE13 and FE1 in Figure 1), the empirical distribution of LRT statistics deviated from the expected χ2 distribution, leading to a type I error rate in excess of that specified by the level of the test. The results of our simulation study suggest this bias might affect all tests in Figure 1 involving parameters κ, ω and c. Several authors have noted that LRT statistics derived from models that employ empirical estimates of nucleotide or codon frequencies might not be well approximated by a χ2 distribution [8,28]. Moreover, when Aagaard and Phillips [8] repeated their simulation study under equal codon frequencies and computed LRT statistics by using models with frequency parameters fixed to the true values (πi = 1/61), they found that the LRT statistics matched the χ2 expectation. Aagaard and Phillips [8] suggested that the approximation of codon frequencies is the source of the observed bias in the LRT.

Indeed, empirical estimates do not satisfy the requirements of LRT [21], and consequently the backward elimination procedure. To further investigate the impact of empirical estimates on model selection, we reanalysed all the simulated datasets under the true codon frequencies; i.e., those used to generate the data. We note that for a given dataset the empirical codon frequencies yielded higher likelihood scores than did the true frequencies. This was expected, as empirical estimation will "pick up" some of the sampling errors in each simulation replicate. We found that the accuracy and bias of backward elimination, AIC and AICc under the true codon frequencies were identical to those when empirical codon frequencies were used. This suggests that bias in the model selection procedures used here did not arise from empirical estimation of π's alone.

There are several possible explanations for the bias of all three model selection methods in the direction of greater complexity for one of ω, κ or c. One possibility is that potential non-independence among the values for different parameters means that AIC might not be a good approximation to the Kullback-Leibler divergence, and that the requirements for the χ2 approximation might not be met for the LRT, thus the degree of freedom is not accurate. Also, backward elimination may find a local optimum solution. Under backward elimination the cut-off p-value is subjectively decided before the tests, leading to the possibility that the procedure will stop too early and suggest an overly complex model. This phenomenon is sometimes seen in the regression context. Clearly, these issues require further attention; in the mean time we explored the possibility of tuning the cut-off p-value in order to improve the performance of backward elimination (see also [8]). After evaluating several cut-off values for p, we found that a substantial improvement in performance was obtained by using p = 0.0001. Moreover, we found that by using this cut-off, nearly all the tendency to select an overly complex model for ω, κ or c was avoided, and that errors for selecting overly-simplistic models happen mostly for datasets where one of the partitions was comprised of a very small number of codon sites.

Our application of fixed-effect models to real data was encouraging, having uncovered previously unrecognized heterogeneity among Listeria genes and among sites within the abalone sperm lysin gene. However, if the objective is only to identify individual positive selection sites within a gene, the a priori structural information is not likely to serve as a perfect proxy for those partitions most relevant to differences in selection pressures. For example, Yang and Swanson [4] showed that the exposed sites of lysin likely include both conserved and positively selected codon positions. Hence, averaging ω over all sites in the exposed partition yields a reduced estimate of positive selection pressure. We note this effect was the same under the best-fit model, FE9, as under FE1 (Model E) used by Yang and Swanson [4]. We agree with Yang and Swanson [4] in anticipating that the power of random-effect models to test the strength and direction of selection pressure at sites within genes will be greater than fixed-effect models in most cases.

If the objective is to investigate heterogeneous evolution among genes, as in genome-scale analyses, then fixed-effect models are useful. The present set of models represents only a small step towards genome-scale evolutionary models. For example, decoupling synonymous and nonsynonymous rates, as in the random-effect model of Kosakovsky Pond and Muse [29], would allow users freedom to model gradients in synonymous substitution rates along a genome while allowing independent variability in nonsynonymous rates among genes. Yang and Swanson [4] made several suggestions, including the intriguing idea of enforcing a molecular clock for synonymous changes and leaving nonsynonymous changes unconstrained. We predict that joining fixed-effect codon models and data-mining methods to obtain new methods analogous to model based clustering [30,31] could provide extremely useful tools for genome-scale data analysis. Lastly, there is growing interest in both the performance of codon models [32,33] and the impact of heterogeneity among genes [34,35] in multi-gene phylogenetic analysis; improved ability to model among-gene heterogeneity at the codon level could improve their utility for comparing alternative phylogenomic hypotheses.

Conclusion

Random- and fixed-effect codon models have unique advantages and disadvantages. Random-effect models are desirable when there is no a priori knowledge by which sites might be partitioned, or when only a few sites are expected to comprise a partition of interest (but see [5]). Their disadvantage is that models for heterogeneity among sites in features such as the transition to transversion ratio (κ) and equilibrium codon frequencies (π) are unavailable. Fixed-effect models are desirable when data partitions are known to exhibit significant heterogeneity in parameters such as c, κ or π, or when a statistical test of such heterogeneity is desired. The disadvantage here is that any uncertainty in the site partition is not accommodated.

The growing importance of phylogenomics and metagenomics (e.g., [36,37]) will lead to a greater need for models suitable for testing hypotheses, and estimating rates and patterns of evolution, in large multi-gene datasets. Although considerable development remains to be done, we believe the present set of models will find many useful applications provided that results are interpreted with the inherent limitations of the methods in mind. In particular we note: (i) power can be low (see also [8]), particularly when partitions are small, (ii) the accuracy of the partitions may influence the results of model specification and (iii) the tree topology is assumed to be known without error. For the time being we make the following recommendations: (i) select among models by using backward elimination rather than AIC or AICc, (ii) use a stringent cut-off p-value; p = 0.0001 seems appropriate, and (iii) sensitivity analysis should be included in an investigation. Sensitivity of results should be investigated for robustness to tree topology and model of codon frequencies. We note that by using Akaike weights [16] to quantify the evidence in favour of a model, estimates of parameters could be obtained that accommodate model uncertainties (e.g., [19]). Where practical, we recommend that sensitivity to alternative data partitions also should be explored. Lastly, any complex model can have convergence problems or implementation errors; one must always inspect the parameter estimates for atypical results. With thoughtful application, fixed-effect codon models should provide a useful tool for large scale multi-gene analyses.

Methods

Fixed-effect models of codon evolution

The basic codon model [9,10] assumes that the process of substitutions from one codon to another is a Markov process, where the next codon state only depends on the present state, and not on any past state. The element Pij(t) in the transition matrix P(t) gives the probability of going from codon i to codon j during time t. Because they do not occur within a functional protein-coding gene, the three stop codons, UAA, UAG and UGA, are excluded. The transition matrix P(t) can be calculated by P(t) = eQt, where Q = {qij} is a 61 × 61 rate matrix. The element qij denotes the instantaneous substitution rate from codon i to codon j as follows:

When a change among codons involves a transition, the rate is multiplied by κ, the transition/transversion rate ratio. In the same way, if the substitution is nonsynonymous, the rate is multiplied by ω, the nonsynonymous/synonymous rate ratio. Usage of codons within a gene can also be highly biased, and consequently the rate is multiplied by the equilibrium frequency of the targeted codon πj, which is assumed to remain unchanged between generations of the evolutionary process. Given prior information by which sites can be partitioned into classes, parameters such as ω, κ and π can be allowed to differ among partitions, with different Q matrices used for different partitions [4]. Independence among all codon sites is assumed; hence the log likelihood of the complete dataset is the sum of the log likelihood of each site [4].

For all fixed-effect codon models described in this paper (Table 2) the tree topology is fixed a priori and, with the exception of the codon frequencies, the parameter values are estimated by numerical maximization of the likelihood function. Codon frequencies are empirically estimated from the data. For parameters heterogeneous among partitions, we use the maximum likelihood estimation implemented by Yang and Swanson [4]. For the models with identical substitution rate, we simply fix the branch length ratio c at 1 across partitions. For models with homogeneous κ or ω, we use an algorithm similar to the Expectation-Maximization (EM) algorithm [38]. Let's take a single parameter, say ω, as an example of a homogenous parameter. We first independently estimate the parameters in each partition by maximum likelihood. At the "E-step" of the algorithm we obtain the weighted average of ω over all partitions, where the weight is given by the proportion of codon sites in the corresponding partition. At the "M-step" we re-estimate parameters heterogeneous over partitions after fixing ω to its weighted average value. Following this we run the "E-step" by fixing the parameters assumed to be heterogeneous, and re-estimating ω from each partition; an updated estimate of ω is again obtained as the weighted average. The E- and M-steps are run iteratively until successive estimates of ω converge.

Model selection among a set of related fixed-effect models

Backward elimination reduces the most complex codon model, shown at the top of Figure 1, to a simpler one in a stepwise fashion. We begin from model FE1, which assumes different κ, ω, c, and π's for different site classes, and then compare it with simpler models which assume one of these four parameters to be homogeneous (see next level in Figure 1). For example, if the hypothesis of homogeneous κ is not rejected, we will go to FE9 at the next level in Figure 1, which assumes different ω, c, π's but the same κ for different site classes. We then compare FE9 with its nested models at the next level (see FE10, FE11 and FE13 in Figure 1). If more than one homogenous model cannot be rejected by the LRTs at a given level, the backward elimination procedure will choose the model with the largest p-value in the LRT.

Akaike Information Criterion (AIC) is based on minimizing the expected Kullback-Leibler divergence [39,40]; AIC = -2 × ln L( | x) + 2 × k, where k denotes the number of free parameters in the candidate model. For small samples, the AIC is corrected by a second order bias adjustment in the regression and time series settings in order to avoid over-fitting, let n denote the number of observations [17]: . This adjustment places a heavier penalty on the number of parameters when the number of observations is not much larger than the number of parameters; thus AICc tends to choose a simpler model than AIC. We note that the basis for this correction is not expected to hold outside of the regression and time-series settings; however, because we desired a correction for small samples we evaluated the performance of AICc by numerical simulations. In our analysis k denoted the number of parameters in a given model (Table 2) and n denoted the number of codon sites. For both AIC and AICc, the model with the smallest score is chosen as the ideal model.

The full set of 16 fixed-effect models (Figure 1) is implemented in a modified version of codeml that we make freely available [41]. The original version of codeml (3.14b) is one of several programs in the PAML package [42] distributed by Ziheng Yang [43].

Simulated and real sequence data

We simulated codon evolution by using the "evolver" program of the PAML package [42]. There were 16 different scenarios, based on the heterogeneity or homogeneity of four parameters (Table 2) among two data partitions. Parameter values are shown in Table 7. We also varied the ratio of the number of codon sites contained in each partition (50:50, 70:30 or 90:10). Data were simulated independently in two partitions and then concatenated to obtain a composite dataset. Each composite dataset contained 2000 codons for 16 species. A total of 432 composite datasets were possible; i.e., the product of 2 options on κ, 3 options on c, 4 options on π, 6 options on ω and 3 partition proportions. However, as we had two options for heterogeneous rates (1:3 or 1:9) and one possibility for the homogeneous rates (1:1) we made an adjustment to obtain similar number of datasets for each of the 16 scenarios. Specifically, we simulated under only 4 of the options on ω (first and second rows for selection pressure in Table 7) when rates were assumed to be heterogeneous among partitions (c2 = 3 and c2 = 9). Thus 2(κ) × 2(c) × 4(π) × 2(ω) × 3(sites proportion) = 96 possibilities were not included in our simulation. This strategy provided a grand total of 336 composite datasets for evaluating the performance of the different model selection strategies.

Table 7. Parameter values used in simulations of two-partition datasets.

The first real dataset was comprised of sequences for the sperm lysin gene from 25 species of abalone. This is one of the original test cases of Yang and Swanson's paper [4], and it is distributed online by Ziheng Yang as part of the PAML package [43]. Note that this lysin dataset and phylogeny is the same as those from [44], except that a single site containing an alignment gap was excluded. The second real dataset was comprised of 37 genes located within the genomic region of Listeria bacteria that encode the components of the flagellar system (lmo0675 – lmo0717). We note that this region includes several proteins of unknown function, and for comparative purposes we included them in our dataset. Genes were partitioned according to functional categories of the ListiList database [45]. The sequence alignments, phylogenetic trees and gene ontologies for this multi-gene dataset is available online [46]. Although multi-gene datasets can be much larger than ours, it represents both a real biological problem and serves as an illustration of the types of data upon which these techniques can be used.

Authors' contributions

All authors contributed to the interpretation of results, were involved in the drafting and revising of the manuscript, and have read and approved the final version. LB developed the computer code. KD contributed the design and assembly of the Listeria data analysis, and made substantial contributions to interpretation of those data.

Acknowledgements

This research was supported by Discovery Grants from the Natural Sciences and Engineering Research Council of Canada (DG298394 to JPB, and DG40156 to HG) and a grant from Genome Canada.

This article has been published as part of BMC Evolutionary Biology Volume 7, Supplement 1, 2007: First International Conference on Phylogenomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcevolbiol/7?issue=S1.

References

  1. Nielsen R, Yang Z: Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene.

    Genetics 1998, 148:929-936. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  2. Yang Z, Nielsen R, Goldman N, Pedersen AM: Codon-substitution models for heterogeneous selection pressure at amino acid sites.

    Genetics 2000, 155:431-449. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  3. Suzuki Y, Gojobori T: A method for detecting positive selection at single amino acid sites.

    Mol Biol Evol 1999, 16:1315-1328. PubMed Abstract | Publisher Full Text OpenURL

  4. Yang Z, Swanson WJ: Codon-substitution models to detect adaptive evolution that account for heterogeneous selective pressures among site classes.

    Mol Biol Evol 2002, 19:49-57. PubMed Abstract | Publisher Full Text OpenURL

  5. Kosakovsky Pond SL, Frost SD: Not so different after all: a comparison of methods for detecting amino acid sites under selection.

    Mol Biol Evol 2005, 22:1208-1222. PubMed Abstract | Publisher Full Text OpenURL

  6. Massingham T, Goldman N: Detecting amino acid sites under positive selection and purifying selection.

    Genetics 2005, 169:1753-1762. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  7. Clark AG, Glanowski S, Nielsen R, Thomas PD, Kejariwal A, Todd MA, Tanenbaum DM, Civello D, Lu F, Murphy B, Ferriera S, Wang G, Zheng X, White TJ, Sninsky JJ, Adams MD, Cargill M: Inferring nonneutral evolution from human-chimp-mouse orthologous gene trios.

    Science 2003, 302:1960-1963. PubMed Abstract | Publisher Full Text OpenURL

  8. Aagaard JE, Phillips P: Accuracy and power of the likelihood ratio test for comparing evolutionary rates among genes.

    Mol Biol Evol 2005, 60:426-433. Publisher Full Text OpenURL

  9. Goldman N, Yang Z: A codon-based model of nucleotide substitution for protein-coding DNA sequences.

    Mol Biol Evol 1994, 11:725-736. PubMed Abstract | Publisher Full Text OpenURL

  10. Muse SV, Gaut BS: A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with applications to the chloroplast genome.

    Mol Biol Evol 1994, 11:715-725. PubMed Abstract | Publisher Full Text OpenURL

  11. Kosakovski Pond SL, Frost SD, Muse SV: HyPhy: hypothesis testing using phylogenies.

    Bioinformatics 2005, 21:676-679. PubMed Abstract | Publisher Full Text OpenURL

  12. Anisimova M, Bielawski JP, Yang Z: Accuracy and power of the likelihood ratio test in detecting adaptive molecular evolution.

    Mol Biol Evol 2001, 18:1585-92. PubMed Abstract | Publisher Full Text OpenURL

  13. Mantel N: Why step down procedures in variable selection.

    Technometrics 1970, 12:621-625. Publisher Full Text OpenURL

  14. Ben-Bassat M: Use of distance measures, information measures and error bounds in feature evaluation. In Classification, Pattern Recognition and Reduction of Dimensionality: Handbook of Statistics. Volume 2. Edited by Krishnaiah PR, Kanal LN. North-Holland Publishing Company; 1982::773-791. OpenURL

  15. Miller AJ: Subset selection in regression. Chapman & Hall; 1990. OpenURL

  16. Akaike H: Information theory and an extension of the maximum likelihood principle.

    2nd International Symposium on Information Theory 1973, 267-281. OpenURL

  17. Hurvich CM, Tsai CL: Regression and time series model selection in small samples.

    Biometrika 1989, 76:297-307. Publisher Full Text OpenURL

  18. Posada D, Crandall KA: Selecting the best-fit model of nucleotide substitution.

    Syst Biol 2001, 50:580-601. PubMed Abstract | Publisher Full Text OpenURL

  19. Posada D, Buckley TR: Model selection and model averaging in phylogenetics: advantages of akaike information criterion and bayesian approaches over likelihood ratio tests.

    Syst Biol 2004, 53:793-808. PubMed Abstract | Publisher Full Text OpenURL

  20. Abascal F, Zardoya R, Posada D: ProtTest: Selection of best-fit models of protein evolution.

    Bioinformatics 2005, 21:2104-2105. PubMed Abstract | Publisher Full Text OpenURL

  21. Self SG, Liang KL: Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions.

    American Statistical Association 1987, 82:605-610. Publisher Full Text OpenURL

  22. Yang Z, Swanson WJ, Vacquier VD: Maximum-likelihood analysis of molecular adaptation in abalone sperm lysin reveals variable selective pressures among lineages and sites.

    Mol Biol Evol 2000, 17:1446-1455. PubMed Abstract | Publisher Full Text OpenURL

  23. Peel M, Donachie W, Shaw A: Temperature-dependent expression of flagella of Listeria monocytogenes studied by electron microscopy, SDS-PAGE and western blotting.

    J Gen Microbiol 1988, 134:2171-2178. PubMed Abstract OpenURL

  24. Dons L, Eriksson E, Jin Y, Rottenberg ME, Kristensson K, Larsen CN, Bresciani J, Olsen JE: Role of flagellin and the two-component CheA/CheY system of Listeria monocytogenes in host cell invasion and virulence.

    Infect Immun 2004, 72:3237-3244. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  25. Hayashi F, Smith KD, Ozinsky A, Hawn TR, Yi EC, Goodlett DR, Eng JK, Akira S, Underhill DM, Aderem A: The innate immune response to bacterial flagellin is mediated by Toll-like receptor 5.

    Nature 2001, 410:1099-1103. PubMed Abstract | Publisher Full Text OpenURL

  26. The KEGG Database [http://www.genome.ad.jp] webcite

  27. Schmid KJ, Aquadro CF: The evolutionary analysis of "orphans" from the Drosophila genome identifies rapidly diverging and incorrectly annotated genes.

    Genetics 2001, 159:589-298. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  28. Whelan S, Goldman N: Distributions of statistics used for the comparison of models of sequence evolution in phylogenetics.

    Mol Biol Evol 1999, 16:1292-1299. OpenURL

  29. Kosakovsky Pond SL, Muse SV: Site-to-site variation of synonymous substitution rates.

    Mol Biol Evol 2005, 22:2375-2385. PubMed Abstract | Publisher Full Text OpenURL

  30. Banfield J, Raftery A: Model-based Gaussian and non-Gaussian clustering.

    Biometrics 1993, 49:803-821. Publisher Full Text OpenURL

  31. Fraley C, Raftery A: How many clusters? Which clustering method? Answers via model-based cluster analysis.

    The Computer Journal 1998, 41(8):578-588. Publisher Full Text OpenURL

  32. Ren F, Tanaka H, Yang Z: An empirical examination of the utility of codon-substitution models in phylogeny reconstruction.

    Syst Biol 2005, 54:808-818. PubMed Abstract | Publisher Full Text OpenURL

  33. Inagaki Y, Roger AJ: Phylogenetic estimation under codon models can be biased by codon usage heterogeneity.

    Mol Phylogenet Evol, in press. PubMed Abstract | Publisher Full Text OpenURL

  34. Nylander JA, Ronquist F, Huelsenbeck JP, Nieves-Aldrey JL: Bayesian phylogenetic analysis of combined data.

    Syst Biol 2004, 53:47-67. PubMed Abstract | Publisher Full Text OpenURL

  35. Bevan RB, Lang BF, Bryant D: Calculating the evolutionary rates of different genes: a fast, accurate estimator with applications to maximum likelihood phylogenetic analysis.

    Syst Biol 2005, 54:900-915. PubMed Abstract | Publisher Full Text OpenURL

  36. DeLong EF: Microbial population genomics and ecology: the road ahead.

    Environ Microbiol 2004, 6:875-878. PubMed Abstract | Publisher Full Text OpenURL

  37. Doolittle RF: Evolutionary aspects of whole-genome biology.

    Curr Opin Struct Biol 2005, 15:248-253. PubMed Abstract | Publisher Full Text OpenURL

  38. McLachlan GJ, Krishnan T: The EM Algorithm and Extensions. John Wiley and Sons; 1997. OpenURL

  39. Sawa T: Information criteria for discriminating among alternative regression models.

    Econometrica 1978, 46:1273-1282. Publisher Full Text OpenURL

  40. Sugiura N: Further analysis of the data by Akaike's information criterion and the finite corrections.

    Communications in Statistics: Theory and Methods 1978, 7(1):13-26. OpenURL

  41. Source code and compiled binaries for the described fixed-effect models are available at [http://www.bielawski.info] webcite

  42. Yang Z: PAML: a program package for phylogenetic analysis by maximum likelihood.

    Comput Appl Biosci 1997, 13:555-556. PubMed Abstract OpenURL

  43. The PAML package [http://abacus.gene.ucl.ac.uk/software/paml.html] webcite

  44. Lee YH, Ota T, Vacquier VD: Positive selection is a general phenomenon in the evolution of abalone sperm lysin.

    Mol Biol Evol 1995, 12:231-238. PubMed Abstract | Publisher Full Text OpenURL

  45. Moszer I, Glaser P, Danchin A: SubtiList: a relational database for the Bacillus subtilis genome.

    Microbiology 1995, 141:261-268. PubMed Abstract OpenURL

  46. Sequence alignments, phylogenetic trees and gene ontologies for the flagellar system are available at [http://www.bielawski.info] webcite