Bioinformatics Laboratory, Department of Biology, National University of Ireland, Maynooth, Co. Kildare, Ireland

Bork Group, EMBL Heidelberg, Heidelberg, Germany

Department of Computer Science, University College London, Gower Street, London, UK

Department of Computer Science, National University of Ireland, Maynooth, Co. Kildare, Ireland

Abstract

Background

In recent years, model based approaches such as maximum likelihood have become the methods of choice for constructing phylogenies. A number of authors have shown the importance of using adequate substitution models in order to produce accurate phylogenies. In the past, many empirical models of amino acid substitution have been derived using a variety of different methods and protein datasets. These matrices are normally used as surrogates, rather than deriving the maximum likelihood model from the dataset being examined. With few exceptions, selection between alternative matrices has been carried out in an ad hoc manner.

Results

We start by highlighting the potential dangers of arbitrarily choosing protein models by demonstrating an empirical example where a single alignment can produce two topologically different and strongly supported phylogenies using two different arbitrarily-chosen amino acid substitution models. We demonstrate that in simple simulations, statistical methods of model selection are indeed robust and likely to be useful for protein model selection. We have investigated patterns of amino acid substitution among homologous sequences from the three Domains of life and our results show that no single amino acid matrix is optimal for any of the datasets. Perhaps most interestingly, we demonstrate that for two large datasets derived from the proteobacteria and archaea, one of the most favored models in both datasets is a model that was originally derived from retroviral Pol proteins.

Conclusion

This demonstrates that choosing protein models based on their source or method of construction may not be appropriate.

Background

For a number of years phylogenetic construction has been considered to be a problem of statistical inference. One of the most popular methods of inferring phylogenetic relationships is maximum likelihood (ML). It has often been considered that one of the advantages of ML over parsimony based methods is that it allows for the use of different models of evolution depending on the dataset being examined. Therefore knowing the process of evolution and being able to construct realistic models of evolution is the foundation for being able to infer accurate phylogenetic relationships among species. Currently one of the major challenges in phylogenetics is to accurately model the process of nucleotide or amino acid substitution and to choose among our set of models in order to infer accurate phylogenies. Felsenstein

Almost all models of amino acid replacement assume that all amino acids sites evolve independently according to the same Markov process. It is assumed that the Markov process is stationary and homogeneous, so that all rates of substitution are constant across time. Each of the protein substitution models consists of a 20 × 20 instantaneous rate matrix which includes the set of original amino acid frequencies (_{i}) obtained from the dataset that was used to generate the model. The (_{i}) values represent the equilibrium or stationary frequencies of the 20 amino acids and the matrices are often modified to include the set of observed frequencies in the dataset being examined. Models that take into account the observed amino acid frequencies are often denoted by the '+F' suffix

There has been a great deal of research into various techniques for performing model selection on nucleotide data

Until now many phylogenetic analyses of multiple datasets from a fixed set of taxa have assumed a single substitution model for all sets of homologs (e.g.

Results

To investigate the potentially harmful effects of a non-statistical approach to choosing protein models, we built two phylogenies with two arbitrarily selected protein models using a single gene family alignment consisting of 7 taxa (3580 characters in length) taken from the dataset of Philip et al.

Alternative Trees

**Alternative Trees**. Two different trees (with bootstrap support values based on 100 replicates) constructed from a single gene family [34] with different protein models using Phyml v2.4.4 [53]. Tree (a) was produced using the MtREV matrix [15] and Tree (b) was produced using the WAG matrix [18].

The likelihood is calculated as the probability of obtaining the data (multiple sequence alignment) given the model of evolution (substitution model and phylogeny). Ideally we would prefer to use the true tree when performing our model selection as this would remove any conflicting signals from an incorrect base tree. However on real datasets the true tree is unknown so we must use some approximation of the true tree for the model selection procedure in order to estimate the model parameters

Base Tree

**Base Tree**. The true tree used to generate all of the simulated alignments.

Base tree sensitivity

The results of the simulations using different base trees (true, random, and NJ-JTT tree) for the model selection procedure are presented in Table

Base Tree Simulations. Results of simulated datasets when a random, NJ-JTT, and the true tree are used as the base tree for the model selection procedure and the sequence length is 500 characters. Each entry is the number of times out of 100 replicates the correct model was selected by each measure.

Random

NJ-JTT

True

Model

AIC_{1}

AIC_{2}

BIC

AIC_{1}

AIC_{2}

BIC

AIC_{1}

AIC_{2}

BIC

Blosum

0

0

0

91

98

99

84

96

96

Blosum+I

0

0

0

94

99

100

100

100

100

Blosum+G

58

65

67

75

83

87

75

84

87

Blosum+I+G

89

86

85

90

88

87

89

85

85

CPREV

0

0

0

92

99

100

93

98

99

CPREV+I

0

0

0

98

99

99

97

99

100

CPREV+G

80

83

85

89

89

90

89

90

90

CPREV+I+G

94

91

91

80

75

73

80

75

73

Dayhoff

0

0

0

95

100

100

94

98

99

Dayhoff+I

0

0

0

98

100

100

96

99

100

Dayhoff+G

68

72

74

77

86

90

79

88

91

Dayhoff+I+G

94

93

93

82

74

74

84

74

72

JTT

0

0

0

94

99

100

97

99

99

JTT+I

0

0

0

96

100

100

97

100

100

JTT+G

54

59

62

78

85

86

81

87

89

JTT+I+G

94

94

93

89

85

82

92

87

84

MtREV

0

0

0

85

96

97

94

99

99

MtREV+I

0

0

0

92

99

100

97

100

100

MtREV+G

80

87

87

92

94

95

93

94

94

MtREV+I+G

86

85

84

68

65

61

70

63

63

WAG

0

0

0

88

97

99

95

99

100

WAG+I

0

0

0

98

100

100

96

100

100

WAG+G

74

79

79

83

89

89

83

88

89

WAG+I+G

90

89

87

79

73

69

78

73

71

We next examined the difference in the models selected using the likelihood values from the quick NJ-JTT base tree and those of fully optimised ML phylogenies produced using all of the individual models (see methods). There is very little difference (<10%) between the model selection accuracy when model selection was carried out using a full ML tree search using each available model and the models selected by the quicker NJ-JTT method (see Table

Full ML Comparison. A comparison of the models selected from the likelihood values obtained from a full ML tree search using all models and the likelihood values using the default NJ-JTT base tree. The column 'Identical' indicates the number of times (out of 100 alignments) both procedures selected the same model. The column titled 'Rate' indicates cases when the same amino acid matrix and a different ASRV was selected. The column titled 'Matrix' indicates cases when the a different amino acid matrix was selected.

AIC_{1}

AIC_{2}

BIC

Dataset

Identical

Rate

Matrix

Identical

Rate

Matrix

Identical

Rate

Matrix

Proteobacteria

95

4

1

93

6

1

94

2

4

Archaea

99

1

0

96

2

2

95

2

3

Vertebrate

91

7

2

94

5

1

97

1

2

Sequence length

One of the factors that is believed to affect the results of the nucleotide model selection is sequence length _{1}, AIC_{2}, and BIC) for the three different alignment lengths (100, 500, and 1000 characters). As expected, the rates for the longer sequences are increased compared to the shorter sequences. One noticeable feature with the 100 character dataset is that the number of times the correct model was selected when a +I+G ASRV was present was significantly reduced for all matrices. Further examination of the results shows that this is almost always due to the model selection procedure picking the +G version of the model. This is due to the fact that the difference in likelihoods between the +I+G and +G models is quite small at short sequence lengths and not significant enough for the measures to prefer the more parameterised +I+G models. In these cases, we have observed that the

Alignment Length Simulations. Results of the simulated datasets for alignments of 100, 500, and 1000 characters in length. Each entry is the number of times out of 100 replicates the correct model was selected by each measure (using the default NJ-JTT base tree).

100

500

1000

Model

AIC_{1}

AIC_{2}

BIC

AIC_{1}

AIC_{2}

BIC

AIC_{1}

AIC_{2}

BIC

Blosum

86

96

95

91

98

99

94

99

100

Blosum+I

95

100

95

94

99

100

98

100

100

Blosum+G

89

95

95

75

83

87

79

85

88

Blosum+I+G

44

30

30

90

88

87

95

95

94

CPREV

92

99

99

92

99

100

95

100

100

CPREV+I

94

100

100

98

99

99

99

99

100

CPREV+G

87

99

98

89

89

90

91

96

97

CPREV+I+G

51

37

37

80

75

73

95

94

94

Dayhoff

92

99

99

95

100

100

93

99

100

Dayhoff+I

94

100

100

98

100

100

96

100

100

Dayhoff+G

83

93

93

77

86

90

94

94

95

Dayhoff+I+G

54

35

38

82

74

74

95

92

91

JTT

95

98

98

94

99

100

93

98

100

JTT+I

95

99

98

96

100

100

96

100

100

JTT+G

87

94

94

78

85

86

91

91

93

JTT+I+G

48

36

40

89

85

82

96

95

94

MtREV

95

98

98

85

96

97

91

97

97

MtREV+I

97

100

100

92

99

100

97

100

100

MtREV+G

86

97

97

92

94

95

92

95

96

MtREV+I+G

29

17

17

68

65

61

87

85

83

WAG

91

97

96

88

97

99

97

98

100

WAG+I

94

100

99

98

100

100

97

99

100

WAG+G

85

95

93

83

89

89

86

95

95

WAG+I+G

50

34

36

79

73

69

97

96

94

Among-site rate variation parameters

ASRV parameters can vary greatly in real datasets therefore it is important to investigate if the model selection procedure is affected in any way by varying ASRV's. Table

Gamma Distribution Simulations. Results of simulations when the

Model

AIC_{1}

AIC_{2}

BIC

AIC_{1}

AIC_{2}

BIC

AIC_{1}

AIC_{2}

BIC

BLOSUM62+G

75

83

87

32

62

69

36

68

74

BLOSUM62+I+G

90

88

87

95

93

92

100

100

100

CPREV+G

89

89

90

39

72

77

39

65

79

CPREV+I+G

80

75

73

93

89

89

100

100

100

Dayhoff+G

77

86

90

33

36

74

38

60

66

Dayhoff+I+G

82

74

74

98

95

92

100

100

100

JTT+G

78

85

86

43

71

76

25

54

63

JTT+I+G

89

85

82

98

96

94

100

100

100

MtREV+G

92

94

95

46

72

76

51

75

84

MtREV+I+G

68

65

61

90

85

83

100

100

100

WAG+G

83

89

89

35

70

76

32

70

79

WAG+I+G

79

73

69

97

91

90

100

100

100

Dayhoff+I+G

54

35

38

82

74

74

95

92

91

JTT

95

98

98

94

99

100

93

98

100

JTT+I

95

99

98

96

100

100

96

100

100

JTT+G

87

94

94

78

85

86

91

91

93

JTT+I+G

48

36

40

89

85

82

96

95

94

MtREV

95

98

98

85

96

97

91

97

97

MtREV+I

97

100

100

92

99

100

97

100

100

MtREV+G

86

97

97

92

94

95

92

95

96

MtREV+I+G

29

17

17

68

65

61

87

85

83

WAG

91

97

96

88

97

99

97

98

100

WAG+I

94

100

99

98

100

100

97

99

100

WAG+G

85

95

93

83

89

89

86

95

95

WAG+I+G

50

34

36

79

73

69

97

96

94

Amino acid frequency perturbation

Each of the protein substitution models consists of an instantaneous rate matrix (Q) which includes a set of original amino acid frequencies (_{i}) obtained from the dataset that was used to generate the model. If we use the observed amino acid frequency parameters of the dataset being examined (denoted by the '+F' suffix) instead, then we include 19 extra free parameters when evaluating each model. We were interested in investigating what effect the change in amino acid frequency proportions would have on the model selection procedure and whether the corresponding '+F' versions of the models would be selected. We would expect our model selection procedure to be robust enough to select the corresponding amino acid matrix despite the variation in amino acid frequencies. Table

Amino Acid Frequency Simulations. Results of the simulated datasets where the original amino acid frequencies are randomly perturbed by up to 10% from the original values and the alignment length is 500 characters. Each entry indicates the number of times out of 100 replicates the correct model was selected by each measure.

Model

AIC_{1}

AIC_{2}

BIC

Model

AIC_{1}

AIC_{2}

BIC

Blosum+F

94

100

100

JTT+F

93

100

100

Blosum+I+F

71

91

95

JTT+I+F

67

89

94

Blosum+G+F

86

93

96

JTT+G+F

75

89

92

Blosum+I+G+F

99

97

96

JTT+I+G+F

98

96

96

CPREV+F

92

100

100

MtREV+F

93

99

99

CPREV+I+F

87

98

99

MtREV+I+F

86

96

99

CPREV+G+F

93

96

97

MtREV+G+F

86

93

95

CPREV+I+G+F

89

87

84

MtREV+I+G+F

85

82

80

Dayhoff+F

93

99

99

WAG+F

95

100

100

Dayhoff+I+F

91

98

99

WAG+I+F

82

96

97

Dayhoff+G+F

86

93

96

WAG+G+F

88

95

96

Dayhoff+I+G+F

99

97

96

WAG+I+G+F

90

89

89

Expected model selections

Some of the amino acid substitution matrices were developed specifically for use with certain types of datasets. For example, the MtREV

Real Dataset Analysis. Results of the model selection on the specialised datasets (see the references for full descriptions of the individual datasets). Amino acid matrix expectations are based on previously published information about the sequences ([19, 54, 55] and LANL [56]).

Dataset

Source

Expected

AIC_{1}

AIC_{2}

BIC

mtCDNApri

Yang [54]

MtMam

MtMam+I+G

MtMam+G

MtMam+G

mtCDNAape

Yang [54]

MtMam

MtMam+F

MtMam+F

MtMam+F

70pep_nogap

Reyes

MtMam

MtMam+I+G+F

MtMam+I+G

MtMam+I+G

BETA

Dimmic

RtREV

RtREV+G+F

RtREV+G

RtREV+G

ENDO

Dimmic

RtREV

RtREV+I+G+F

RtREV+I+G+F

RtREV+I+G+F

GAGGAM

Dimmic

JTT

JTT+G+F

JTT+G+F

JTT+G+F

GAGHIV

Dimmic

JTT

JTT+G+F

JTT+G+F

JTT+G+F

GAMMA

Dimmic

RtREV

CPREV+G+F

RtREV+G

RtREV+G

LENTI

Dimmic

RtREV

RtREV+I+G+F

RtREV+I+G+F

RtREV+I+G+F

SPUMA

Dimmic

RtREV

RtREV+G

RtREV+G

RtREV+G

NONLTR

Dimmic

RtREV

RtREV+I+G+F

RtREV+I+G+F

RtREV+I+G+F

SIVPOLPRO

LANL

RtREV

RtREV+G+F

RtREV+G+F

RtREV+G

Model variation among multi-gene datasets

Figure

Proteobacteria Dataset

**Proteobacteria Dataset**. A break-down of the set of best-fit protein models for the proteobacteria dataset.

Vertebrate Dataset

**Vertebrate Dataset**. A break-down of the set of best-fit protein models for the vertebrate dataset.

Archaea Dataset

**Archaea Dataset**. A break-down of the set of best-fit protein models for the archaea dataset.

Model selection and tree accuracy

Table

Tree Accuracy Simulations. Results of the simulated tree accuracy test where alignments were generated with a particular model and then phylogenies were built using all of the other available models. Each entry is the average scaled Robinson-Foulds (RF) distance [40] over the trees inferred using the alternative models. This test was repeated 10 times for each model and the values in brackets are the RF distances from the true tree when phylogenies were inferred using the model that generated the alignment. Phyml [53] was used to build all trees.

Model

RF Distance

Model

RF Distance

Blosum

0.03 (0.03)

JTT

0.05 (0.05)

Blosum+I

0.02 (0.02)

JTT+I

0.05 (0.04)

Blosum+G

0.08 (0.06)

JTT+G

0.04 (0.03)

Blosum+I+G

0.05 (0.05)

JTT+I+G

0.12 (0.11)

CPREV

0.05 (0.04)

MtREV

0.06 (0.05)

CPREV+I

0.09 (0.04)

MtREV+I

0.08 (0.08)

CPREV+G

0.06 (0.05)

MtREV+G

0.07 (0.06)

CPREV+I+G

0.07 (0.06)

MtREV+I+G

0.12 (0.1)

Dayhoff

0.07 (0.07)

WAG

0.02 (0.02)

Dayhoff+I

0.06 (0.05)

WAG+I

0.04 (0.04)

Dayhoff+G

0.06 (0.06)

WAG+G

0.1 (0.1)

Dayhoff+I+G

0.05 (0.04)

WAG+I+G

0.04 (0.04)

In real datasets, the true tree is unknown and therefore it is impossible to know with certainty if we have found the true tree. One possible indication as to whether the choice of model is improving the inferred phylogenies might be to take a large dataset of orthologs and measure the level of congruence among the inferred trees. It would be expected that the congruence among the trees would increase as the optimal models are used to build the trees. We took our proteobacteria dataset (2135 orthologs) and built phylogenies using fixed amino acid matrices and also built phylogenies using the optimal protein model for each alignment. Table

Proteobacteria Tree Accuracy Analysis. The scaled Robinson-Foulds (RF) distances [40] of the trees produced from the Proteobacteria dataset using fixing a model used to build trees from each alignment. The values reported are the median and average distance computed by comparing every tree against every other tree. When the optimal set of models were used the median was 0.22 and the average was 0.34. Phyml [53] was used to build all trees.

Model

Median RF

Mean RF

Model

Median RF

Mean RF

Blosum

0.23

0.35

JTT

0.23

0.34

Blosum+I

0.25

0.35

JTT+I

0.25

0.35

Blosum+G

0.25

0.35

JTT+G

0.25

0.35

Blosum+I+G

0.25

0.35

JTT+I+G

0.25

0.35

CPREV

0.24

0.35

MtREV

0.25

0.35

CPREV+I

0.25

0.35

MtREV+I

0.25

0.35

CPREV+G

0.25

0.35

MtREV+G

0.25

0.35

CPREV+I+G

0.25

0.35

MtREV+I+G

0.25

0.35

Dayhoff

0.2

0.34

WAG

0.21

0.34

Dayhoff+I

0.21

0.34

WAG+I

0.23

0.35

Dayhoff+G

0.22

0.34

WAG+G

0.25

0.35

Dayhoff+I+G

0.22

0.34

WAG+I+G

0.25

0.35

Discussion

We have studied the influence of various factors on protein model selection. Our simulations have confirmed previous work showing that the model selection procedure performs quite accurately using an approximate tree for model selection. One of the most interesting results that we have shown using real datasets is that less than 9% of the time was a different matrix selected using a full ML analysis than those selected using a quick NJ-JTT method. This further strengthens the recent results presented by Sullivan

It should be emphasized that many of the current set of models of amino acid or nucleotide substitution make many unrealistic assumptions such as reversibility, amino acid composition stationarity, and homogeneous substitution rates. However much work is currently taking place to develop methods to loosen many of these restrictions

We have highlighted an example where two highly-supported and topologically different phylogenies were produced from the same alignment using two arbitrarily selected amino acid substitution matrices (see Fig.

The results of our cross-domain substitution model analysis are interesting as there are noticeable differences in the groups of models selected by each dataset with no single matrix emerging as the best for any of the datasets. The large diversity of amino acid matrices cannot come as a great surprise as it would seem intuitively unreasonable to assume that a very large group of independently evolving gene families from a fixed taxon set followed an identical amino acid substitution pattern. Perhaps one of the most significant findings is that the RtREV matrix

Conclusion

In this study, we have analysed the ability of the AIC and the BIC to select the appropriate evolutionary model in cases where the model is known. We have shown that both methods are suitable for this purpose. We have also shown that none of the currently available models is universally preferred for all alignments and that there is considerable variation in the substitution process across protein families. What we have not attempted to show is that for any given alignment the selected model is the actual model that gave rise to the observed data. However, on the basis of our results we can speculate on the appropriateness of the models. Considering that a viral model is one of the most preferred models for these cellular sequences, perhaps none of the models are really capturing the data. The models are homogeneous across the tree and this is likely to be a simplification. Therefore, even though we have produced a robust method of model selection, it is likely that the models themselves need to be improved.

Methods

The AIC is a popular model selection measure that attempts to strike a balance between the goodness-of-fit and complexity of a model. The AIC is calculated by

_{1 }= -2 ln _{i }+ 2_{i}, (1)

where _{i }is the number of free parameters in model _{i }is the likelihood value of model _{2 }can sometimes be more accurate at determining the correct nucleotide substitution model. It is calculated by replacing the 2N_{i }term with 5N_{i }thus further penalising models of greater complexity. The BIC is another model selection measure and is equivalent to selecting the model with the maximum posterior probability and is calculated from

_{i }+ _{i }ln

where _{2 }and BIC tend to select simpler models than the AIC_{1 }because they penalise the addition of further model parameters more than the AIC_{1 }

We have recently developed a protein model selection program called MODELGENERATOR _{1}, AIC_{2}, BIC) when applied to protein model selection. For all of the simulations, we used the same 20 taxon clocklike tree used by Posada and Crandall

Base tree sensitivity

In order to compare the sensitivity of protein model selection to the accuracy of the base tree, we generated 2400 individual alignments of 500 characters in length using each of the protein models available in Seq-Gen (100 alignments per model) fixing the proportion of invariable sites at 0.2 and the

To further investigate the effect of using a distance-based tree for comparison rather than the fully optimised ML tree of each model, we obtained three real datasets from each of the Domains of life. The first dataset consists of 2135 gene families obtained from 25 complete proteobacteria genomes. The homologs were identified by performing all-against-all blast searches ^{-7}. The sequences were aligned using ClustalW 1.81 ^{-7}). Each of these families consisted of between 4 and 16 taxa and were aligned using ClustalW 1.81 using the default settings

Sequence length

We generated 100 replicate alignments of each of the protein models available in Seq-Gen consisting of 100, 500, and 1000 characters in length. For these tests, we fixed the proportion of invariable sites at 0.2 and the

Rate-distribution parameters

In order to investigate the possible effect of varying ASRV parameters, we generated a number of different simulated datasets (100 replicate alignments per model) with a fixed sequence length of 500 characters and varied the

Amino acid frequency perturbation

In order to create these simulated '+F' alignments, we took the original amino acid frequencies of each model and randomly perturbed each of the individual amino acid frequencies by up to 10% change from its original value in each model (ensuring that the summation of the new set of frequencies remained 1.0) and then used Seq-Gen to generate an alignment using the new set of amino acid frequencies according to the substitution process of the individual model (see algorithm in Figure

Pseudo Code

**Pseudo Code**. The algorithm used to generate the simulated +F alignments can be described in pseudocode as follows. The function random returns a random number greater than the first argument and less than the second argument.

Expected model selection

We obtained the two primate mitochondrial datasets that are included as example datasets in Paml 3.14

Model variation among empirical datasets

For this test, we used the full set of sequences from the three real datasets of the each Domain of life (as described above). We performed model prediction for each alignment in the datasets in order to assess the extent of model differences within the gene families.

Model selection and tree accuracy

To test for the effect of

In an attempt to analyse the effect of

Supplementary data

All of the simulated and real alignments mentioned in the paper are available for download from

Authors' contributions

TMK and JOM initially formulated the idea for the manuscript. CJC and MMP provide some of the real datsets used in the analyses. TMK developed the software, performed the experiments, and drafted the manuscript. All authors read and approved the final manuscript.

Acknowledgements

We thank Peter Foster and Zheng Yang for providing valuable comments on the manuscript. We wish to acknowledge James Cotton and Rod Page for making their vertebrate dataset available for our analysis. We thank Davide Pisani and Jennifer Commins for helping create the figures. We would like to acknowledge the financial support of the Irish Research Council for Science, Engineering and Technology (IRCSET). The authors wish to acknowledge the SFI/HEA Irish Centre for High-End Computing (ICHEC) for the provision of computational facilities and support.