Center for Computational Biology and Bioinformatics, Institute for Cellular and Molecular Biology, and Section of Integrative Biology, The University of Texas at Austin, Austin, TX 78712, USA

Abstract

Background

Protein structure mediates site-specific patterns of sequence divergence. In particular, residues in the core of a protein (solvent-inaccessible residues) tend to be more evolutionarily conserved than residues on the surface (solvent-accessible residues).

Results

Here, we present a model of sequence evolution that explicitly accounts for the relative solvent accessibility of each residue in a protein. Our model is a variant of the Goldman-Yang 1994 (GY94) model in which all model parameters can be functions of the relative solvent accessibility (RSA) of a residue. We apply this model to a data set comprised of nearly 600 yeast genes, and find that an evolutionary-rate ratio

Conclusions

Structure-aware models of sequence evolution provide a significantly better fit than traditional models that neglect structure. The linear relationship between

Background

Substitution patterns in protein-coding genes are shaped by the 3-dimensional structure of the expressed proteins. To account for this influence of structure on sequence evolution, evolutionary biologists increasingly aim to combine sequence analysis with structural information or to develop models of sequence evolution that incorporate structural features of the expressed protein. Some authors calculate amino-acid substitution matrices as a function of protein structure

These various analyses differ in their specific results as well as in the approaches taken. However, one pattern consistently emerges: Residues in the core of proteins are more conserved than residues on the surface. This finding agrees with our understanding of protein biochemistry. Substitutions in the core of a protein are more likely to disrupt fold stability than substitutions on the surface, and the loss of the structural integrity of a protein is frequently the underlying cause of loss of function

Inspired by the observed linear relationship between evolutionary conservation and RSA, we here take the standard Goldman-Yang model of coding-sequence evolution (GY94,

Results

An RSA-dependent Markov model of coding-sequence evolution

Previous works assessing the relationship between evolutionary rate and RSA subdivided sites into groups with comparable RSA and then calculated evolutionary rates separately for each group

To resolve these shortcomings, we developed a variant of the GY94 model
_{
ij
}) of the Markov process describing the substitution process as (for

where _{
j
}is the frequency of codon

where

We implemented this model in the phylogenetic modeling language HyPhy
_{
k
}. In this way, we approximate a single matrix _{
k
}=_{
k
}), with

Our model contains three fitted parameters: _{0}, _{0}, or _{0}. Second, a parameter can be a linear function of RSA. In this case, we have _{0} + _{1}
_{0} + _{1}
_{0} + _{1}
_{
k
}, because of the binning procedure). Finally, we can allow for separate

Examples of RSA-dependent sequence-evolution models considered

**Examples of RSA-dependent sequence-evolution models considered.** All models have three parameters, evolutionary-rate ratio **A**) All parameters are estimated per-bin. (**B**) **C**) All paramters are estimated as linear functions.

A linear RSA ependency for all estimated parameters provides the best model fit

We fitted our model to a data set of yeast sequences with available structural information. We identified 587

Since we considered three different functional forms of RSA dependence (constant, linear, and per-bin) for each of the three parameters

**
ω
**

**
t
**

**
κ
**

**ln****
L
**

**
df
**

**AIC**

**
t
**

**
κ
**

linear

linear

per-bin

−839713.86

24

1679476

+

−

linear

linear

linear

−839736.74

6

1679485

+

−

per-bin

linear

per-bin

−839701.37

42

1679487

+

−

per-bin

linear

linear

−839722.37

24

1679493

+

−

linear

per-bin

linear

−839723.27

24

1679495

+

−

linear

per-bin

per-bin

−839707.75

42

1679499

+

−

per-bin

per-bin

linear

−839710.08

42

1679504

+

−

per-bin

per-bin

per-bin

−839694.42

60

1679509

+

−

linear

constant

linear

−839757.23

5

1679524

0

−

per-bin

constant

linear

−839740.64

23

1679527

+

−

linear

constant

per-bin

−839742.62

23

1679531

0

−

per-bin

constant

per-bin

−839727.25

41

1679537

0

−

linear

linear

constant

−839825.99

5

1679662

−

0

per-bin

linear

constant

−839809.70

23

1679665

−

0

linear

per-bin

constant

−839817.06

23

1679680

−

0

per-bin

per-bin

constant

−839800.41

41

1679683

−

0

linear

constant

constant

−839867.98

4

1679744

0

0

per-bin

constant

constant

−839856.43

22

1679757

0

0

constant

linear

per-bin

−840468.84

23

1680984

+

−

constant

per-bin

per-bin

−840459.99

41

1681002

+

−

constant

per-bin

linear

−840479.14

23

1681004

+

−

constant

linear

linear

−840524.57

5

1681059

+

−

constant

linear

constant

−840697.41

4

1681403

+

0

constant

per-bin

constant

−840688.35

22

1681421

+

0

constant

constant

linear

−840738.77

4

1681486

0

−

constant

constant

constant

−840740.37

3

1681487

0

0

constant

constant

per-bin

−840726.86

22

1681498

0

0

In general, we found that all parameters varied significantly with RSA. The top eight models did not contain a single model in which even one parameter was constant over RSA. This result shows that it is not sufficient to just make

Whenever the transition–transversion bias

Figure

Evolutionary-rate ratio increases linearly with RSA

**Evolutionary-rate ratio increases linearly with RSA.** The solid line shows

To assess the effect of the binning procedure on model estimation, we re-fitted the fully linear model (with linear

**
n
**

**
ω
**

**
ω
**

**
t
**

**
t
**

**
κ
**

**
κ
**

**ln****
L
**

4

0.1205

0.0106

0.7110

2.4706

-2.5487

5.3465

-839824.56

5

0.1208

0.0116

0.6967

2.4734

-2.5948

5.3547

-817178.82

6

0.1162

0.0135

0.7012

2.4828

-2.5361

5.3136

-839781.41

7

0.1149

0.0143

0.7034

2.4849

-2.5102

5.2976

-839764.54

8

0.1138

0.0148

0.7269

2.4805

-2.5336

5.2996

-839760.69

9

0.1123

0.0154

0.7062

2.4900

-2.4831

5.2759

-835407.29

10

0.1129

0.0156

0.7020

2.4898

-2.5003

5.2811

-839745.29

11

0.1132

0.0159

0.6742

2.4879

-2.4497

5.2669

-797981.33

12

0.1119

0.0161

0.6706

2.5007

-2.4451

5.2571

-837291.42

13

0.1110

0.0162

0.7114

2.4902

-2.4846

5.2703

-836692.33

14

0.1108

0.0164

0.6956

2.5005

-2.4632

5.2532

-837806.63

15

0.1115

0.0164

0.6959

2.4941

-2.4759

5.2653

-839684.07

16

0.1102

0.0167

0.7174

2.4897

-2.4858

5.2666

-839740.91

17

0.1098

0.0169

0.7146

2.4886

-2.4609

5.2562

-835852.76

18

0.1097

0.0170

0.7074

2.4942

-2.4652

5.2548

-839148.15

19

0.1100

0.0169

0.7038

2.4937

-2.4785

5.2627

-839318.45

20

0.1097

0.0171

0.7038

2.4943

-2.4732

5.2592

-839736.74

Surprisingly, the log-likelihood did not vary smoothly in

GY94 model provides a better model-fit than MG94 model

The GY94 model describes evolutionary rates using the two parameters

Here, we have allowed both

To assess whether the nonsynonymous rate

Comparison of the GY94 and the MG94 models

**Comparison of the GY94 and the MG94 models.** The solid line shows

Effect of relative solvent accessibility on synonymous and nonsynonymous substitution rates

The previous subsections have shown that substitution rates at both synonymous and nonsynonymous sites are affected by RSA, and that the ratio

The quantities

The mutational-opportunity and the physical-sites definitions gave nearly identical results for

Evolutionary rates

**Evolutionary rates ****and **** dS.** (

The effect of core size and expression level on evolutionary rate

In yeast, the primary determinant of evolutionary rate is gene expression level
_{0} + _{1}

Franzosa and Xia showed that the slope of ^{−9}). By contrast, the intercepts were not significantly different (likelihood ratio test,

Dependency of

**Dependency of ****=****on protein core size and expression level.** (**A**) Core size affects evolutionary rate on the surface of the protein but not in the core. (**B**) Expression level affects evolutionary rate both on the surface and in the core. However, it has a bigger effect on the surface of the protein. In both figures, the solid lines were estimated jointly from the data using a linear dependency of

The two slopes we found were more similar to each other than the ones found by Franzosa and Xia

We carried out a similar analysis on high-expression and low-expression genes, fitting a separate line to each group of proteins (
^{−62}). We also found a difference in intercept (
^{−12}). Similar results were found when we used codon adaptation index as a proxy for gene expression level (data not shown).

Finally, we carried out a joint analysis of core size and expression level by extracting four groups of proteins from our data set: proteins with (1) high expression level and large core, (2) high expression level and small core, (3) low expression level and large core, and (4) low expression level and small core. Figure
^{−4}). Surprisingly, the effect of core size on slope was reversed for high- and low-expression genes. For high-expression genes, proteins with larger core size showed a larger slope in

Joint analysis of the effects of both core size (small or large) and expression level (low or high) on the relationship between

**Joint analysis of the effects of both core size (small or large) and expression level (low or high) on the relationship between ****=****and RSA.** Only the fitted lines are shown. Surprisingly, for low-expression genes, small-core proteins evolve faster than larger-core proteins. This relationship is reversed in a larger dataset obtained with less-stringent criteria (see text).

Discussion

We have developed a method that models the evolutionary rate of a coding sequence within the context of the protein’s 3-dimensional structure. Our method is a simple extension of the standard GY94 model, modified such that all parameters are functions of relative solvent accessibility (RSA). We have found that the evolutionary-rate ratio

Our method presents a unified statistical framework for comparing RSA-dependent model parameters among different groups of proteins. Using this framework, we have shown that protein core size affects only the slope of

We have found that the variation in

Our findings here are broadly consistent with the findings of Franzosa and Xia

In our joint analysis of core size and expression level, we made the unexpected observation that the effect of core size on the slope of

Our approach is conceptually related to other recent works attempting to combine protein structure with sequence evolution

Following Franzosa and Xia

We found that in our model, both

The challenge in developing any such models will be to make them realistic yet sufficiently simple so they can be fit to moderately sized data sets. An alternative, simpler strategy could be to calculate equilibrium codon frequencies in an RSA-dependent manner. We considered calculating codon frequencies per bin and found that doing so generally improved AIC scores but did not eliminate the need for RSA-dependent

Our method requires a solved crystal structure to calculate RSA values. Although the Protein Data Bank (PDB) has been growing rapidly over the past decade, the number of available structures is still small compared to the number of available sequences. For example, many of the yeast sequences we used in our analysis did not have a corresponding structure. For those sequences, we relied on homologous protein structures solved in related organism. Homology mapping performs relatively well in predicting relative solvent accessibility

Our method assumes that RSA remains constant throughout evolution. Yet every amino-acid replacement will cause some distortion in the protein structure

Conclusions

Our work has shown that the evolutionary rate ratio

Methods

Homology mapping and categorization of genes

In order to construct a large data set of sequences with corresponding structures, we obtained open reading frames (ORFs) of the yeast

The percent solvent-accessible surface area (ASA) for each aligned residue was calculated using DSSP

Calculation of evolutionary rates

The codons from the yeast alignments were binned by the RSA value of their respective residues, as described

We implemented the model described by Eq. (1) in the HyPhy batch language
_{
j
}) using F3×4 model.

We calculated synonymous (

Statistical analysis

We used the Akaike information criterion (AIC)

Abbreviations

AIC: Akaike information criterion; CAI: Codon adaptation index; GY94: Goldman-Yang 1994; HyPhy: Hypothesis testing using Phylogenies (software); MG94: Muse-Gaut 1994; ORF: Open reading frame; PDB: Protein data bank; RSA: Relative solvent accessibility.

Authors’ contributions

MPS collected data, developed HyPhy batch files, ran analyses, prepared figures, and wrote the manuscript. AGM developed HyPhy batch files. COW conceived of the study, participated in its design and coordination, and wrote the manuscript. All authors read and approved the final manuscript.

Acknowledgements

This work was supported by NIH grant R01 GM088344, NSF grant EF-0742373, and NSF Cooperative Agreement No. DBI-0939454.