, Fred Hutchinson Cancer Research Center, Seattle, WA, USA

Department of Applied Mathematics and Statistics, University of California Santa Cruz, Santa Cruz, CA, USA

Abstract

Background

Statistical models and methods that associate changes in the physicochemical properties of amino acids with natural selection at the molecular level typically do not take into account the correlations between such properties. We propose a Bayesian hierarchical regression model with a generalization of the Dirichlet process prior on the distribution of the regression coefficients that describes the relationship between the changes in amino acid distances and natural selection in protein-coding DNA sequence alignments.

Results

The Bayesian semiparametric approach is illustrated with simulated data and the abalone lysin sperm data. Our method identifies groups of properties which, for this particular dataset, have a similar effect on evolution. The model also provides nonparametric site-specific estimates for the strength of conservation of these properties.

Conclusions

The model described here is distinguished by its ability to handle a large number of amino acid properties simultaneously, while taking into account that such data can be correlated. The multi-level clustering ability of the model allows for appealing interpretations of the results in terms of properties that are roughly equivalent from the standpoint of molecular evolution.

Background

The structural and functional role of a codon in a gene determines its ability to freely change. For example, nonsynonymous (amino acid altering) substitutions may not be tolerated at certain codon sites due to strong negative selection, while at other sites some nonsynonymous substitutions may be allowed if they do not affect key physicochemical properties associated with protein function

A common feature of all the methods listed above is the implicit assumption that properties are independent from each other in terms of their effect on evolution. A review of the amino acid index database (available for example at

A natural way to account for correlations in the data is by considering a factor structure, see for example

Although the clusters of properties can in principle be considered nuisance parameters that are of no direct interest, in practice posterior inference on the clustering structure can provide interesting insights about the molecular evolution process of a given gene. Indeed, as will become clear in the following sections, our approach incorporates the effect of amino acid usage bias. Hence, any significant differences between the cluster structure estimated from the observed protein-coding sequence alignment and the correlation structure derived from the raw distances between the properties in such cluster can be interpreted a signal of extreme amino acid usage bias in that particular region of the genome.

The rest of the paper is organized as follows. A brief review of DP mixture models along with the details of our model is provided in the Methods section. This section also includes a review of some of the currently available methods for characterizing molecular evolution that take into account changes amino acid properties. The model is then evaluated via simulation studies and illustrated through a real data example. The simulated and real data analyses, as well as comparisons between the proposed semiparametric regression approach and other methods, are presented in Results and discussion. Finally, the Conclusions section provides our concluding remarks.

Methods

Dirichlet process mixture models

The Dirichlet process (DP) was formally introduced by
_{0}) prior for _{0}. _{0}.

One of the most commonly used definitions of the DP is its constructive definition
_{0}) is almost surely of the form

where
_{
l
}. The locations _{
l
} are i.i.d. draws from _{0}, while the corresponding weights _{
l
} are generated using the following “stick-breaking” mechanism. Let _{1}=_{1}and define
_{
l
}:_{
l
}:_{
l
}:

The DP is most often used to model the distribution of random effects in hierarchical models. In the simplest case where no covariates are present, these models reduce to nonparametric mixture models (e.g.,
_{1}
_{2},…,_{
n
} such that
_{
i
}) is a parametric density. Then, the DP mixture model places a DP prior on _{
i
} as

The almost sure discreteness of realizations of _{
i
}, making DP mixture models appealing in applications where clustering is expected. The clustering nature is easier to see from the Pólya urn characterization of the DP
_{
i
}s, by marginalizing _{0} and the indicators _{1},…,_{
n
} are discrete indicators sequentially generated with _{1}=1 and

where

One advantage of DP mixture models over other approaches to clustering and classification is that they allow us to automatically estimate the number of components in the mixture. Indeed, from the Pólya urn representation of the process it should be clear that, although the number of ^{∗}=max_{
i≤n
}{_{
i
}}<

The model

Our data consist of observed and expected amino acid distances derived from a DNA sequence alignment, a specific phylogeny, a stochastic model of sequence evolution, and a predetermined set of physicochemical amino acid properties. In the analyses presented here, we disregard uncertainty in the alignment/phylogeny/ancestral sequence level since our main focus is the development and implementation of models that allow us to make inferences on the latent effects that several amino acid properties may have on molecular evolution for a given phylogeny and an underlying model of sequence evolution. Extensions of these analyses that take into account these uncertainties are briefly described in Conclusions. For further discussion on this issue, see also

In order to calculate the observed distances, we first infer the ancestral sequences under a specific substitution model and a given phylogeny. In our applications, we use PAML version 3.15
_{
i,j
} for site

To compute the expected distances, note that each codon can mutate to one of at most nine alternative codons through a single nucleotide substitution
_{
k
} be the number of nonsynonymous mutations possible through a single nucleotide change, corresponding to a particular codon _{
k
}. The frequency of codon

We consider a hierarchical regression model that relates _{
i,j
} to _{
i,j
}and allows us to compare the expected and observed distances at the codon level for several properties simultaneously with the following rationale. If a given site _{
i,j
}≈_{
i,j
}. If property _{
i,j
}<<_{
i,j
} and finally, if property _{
i,j
}>>_{
i,j
}.

To construct our model, we first standardize the distances _{
i,j
} and _{
i,j
} by dividing them by the maximum possible distance for each property. This enables us to use priors with the same scale for all the regression coefficients. Our regression model for the standardized distances

where
_{
i,j
} and

To complete the model, we need to describe a model for the matrix of regression coefficients _{
i,j
}. There are a number of possible models for this type of data which utilize Bayesian nonparametric methods; some recent examples include the infinite relational model (IRM)

In this paper we focus on the NIRM, which is constructed by partitioning the original matrix into groups corresponding to entries with similar behavior. This is done by generating partitions in one of the dimensions of the matrix (say, rows) that are nested within clusters of the other dimension (columns). This structure allows us to identify groups of (typically correlated) properties with similar pattern and then, within each such group, identify clusters of sites with similar values of _{
i,j
}(Figure
**
θ
**

Stylized representation of our model.

**Stylized representation of our model.** Each sub table at the second level of clustering shares a common value for the regression coefficient _{i,j}. Rows correspond to properties, while columns correspond to sites.

More specifically, we denote by **
θ
**

is a random distribution such that
_{
k
}∼**
θ
**

To obtain cluster-specific partitions for the sites (rows), _{
k
} (the joint distribution associated with all sites for a given cluster of properties) has to be chosen carefully. In particular, we write

with
_{
l,k
}∼_{
k
}) for every _{
l,k
} are independently drawn from the baseline measure _{0,l,k
}.

The baseline measure _{0,l,k
} is chosen to accommodate the fact that some
_{0,l,k
}is a mixture with a point mass at zero and a continuous density otherwise. To allow for a more flexible model we assume that different prior variances are associated with the
_{0lk
}as below.

with

where
_{
κ
},_{
κ
}),
_{
l,k
} and
_{
i,j
}and
_{
l,k
}has the value zero (i.e., the properties associated with this cluster are strongly conserved at this cluster of sites).

Note that our model implies that both sites and properties are exchangeable a priori. If no additional prior information is available, this type of assumption seems reasonable. However, a posteriori, it is possible to have sites behave differently in different clusters.

To complete the model we place hyperpriors on all parameters of the resulting model. Conjugate priors are chosen for ease of computation. _{
k
} denotes the mean for the _{
l,k
}s that are different from zero belonging to a specific cluster of properties _{
α
},_{
α
}) prior for all _{
k
} are assumed to follow _{
ρ
},_{
ρ
}) with mean _{
ρ
}/_{
ρ
}, and _{
γ
},_{
γ
}) with mean _{
γ
}/_{
γ
} for all _{0lk
}, follows a _{
λ
},_{
λ
}). The specific choice of hyperparameters is discussed later as part of each data analysis. In general, we use _{
α
}) prior for _{
k
}to correspond to our assumption of neutrality a priori for the properties.

Related work

We compare results from our proposed method with results from a few currently available methods that aim to characterize molecular evolution while also taking into account changes in amino acid properties, namely, the regression model in

In

Posterior simulation

Various algorithms exist for posterior inference of DP mixtures - some of the most popular ones use (i) the Pólya urn characterization to marginalize out the unknown distribution(s)

We use an extension of the finite mixture approximation discussed in
_{
k
}and locations
_{
j
}} such that, for _{
j
}=_{
k
}, we truncate at a sufficient level _{
i,k
}} where _{
i,k
}=

To determine the truncation levels ^{
K−1}. Using prior guesses for

Results and discussion

Empirical exploration via simulation studies

We present two simulation studies to check the performance of the model under different scenarios. Additional simulation scenarios that may be of interest are available as an Additional file

Additional simulations are provided in a separate supplemental file.

Click here for file

Simulation study 1

The setup for the first simulation is as follows. We generate values for the distinct regression coefficients (_{
l,k
}) from a _{
i,j
} from _{
l,k
}
_{
i,j
},^{2}=0.001). The _{
i,j
}s are obtained from the lysin data set described below with analyses for 32 properties, which implies

We fitted the model in The Model subsection to the
_{
i,j
}, so _{
l,k
}=_{
l,k
} and (ii) _{
l,k
}∼_{0}where _{0}∼^{2}). We used _{
k
}∼^{2}∼^{2}∼_{
i,j
}s to different partitions. Posterior summaries based on the two chains were consistent with each other.

In this scenario, we had four clusters for the columns, each with differing number of groups, leading to twelve distinct cluster combinations for the entire matrix of _{
i,j
}s (Figure

Image plots for true _{i,j}values (left panel) and posterior means

**Image plots for true **
**
β
**

Marginal posterior probabilities of each pair of columns belonging to the same cluster

**Marginal posterior probabilities of each pair of columns belonging to the same cluster.**

Similar graphical summaries obtained for the structure of rows within each cluster of columns show that the correct clustering structures for the rows, within each cluster of columns, are inferred (see Figure

Marginal posterior probabilities of each pair of rows belonging to the same cluster for two different clusters of columns.

**Marginal posterior probabilities of each pair of rows belonging to the same cluster for two different clusters of columns.**

This scenario corresponds to the type of situation we expect on most real datasets: properties will cluster into groups and, within each group of properties, clusters of sites with similar responses can be clearly identified. Our results suggest that, as expected, the model is capable of identifying these multiple clusters with high accuracy and therefore accurately estimate the value of the regression coefficients. Other scenarios, including extreme cases where all properties belong to a common cluster while sites belong to one of several clusters, and cases where each property has a different effect on amino acid rates are available as Additional file

To investigate the effect of the truncation levels and the priors on our model, we performed sensitivity analysis by varying the truncation levels as well as the different hyperparameters. Increasing the truncation level to 35 did not affect the results and the estimated posterior means of the ^{2} makes the results marginally better, i.e., posterior means of the _{
i,j
}s,

Simulation study 2 - data simulated from a biological model

In our second simulation study the model is evaluated in the context of biological sequences generated from an evolutionary model. In particular, a Markov model was used to generate 20 sequences of 90 codons each. For the first one-third of the sites (sites 1-30) we used transition probabilities obtained from the codon-substitution model of

Once we obtained the sequences, we generated ancestral sequences using _{
i,j
} and _{
i,j
} for five properties, namely, hydropathy (_{
v
}), polarity (_{
i
}) and partial specific volume (^{0}). Of these, _{
v
}and ^{0}.

Our model was fitted with

The analyses found that there were three clusters of properties - the first cluster has properties _{
v
} and ^{0} and the third cluster only had property _{
i
} as shown in Figure
_{
i,j
}s for representative properties of the three clusters in Figure
_{
i,j
}values in a specific cluster and groups them together. Groups of sites that change a property can also be identified for clusters 2 and 3 in Figure
_{
v
} and ^{0}), there is a big group of sites which conserve these properties. Most of these sites are in the central one-third portion (i.e., the portion that includes sites 31-60) which were simulated under a transition probability matrix that favors transitions that conserve volume. Finally, for cluster 3 (_{
i
}) there is one large group of sites which conserve the property and one group comprising sites 39 and 80 which change the property greatly.

Marginal posterior probabilities of any two properties being in the same cluster for the data simulated under a biological model

**Marginal posterior probabilities of any two properties being in the same cluster for the data simulated under a biological model.**

Posterior means of _{i,j}s for the three clusters in Figure

**Posterior means of **_{i,j}**s for the three clusters in Figure **** for the simulated data under a biological model.** The sites are sorted according to the increasing value of posterior means.

Marginal posterior probabilities of any two sites for the simulated data being grouped together in the first cluster in Figure

**Marginal posterior probabilities of any two sites for the simulated data being grouped together in the first cluster in Figure ****.** The sites are sorted according to the increasing value of posterior means of _{i,j}s.

To better understand the performance of our method, we also analyzed the sequences generated above with the parametric regression model in
_{
v
} for the regression model of

**Parametric regression **

**Semiparametric regression**

Sites marked in bold are the ones which are in the region of interest - for _{
v
}where small changes were encouraged while generating the sequences. Underlined sites are identified by both methods.

30 sites with largest posterior mean

- 4

- 5

- 14

- 18

- 19

- 21

- 24

- 33

- 48

- 52

- 54

- 59

- 62

- 64

- 65

- 67

- 71

- 74

- 75

- 77

- 80

- 81

- 82

- 84

- 85

- 89

- 4

- 5

- 14

- 18

- 19

- 21

- 24

- 33

- 48

- 52

- 54

- 59

- 62

- 64

- 65

- 67

- 71

- 72

- 74

- 75

- 77

- 80

- 81

- 82

- 84

- 85

- 86

- 89

30 sites with lowest posterior mean
_{
v
}

- 5

- 6

- 7

- 9

- 16

- 19

- 24

- 25

- 26

- 27

- 28

- 31

- 32

- 36

- 49

- 58

- 59

- 60

- 61

- 64

- 65

- 67

- 79

- 80

- 83

- 88

- 5

- 6

- 7

- 9

- 16

- 19

- 24

- 25

- 26

- 27

- 28

- 31

- 32

- 34

- 36

- 38

- 49

- 58

- 59

- 60

- 61

- 64

- 65

- 67

- 79

- 80

- 83

- 88

Table

**Property**

**Radically changing (1.645)**

**Radically changing (3.695)**

**Conserved (1.645)**

**Conserved (3.695)**

Values in parentheses denote the cut-off values for the

5, 59, **65**, **67**, **71**, **74**, **81**, **82**, **89**

**74**

36, 83

None

21, 24, 37, **64**, **65**, **67**, **71**, **74**, **75**, **81**, **82**, **89**

None

7, 18, 36, 49, 55

None

_{
v
}

10, 33, 66

None

5, 18, **36**, **49**

None

^{0}

10, 13, 33, 66

None

18, **36**

None

_{
i
}

39, 55, 72

None

11, 64, 72

None

Finally, we analyzed the sequences generated previously with _{
v
}. We chose to run

Table
_{
v
} are not in agreement. This is probably due to the fact that partitions are not always directly comparable with the amino acid distances. For example, under the volume partition of

**Property**

**
ω
**

**
ω
**

**
ω
**

**
ω
**

Sites marked in bold are in the region of interest.

None

None

None

1, 2, 5, 7, 10, 11, 12, 13, 14, 18, 19, 20, 26, 27, 30, 32, 33, 34, 36, 37, 42, 43, 47, 53, 57, 59, **61**, **62**, **63**, **64**, **66**, **67**, **68**, **69**, **72**, **73**, **74**, **75**, **77**, **82**, **83**, **86**, **87**, **88**, **90**

_{
v
}

None

None

None

2, 7, 9, 18, 19, 20, 22, 27, 31, 32, 36, 38, 53, 55, 61, 62, 64, 67, 72, 74, 86

Illustration with Lysin data

Our proposed model was applied to the sperm lysin data set which consisted of cDNA from 25 abalone species with 135 codons in each sequence
_{0} and allows for an additional positive selection category with _{1}.

The lysin data was analyzed with the model in The Model subsection with the 32 amino acid properties listed in Table
_{
k
} were assumed to follow a _{
l,k
} being 0, was assumed to follow a _{
i,j
}s were expected to be 0 _{
κ
}and _{
κ
}, the hyperparameters for the prior of
_{
l,k
}is 0, were chosen as 2 and 100 which implied a prior mean of 0.01. When _{
l,k
} is different from zero,
_{0}, the scale factor for
^{2}and
_{
k
}s were assumed to follow a

**AAindex accession number (if available)**

**Property**

**Symbol**

**AAindex accession number (if available)**

**Property**

**Symbol**

Properties marked by ∗ are from

KYTJ820101

Hydropathy

∗

Helical contact area

_{
a
}

GRAR740103

Molecular volume

_{
v
}

ZIMJ680104

Isoelectric point

_{
i
}

MANP780101

Surrounding hydrophobicity

_{
p
}

OOBM770103

Long-range non-bonded energy

_{
l
}

ZIMJ680103

Polarity(Zimmerman)

_{
zim
}

∗

Mean r.m.s. fluctuation displacement

CHOP780201

Alpha-helical tendencies

_{
α
}

FASG760101

Molecular weight

_{
w
}

GRAR740102

Polarity(Grantham)

∗

Normalized consensus hydrophobicity

_{
nc
}

PONP800108

Average number of surrounding residues

_{
s
}

COHE430101

Partial specific volume

^{0}

∗

Power to be at the C-terminal

_{
c
}

WOEC730101

Polar requirement

_{
r
}

GRAR740101

Composition

∗

Power to be at the middle of alpha-helix

_{
m
}

∗

Compressibility

^{0}

∗

Power to be at the N-terminal

_{
n
}

FAUJ880113

Equilibrium constant (ionization of COOH)

^{
′
}

MCMT640101

Refractive index

CHOP780202

Beta-structure tendencies

_{
β
}

OOBM770102

Short and medium range non-bonded energy

_{
sm
}

ZIMJ680102

Bulkiness

_{
l
}

PONP800107

Solvent accessible reduction ratio

_{
a
}

∗

Buriedness

_{
r
}

∗

Thermodynamic transfer hydrophobicity

_{
t
}

∗

Chromatographic index

_{
F
}

OOBM770101

Total non-bonded energy

_{
t
}

CHAM830101

Coil tendencies

_{
c
}

CHOP780101

Turn tendencies

Figure
_{
l
}, and _{
v
},^{0},_{
w
},_{
α
},_{
zim
}, which is correlated with _{
i
}with which it shows a large correlation value (about 0.9). There is some uncertainty regarding the membership of ^{0} and _{
sm
}, since both of them are assigned to the largest cluster about 50% of the time, while _{
sm
} is clustered with properties related to volume to a lesser extent. ^{1}is the only property that is almost never clustered with other properties.

Marginal posterior probabilities of any two properties being in the same cluster for the lysin data.

**Marginal posterior probabilities of any two properties being in the same cluster for the lysin data.**

Site specific results based on the posterior means (denoted by
_{
v
}). The first three clusters also have a fairly large number of sites with mean
_{
zim
}), which corresponds to properties _{
zim
}and _{
i
}. A large number of sites in cluster 4 strongly conserve the properties (e.g., sites 35, 43, 49, 51, 64, 114, 117, 121), as is evident by the very small mean

Posterior means

**Posterior means ****s for the four clusters (denoted by representative properties) in Figure **** for lysin.** The sites are sorted according to the increasing value of posterior means.

Figure
_{
i,j
}s different from zero for sites 82, 99, 120 and 127 for properties belonging to different clusters. Of these, sites 120 and 127 were found to be under positive selection by _{
v
}. We can also see similarities in the posterior summaries across sites. For example, for property ^{1} sites 82, 120 and 127 have similar values for _{
i,j
}. One of the advantages of using the semiparametric approach is that we can identify groups of sites that either conserve or radically change a set of similar amino acid properties. For example, sites 122 and 127 both seem to be altering the amino acid properties in the first large cluster of properties related to _{
zim
}: site 122 strongly conserves properties in this cluster while site 127 radically changes them.

Posterior summaries of _{i,j}s different from zero for sites 82, 99, 120 and 127 in lysin data.

**Posterior summaries of **_{i,j}**s different from zero for sites 82, 99, 120 and 127 in lysin data.** The first 4 properties on the x-axis belong to 4 different clusters and the next 2 do not belong to any specific cluster all the time. The vertical lines are 90% posterior intervals of the _{i,j}s that are different from 0, the medians (filled circles) and the 25^{th} and 75^{th} percentiles (stars) are highlighted.

Table
_{
zim
}and _{
i
}, in agreement with Figure

**Cluster**

**Site number**

1

96

2 and 3

22, 28, 35, 51, 111, 117, 128

11, 17, 18, 19, 24, 25, 27, 29, 33, 35, 42, 43, 47, 49, 51,

4

53, 58, 64, 66, 68, 69, 71, 73, 79, 81, 88, 94, 96, 98, 100,

101, 104, 105, 110, 111, 114, 115, 117, 121, 122, 129, 131

The results are fairly robust to the choice of different hyperparameter values. Note that the scale factor for
_{
i,j
}values, and it is advisable to choose it so that the prior variance for the unique _{
i,j
}s is not too large.

Conclusions

In this paper, we present a Bayesian hierarchical regression model with a nested infinite relational model on the regression coefficients. The model is capable of identifying sites which show radical or conserved amino acid changes. The (almost sure) discreteness of the DP realizations induces clustering at the level of properties which is analogous to the factor model in

The main advantage of the models we have described is their ability to simultaneously handle multiple properties with potentially correlated effects on molecular evolution. Our simulations suggest that our models are flexible but robust, being capable of dealing with a range of situations including those where properties are perfectly correlated, as well as those where all properties are uncorrelated. Our semiparametric regression models also work well, particularly in comparison with the regression model in

The NIRM that is the basis of our model defines a separately exchangeable prior on matrices. This means that the prior is invariant to the order in which properties and sites are included. This is due to the fact that the rows as well as the columns of the parameter of interest are independent draws from a DP. From the point of view of modeling multiple properties, this is a highly desirable property. However, assuming that DNA sites are exchangeable can be questionable. Although this is a potential limitation of our model, we should note that the assumption of independence across sites (which is a stronger assumption than exchangeability) underlies all the methods discussed in the Background section. If information about the 3-dimensional structure of the encoded protein or other sequence specific information that can guide the construction of the dependence model is available, our model could be easily extended to account for this feature. In the absence of such information, exchangeability across DNA sites seems to be a reasonable prior assumption. Indeed, in contrast to the most common independence assumption, our exchangeability assumption allows us to explain correlations at the level of sites.

In our applications, we have used codon substitution models for reconstructing ancestral sequences as we wished to compare our methods to other methods for detecting selective sites that also use codon substitution models, such as those implemented in

Finally, it is important to note that the “observed” distances are not really directly observed, but are instead constructed from ancestral sequences and, therefore, subject to error. A simple way to account for this additional level of uncertainty is to modify the computation of expected distances by incorporating the ideas of

Appendix: details about the Gibbs sampler

The truncations and the introduction of the configuration variables imply that (2) and (3) can be written as

with _{
l,k
}∼_{0lk
}and _{
k
} and _{
l,k
} being the appropriate stick breaking weights. Writing the model as in (5) helps in obtaining the forms of the full conditionals as below.

The column indicators _{
j
} for

where
_{
l,k
}=0 or is
_{
l,k
} is different from zero. _{
k
} is sampled in two parts: first, by generating _{
k
}from a
_{
K
}=1, where _{
k
}is the number of columns assigned to cluster

For _{
i,k
}are also sampled from a multinomial with probabilities of the form

The updated weights _{
l,k
} are sampled in a manner similar to the _{
k
}, i.e., _{
l,k
} are generated from a
_{
Lk
}=1, where _{
l,k
} is the number of _{
i,j
}s assigned to atom

Following
_{
k
} are sampled in two steps by introducing auxiliary variables _{1}and _{2}. First, sample _{1}from

and then

where
_{
j
}. Similarly, for each

where
_{
i,k
}, for a specific cluster of columns

To sample the unique
_{
l,k
}which take the value 1 when _{
l,k
}is different from zero. For _{
l,k
},
_{
l,k
} are jointly sampled in the following way - _{
l,k
} is sampled by integrating out _{
l,k
}and
_{
l,k
} and _{
l,k
} is sampled conditional on both the corresponding _{
l,k
}and

with the individual expressions obtained as follows.

First, let

where

where

The full conditional of

Finally, for _{
k
}is given by

where

and

.

Software availability

The R code implementing the models in the paper is freely available at

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

SD, AR and RP formulated the model. SD performed the analyses and drafted the manuscript. AR and RP revised the manuscript draft. All authors read and approve the final version of the manuscript.

Acknowledgements

RP and SD were supported by the NIH/NIGMS grant R01GM072003-02. AR was supported by the NIH/NIGMS grant R01GM090201-01.