Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Research article

Bayesian semiparametric regression models to characterize molecular evolution

Saheli Datta1*, Abel Rodriguez2 and Raquel Prado2

Author affiliations

1 , Fred Hutchinson Cancer Research Center, Seattle, WA, USA

2 Department of Applied Mathematics and Statistics, University of California Santa Cruz, Santa Cruz, CA, USA

For all author emails, please log on.

Citation and License

BMC Bioinformatics 2012, 13:278  doi:10.1186/1471-2105-13-278

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/13/278


Received:15 December 2011
Accepted:11 October 2012
Published:30 October 2012

© 2012 Datta et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Statistical models and methods that associate changes in the physicochemical properties of amino acids with natural selection at the molecular level typically do not take into account the correlations between such properties. We propose a Bayesian hierarchical regression model with a generalization of the Dirichlet process prior on the distribution of the regression coefficients that describes the relationship between the changes in amino acid distances and natural selection in protein-coding DNA sequence alignments.

Results

The Bayesian semiparametric approach is illustrated with simulated data and the abalone lysin sperm data. Our method identifies groups of properties which, for this particular dataset, have a similar effect on evolution. The model also provides nonparametric site-specific estimates for the strength of conservation of these properties.

Conclusions

The model described here is distinguished by its ability to handle a large number of amino acid properties simultaneously, while taking into account that such data can be correlated. The multi-level clustering ability of the model allows for appealing interpretations of the results in terms of properties that are roughly equivalent from the standpoint of molecular evolution.

Background

The structural and functional role of a codon in a gene determines its ability to freely change. For example, nonsynonymous (amino acid altering) substitutions may not be tolerated at certain codon sites due to strong negative selection, while at other sites some nonsynonymous substitutions may be allowed if they do not affect key physicochemical properties associated with protein function [1]. Thus, at such preferentially changing sites, more frequent substitutions occur between physicochemically similar amino acids (or codons which lead to those amino acids) than dissimilar ones [2-4]. Methods which use changes in physicochemical amino acid properties have thus been proposed in the study of evolution. For example, [5-7] use distances to calculate deviations from neutrality for a particular amino acid property. Alternative approaches model the evolution of protein coding sequences as continuous-time Markov chains with rate matrices that distinguish between property-altering and property-conserving mutations as in [8] and [9]. More recently, [10] proposed a Bayesian hierarchical regression model that compares the observed amino acid distances to the expected distances under neutrality for a given set of amino acid properties and incorporates mixture priors for variable selection. The hierarchical mixture priors enable the model in [10] to identify neutral, conserved and radically changing sites, while automatically adjusting for multiple comparisons and borrowing information across properties and sites.

A common feature of all the methods listed above is the implicit assumption that properties are independent from each other in terms of their effect on evolution. A review of the amino acid index database (available for example at http://www.genome.jp/dbget/aaindex.html webcite), which lists more than 500 amino acid properties, shows that a large number of them are highly correlated. Although the correlations we observe in the data can be different from those computed from the raw amino acid scores due to the influence of factors such as codon bias, by ignoring these correlations we are also ignoring the fact that correlated properties may affect a particular site in similar ways. Hence, approaches that do not take into account the correlations in the rates of mutations on different codons do not make use of key information about the relative importance of different physicochemical properties on molecular evolution.

A natural way to account for correlations in the data is by considering a factor structure, see for example [11]. However, selecting the number and order of the factors can be a difficult task in this type of factor models. In addition, the particular structure of the model in [11] makes it difficult to incorporate the effect of the factors on regions that are very strongly conserved. This paper extends the Bayesian hierarchical regression model in [10] by placing a nonparametric prior on the distribution of the regression coefficients describing the effect of properties on molecular evolution. The prior is an extension of the well known Dirichlet process prior [12,13] to model separately exchangeable arrays [14,15]. As in [10], the main goal of the model described in this paper is to identify sites that are either strongly conserved or radically changing. In order to account for correlations across properties, our model clusters properties with similar effects on evolution, and within each such group, clusters sites with similar regression coefficients and nonparametrically estimates their distribution. In addition to accounting for correlations across properties, this structure allows us to dramatically reduce the number of parameters in the model and generate interpretable insights about molecular evolution at the codon level.

Although the clusters of properties can in principle be considered nuisance parameters that are of no direct interest, in practice posterior inference on the clustering structure can provide interesting insights about the molecular evolution process of a given gene. Indeed, as will become clear in the following sections, our approach incorporates the effect of amino acid usage bias. Hence, any significant differences between the cluster structure estimated from the observed protein-coding sequence alignment and the correlation structure derived from the raw distances between the properties in such cluster can be interpreted a signal of extreme amino acid usage bias in that particular region of the genome.

The rest of the paper is organized as follows. A brief review of DP mixture models along with the details of our model is provided in the Methods section. This section also includes a review of some of the currently available methods for characterizing molecular evolution that take into account changes amino acid properties. The model is then evaluated via simulation studies and illustrated through a real data example. The simulated and real data analyses, as well as comparisons between the proposed semiparametric regression approach and other methods, are presented in Results and discussion. Finally, the Conclusions section provides our concluding remarks.

Methods

Dirichlet process mixture models

The Dirichlet process (DP) was formally introduced by [12] as a prior probability model for random distributions G. A DP(ρ, G0) prior for G is characterized by two parameters, a positive scalar parameter ρ, and a parametric base distribution (or centering distribution) G0. ρ can be interpreted as the precision parameter, with larger values of ρresulting in realizations of G that are closer to the base distribution G0.

One of the most commonly used definitions of the DP is its constructive definition [13], which characterizes DP realizations as countable mixtures of point masses. Specifically, a random distribution G generated from DP(ρ, G0) is almost surely of the form

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M1">View MathML</a>

where <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M2">View MathML</a> denotes a point mass at ϕl. The locations ϕl are i.i.d. draws from G0, while the corresponding weights wl are generated using the following “stick-breaking” mechanism. Let w1=v1and define <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M3">View MathML</a> for l=2,3,…, where {vl:l=1,2,…} are i.i.d. draws from a Beta(1, ρ) distribution. Defining the weights in this way ensures <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M4">View MathML</a>. Furthermore, the sequences {vl:l=1,2,…} and {ϕl:l=1,2,…} are independent.

The DP is most often used to model the distribution of random effects in hierarchical models. In the simplest case where no covariates are present, these models reduce to nonparametric mixture models (e.g., [16-18]). Assume that we have an independent sample of observations y1y2,…,yn such that <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M5">View MathML</a>, where k(·;θi) is a parametric density. Then, the DP mixture model places a DP prior on θi as

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M6">View MathML</a>

The almost sure discreteness of realizations of G from the DP prior allows ties in θi, making DP mixture models appealing in applications where clustering is expected. The clustering nature is easier to see from the Pólya urn characterization of the DP [19] which gives the induced joint distribution for the θis, by marginalizing G over its DP prior. Under that representation, we can write <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M7">View MathML</a> where <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M8">View MathML</a> is an independent and identically distributed sample from G0 and the indicators ξ1,…,ξn are discrete indicators sequentially generated with ξ1=1 and

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M9">View MathML</a>

where <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M10">View MathML</a> and

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M11">View MathML</a>

One advantage of DP mixture models over other approaches to clustering and classification is that they allow us to automatically estimate the number of components in the mixture. Indeed, from the Pólya urn representation of the process it should be clear that, although the number of potential mixture components is infinite, the model implicitly places a prior on the number of components that, for moderate values of ρ, favors the data being generated by an effective number of components K=maxin{ξi}<n.

The model

Our data consist of observed and expected amino acid distances derived from a DNA sequence alignment, a specific phylogeny, a stochastic model of sequence evolution, and a predetermined set of physicochemical amino acid properties. In the analyses presented here, we disregard uncertainty in the alignment/phylogeny/ancestral sequence level since our main focus is the development and implementation of models that allow us to make inferences on the latent effects that several amino acid properties may have on molecular evolution for a given phylogeny and an underlying model of sequence evolution. Extensions of these analyses that take into account these uncertainties are briefly described in Conclusions. For further discussion on this issue, see also [10].

In order to calculate the observed distances, we first infer the ancestral sequences under a specific substitution model and a given phylogeny. In our applications, we use PAML version 3.15 [20] and the codon substitution model of [21], which accounts for the possibility of multiple substitutions at a given site. Nonsynonymous substitutions are then counted by comparing DNA sequences between two neighboring nodes in the phylogeny. The observed mean distance, denoted as yi,j for site i and property j, is obtained as the mean absolute difference in the property scores due to all nonsynonymous substitutions at site i. Only those sites with at least one nonsynonymous change from the ancestral level are retained for further analysis.

To compute the expected distances, note that each codon can mutate to one of at most nine alternative codons through a single nucleotide substitution [5], only some of which are nonsynonymous (changes to stop codons are ignored). Let Nk be the number of nonsynonymous mutations possible through a single nucleotide change, corresponding to a particular codon k (k=1,…,61). Let <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M12">View MathML</a> be the absolute difference in property j between nonsynonymous codon pairs at site i differing at one codon position, where l=1,…,Nk. The frequency of codon k at a particular site i in the DNA sequence under study is denoted by <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M13">View MathML</a>. Then, the expected mean distance for a particular site i and a given property j is given by

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M14">View MathML</a>

We consider a hierarchical regression model that relates xi,j to yi,jand allows us to compare the expected and observed distances at the codon level for several properties simultaneously with the following rationale. If a given site i is neutral with respect to property j, then yi,jxi,j. If property j is conserved at site i, then yi,j<<xi,j and finally, if property j is radically changing at site i, then yi,j>>xi,j.

To construct our model, we first standardize the distances xi,j and yi,j by dividing them by the maximum possible distance for each property. This enables us to use priors with the same scale for all the regression coefficients. Our regression model for the standardized distances <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M15">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M16">View MathML</a>, for sites i=1,…,I and properties j=1,…,J, can be written as

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M17">View MathML</a>

(1)

where <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M18">View MathML</a> is the observed number of nonsynonymous changes at a particular site i and βi,j and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M19">View MathML</a> are the regression coefficient and variance parameter associated with site i and property j. The mixture model accounts for the fact that some of the <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M20">View MathML</a>s can be equal to zero as some nonsynonymous changes do not change the value of the property being measured (e.g., Aspargine, Aspartic acid, Glutamine, Glutamic acid all have the same hydropathy score).

To complete the model, we need to describe a model for the matrix of regression coefficients βi,j. There are a number of possible models for this type of data which utilize Bayesian nonparametric methods; some recent examples include the infinite relational model (IRM) [22,23], the matrix stick breaking process (MSBP) [24], and the nested infinite relational model (NIRM) [14,15].

In this paper we focus on the NIRM, which is constructed by partitioning the original matrix into groups corresponding to entries with similar behavior. This is done by generating partitions in one of the dimensions of the matrix (say, rows) that are nested within clusters of the other dimension (columns). This structure allows us to identify groups of (typically correlated) properties with similar pattern and then, within each such group, identify clusters of sites with similar values of βi,j(Figure 1 provides a graphical representation of this idea). In our setting, we take <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M21">View MathML</a> and employ a NIRM to generate a prior for [θij].

thumbnailFigure 1. Stylized representation of our model. Each sub table at the second level of clustering shares a common value for the regression coefficient βi,j. Rows correspond to properties, while columns correspond to sites.

More specifically, we denote by θj=(θ1,j,…,θIj)the vector of regression coefficients and the associated variances corresponding to property (column) j. To obtain clusters for the properties, we assume that θjF, where

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M22">View MathML</a>

(2)

is a random distribution such that <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M23">View MathML</a>, vkBeta(1,ρ), and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M24">View MathML</a>. Indeed, the discrete nature of F ensures that ties among the θjhappen with non-zero probability.

To obtain cluster-specific partitions for the sites (rows), Hk (the joint distribution associated with all sites for a given cluster of properties) has to be chosen carefully. In particular, we write <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M25">View MathML</a> for any specific specific cluster of properties k and let

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M26">View MathML</a>

(3)

with <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M27">View MathML</a>, ul,kBeta(1,γk) for every k, and φl,k are independently drawn from the baseline measure G0,l,k.

The baseline measure G0,l,k is chosen to accommodate the fact that some <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M28">View MathML</a>s can be zero, since some nonsynonymous changes can keep the value of the property being measured unchanged. Thus, G0,l,kis a mixture with a point mass at zero and a continuous density otherwise. To allow for a more flexible model we assume that different prior variances are associated with the <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M29">View MathML</a>s which are zero and those <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M30">View MathML</a>s that are different from zero, with the specific form of G0lkas below.

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M31">View MathML</a>

(4)

with

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M32">View MathML</a>

where <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M33">View MathML</a>Inv-Ga(aκ,bκ), <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M34">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M35">View MathML</a>Inv-Ga<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M36">View MathML</a>. Here ϕl,k and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M37">View MathML</a> respectively denote the unique values βi,jand <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M38">View MathML</a> can take, whereas λ is the prior probability that ϕl,khas the value zero (i.e., the properties associated with this cluster are strongly conserved at this cluster of sites).

Note that our model implies that both sites and properties are exchangeable a priori. If no additional prior information is available, this type of assumption seems reasonable. However, a posteriori, it is possible to have sites behave differently in different clusters.

To complete the model we place hyperpriors on all parameters of the resulting model. Conjugate priors are chosen for ease of computation. αk denotes the mean for the ϕl,ks that are different from zero belonging to a specific cluster of properties k and is assumed to have a N(mα,Cα) prior for all k. The DP concentration parameters ρand γk are assumed to follow Ga(aρ,bρ) with mean aρ/bρ, and Ga(aγ,bγ) with mean aγ/bγ for all k, respectively. λ, which is the prior probability for the point mass at 0 in G0lk, follows a Beta(aλ,bλ). The specific choice of hyperparameters is discussed later as part of each data analysis. In general, we use Ga(1,1) priors for the DP concentration parameters and a N(1,Cα) prior for αkto correspond to our assumption of neutrality a priori for the properties.

Related work

We compare results from our proposed method with results from a few currently available methods that aim to characterize molecular evolution while also taking into account changes in amino acid properties, namely, the regression model in [10], TreeSAAP[25], and EvoRadical[9].

In [10], the first level of the model is the regression equation on <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M39">View MathML</a> as in equation (1), but it implicitly assumes independence among properties and independence among sites unlike our current model. The model in [10] is suitable for use when a few mostly independent amino acid properties are being analyzed whereas the new semiparametric model is better suited to the analysis of a large number of possibly correlated properties.

TreeSAAP uses the methods of [6] to classify nonsysnonymous substitutions into one of M categories, with higher numbered categories corresponding to sites showing radical changes and lower numbered categories used for sites showing conserved changes for a given property. For the analysis considered here, we used 8 categories where categories 6, 7, and 8 corresponded to sites showing radical changes, and categories 1 and 2 to sites showing conserved changes. Nonsynonymous changes are inferred from the ancestral reconstruction using the nucleotide substitution models in baseml implemented in PAML. We used a Bonferroni correction to correct for multiple comparisons.

EvoRadical implements the models of [9], which use partitions of amino acids to parameterize the rates of property-conserving and property-altering codon substitutions in a maximum likelihood framework. The model considers three types of substitutions: synonymous, property-conserving nonsynonymous and property-altering nonsynonymous which is a slight improvement from [8]. For analyses with multiple properties, one has to create different partitions for the different properties and run EvoRadical for each property.

Posterior simulation

Various algorithms exist for posterior inference of DP mixtures - some of the most popular ones use (i) the Pólya urn characterization to marginalize out the unknown distribution(s) [26,27], (ii) a truncation approximation to the stick-breaking representation of the process which paves the way for the use of methods employed in finite mixture models [28,29], (iii) reversible jump MCMC or split-merge methods [30,31]. Some other recent approaches have also used variational methods [32] and slice samplers [33].

We use an extension of the finite mixture approximation discussed in [28] for its ease of implementation. Truncating F at a sufficiently large K, we write <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M40','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M40">View MathML</a>, with the weights Πkand locations <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M41','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M41">View MathML</a> generated as described earlier in this Section. Next we introduce configuration variables {ζj} such that, for k=1,…,K, ζj=k if and only if <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M42','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M42">View MathML</a>. Similarly for Gk, we truncate at a sufficient level L, and introduce another set of configuration variables {ξi,k} where ξi,k=l, with l=1,…,L, if and only if <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M43','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M43">View MathML</a>. Additional details about the algorithm are provided in the Appendix.

To determine the truncation levels K and L, we follow [29]. In particular, note that conditional on ρ (the DP concentration parameter), the tail probability <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M44','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M44">View MathML</a> has expectation {ρ/(1 + ρ)}K−1. Using prior guesses for ρand acceptable tolerance levels for the tail probability to be small, one can then solve for the truncation level K. In our analyses, we used K and L in the range of 25 to 35. These values are in line with those used in other applications (for example, see [34]).

Results and discussion

Empirical exploration via simulation studies

We present two simulation studies to check the performance of the model under different scenarios. Additional simulation scenarios that may be of interest are available as an Additional file 1.

Additional file 1. Additional simulations are provided in a separate supplemental file.

Format: PDF Size: 332KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

Simulation study 1

The setup for the first simulation is as follows. We generate values for the distinct regression coefficients (ϕl,k) from a N(1,0.25). The number of distinct regression coefficients depends on the particular clustering structure for the corresponding simulation. Once we obtain the regression coefficients, we generate observations yi,j from N(ϕl,kxi,j,σ2=0.001). The xi,js are obtained from the lysin data set described below with analyses for 32 properties, which implies J=32 and I=94.

We fitted the model in The Model subsection to the <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M45">View MathML</a>s and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M46','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M46">View MathML</a>s, with the following modifications: (i) the NIRM is imposed on βi,j, so φl,k=ϕl,k and (ii) ϕl,kG0where G0N(α,τ2). We used K=25 and L=25 for the simulations. The MCMC algorithm was run with the following hyperpriors: ρGa(1,1), γkGa(1,1) for all k, αN(1,0.25). σ2Inv-Ga(100, 10) and τ2Inv-Ga(2,4) were chosen such that the prior means corresponded to the true values for these hyperparameters. Results are based on 15000 iterations, with the first 5000 discarded as burn-in. Convergence was assessed by running two chains where each chain was initialized by randomly assigning the βi,js to different partitions. Posterior summaries based on the two chains were consistent with each other.

In this scenario, we had four clusters for the columns, each with differing number of groups, leading to twelve distinct cluster combinations for the entire matrix of βi,js (Figure 2, left panel). Figure 3 shows the marginal probabilities for any two columns (properties) of belonging to the same cluster. The model correctly identifies that there are 4 clusters for the columns and assigns each set of columns to its corresponding cluster with no uncertainty.

thumbnailFigure 2. Image plots for true βi,jvalues (left panel) and posterior means <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M48','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M48">View MathML</a>s (right panel).

thumbnailFigure 3. Marginal posterior probabilities of each pair of columns belonging to the same cluster.

Similar graphical summaries obtained for the structure of rows within each cluster of columns show that the correct clustering structures for the rows, within each cluster of columns, are inferred (see Figure 4). For this level, however, there is some uncertainty about the membership of the clusters for a few rows. See, for example, the right panel of Figure 4. Some rows in cluster 1 (in the lower left) are sometimes being assigned to cluster 3 (top right). The distinct values of ϕ used for these two clusters were 0.73 and 0.98, therefore, it does not seem unreasonable to see some uncertainty in the assignment of clusters. Posterior means of <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M49','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M49">View MathML</a>s agree closely with the true values as shown in Figure 2.

thumbnailFigure 4. Marginal posterior probabilities of each pair of rows belonging to the same cluster for two different clusters of columns.

This scenario corresponds to the type of situation we expect on most real datasets: properties will cluster into groups and, within each group of properties, clusters of sites with similar responses can be clearly identified. Our results suggest that, as expected, the model is capable of identifying these multiple clusters with high accuracy and therefore accurately estimate the value of the regression coefficients. Other scenarios, including extreme cases where all properties belong to a common cluster while sites belong to one of several clusters, and cases where each property has a different effect on amino acid rates are available as Additional file 1.

To investigate the effect of the truncation levels and the priors on our model, we performed sensitivity analysis by varying the truncation levels as well as the different hyperparameters. Increasing the truncation level to 35 did not affect the results and the estimated posterior means of the βs showed close agreement with the true values. The analyses was also fairly robust to the choice of the priors, since varying the hyperparameters had almost no effect on the results. Decreasing the prior variance of τ2 makes the results marginally better, i.e., posterior means of the βi,js, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M50">View MathML</a>s, are slightly closer to the true values.

Simulation study 2 - data simulated from a biological model

In our second simulation study the model is evaluated in the context of biological sequences generated from an evolutionary model. In particular, a Markov model was used to generate 20 sequences of 90 codons each. For the first one-third of the sites (sites 1-30) we used transition probabilities obtained from the codon-substitution model of [21] with equal equilibrium probabilities for all 61 codons. For the second one-third of the sites (sites 31-60), we modified the transition probability matrix from the previous step by increasing the probabilities of transitions between codons that have small distances for volume and decreasing the probabilities of transitions between codons that have large distances for volume - this was done to encourage only those changes that conserve volume in this part of the sequences. Finally, for the last one-third of the sites (sites 61-90), we modified the original transition probability to encourage radical changes in hydropathy. Thus, we increased some transition probabilities between codons that have very different hydropathy scores and decreased a few of those that have similar hydropathy scores. Note that, since the equilibrium probabilities are either uniform or roughly uniform across all sites, the correlation structure across properties is retained in the expected distances, which simplifies the interpretation of the results.

Once we obtained the sequences, we generated ancestral sequences using PAML, version 3.15, [20] and calculated observed and expected distances yi,j and xi,j for five properties, namely, hydropathy (h), volume (Mv), polarity (p), isoelectric point (pHi) and partial specific volume (V0). Of these, h and p are correlated and so are Mvand V0.

Our model was fitted with K=25 and L=25 as truncation levels. The prior distributions were the same as the ones used for our previous simulation. Results are based on 15000 iterations, of which the first 5000 were burn-in. There did not seem to be any obvious problems with convergence, which was assessed by visual inspection of trace plots of some of the parameters.

The analyses found that there were three clusters of properties - the first cluster has properties h and p, the second cluster comprised of properties Mv and V0 and the third cluster only had property pHi as shown in Figure 5. Figure 6 shows the posterior means of βi,js for representative properties of the three clusters in Figure 5. Sites 24, 65, 67, 71, 81, 82, and 89 have large posterior means <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M51','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M51">View MathML</a>s for cluster 1 (h and p). These are also the same sites that show up in the small cluster at the top right in Figure 7. Specifically, Figure 7 shows how often any two sites in cluster 1 are grouped together. The sites in the lower left (16, 28, 46, 51) have small posterior means <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M52','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M52">View MathML</a>s for these properties (h and p) and are grouped together more often. The big group of sites in the middle mostly seem to have mean <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M53','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M53">View MathML</a>s around 1 while sites 81, 89, 71, and 65 have the largest <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M54','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M54">View MathML</a> values and very large probabilities of being clustered together in cluster 1. Thus, the model successfully identifies sites that have similar βi,jvalues in a specific cluster and groups them together. Groups of sites that change a property can also be identified for clusters 2 and 3 in Figure 5. In particular, for cluster 2 (Mv and V0), there is a big group of sites which conserve these properties. Most of these sites are in the central one-third portion (i.e., the portion that includes sites 31-60) which were simulated under a transition probability matrix that favors transitions that conserve volume. Finally, for cluster 3 (pHi) there is one large group of sites which conserve the property and one group comprising sites 39 and 80 which change the property greatly.

thumbnailFigure 5. Marginal posterior probabilities of any two properties being in the same cluster for the data simulated under a biological model.

thumbnailFigure 6. Posterior means of βi,js for the three clusters in Figure 5 for the simulated data under a biological model. The sites are sorted according to the increasing value of posterior means.

thumbnailFigure 7. Marginal posterior probabilities of any two sites for the simulated data being grouped together in the first cluster in Figure 5. The sites are sorted according to the increasing value of posterior means of βi,js.

To better understand the performance of our method, we also analyzed the sequences generated above with the parametric regression model in [10], TreeSAAP[25], and EvoRadical[9]. Table 1 lists the thirty sites with the largest posterior means <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M55','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M55">View MathML</a>s for h, and the thirty sites with the smallest posterior means <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M56">View MathML</a>s for Mv for the regression model of [10] and also for our new semiparametric approach. Many of the same sites are identified by both methods, however, our new method performs slightly better than the regression model in [10]. In particular the new method identifies two additional sites in the 61-90 region as sites that change h.

Table 1. Comparing results between models in [[10]] and the new semiparametric model, for the data simulated under a biological model

Table 2 lists sites that TreeSAAP finds significant for the different properties. All of the sites that TreeSAAP finds significant are also identified by our methods. However, note that once we correct for multiple comparisons in the TreeSAAP results, only one site (74) still remains significant. We note that the hierarchical specification of the priors in our models automatically accounts for multiple comparisons and no corrections are needed (see [10] for more discussion on this).

Table 2. Sites identified as significant by TreeSAAP for the different properties for the simulation study based on a biological model

Finally, we analyzed the sequences generated previously with EvoRadical using two different partitions [8] - one for p and the other for Mv. We chose to run Evoradical with p instead of h, since a partition of the amino acids for polarity was already available in [8]. Additionally, given that h and p are correlated, we expect to see somewhat similar results for these two properties.

Table 3 lists site-specific results from EvoRadical. The sites listed have high posterior probabilities (>0.95) of being in the different site classes. This was the criterion that was used to identify significant sites in [9]. The results presented here correspond to Model A1 in [9] which uses ω for the nonsynonymous to synonymous substitution rate ratio for codons encoding amino acids with properties in the same partition, and γmeasures the nonsynonymous to synonymous substitution rate ratio between codons for properties belonging to different partitions. While the sites listed for p somewhat match results from the other methods, the results for Mv are not in agreement. This is probably due to the fact that partitions are not always directly comparable with the amino acid distances. For example, under the volume partition of [8], both glycine and valine are small and glutamine is large, while looking at the volume scores glycine is very different from valine and glutamine. Thus, our models would consider a change from glycine to valine as radical, whereas for the partition-based method of [9], there would be no change. The fact that the user has to define a property-specific partition in advance, as opposed to directly working with the physicochemical distances, is one of the disadvantages of partition-based methods.

Table 3. Sites that have high posterior probabilities (>0.95) of belonging to each site class for the different partitions for EvoRadical for the simulated data

Illustration with Lysin data

Our proposed model was applied to the sperm lysin data set which consisted of cDNA from 25 abalone species with 135 codons in each sequence [35]. Sites with alignment gaps were removed from all sequences, which resulted in 122 codons for the analysis presented here. The phylogeny of [35] and the codon substitution model M8 in PAML, version 3.15, [20] was used to generate the ancestral sequences. The model M8 uses a discretized beta distribution to model ω values between zero and one with probability p0 and allows for an additional positive selection category with ω>1 and probability p1.

The lysin data was analyzed with the model in The Model subsection with the 32 amino acid properties listed in Table 4. A few of the properties were chosen because of their functional importance. Some of the other properties have been previously used in analyses by [25]. Only sites which showed at least one nonsynonymous change were retained for the final analysis, which led to a data set with 94 sites. We used K=25 and L=35 as truncation levels for this data. The prior distributions with the following hyperparameters were used in the analysis. The DP concentration parameters ρ and γk were assumed to follow a Ga(1,1). λ, the prior probability for ϕl,k being 0, was assumed to follow a Beta(2,8) which implied that about 20% of the unique βi,js were expected to be 0 a priori. aκand bκ, the hyperparameters for the prior of <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M59','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M59">View MathML</a> when ϕl,kis 0, were chosen as 2 and 100 which implied a prior mean of 0.01. When ϕl,k is different from zero, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M60','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M60">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M61','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M61">View MathML</a> control the prior for <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M62','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M62">View MathML</a>. V0, the scale factor for <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M63','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M63">View MathML</a>, was fixed at the ratio of prior means of σ2and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M64','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M64">View MathML</a> (the variance terms in the regression model in [10] for which we had used prior means of 0.1 and 0.01 respectively). Finally, the αks were assumed to follow a N(1, 0.25) to conform to our prior assumption of neutrality for the properties. Results are based on 20000 iterations, of which the first 10000 were burn-in. Convergence was assessed by visual inspection of trace plots of some of the parameters and there did not seem to be any obvious problems with convergence.

Table 4. List of 32 amino acid properties used in the analysis

Figure 8 shows the marginal posterior probabilities of any two properties being assigned to the same cluster. There seem to be four mostly distinct clusters in the properties in our list. The biggest cluster consists of 20 properties that are related to polarity and hydropathy. All 20 properties are assigned to this cluster with very high probability. The next cluster is comprised of the properties Bl, and c. There is also a fairly big cluster whose members are related to volume (Mv,V0,Mw,Cα,μ). pzim, which is correlated with p to some extent, is clustered with pHiwith which it shows a large correlation value (about 0.9). There is some uncertainty regarding the membership of K0 and Esm, since both of them are assigned to the largest cluster about 50% of the time, while Esm is clustered with properties related to volume to a lesser extent. pK1is the only property that is almost never clustered with other properties.

thumbnailFigure 8. Marginal posterior probabilities of any two properties being in the same cluster for the lysin data.

Site specific results based on the posterior means (denoted by <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M65','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M65">View MathML</a>s), for one representative property each from the four clusters in Figure 8 are shown in Figure 9. The sites are sorted according to the increasing value of mean <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M66','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M66">View MathML</a> for each image. Sites on the far right radically change properties in each group. For example, most of the sites that appear on the far right, like sites 15, 16, 21, 75, 82, 99 and 126, for cluster 1 (represented by h) have <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M67','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M67">View MathML</a> values of 1.2-1.4. There seem to be more sites radically changing properties in cluster 1 than in clusters 2 (represented by c) or 3 (represented by Mv). The first three clusters also have a fairly large number of sites with mean <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M68','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M68">View MathML</a> between 0 and 1. This is different from what we see for cluster 4 (represented by pzim), which corresponds to properties pzimand pHi. A large number of sites in cluster 4 strongly conserve the properties (e.g., sites 35, 43, 49, 51, 64, 114, 117, 121), as is evident by the very small mean <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M69','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M69">View MathML</a>s for sites in the far left, unlike in the other clusters.

thumbnailFigure 9. Posterior means <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M71','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M71">View MathML</a>s for the four clusters (denoted by representative properties) in Figure 8 for lysin. The sites are sorted according to the increasing value of posterior means.

Figure 10 shows the posterior summaries of βi,js different from zero for sites 82, 99, 120 and 127 for properties belonging to different clusters. Of these, sites 120 and 127 were found to be under positive selection by PAML, while sites 82, 99 and 127 were identified as radically changing some of the properties by the regression model in [10]. The sites show different behavior for the different properties, for example, site 82 shows radical changes for h, while it conserves Mv. We can also see similarities in the posterior summaries across sites. For example, for property pK1 sites 82, 120 and 127 have similar values for βi,j. One of the advantages of using the semiparametric approach is that we can identify groups of sites that either conserve or radically change a set of similar amino acid properties. For example, sites 122 and 127 both seem to be altering the amino acid properties in the first large cluster of properties related to p and h. However, sites 122 and 127 have a very different behavior in cluster 4 related to pzim: site 122 strongly conserves properties in this cluster while site 127 radically changes them.

thumbnailFigure 10. Posterior summaries of βi,js different from zero for sites 82, 99, 120 and 127 in lysin data. The first 4 properties on the x-axis belong to 4 different clusters and the next 2 do not belong to any specific cluster all the time. The vertical lines are 90% posterior intervals of the βi,js that are different from 0, the medians (filled circles) and the 25th and 75th percentiles (stars) are highlighted.

Table 5 lists sites that are highly conserved with posterior mean <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M72','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M72">View MathML</a>s less than 0.4 for the different clusters. The largest number of highly conserved sites appears in cluster 4, which includes properties pzimand pHi, in agreement with Figure 9. Some of these sites like 35, 51, 111 and 117 also conserve properties in clusters 2 and 3. A number of them, such as sites 28, 35, 58, 66, 94, 104, 117, and 128 are also identified as sites under negative selection by methods that take into account the relative rate of nonsynonymous to synonymous rate ratio, such as those implemented in PAML[20]. In order to determine which sites are under positive and negative selection by PAML, we follow an approach similar to that used by [35] in the analysis of the lysin data. In particular, [35] found that PAML model M8, which supports positive selection, is the model that better fits the lysin data. Therefore, we classified sites as negatively selected if the estimated ω was smaller than 0.3 and if Pr(ω>1|data)<0.5 using PAML model M8. Results comparing sites conserving or radically changing a small group of properties with sites inferred to be under positive or negative selection by PAML was also presented in [10].

Table 5. Strongly conserved sites (<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M73','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M73">View MathML</a>) for lysin data for different clusters

The results are fairly robust to the choice of different hyperparameter values. Note that the scale factor for <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M74">View MathML</a> ultimately affects the variation in the βi,jvalues, and it is advisable to choose it so that the prior variance for the unique βi,js is not too large.

Conclusions

In this paper, we present a Bayesian hierarchical regression model with a nested infinite relational model on the regression coefficients. The model is capable of identifying sites which show radical or conserved amino acid changes. The (almost sure) discreteness of the DP realizations induces clustering at the level of properties which is analogous to the factor model in [11], with the advantage being that the nonparametric method automatically determines the appropriate number of clusters. The multi-level clustering ability of the NIRM also induces clustering at the level of sites and allows us to capture skewness and heterogeneity in the distribution of the random effects distribution associated with each cluster of properties.

The main advantage of the models we have described is their ability to simultaneously handle multiple properties with potentially correlated effects on molecular evolution. Our simulations suggest that our models are flexible but robust, being capable of dealing with a range of situations including those where properties are perfectly correlated, as well as those where all properties are uncorrelated. Our semiparametric regression models also work well, particularly in comparison with the regression model in [10], TreeSAAP and EvoRadical, when applied to DNA sequence data generated from an evolutionary model. In addition, the analysis of the lysin data suggests that the model leads to reasonable results.

The NIRM that is the basis of our model defines a separately exchangeable prior on matrices. This means that the prior is invariant to the order in which properties and sites are included. This is due to the fact that the rows as well as the columns of the parameter of interest are independent draws from a DP. From the point of view of modeling multiple properties, this is a highly desirable property. However, assuming that DNA sites are exchangeable can be questionable. Although this is a potential limitation of our model, we should note that the assumption of independence across sites (which is a stronger assumption than exchangeability) underlies all the methods discussed in the Background section. If information about the 3-dimensional structure of the encoded protein or other sequence specific information that can guide the construction of the dependence model is available, our model could be easily extended to account for this feature. In the absence of such information, exchangeability across DNA sites seems to be a reasonable prior assumption. Indeed, in contrast to the most common independence assumption, our exchangeability assumption allows us to explain correlations at the level of sites.

In our applications, we have used codon substitution models for reconstructing ancestral sequences as we wished to compare our methods to other methods for detecting selective sites that also use codon substitution models, such as those implemented in PAML and EvoRadical. However, it is possible to perform the proposed Bayesian semiparametric analyses using amino acid substitution models instead of codon substitution models. Note that the substitution model is only used in the calculation of the observed distances. First, we infer the ancestral sequences under a specific substitution model and a given phylogeny. We then compute the observed distances for a given property and a given site as the mean absolute difference in property scores due to all nonsynonymous substitutions at that site, where the nonsynonymous substitutions are counted by comparing the DNA sequences between two neighboring nodes in the phylogeny. The reconstructed ancestral sequences, and therefore the observed distances in our model, may differ under different substitution models, but the method can be implemented under any substitution model, including amino acid substitution models. The gain in execution time from using amino acid substitution models instead of codon-based ones could potentially be significant if the uncertainty in the alignment/phylogeny/ancestral level is taken into account.

Finally, it is important to note that the “observed” distances are not really directly observed, but are instead constructed from ancestral sequences and, therefore, subject to error. A simple way to account for this additional level of uncertainty is to modify the computation of expected distances by incorporating the ideas of [37]. This approach was previously employed in [10], with little impact on the final results.

Appendix: details about the Gibbs sampler

The truncations and the introduction of the configuration variables imply that (2) and (3) can be written as

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M75','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M75">View MathML</a>

(5)

with φl,kG0lkand Πk and wl,k being the appropriate stick breaking weights. Writing the model as in (5) helps in obtaining the forms of the full conditionals as below.

The column indicators ζj for j=1,…,J are sampled from a multinomial distribution with probabilities

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M76','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M76">View MathML</a>

where <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M77','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M77">View MathML</a> is <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M78','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M78">View MathML</a> if ϕl,k=0 or is <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M79','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M79">View MathML</a> if ϕl,k is different from zero. Πk is sampled in two parts: first, by generating vkfrom a <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M80','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M80">View MathML</a> for k=1,…,K−1 and vK=1, where mkis the number of columns assigned to cluster k and then, by constructing <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M81','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M81">View MathML</a>.

For i=1,…,Iand k=1,…,K, the indicators ξi,kare also sampled from a multinomial with probabilities of the form

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M82">View MathML</a>

The updated weights wl,k are sampled in a manner similar to the Πk, i.e., ul,k are generated from a <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M83">View MathML</a> for l=1,…,L−1 and uLk=1, where nl,k is the number of βi,js assigned to atom l of cluster k and then, by constructing <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M84','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M84">View MathML</a>.

Following [18], the DP concentration parameters ρand γk are sampled in two steps by introducing auxiliary variables η1and η2. First, sample η1from

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M85','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M85">View MathML</a>

and then ρfrom

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M86','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M86">View MathML</a>

where <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M87','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M87">View MathML</a> is the number of unique column indicators ζj. Similarly, for each k=1,…,K,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M88','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M88">View MathML</a>

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M89','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M89">View MathML</a>

where <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M90','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M90">View MathML</a> is the number of unique row indicators ξi,k, for a specific cluster of columns k.

To sample the unique <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M91','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M91">View MathML</a>s given in (4), we introduce a set of indicator variables ψl,kwhich take the value 1 when ϕl,kis different from zero. For l=1,…,Land k=1,…,K, ψl,k, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M92','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M92">View MathML</a> and ϕl,k are jointly sampled in the following way - ψl,k is sampled by integrating out ϕl,kand <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M93','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M93">View MathML</a> from its full conditional, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M94','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M94">View MathML</a> is sampled conditional on ψl,k and ϕl,k is sampled conditional on both the corresponding ψl,kand <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M95','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M95">View MathML</a>, i.e.,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M96','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M96">View MathML</a>

with the individual expressions obtained as follows.

First, let <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M97','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M97">View MathML</a>. Then,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M98','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M98">View MathML</a>

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M99','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M99">View MathML</a>

where <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M100','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M100">View MathML</a> and the update terms are given by <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M101','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M101">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M102','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M102">View MathML</a>.

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M103','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M103">View MathML</a>

where <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M104','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M104">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M105','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M105">View MathML</a>.

The full conditional of λ is given by

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M106','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M106">View MathML</a>

Finally, for k=1,…,K, the full conditional of αkis given by

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M107','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M107">View MathML</a>

where

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M108','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M108">View MathML</a>

and

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/278/mathml/M109','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/278/mathml/M109">View MathML</a>

.

Software availability

The R code implementing the models in the paper is freely available at http://www.ams.ucsc.edu/~raquel/software/ webcite.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

SD, AR and RP formulated the model. SD performed the analyses and drafted the manuscript. AR and RP revised the manuscript draft. All authors read and approve the final version of the manuscript.

Acknowledgements

RP and SD were supported by the NIH/NIGMS grant R01GM072003-02. AR was supported by the NIH/NIGMS grant R01GM090201-01.

References

  1. Pakula AA, Sauer RT: Genetic analysis of protein stability and function.

    Annu Rev Genet 1989, 23:289-310. PubMed Abstract | Publisher Full Text OpenURL

  2. Zuckerkandl E, Pauling L: Evolutionary divergence and convergence in proteins. In Evolving Genes and Proteins. New York: Academic Press; 1965:97-166. OpenURL

  3. Sneath PHA: Relations between chemical structure and biology.

    J Theor Biol 1966, 12:157-195. PubMed Abstract | Publisher Full Text OpenURL

  4. Miyata T, Miyazawa S, Yasunaga T: Two types of amino acid substitutions in protein evolution.

    J Mol Evol 1979, 12(3):219-236. PubMed Abstract | Publisher Full Text OpenURL

  5. Xia X, Li WH: What amino acid properties affect protein evolution?

    J Mol Evol 1998, 47:557-564. PubMed Abstract | Publisher Full Text OpenURL

  6. McClellan DA, McCracken KG: Estimating the influence of selection on the variable amino acid sites of the cytochrome b protein functional domains.

    Mol Biol Evol 2001, 18:917-925. PubMed Abstract | Publisher Full Text OpenURL

  7. McClellan D, Palfreyman E, Smith M, Moss J, Christensen R, Sailsbery J: Physicochemical evolution and molecular adaptation of the cetacean and artiodactyl cytochrome b proteins.

    Mol Biol Evol 2005, 22:437-455. PubMed Abstract | Publisher Full Text OpenURL

  8. Sainudiin R, Wong WSW, Yogeeswaran K, Nasrallah JB, Yang Z, Nielsen R: Detecting site-specific physicochemical selective pressures: applications to the class I HLA of the human major histocompatibility complex and the SRK of the plant sporophytic self-incompatibility system.

    J Mol Evol 2005, 60:315-326. PubMed Abstract | Publisher Full Text OpenURL

  9. Wong WSW, Sainudiin R, Nielsen R: Identification of physicochemical selective pressure on protein encoding nucleotide sequences.

    BMC Bioinf 2006, 7:148-157. BioMed Central Full Text OpenURL

  10. Datta S, Prado R, Rodriguez A, Escalante AA: Characterizing molecular evolution: a hierarchical approach to assess selective influence of amino acid properties.

    Bioinformatics 2010, 26:2818-2825. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  11. Datta S, Prado R, Rodriguez A: Bayesian factor models in characterizing molecular adaptation.

    2012.

    Tech. rep., University of California, Santa Cruz

  12. Ferguson T: A Bayesian analysis of some nonparametric problems.

    Ann Stat 1973, 1:209-230. Publisher Full Text OpenURL

  13. Sethuraman J: A constructive definition of Dirichlet priors.

    Statistica Sinica 1994, 4:639-650. OpenURL

  14. Shafto P, Kemp C, Mansinghka V, Gordon M, Tenenbaum JB: Learning cross-cutting systems of categories. In Proceedings of the 28th Annual Conference of the Cognitive Science Society. Erlbaum; 2006:2146-2151. OpenURL

  15. Rodriguez A, Ghosh K: Nested partition models. Tech. rep., University of California, Santa Cruz. 2009

  16. Lo AY: On a class of Bayesian nonparametric estimates: I. density estimates.

    Ann Stat 1984, 12:351-357. Publisher Full Text OpenURL

  17. Escobar MD: Estimating normal means with a Dirichlet process prior.

    J Am Stat Assoc 1994, 89:268-277. Publisher Full Text OpenURL

  18. Escobar MD, West M: Bayesian density estimation and inference using mixtures.

    J Am Stat Assoc 1995, 90:577-588. Publisher Full Text OpenURL

  19. Blackwell D, Macqueen JB: Ferguson distribution via Pólya urn schemes.

    Ann Stat 1973, 1:353-355. Publisher Full Text OpenURL

  20. Yang Z: Phylogenetic analysis using parsimony and likelihood methods.

    J Mol Evol 1997, 42:294-307. OpenURL

  21. Nielsen R, Yang Z: Likelihood models for detecting positively selected amino acid sites and applications to the HIV–1 envelope gene.

    Genetics 1998, 148:929-936. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  22. Kemp C, Tenenbaum JB, Griffiths TL, Yamada T, Ueda N: Learning systems of concepts with an infinite relational model. In Proceedings of the 21st National Conference on Artificial Intelligence - Volume 1. AAAI Press; 2006:381-388. OpenURL

  23. Xu Z, Tresp V, Yu K, Kriegel HP: Infinite hidden relational models. In Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence. AUAI Press; 2006:544-551. OpenURL

  24. Dunson DB, Xue Y, Carin L: The matrix stick-breaking process: flexible Bayes meta-analysis.

    J Am Stat Assoc 2008, 103:317-327. Publisher Full Text OpenURL

  25. Woolley S, Johnson J, Smith MJ, Crandall KA, McClellan DA: TreeSAAP: Selection on Amino Acid Properties using phylogenetic trees.

    Bioinformatics 2003, 19:671-672. PubMed Abstract | Publisher Full Text OpenURL

  26. MacEachern SN: Estimating normal means with a conjugate style Dirichlet process prior.

    Commnunications Stat, Part B - Simul Comput 1994, 23:727-741. Publisher Full Text OpenURL

  27. MacEachern SN, Muller P: Estimating mixture of Dirichlet process models.

    J Comput Graphical Stat 1998, 7:223-238. OpenURL

  28. Ishwaran H, James LF: Gibbs sampling methods for stick-breaking priors.

    J Am Stat Assoc 2001, 96:161-173. Publisher Full Text OpenURL

  29. Ishwaran H, Zarepour M: Dirichlet process sieves in finite normal mixtures.

    Statistica Sinica 2002, 12:941-963. OpenURL

  30. Green PJ, Richardson S: Modelling heterogeneity with and without the Dirichlet process.

    Scand J Stat 2001, 28:355-375. Publisher Full Text OpenURL

  31. Jain S, Neal RM: A split-merge Markov Chain Monte Carlo procedure for the Dirichlet process mixture model.

    J Comput Graphical Stat 2004, 13:158-182. Publisher Full Text OpenURL

  32. Blei DM, Jordan MI: Variational inference for Dirichlet process mixtures.

    Bayesian Anal 2006, 1:121-144. Publisher Full Text OpenURL

  33. Walker SG: Sampling the Dirichlet mixture model with slices.

    Commun Stat - Simul Comput 2007, 36:45. Publisher Full Text OpenURL

  34. Rodriguez A, Dunson DB, Gelfand AE: The nested Dirichlet process.

    J Am Stat Assoc 2008, 103:534-546. Publisher Full Text OpenURL

  35. Yang Z, Swanson W, Vacquier V: Maximum-likelihood analysis of molecular adaptation in abalone sperm lysin reveals variable selective pressures among lineage and sites.

    Mol Biol Evol 2000, 17:1446-1455. PubMed Abstract | Publisher Full Text OpenURL

  36. Gromiha MM, Oobatake M, Sarai A: Important amino acid properties for enhanced thermostability from mesophilic to thermophilic proteins.

    Biophys Chem 1999, 82:51-67. PubMed Abstract | Publisher Full Text OpenURL

  37. Minin VN, Suchard MA: Counting labeled transitions in continuous-time Markov models of evolution.

    J Math Biol 2008, 56:391-412. PubMed Abstract | Publisher Full Text OpenURL