Abstract
Background
An important question in the analysis of biochemical data is that of identifying subsets of molecular variables that may jointly influence a biological response. Statistical variable selection methods have been widely used for this purpose. In many settings, it may be important to incorporate ancillary biological information concerning the variables of interest. Pathway and network maps are one example of a source of such information. However, although ancillary information is increasingly available, it is not always clear how it should be used nor how it should be weighted in relation to primary data.
Results
We put forward an approach in which biological knowledge is incorporated using informative prior distributions over variable subsets, with prior information selected and weighted in an automated, objective manner using an empirical Bayes formulation. We employ continuous, linear models with interaction terms and exploit biochemicallymotivated sparsity constraints to permit exact inference. We show an example of priors for pathway and networkbased information and illustrate our proposed method on both synthetic response data and by an application to cancer drug response data. Comparisons are also made to alternative Bayesian and frequentist penalisedlikelihood methods for incorporating networkbased information.
Conclusions
The empirical Bayes method proposed here can aid prior elicitation for Bayesian variable selection studies and help to guard against misspecification of priors. Empirical Bayes, together with the proposed pathwaybased priors, results in an approach with a competitive variable selection performance. In addition, the overall procedure is fast, deterministic, and has very few userset parameters, yet is capable of capturing interplay between molecular players. The approach presented is general and readily applicable in any setting with multiple sources of biological prior knowledge.
Background
Ongoing advancements and cost reductions in biochemical technology are enabling acquisition of ever richer datasets. In many settings, in both basic biology and medical studies, it may be important to model the relationship between assayed molecular entities, such as genes, proteins or metabolites, and a biological response of interest.
Molecular players may act in concert to influence biological response: this has motivated a need for multivariate methods capable of modelling such joint activity. When sample sizes are smalltomoderate, as is often the case in molecular studies, robust modelling of joint influences becomes especially challenging. However, often it is likely that only a small number of players are critical in influencing the response of interest. Then, the challenge is to identify appropriate variable subsets.
Statistical variable selection methods have been widely used in the bioinformatics domain to discover subsets of influential molecular predictors. Both penalised likelihood and Bayesian approaches have been used in a diverse range of applications [16].
Bayesian approaches can facilitate the integration of ancillary information regarding variables under study through prior probability distributions. Ongoing development of online tools and databases have meant that such information is widely available, and depending on context, may include networks and pathway maps, public gene expression datasets, molecular interaction databases, ontologies and so on. However, while the idea of incorporating such information into variable selection has a clear appeal, it is not always obvious what information should be included nor how it should be weighted. Indeed, many existing Bayesian variable selection approaches do not attempt integrative analyses exploiting such information and instead employ standard priors that do not specify preferences for particular variables, but may, for example, encode a preference for sparse models [4,7]. Several Bayesian variable selection studies have put forward simple approaches for incorporating prior knowledge by independently assigning each variable a prior probability of being included in the model [1,6,8,9]. However, subjectively setting such hyperparameters for each variable may be difficult. Furthermore, prior independence may be a questionable assumption, since molecular variables are unlikely to influence a response independently of one another.
We develop a variable selection procedure in which an empirical Bayes approach is used to objectively select between a choice of informative priors incorporating ancillary information (‘biologically informative priors’) and also to objectively weight the contribution of the prior to the overall analysis. The work presented here is motivated by questions concerning the relationship between signalling proteins and drug response in human cancers. In the protein signalling setting (as also in gene regulation) there is now much information available, both in the literature and in diverse online resources, concerning relevant pathways and networks. We therefore develop pathway and networkbased informative priors for this setting, applying the methods proposed to automatically select and weight the prior and thence carry out variable selection.
The relationship between response and predictors is modelled using a continuous, linear model with interaction terms. In this way we avoid data discretization (which can lose information) yet retain the ability to capture combinatorial interplay. We take advantage of biochemicallymotivated sparsity constraints to permit exact inference, thereby avoiding the need for approximate approaches such as Markov chain Monte Carlo (MCMC). This enables the calculation of exact probability scores over which variables are likely to be influential. The overall procedure is computationally fast: empirical Bayes analysis and subsequent calculation of posterior (inclusion) probabilities for 52 predictors via full model averaging required only 10 minutes (in MATLAB R2010a on a standard singlecore personal computer; code freely available, together with simulation scripts, at http://go.warwick.ac.uk/stevenhill/IBKVS webcite). Moreover, the overall procedure we put forward is simple from the user perspective, requiring very few userset parameters or MCMC convergence diagnostics.
The remainder of the paper is organised as follows. We begin below by defining notation and reviewing Bayesian variable selection. We then describe methods, including empirical Bayes analysis to objectively select and weight biologically informative prior information, pathwaybased informative priors and exact inference. We illustrate our method on published single cell proteomic data [10] and on proteomic data and drug response from ongoing work in breast cancer. We also compare the proposed approach to alternative methods. We conclude with a discussion of our results, the merits and shortcomings of our work, and highlight directions for further work.
Notation
Let be a vector of response values and for be corresponding pdimensional candidate predictors. X_{i } forms row i of the n × p predictor matrix X.
Let be a binary vector and be the number of nonzeros in γ. Then X_{γ} is the matrix obtained by removing from X those columns j for which γ_{j }= 0. Similarly, for a vector , a_{γ} is obtained from a by removing components a_{j }for which γ_{j }= 0.
Bayesian variable selection
Bayesian linear model
Consider the classical linear model Y = Xβ + ε, where are regression coefficients and , where denotes a Normal distrbution. In some settings it makes sense to assume that some of the regression coefficients can be set to zero, thereby removing the corresponding predictors from the model. Variable selection addresses the question of which subset of predictors best models the response. An inclusion indicator vector specifies which regression coefficients vanish. That is, predictor j is included in the model if and only if γ_{j }= 1. We use γ to denote both the inclusion indicator vector and the model it specifies. Given model γ we have the reduced linear model
We are interested in the posterior distribution over models P(γ  Y,X). From Bayes’ rule we have
where p(Y  γ,X_{γ}) is the marginal likelihood. P(γ) is a prior over models (the ‘model prior’), and is the main focus of the present paper. The marginal likelihood is obtained by integrating out the regression parameters β_{γ} and variance parameter σ^{2} and thereby automatically penalises complex models with many parameters. This penalisation occurs because a more complex model has a larger parameter space. This means that for more complex models, the integral that defines the marginal likelihood is over a greater number of dimensions, with prior mass spread over a larger space. This in turn results in a lower marginal likelihood score.
Model selection and model averaging
The posterior distribution over models P(γ  YX) can be used to find a single, maximum a posteriori (MAP) model, . However, considering a single ‘best’ model may be misleading, especially in the smallsample setting where the posterior distribution is likely to be diffuse with several similarly highscoring models. Model averaging [11,12] can ameliorate such effects by averaging over the entire space of models to calculate posterior inclusion probabilities for each individual predictor,
These inclusion probabilities are a measure of the importance of each individual predictor in determining the response.
Evaluating the summation in Equation 3 requires enumerating the entire posterior over models P(γ  YX). The model space Γ can be vast ( ) even for moderate values of p . Thus, Markov chain Monte Carlo [13] is often used to sample from the posterior over models thereby providing asymptotically valid estimates of the inclusion probabilities [4]. As outlined in Methods below, we instead calculate exact inclusion probabilities; an approach rendered computationally viable through restricting the size of model space Γ. Justifications for such a restriction and advantages and disadvantages of an exact approach are provided in the Discussion below.
Model prior
Calculating the posterior distribution over models (2) requires specifying a prior over Γ, P(γ). A common choice of prior assumes that the a priori inclusion probabilities are independent and have Bernoulli distributed marginal distributions P(γ_{j}) with success parameter Π. This hyperparameter may be a userdefined constant or may itself have a Beta prior [9,14]. In the former case, small values are often chosen to promote parsimonious models [1].
These priors provide no information regarding specific predictors and do not utilise domain knowledge. Employing predictor dependent hyperparameters Π_{j }enables incorporation of prior knowledge that some predictors are more important than others. However, utilising such a prior may be difficult in practice due to the many hyperparameters that must be subjectively specified. We note also in this formulation, prior inclusion probabilities are still independent.
Methods
We now describe the Bayesian variable selection method used in the present work. We describe in turn, an extended linear model including interactions between predictors, exact computation of posterior inclusion probabilities, biologically informative model priors and empirical Bayes learning of associated hyperparameters.
Bayesian linear model with interaction terms
We extend the classical linear model in Equation 1 above to enable combinatorial relationships between predictors and response to be captured. Given model γ, response Y_{i} depends in a nonlinear fashion on the included predictors X_{iγ }whilst remaining linear in the regression parameters. In particular, the mean for Y_{i} is a linear combination of included predictors and all possible products of included predictors. For example, if with γ_{3} = γ_{5} = 1, we have Y_{i }= X_{iγ}β_{γ} + αX_{i3}X_{i5} + ε_{i}. We extend the predictor matrix X_{γ} and regression coefficient vector β_{γ }to include the interaction terms and coefficients respectively, and we denote the extended versions by and . All columns in are standardised to have zero mean and unit variance.
The likelihood now takes the form
We choose hierarchical parameter priors following Smith and Kohn [15] and Nott and Green [14], taking the prior for given γ and σ^{2} to be Normal
and the prior for σ^{2} to be . Integrating out the parameters results in the following closed form marginal likelihood,
We note that, in contrast to the widelyused normal inversegamma prior [8,16], this formulation has no free hyperparameters and enjoys attractive invariance properties under rescaling [17].
Exact posterior inclusion probabilities
We enforce a restriction on the number of predictors that are allowed to be included in the model. That is, we only allow γ with for some . Thus, instead of being exponential in p, the model space Γ has polynomial size of order , thereby allowing explicit calculation of posterior inclusion probabilities via Equation 3. We take d_{max }= 4, giving for the p = 52 predictors in the cancer drug response application below; the original size of Γ was of order 10^{15}.
Biologically informative model priors
We now turn our attention to the model prior P(γ). In many molecular biology settings, there is much valuable information available which may be used to construct biologically informative model priors. This could be network and pathway structures, providing information on relationships between predictors, or information from publicly available datasets. However, it may not be obvious precisely how such information should be used and it is usually possible to encode several different, apparently plausible priors. We are therefore interested in investigating the question of how to choose between such priors.
Suppose we have M priors to choose from, with each prior, indexed by , encoded by a function which scores a proposed model γ according to the prior information. Following Mukherjee et al.[4], we take the overall prior to be of the following form,
where m is a hyperparameter (the ‘source parameter’) that selects amongst priors and λ is a hyperparameter controlling the overall strength of the prior.
We consider two simple pathwaybased priors, capturing information regarding number of pathways and intrapathway distances via functions f_{1} and f_{2} respectively. Below we proceed to give details for each, making use of the following notation. We let denote the set of proteins contained in pathway k, , and we let be the set of proteins that are both in model γ and in pathway k. We note that a protein is both allowed to be a member of more than one pathway or to not be a member of any pathways. If there is no prior information available, the pathwaybased priors reduce to a flat prior over model space. Figure 1 illustrates properties of the two prior components.
Figure 1. Properties of pathwaybased priors. Priors are encoded by functions f_{1}(γ) (number of pathways) and f_{2}(γ) (intrapathway distance). Shaded components are contained in model γ and shapes represent different pathways. Top row: Comparisons of the scoring functions. Top left  γ_{1} has larger intrapathway distance than γ_{2}; Top middle  distance is agnostic to number of pathways; Top right  addition of a singleton has no effect on distance. Bottom row: The root component in each network is in both pathways. However, f_{1}(γ) is defined so as to avoid double counting.
Number of pathways (f_{1})
The first pathwaybased feature encodes the notion that predictors that are influential in determining response may belong to a small number of pathways or, in contrast, may be spread across many pathways. We encode such beliefs by a function f_{1}(γ) which counts the number of pathways represented in a model γ. Specifically, where K_{γ} is the pathway count given by for satisfying
This definition prevents the empty model being a priori most probable and avoids double counting (proteins that are members of multiple pathways are considered to be a member of only one pathway for the purpose of calculating f_{1}(γ), and this single pathway is selected to minimise the pathway count; see Figure 1). If the strength parameter λ is negative, the prior increasingly penalises models as number of pathways increases, whereas a positive value results in a prior that prefers models representing many pathways.
Intrapathway distance (f_{2})
The second feature we consider is that variables which jointly influence the response may either be close to each other in a network sense, or may in fact be far apart in the network. This is done by a function f_{2}(γ) which gives the average distance between pairs of proteins that are both in γ and in the same pathway. Specifically, the distance between two proteins j_{1} and j_{2}, denoted d(j_{1},j_{2}), is the number of edges in the shortest (undirected) path between them. Then, we define where D_{γ} is the average of all d(j_{1},j_{2}) with for some k. In order for the distance to be defined for any two proteins in a pathway, we assume that the network topology for a pathway consists of a single connected component (in the undirected sense). We term a protein included in γ as a singleton if there are no other included proteins in the same pathway (i.e. protein j is a singleton if for some k). For models that only contain singletons or the empty model we set D_{γ }= 0. The function f_{2}(γ) defined in this way satisfies a number of natural desiderata. It is agnostic to γ and to the pathway count K_{γ}(see Figure 1). Also, it avoids double counting (the distance between each protein pair contributes to f_{2}(γ) once only) and is indifferent between models including only singletons and models with the smallest possible average distance of D_{γ }= 1. A negative strength parameter λ results in a prior that penalises larger intrapathway distances, whilst a positive value encourages larger distances.
Empirical Bayes
We set the prior source parameter m and strength parameter λ in an objective manner using empirical Bayes [18]. Specifically, we maximise the following marginal likelihood,
For a given choice of hyperparameters, the marginal likelihood can be calculated exactly by exploiting the model space restriction described above. The score is calculated for varying hyperparameters and those resulting in the largest score are used for variable selection.
Prediction
Given already observed data X,Y, we can predict the expected value of new response Y^{′ } from new predictor data X^{′ } by model averaging:
with
and the model posterior P(γ  Y,X) calculated via Equations 2, 6 and 7.
Results
We first show an application of our proposed approach to synthetic response data generated from a published study of cell signalling, and then further illustrate the approach with an analysis of proteomic data and drug response from breast cancers.
Synthetic response data
In ongoing studies, such as that presented below, truly objective performance comparisons may be challenging, since we usually do not know which molecules are truly influential in driving biological response. At the same time, in fully synthetic data it can be difficult to mimic realistic correlations between variables within a pathway or across a network. For this reason, we empirically assessed the methods proposed using published singlecell, phosphoproteomic data [10] with responses generated from that data. This preserved pathwayrelated correlation structure between predictors but permitted objective assessment. The dataset consists of 11 proteins and n_{tot }= 853 samples. (The complete dataset from [10] contains data obtained under nine different conditions, corresponding to different interventions. Here, we use the baseline dataset which contains 853 samples).
Figure 2 shows a network and pathway structure for the 11 proteins for use with the biologically informative priors of the type described above. The network structure was taken from Sachs et al.[10] and reflects current knowledge of signalling interactions. The proteins were assigned into four pathways.
Figure 2. Protein network and pathway structure for biologically informative priors in the synthetic response data study. Responses were generated from published phosphoproteomic data [10] consisting of 11 proteins and 853 samples (baseline data only). Network structure shown here is based on that given in Sachs et al.[10] and reflects current knowledge of signalling interactions. Proteins were divided into four pathways, denoted by node colours red, blue, green and yellow. The grey nodes are each members of all four pathways.
Figure 3. Synthetic response data, average ROC curves. Number of true positives plotted against number of false positives for Simulations 1, 2 and 3. Proteomic data from Sachs et al.[10] were used to create response data with true underlying model known to favour a particular prior: Simulaton 1  distance prior with positive λ; Simulations 2 and 3  either distance prior or number of pathways prior with negative λ. Legend  ‘BVS’: Bayesian variable selection; ‘+int’: linear model with interaction terms; ‘int’: linear model without interaction terms; ‘EB’: empirical Bayes used to select and weight pathwaybased priors automatically; ‘flat’: flat prior; ‘incorrect’: wrong prior with respect to true, underlying protein set (see main text for details); ‘MRF prior’: Markov random field prior [19]; ‘Lasso’: Lasso linear regression (curve produced by thresholding absolute regression coefficients, whilst marker ‘X’ is single model obtained by taking only predictors with nonzero coefficients); ‘Li&Li’: penalisedlikelihood approach proposed by Li and Li [21] that also incorporates network information (see main text for details); ‘corr’: absolute Pearson correlations between each protein and response. Area under the (average) ROC curve ("AUC") appears in parentheses.
We first considered two simulation models, and , each of which is a predictor subset consisting of three proteins; PIP3, ERK1/2, p38 for Simulation 1, and RAF, MEK1/2, PKA for Simulation 2. In each case, the three proteins were chosen to be favoured by a particular prior. is favoured by the intrapathway distance prior (f_{2}) with positive λ; the proteins included in had a large average intrapathway distance and incorporated a medium number of pathways . is favoured by either the number of pathways prior (f_{1}) with negative λ or the intrapathway distance prior (f_{2}) with negative λ; the proteins included in had both a small intrapathway distance and incorporated a small number of pathways . Since, by construction, each model is favoured by a particular prior, we can test the ability of the empirical Bayes approach to select appropriate hyperparameter values. Response data Y were generated using a linear model with interaction terms (4); Y=A + BC + ε, where A,B,C are the three influential variables.
We are especially interested in the smallsample regime that is often of interest in molecular studies. We therefore subsampled (without replacement) n=35 training data from the complete dataset (this matched the sample size of the drug response study reported below), and assessed predictive ability on the remaining, heldout data ( ).
Subsampling was repeated to give 5,000 training/test pairs, over which results are reported below. At each iteration, only smallsample training data was used for inference. The empirical Bayes method was employed to set prior source and strength parameters (using training data only), with (this specification permits a flat prior if empirical Bayes analysis supports neither prior). Posterior inclusion probabilities were then calculated as described above.
We assessed performance by comparing the true underlying model γ^{*} to the model γ_{τ }obtained by thresholding posterior inclusion probabilities at level τ. For results from each smallsample dataset, a receiver operating characteristic (ROC) curve was constructed by plotting number of true positives against number of false positives for varying thresholds τ. Figure 3a,b shows average ROC curves over the 5,000 iterations for Simulation 1 and Simulation 2, together with the area under the ROC curve (AUC). AUC is a summary of the curve and provides a measure of variable selection accuracy, with higher scores indicating better performance. The score is normalised to take a value between 0 and 1. Our Bayesian variable selection (BVS) method with empirical Bayes and linear model with interaction terms (‘BVS: EB +int’) is compared with eight other approaches:
(i) BVS with flat prior and linear model with interaction terms (‘BVS: flat +int’);
(ii) BVS with a prior that is incorrect with respect to the true, underlying model: intrapathway distance prior (f_{2}) favouring small distances (λ = − 5) for Simulation 1 and large distances (λ = 5) for Simulation 2 (‘BVS: incorrect +int’);
(iii) BVS with flat prior and linear model with no interaction terms (‘BVS: flat int’);
(iv) BVS with a Markov random field prior [19] and linear model with interaction terms (see below for further details; ‘BVS: MRF prior +int’);
(v) penalisedlikelihood Lasso regression [20] using a linear model with no interaction terms (see below for further details; ‘Lasso int’);
(vi) penalisedlikelihood Lasso regression [20] using a linear model with pairwise interaction terms (‘Lasso +int’);
(vii) a penalisedlikelihood approach, proposed by Li and Li [21], based on the Lasso and also incorporates network structure information (‘Li&Li’); and
(viii) absolute correlation coefficients between each predictor and response (‘corr’).
Markov random field priors have previously been used in Bayesian variable selection to take network structure of predictors into account [19,22]. A Markov random field is an undirected graphical model G = (VE) in which vertices V represent variables (here, the predictors) and edges E represent probabilistic relationships between them. Let A = (a_{i,j}) be a binary symmetric matrix with a_{i,j }= 1 if and only if edge (ij)∈E. Then, the Markov random field prior is given by
The strength parameter λ is usually constrained to be nonnegative, resulting in a prior that encourages selection of predictors whose neighbours in G are also included in the model. Here, we do not enforce this constraint and also allow negative values for λ. Negative values result in a prior that penalises models containing predictors that are neighbours in G. As with the proposed Bayesian variable selection method, we use a linear model with interaction terms and set λ with empirical Bayes. The graph structure G is obtained from the structure shown in Figure 2 by converting all directed edges to undirected edges.
Lasso regression performs variable selection by placing an ℓ_{1} penalty on the regression coefficients. This has the effect of shrinking a subset of regression coefficients to exactly zero; the predictors with nonzero coefficients are taken as the inferred model. Sparsity of the inferred model is controlled by a tuning parameter, which we set by 5fold crossvalidation. This method results in a single inferred model (i.e. point estimate). However, a full ROC curve can still be obtained by thresholding absolute regression coefficients.
The penalisedlikelihood method proposed by Li and Li [21] combines a Lasso penalty with an additional penalty term that incorporates predictor network structure. Together, these penalties encourage sparse estimates for regression coefficients that are also ‘smooth’ over the network structure; that is, coefficients for neighbours in the network are encouraged to be similar. In terms of variable selection, the approach promotes models containing predictors that are neighbours in the graph. The two penalty terms each have their own tuning parameter, controlling sparsity and smoothness over the network respectively; we set these parameters by 5fold crossvalidation. (This approach, and the standard Lasso regression approach, were implemented using Matlab package glmnet[23].)
We observe that, in both simulations, the automated empirical Bayes analysis, with pathwaybased priors, improves performance over the flat prior and provides substantial gains over an incorrect prior. The empirical Bayes approach selected the correct prior in 85% of iterations for Simulation 1 and 96% of iterations for Simulation 2 (for Simulation 1 correct prior parameters were m = 2 with λ > 0, median value of λ selected was λ = 3.5; for Simulation 2 correct prior parameters were m = 1 or m = 2 with λ < 0, median value of λ selected was λ = −5 for both m = 1 and m = 2). Since the Lasso regression method (with interaction terms) does not incorporate prior information, it is unsurprising that it is also outperformed by the empirical Bayes approach. Hence, it is fairer to compare it to Bayesian variable selection with a flat prior (and interaction terms). In Simulation 1 these regimes both show a similar performance with the Lasso approach displaying some gains at small numbers of false positives. However, in Simulation 2 the Bayesian approach offers a clear improvement in performance over Lasso regression (AUC scores of 0.91 and 0.83 respectively). Due to its inability to model combinatorial interplay, the linear model without interaction terms is outperformed by the linear model with interaction terms for both Bayesian and Lasso approaches.
In Simulation 1, the strength parameter for the Markov random field prior was set to λ = −5 by empirical Bayes in 86% of iterations, thereby correctly promoting models that do not contain predictors that are neighbours in the network. However, although the Markov random field prior offers improvements over a flat prior, it is outperformed by the proposed pathwaybased priors at small numbers of false negatives. This is due to the fact that our intrapathway distance prior is able to promote models with large distances between predictors, whilst the Markov random field prior can only penalise models that contain neighbours. The Markov random field prior is in general less flexible because it considers neighbours rather than distances. Intriguingly, in Simulation 2, λ > 0 (which correctly promotes models containing neighbours in the network) was only selected by empirical Bayes in 41% of iterations. As a result, performance of the Markov random field prior is inferior to a flat prior. We discuss this further in Discussion below.
The penalisedlikelihood approach proposed in [21], incorporating network information, performs poorly in both simulations, with similar performance compared to simply looking at correlations between predictors. Whilst in Simulation 2, a clear improvement is observed over standard Lasso regression (without interaction terms), this is not the case in Simulation 1. This is because the approach promotes models containing predictors that are neighbours in the network, which reflects the true underlying model for Simulation 2 only. The general poor performance of this approach is likely due to its inability to capture combinatorial interplay since it necessarily employs a linear model without interaction terms.
Since the networkbased penalisedlikelihood approach [21] does not incorporate interaction terms we performed a third simulation to investigate its performance under a datagenerating model without interaction terms. In particular, we used the same true underlying predictor subset as in Simulation 2 (i.e. ), which contains predictors that are neighbours in the network, but generated data using a linear model without interaction terms; Y = A + 2B + 3C + ε, where ABC are the three influential variables. We note that each predictor in the datagenerating model has a different magnitude of influence on the response (i.e. different regression coefficients). Average ROC curves are shown in Figure 3c. Comparisons are made to other approaches as described above, but all methods now use linear models without interaction terms. As in Simulations 1 and 2 the Bayesian variable selection approach with empirical Bayes and pathwaybased priors outperforms a flat prior and an incorrect prior, with empirical Bayes selecting the correct prior in 99% of iterations (correct and incorrect priors are the same as for Simulation 2). The Bayesian approach with Markov random field prior showed a similar performance to the proposed pathwaybased priors (a correct value of λ > 0 was selected in 90% of iterations). However, the approach of Li and Li [21], whilst now more competitive compared with Simulation 2, is still outperformed by the empirical Bayes approach with pathwaybased priors. Moreover, it does not display a clear improvement over Lasso regression.
Table 1. Synthetic response data, predictive errors from heldout test data
The failure of the incorrect prior illustrates the importance of prior elicitation. Moreover, our results demonstrate that the proposed empirical Bayes approach can select a suitable prior automatically, even under very small sample conditions (here n = 35). If the data is not in agreement with a proposed prior, then it is desirable that λ = 0 is selected by empirical Bayes, resulting in a flat prior. To test this, we used the model in Simulation 2 with a prior that favoured models with predictors from many pathways (i.e. number of pathways prior with λ restricted to be nonnegative). This prior does not reflect the true, underlying model, which contains a small number of pathways. Empirical Bayes analysis successfully selected λ = 0 in 95% of iterations.
For each dataset, we used the posterior predictive distribution (Equation 10; calculated via exact model averaging) to predict responses for heldout test data. Mean absolute predictive errors, obtained by averaging over all 5,000 train/test iterations, are shown in Table 1 (‘MA’). The empirical Bayes approach with pathwaybased priors shows improvements in predictive accuracy over a flat prior, and substantial improvements over both the ‘incorrect’ prior and a baseline linear model without interaction terms including all 11 predictors (i.e. no variable selection). It also outperforms the Markov random field prior in Simulations 1 and 2, and outperforms Lasso regression in Simulation 3. In Simulations 1 and 2, Lasso regression offers the best predictive performance (we note that prediction used regression coefficients obtained by maximum penalised likelihood estimation; the alternative of using Equation 11 with the single model corresponding to nonzero coefficients gave very poor predictive accuracy, inferior to the baseline linear approach; data not shown). In Simulation 3, the best predictive accuracy is provided by the Markov random field prior. The penalised regression approach proposed in [21] has very poor predictive performance across all simulations, inferior to the baseline linear model. We also found that model averaging provided gains relative to prediction using the MAP model (Equation 11), with a 5%, 7% and 3% decrease in error on average for Simulations 13 respectively (see Table 1, ‘MAP’).
Figure 4. Synthetic response data; effect of sparsity restriction and range of prior strength parameter. Results reported in Figure 3, for the empirical Bayes approach, were obtained by exact model averaging with the number of predictors included in a model restricted to not exceed d_{max }= 4. Posterior inclusion probabilities for 50 simulated datasets from Simulation 1 were compared with results obtained by exact model averaging with an increased maximum number of included predictors of d_{max }= 5 (left) and using Markov chain Monte Carlobased model averaging with no sparsity restriction (centre). Sensitivity to the range of prior strength parameter values considered by empirical Bayes was also assessed by comparing the posterior inclusion probabilities obtained with to those obtained with an increased range of .
Figure 5. Network and pathway structure for biologically informative priors in the cancer drug response data study. Network constructed using information from cellsignal.com webcite. Square nodes represent fully connected subnetworks consisting of isoforms and phosphoforms of the named protein (see Additional file 1). Node colouring represents pathway structure. Red, blue, yellow, green and purple nodes denote 5 pathways. Orange nodes are in both the red and yellow pathways. Light grey nodes are in all 5 pathways. Dark grey node is in all pathways except the purple pathway. White nodes are not assigned to a pathway.
Addtional file 1. Cancer drug response application. Tables of proteins and cell lines included in the analysis, further details of the experimental procedure and Figure S1.
Format: PDF Size: 112KB Download file
This file can be viewed with: Adobe Acrobat Reader
The only userset parameters in the proposed method are d_{max} (the maximum number of predictors allowed in a model), and the range of values for the prior strength parameter λ to optimise over in empirical Bayes. We sought to check the sensitivity of our results to these parameters. As described in ‘Methods’ above, we set d_{max }= 4 and considered . We compared the posterior inclusion probabilities inferred from 50 iterations of Simulation 1 to those obtained using (i) an increased maximum number of included predictors of d_{max }= 5; (ii) Markov chain Monte Carlobased (MCMC) inference with no restriction on number of included predictors, and (iii) an increased range for the prior strength (see Figure 4). We found very close agreement in all cases, indicating that results reported do not depend on the sparsity restriction or the chosen range for λ.
In Simulation 2 and Simulation 3, the smallest value of λ = −5 was selected by empirical Bayes in a majority of iterations. The true, underlying model has the minimum possible number of pathways and intrapathway distance. Hence, the strong (negative) prior strength is appropriate because it causes the prior to heavily penalise any model not satisfying these minima. Under the increased range for λ, the smallest value (λ = −10) was still selected in these iterations, but results were almost identical (as seen for Simulation 1 in Figure 4). This indicates that the prior was already having close to maximal influence at the lower value of λ = −5.
Cancer drug response data
Aberrant signalling is heavily implicated in almost every aspect of cancer biology [24,25] and, as a result, signalling proteins are targets for many emerging cancer therapies. Here, we apply the methods proposed to probing phosphoproteomic influences on response to an anticancer agent Triciribine.
Phosphoprotein abundance was assayed in a highthroughput manner using the KinetWorks^{TM} system (Kinexus Inc, Vancouver, Canada), for p = 52 proteins related to epidermal growth factor receptor (EGFR) signalling, in each of n = 35 breast cancer cell lines (see Additional File 1:Sections 1.11.2 for details). The EGFR signalling network plays a central role in breast cancer biology [26] and the cell lines used have previously been shown to retain much of the biological heterogeneity of primary tumours [27]. GI50 (log transformed) was used to quantify response to Triciribine for each of the 35 cell lines [28]. GI50 is the concentration that causes 50% growth inhibition compared to a baseline. A network (with a total of five pathways) was constructed using http://cellsignal.com webcite (see Figure 5).
Figure 6 shows marginal likelihood scores arising from empirical Bayes. This selects the intrapathway distance prior (m = 2) with hyperparameter λ = 5 (i.e. a prior promoting larger distances). Due to the small sample size, we tested robustness of this choice by running empirical Bayes with each data sample removed. The same prior was selected in 86% of the iterations.
Figure 6. Drug response data, empirical Bayes analysis. Parameters controlling source of prior information and prior strength (m and λ respectively) were set objectively using the data. Log marginal likelihood (calculated exactly up to a constant) is plotted against λ for m = 1 (number of pathways prior) and m = 2 (intrapathway distance prior). Parameters were set to the values with maximal marginal likelihood: m = 2 and λ = 5.
Figure 7. Drug response data, posterior inclusion probabilities. Obtained via exact model averaging with (a) biologically informative pathwaybased model prior with parameters set objectively using empirical Bayes (m = 2 (intrapathway distance), λ = 5  see Figure 6), (b) flat prior and (c) "incorrect" biologically informative prior that is not optimal according to empirical Bayes analysis (m = 1 (number of pathways), λ = −5). Posterior inclusion probabilities provide a measure of how influential each protein is in determining drug response. Proteins contained in the single MAP model are shaded red.
Table 2. Drug response data, predictive errors from crossvalidation
Figure 7 shows posterior inclusion probabilities obtained under three prior regimes: empirical Bayes (intrapathway distance prior with λ = 5), flat prior and an "incorrect" prior that is not optimal according to the empirical Bayes analysis (number of pathways prior with λ = −5). PhosphoIR and phosphoRB(S259) stand out in the empirical Bayes analysis. Triciribine targets AKT, which inhibits apoptotic processes and is heavily implicated in cancer signalling [29]. IR (insulin receptor) is a tyrosine kinase receptor, known to stimulate the AKT pathway [30], and it has been suggested that the RB/E2F pathway, which is also known to play a role in cancer [31], has an effect on AKT activity via transcriptional regulation [32]. Hence, the salience of IR and RB accords with known biology and drug mechanism. The MAP model for each prior regime is highlighted in red in Figure 7. We note that these models do not always contain the proteins with highest inclusion probabilities.
We performed LeaveOneOutCrossValidation (LOOCV), making predictions for the heldout test sample using both posterior model averaging (Equation 10) and the MAP model (Equation 11). The full variable selection approach, including selection of hyperparameters with empirical Bayes, was carried out at each crossvalidation iteration. Table 2 shows mean absolute predictive errors, with comparisons made as in the synthetic response data study above. For the ‘incorrect’ prior, the prior source parameter not selected by empirical Bayes was used, along with the optimal strength parameter λ for that prior. Mirroring the synthetic data results (Simulations 1 and 2), we observe that prior elicitation with empirical Bayes provides an increase in mean predictive accuracy over a flat prior, an ‘incorrect’ prior, the Markov random field prior and the approach proposed by Li and Li [21], whilst Lasso regression has lowest mean predictive error. We note, however, that due to the very small sample size, differences in mean predictive error between these regimes are not conclusive. Yet, they all show an improvement over the baseline linear approach, and model averaging results in an average 26% decrease in predictive error over using MAP models. The prior strength parameter for the Markov random field prior was set to λ = −5 by empirical Bayes in every crossvalidation iteration (this agrees with the selection of a pathwaybased prior promoting large distances).
We again checked sensitivity of results to the restriction on the number of predictors included in a model, d_{max }= 4. The results in Figure 7 were compared with those obtained using an increased maximum number of included predictors of d_{max }= 5 and using MCMCbased inference with no such restriction (see Figure 8). The strong agreement between d_{max }= 4 and d_{max }= 5 suggests that the minor differences observed between d_{max }= 4 and MCMC are a result of inherent Monte Carlo error. We also see a close agreement between results in Figure 7a (using ) and those obtained by optimising over the increased range of (see Additional File 1:Figure S1). This shows that results reported do not depend on the sparsity restriction or the range of values considered for the prior strength parameter.
Figure 8. Drug response data; effect of sparsity restriction. Posterior inclusion probabilities in Figure 7 were obtained by exact model averaging with the number of predictors included in a model restricted to not exceed d_{max }= 4. These results were compared with results obtained by exact model averaging with an increased maximum number of included predictors of d_{max }= 5 (left column) and using Markov chain Monte Carlobased model averaging with no sparsity restriction (right column).
Table 3. Illustrative computation times
Discussion
Model priors incorporating biological information can play an important role in variable selection, especially at the small sample sizes characteristic of molecular studies. In applications where there are multiple sources of prior information, or multiple possible prior specifications, the empirical Bayes approach we put forward permits objective selection and weighting. This aids prior elicitation and guards against the use of misspecified priors. We demonstrated that a biologically informative prior, with hyperparameters set by empirical Bayes, can have benefits over both a flat prior and a subjectively formed prior which is incorrect with respect to the underlying system. We also observed that, whilst Lasso regression can offer some improvement in predictive performance over the Bayesian approaches, its accuracy in selecting the correct underlying model (i.e. variable selection) can be inferior to the proposed empirical Bayes approach, thereby affecting interpretability of results. Empirical Bayes approaches have previously been used in variable selection, but with standard Bernoullidistributed priors [33,34].
We developed informative priors in the context of protein signalling based on two highlevel features derived from network information: the number of pathways a subset of predictors incorporates and the intrapathway distance between proteins in a proposed model. This formulation used the entire network structure in an intuitive way, removing the the need to specify individual prior probabilities for each variable and avoiding assumptions of prior independence between variables.
Our pathwaybased priors form part of a growing literature on exploiting existing domain knowledge to aid inference, especially in the small sample setting. For example, recent variable selection studies also make use of graph structure within a Bayesian Markov random field prior [19,35,36] and within a nonBayesian framework [21,37,38], essentially preferring models containing predictors that are neighbours in the graph. This is similar in spirit to the special case of our prior where the network consists of a single pathway and short intrapathway distances are strongly preferred.
We compared our pathwaybased priors to the Markov random field prior, but found in Simulation 2 that empirical Bayes frequently set the prior strength parameter to an incorrect value, resulting in a prior that penalises models containing predictors that are neighbours in the network, instead of promoting them. This is likely due to the parameterisation of the Markov random field prior, which is not agnostic to the number of included predictors in the model γ; addition of a predictor to a model could lead to a substantial increase in the prior score. Indeed, it has previously been noted that Markov random field priors can be unstable with the occurance of phase transitions in γ [19]. Hence, the prior prefers less sparse models, but these models do not agree well with the data, as more complex models are penalised by the marginal likelihood. In contrast, our distance prior is based on an average distance measure and so is somewhat indifferent to γ. In Simulation 3, we do not observe this behaviour of the Markov random field prior; since the linear model does not include interaction terms, the model complexity does not increase as sharply with γ and so there is less disagreement between the prior and the marginal likelihood. We note that biologically informative priors have also been used for classification [22,39,40] and network inference [4143].
We also compared our approach to the networkbased penalisedlikelihood method proposed by Li and Li [21]. It performed poorly in Simulations 1 and 2, primarily due to its inability to capture nonlinear interplay. However, even in Simulation 3, with no interaction terms in the datagenerating model, it failed to match the performance of our proposed empirical Bayes approach with pathwaybased priors. This could be due to lack of similarity between the regression coefficients in the datagenerating model, which goes against an assumption of the penalisedlikelihood approach; that coefficients are similar for predictors that are neighbours in the network. This could also explain its poor predictive performance.
We used a continuous regression framework with interaction terms. Whilst discrete models are naturally capable of capturing nonlinear interplay between components, the discretisation process results in a loss of information. Continuous models avoid this loss, but the response is usually assumed to depend linearly on predictors. The product terms in our model provide the possibility of capturing influences on the response of interest by interplay between predictors, including higherorder interactions. Chipman [44] and Jensen et al.[3] have employed a related approach allowing pairwise interactions only. We note that, under our formulation, model complexity grows rapidly with number of included predictors. However, complex models are naturally penalised by the marginal likelihood formulation giving overall sparse, parsimonious models, yet allowing for complex interplay via product terms.
We carried out variable selection using exact model averaging. This was made possible by means of a sparsity restriction. Sparsity constraints have been employed in previous work in Bayesian variable selection [4,45] and also in the related setting of inference of gene regulatory networks [42,46]. The sparsityconstrained approach proposed is attractive as it yields exact posterior probabilities and facilitates exact empirical Bayes analysis. Sparsity is a reasonable assumption in settings where it is likely that only a few predictors play a key role in influencing a response. In such settings, and where data is of smalltomoderate dimensionality, our exact approach is fast and deterministic with no requirement of MCMC convergence diagnostics. This, together with empirical Bayes and the choice of parameter priors, results in the overall approach having very few userset parameters.
In applications of higher dimensionality, where the exact calculation is no longer feasible, empirical Bayes can still be performed using an approximate conditional marginal ‘likelihood’ approach as seen in George and Foster [33] and Yuan and Lin [34]. This involves optimisation over the model space instead of averaging. MCMC, with the selected hyperparameter values, can then be used to estimate inclusion probabilities. Alternatively, a fully Bayes MCMC approach could be taken, which places a prior on the hyperparameters and integrates them out (see e.g. [14]).
Illustrative computational times for our approach are shown in Table 3, for four values of p (number of predictors) and four values of d_{max}(maximum number of predictors allowed in a model). We also considered linear models with and without interaction terms. Empirical Bayes was used to select between two priors (M = 2) and to set the prior strength parameter (optimisation performed over ten values of λ). The computation time scales as for the model without interaction terms and for the model with interaction terms. We see that the approach is fast on datasets of moderate dimensionality (∼100 variables) with d_{max }= 3. We note that shortage of memory was the limiting factor on our machine. Computational time could also be improved by using multiple cores to calculate empirical Bayes marginal likelihood scores for multiple values of λsimultaneously.
We showed examples of automated selection between multiple sources of ancillary information, but, rather than selecting a single source, the methods proposed could be generalised to allow combinations of complementary information sources as seen in Jensen et al.[3]. Whilst our priors were based on pathway and network structure, the methods can also permit integration and weighting of publicly available data, which while plentiful, can be of uncertain relevance to a given study.
Conclusions
In this paper we have proposed an empirical Bayes method for objective selection and weighting of biologically informative prior information for integration within Bayesian variable selection. The method is computationally efficient, exact and has very few userset parameters. We developed informative pathwaybased priors in the context of protein signalling and illustrated our method on synthetic repsonse data. We demonstrated that in situations where there are several plausible formulations for the prior, it is capable of selecting the most appropriate. In particular, the approach has potential to significantly improve results by guarding against misspecification of priors. Comparisons were made to alternative methods, demonstrating that the proposed approach offers a competitive variable selection performance. We have also shown an application on cancer drug response data and obtained biologically plausible results. Our method is general and can be applied in any setting with multiple sources of prior knowledge.
Competing interests
The authors declare that they have no competing interests.
Authors contributions
SMH designed the priors, carried out all computational analyses and wrote the paper. SM conceived the study, provided feedback on all aspects and revised the manuscript. RMN and NB carried out the cell line assays. WLK and SZ carried out the drug response assays. PTS and JWG led the biological aspects of the work and provided feedback. All authors read and approved the final manuscript.
Acknowledgements
This work was supported by the Director, Office of Science, Office of Basic Energy Sciences, of the U.S. Department of Energy under Contract No. DEAC0205CH11231, by the National Institutes of Health, National Cancer Institute grants U54 CA 112970 and P50 CA 58207 to JWG, and the Cancer Systems Biology Center grant from the Netherlands Organisation for Scientific Research. SMH and SM were supported under EPSRC EP/E501311/1.
References

Lee KE, Sha N, Dougherty ER, Vannucci M, Mallick BK: Gene selection: a Bayesian variable selection approach.
Bioinformatics 2003, 19:9097. PubMed Abstract  Publisher Full Text

Rogers S, Girolami M: A Bayesian regression approach to the inference of regulatory networks from gene expression data.
Bioinformatics 2005, 21(14):31313137. PubMed Abstract  Publisher Full Text

Jensen ST, Chen G, Stoeckert CJ: Bayesian variable selection and data integration for biological regulatory networks.
Ann Appl Stat 2007, 1:612633. Publisher Full Text

Mukherjee S, et al.: Sparse combinatorial inference with an application in cancer biology.
Bioinformatics 2009, 25:265271. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wu TT, Chen YF, Hastie T, Sobel E, Lange K: Genomewide association analysis by lasso penalized logistic regression.
Bioinformatics 2009, 25:714721. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

AiJun Y, XinYuan S: Bayesian variable selection for disease classification using gene expression data.
Bioinformatics 2010, 26:215222. PubMed Abstract  Publisher Full Text

Brown PJ, Vannucci M, Fearn T: Bayes model averaging with selection of regressors.
J R Stat Soc B 2002, 64:519536. Publisher Full Text

George EI, McCulloch RE: Approaches for Bayesian variable selection.

Chipman H, et al.: The practical implementation of Bayesian model selection.

Sachs K, Perez O, Pe’er D, Lauffenburger DA, Nolan GP: Causal proteinsignaling networks derived from multiparameter singlecell data.
Science 2005, 308:523529. PubMed Abstract  Publisher Full Text

Madigan D, York J, Allard D: Bayesian graphical models for discrete data.
Int Stat Rev 1995, 63:215232. Publisher Full Text

Raftery AE, Madigan D, Hoeting JA: Bayesian model averaging for linear regression models.
J Am Stat Assoc 1997, 92:179191. Publisher Full Text

Robert CP, Casella G: Monte Carlo Statistical Methods. New York: Springer; 2004.

Nott DJ, Green PJ: Bayesian variable selection and the SwendsenWang algorithm.
J Comput Graph Stat 2004, 13:141157. Publisher Full Text

Smith M, Kohn R: Nonparametric regression using Bayesian variable selection.
J Econometrics 1996, 75:317343. Publisher Full Text

Denison DGT, Holmes CC, Mallick BK, Smith AFM: Bayesian Methods for Nonlinear Classification and Regression. London: Wiley; 2002.

Kohn R, Smith M, Chan D: Nonparametric regression using linear combinations of basis functions.
Stat Comput 2001, 11:313322. Publisher Full Text

Carlin BP, Louis TA: Bayesian Methods for Data Analysis. Chapman & Hall; 2008.

Li F, Zhang NR: Bayesian variable selection in structured highdimensional covariate spaces with applications in genomics.
J Am Stat Assoc 2010, 105:12021214. Publisher Full Text

Tibshirani R: Regression shrinkage and selection via the lasso.

Li C, Li H: Networkconstrained regularization and variable selection for analysis of genomic data.
Bioinformatics 2008, 24:11751182. PubMed Abstract  Publisher Full Text

Stingo FC, Vannucci M: Variable selection for discriminant analysis with Markov random field priors for the analysis of microarray data.
Bioinformatics 2011, 27:495501. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Friedman J, Hastie T, Tibshirani R: Regularization paths for generalized linear models via coordinate descent.

Weinberg RA: The Biology of Cancer. New York: Garland Science; 2006.

Hanahan D, Weinberg RA: The hallmarks of cancer.
Cell 2000, 100:5770. PubMed Abstract  Publisher Full Text

Yarden Y, Sliwkowski MX: Untangling the ErbB signalling network.
Nat Rev Mol Cell Biol 2001, 2:127137. PubMed Abstract  Publisher Full Text

Neve RM, et al.: A collection of breast cancer cell lines for the study of functionally distinct cancer subtypes.
Cancer Cell 2006, 10:515527. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Heiser LM, et al.: Subtype and pathway specific responses to anticancer compounds in breast cancer.
Proceedings of the National Academy of Sciences 2011,.
in press

Yang L, et al.: Akt/protein kinase B signaling inhibitor2, a selective small molecule inhibitor of Akt signaling with antitumor activity in cancer cells overexpressing Akt.
Cancer Res 2004, 64:43944399. PubMed Abstract  Publisher Full Text

Burgering BMT, Coffer PJ: Protein kinase B (cAkt) in phosphatidylinositol3OH kinase signal transduction.

Nevins JR: The Rb/E2F pathway and cancer.
Hum Mol Genet 2001, 10:699703. PubMed Abstract  Publisher Full Text

Chaussepied M, Ginsberg D: Transcriptional Regulation of AKT Activation by E2F.
Mol Cell 2004, 16:831837. PubMed Abstract  Publisher Full Text

George EI, Foster DP: Calibration and empirical Bayes variable selection.
Biometrika 2000, 87:731747. Publisher Full Text

Yuan M, Lin Y: Efficient empirical Bayes variable selection and estimation in linear models.
J Am Stat Assoc 2005, 100:12151225. Publisher Full Text

Wei Z, Li H: A hidden spatialtemporal Markov random field model for networkbased analysis of time course gene expression data.

Li H, Monni S: Bayesian methods for networkstructured genomics data. In Frontiers of Statistical Decision Making and Bayesian Analysis. Edited by Chen MH, Müller P, Sun D, Ye K, Dey DK. New, York: Springer; 2010:303315.

Binder H, Schumacher M: Incorporating pathway information into boosting estimation of highdimensional risk prediction models.
BMC Bioinf 2009, 10:18. BioMed Central Full Text

Slawski M, zu Castell W, Tutz G: Feature selection guided by structural information.
Ann Appl Stat 2010, 4:10561080. Publisher Full Text

Zhu Y, Shen X, Pan W: Networkbased support vector machine for classification of microarray samples.

Guillemot V, Tenenhaus A, Le Brusquet L, Frouin V: Graph constrained discriminant analysis: A new method for the integration of a graph into a classification process.
PLoS ONE 2011, 6:e26146. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hartemink AJ, Bernard A: Informative structure priors: Joint learning of dynamic regulatory networks from multiple types of data. In Pac Symp Biocomput 2005. Singapore: World Scientific; 2005:459470.

Werhli AV, Husmeier D: Reconstructing gene regulatory networks with Bayesian networks by combining expression data with multiple sources of prior knowledge.

Mukherjee S, Speed TP: Network inference using informative priors.
Proc Natl Acad Sci USA 2008, 105:1431314318. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chipman H: Bayesian variable selection with related predictors.
Can J Stat 1996, 24:1736. Publisher Full Text

Jiang W: Bayesian variable selection for high dimensional generalized linear models: convergence rates of the fitted densities.
Ann Stat 2007, 35:14871511. Publisher Full Text

Husmeier D: Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks.
Bioinformatics 2003, 19:22712282. PubMed Abstract  Publisher Full Text