The Netherlands Cancer Institute, 1066 CX Amsterdam, The Netherlands

Centre for Complexity Science, University of Warwick, Coventry CV4 7AL, UK

Department of Statistics, University of Warwick, Coventry CV4 7AL, UK

Genentech Inc., San Francisco, CA 94080

Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720

Center for Spatial Systems Biomedicine, Oregon Health & Science University, Portland, OR 97239

Abstract

Background

An important question in the analysis of biochemical data is that of identifying subsets of molecular variables that may jointly influence a biological response. Statistical variable selection methods have been widely used for this purpose. In many settings, it may be important to incorporate ancillary biological information concerning the variables of interest. Pathway and network maps are one example of a source of such information. However, although ancillary information is increasingly available, it is not always clear how it should be used nor how it should be weighted in relation to primary data.

Results

We put forward an approach in which biological knowledge is incorporated using informative prior distributions over variable subsets, with prior information selected and weighted in an automated, objective manner using an empirical Bayes formulation. We employ continuous, linear models with interaction terms and exploit biochemically-motivated sparsity constraints to permit exact inference. We show an example of priors for pathway- and network-based information and illustrate our proposed method on both synthetic response data and by an application to cancer drug response data. Comparisons are also made to alternative Bayesian and frequentist penalised-likelihood methods for incorporating network-based information.

Conclusions

The empirical Bayes method proposed here can aid prior elicitation for Bayesian variable selection studies and help to guard against mis-specification of priors. Empirical Bayes, together with the proposed pathway-based priors, results in an approach with a competitive variable selection performance. In addition, the overall procedure is fast, deterministic, and has very few user-set parameters, yet is capable of capturing interplay between molecular players. The approach presented is general and readily applicable in any setting with multiple sources of biological prior knowledge.

Background

Ongoing advancements and cost reductions in biochemical technology are enabling acquisition of ever richer datasets. In many settings, in both basic biology and medical studies, it may be important to model the relationship between assayed molecular entities, such as genes, proteins or metabolites, and a biological response of interest.

Molecular players may act in concert to influence biological response: this has motivated a need for multivariate methods capable of modelling such joint activity. When sample sizes are small-to-moderate, as is often the case in molecular studies, robust modelling of joint influences becomes especially challenging. However, often it is likely that only a small number of players are critical in influencing the response of interest. Then, the challenge is to identify appropriate variable subsets.

Statistical variable selection methods have been widely used in the bioinformatics domain to discover subsets of influential molecular predictors. Both penalised likelihood and Bayesian approaches have been used in a diverse range of applications

Bayesian approaches can facilitate the integration of ancillary information regarding variables under study through prior probability distributions. Ongoing development of online tools and databases have meant that such information is widely available, and depending on context, may include networks and pathway maps, public gene expression datasets, molecular interaction databases, ontologies and so on. However, while the idea of incorporating such information into variable selection has a clear appeal, it is not always obvious what information should be included nor how it should be weighted. Indeed, many existing Bayesian variable selection approaches do not attempt integrative analyses exploiting such information and instead employ standard priors that do not specify preferences for particular variables, but may, for example, encode a preference for sparse models

We develop a variable selection procedure in which an empirical Bayes approach is used to objectively select between a choice of informative priors incorporating ancillary information (‘biologically informative priors’) and also to objectively weight the contribution of the prior to the overall analysis. The work presented here is motivated by questions concerning the relationship between signalling proteins and drug response in human cancers. In the protein signalling setting (as also in gene regulation) there is now much information available, both in the literature and in diverse online resources, concerning relevant pathways and networks. We therefore develop pathway- and network-based informative priors for this setting, applying the methods proposed to automatically select and weight the prior and thence carry out variable selection.

The relationship between response and predictors is modelled using a continuous, linear model with interaction terms. In this way we avoid data discretization (which can lose information) yet retain the ability to capture combinatorial interplay. We take advantage of biochemically-motivated sparsity constraints to permit exact inference, thereby avoiding the need for approximate approaches such as Markov chain Monte Carlo (MCMC). This enables the calculation of exact probability scores over which variables are likely to be influential. The overall procedure is computationally fast: empirical Bayes analysis and subsequent calculation of posterior (inclusion) probabilities for 52 predictors via full model averaging required only 10 minutes (in MATLAB R2010a on a standard single-core personal computer; code freely available, together with simulation scripts, at

The remainder of the paper is organised as follows. We begin below by defining notation and reviewing Bayesian variable selection. We then describe methods, including empirical Bayes analysis to objectively select and weight biologically informative prior information, pathway-based informative priors and exact inference. We illustrate our method on published single cell proteomic data

Notation

Let
**X**
_{
i
} forms row **X**.

Let
**X**
_{
γ
} is the
**X** those columns _{
j
}= 0. Similarly, for a vector
**a**
_{
γ
} is obtained from **a** by removing components _{
j
}for which _{
j
}= 0.

Bayesian variable selection

Bayesian linear model

Consider the classical linear model **Y **=** X****
β
** +

We are interested in the posterior distribution over models **Y**,**X**). From Bayes’ rule we have

where **Y** | **X**
_{
γ
}) is the marginal likelihood. _{
γ
} and variance parameter ^{2} and thereby automatically penalises complex models with many parameters. This penalisation occurs because a more complex model has a larger parameter space. This means that for more complex models, the integral that defines the marginal likelihood is over a greater number of dimensions, with prior mass spread over a larger space. This in turn results in a lower marginal likelihood score.

Model selection and model averaging

The posterior distribution over models **Y****X**) can be used to find a single,

These inclusion probabilities are a measure of the importance of each individual predictor in determining the response.

Evaluating the summation in Equation 3 requires enumerating the entire posterior over models **Y****X**). The model space Γ can be vast (

Model prior

Calculating the posterior distribution over models (2) requires specifying a prior over Γ, _{
j
}) with success parameter

These priors provide no information regarding specific predictors and do not utilise domain knowledge. Employing predictor dependent hyperparameters _{
j
}enables incorporation of prior knowledge that some predictors are more important than others. However, utilising such a prior may be difficult in practice due to the many hyperparameters that must be subjectively specified. We note also in this formulation, prior inclusion probabilities are still independent.

Methods

We now describe the Bayesian variable selection method used in the present work. We describe in turn, an extended linear model including interactions between predictors, exact computation of posterior inclusion probabilities, biologically informative model priors and empirical Bayes learning of associated hyperparameters.

Bayesian linear model with interaction terms

We extend the classical linear model in Equation 1 above to enable combinatorial relationships between predictors and response to be captured. Given model _{
i
} depends in a non-linear fashion on the included predictors **X**
_{
iγ
}whilst remaining linear in the regression parameters. In particular, the mean for _{
i
} is a linear combination of included predictors and all possible products of included predictors. For example, if
_{3} =_{5} = 1, we have _{
i
}=** X**
_{
iγ
}
_{
γ
} + _{
i3}
_{
i5} + _{
i
}. We extend the
**X**
_{
γ
} and regression coefficient vector **
β
**

The likelihood now takes the form

We choose hierarchical parameter priors following Smith and Kohn
^{2} to be Normal

and the prior for ^{2} to be

We note that, in contrast to the widely-used normal inverse-gamma prior

Exact posterior inclusion probabilities

We enforce a restriction on the number of predictors that are allowed to be included in the model. That is, we only allow _{
max
}= 4, giving
^{15}.

Biologically informative model priors

We now turn our attention to the model prior

Suppose we have

where

We consider two simple pathway-based priors, capturing information regarding number of pathways and intra-pathway distances via functions _{1} and _{2} respectively. Below we proceed to give details for each, making use of the following notation. We let

Properties of pathway-based priors

**Properties of pathway-based priors.** Priors are encoded by functions _{1}(_{2}(_{1} has larger intra-pathway distance than _{2}; Top middle - distance is agnostic to number of pathways; Top right - addition of a singleton has no effect on distance. Bottom row: The root component in each network is in both pathways. However, _{1}(

Number of pathways (_{1})

The first pathway-based feature encodes the notion that predictors that are influential in determining response may belong to a small number of pathways or, in contrast, may be spread across many pathways. We encode such beliefs by a function _{1}(_{
γ
} is the pathway count given by

This definition prevents the empty model being _{1}(

Intra-pathway distance (_{2})

The second feature we consider is that variables which jointly influence the response may either be close to each other in a network sense, or may in fact be far apart in the network. This is done by a function _{2}(_{1} and _{2}, denoted _{1},_{2}), is the number of edges in the shortest (undirected) path between them. Then, we define
_{
γ
} is the average of all _{1},_{2}) with
_{
γ
}= 0. The function _{2}(_{
γ
}(see Figure
_{2}(_{
γ
}= 1. A negative strength parameter

Empirical Bayes

We set the prior source parameter

For a given choice of hyperparameters, the marginal likelihood can be calculated exactly by exploiting the model space restriction described above. The score is calculated for varying hyperparameters and those resulting in the largest score are used for variable selection.

Prediction

Given already observed data **X**,**Y**, we can predict the expected value of new response ^{
′
} from new predictor data **X**
^{
′
} by model averaging:

with

and the model posterior **Y**,**X**) calculated via Equations 2, 6 and 7.

Results

We first show an application of our proposed approach to synthetic response data generated from a published study of cell signalling, and then further illustrate the approach with an analysis of proteomic data and drug response from breast cancers.

Synthetic response data

In ongoing studies, such as that presented below, truly objective performance comparisons may be challenging, since we usually do not know which molecules are truly influential in driving biological response. At the same time, in fully synthetic data it can be difficult to mimic realistic correlations between variables within a pathway or across a network. For this reason, we empirically assessed the methods proposed using published single-cell, phospho-proteomic data
_{
tot
}= 853 samples. (The complete dataset from

Figure

Protein network and pathway structure for biologically informative priors in the synthetic response data study

**Protein network and pathway structure for biologically informative priors in the synthetic response data study.** Responses were generated from published phospho-proteomic data

Synthetic response data, average ROC curves

**Synthetic response data, average ROC curves.** Number of true positives plotted against number of false positives for Simulations 1, 2 and 3. Proteomic data from Sachs

We first considered two simulation models,
_{2}) with positive _{1}) with negative _{2}) with negative

We are especially interested in the small-sample regime that is often of interest in molecular studies. We therefore subsampled (without replacement)

Subsampling was repeated to give 5,000 training/test pairs, over which results are reported below. At each iteration, only small-sample training data was used for inference. The empirical Bayes method was employed to set prior source and strength parameters (using training data only), with

We assessed performance by comparing the true underlying model ^{*} to the model _{
τ
}obtained by thresholding posterior inclusion probabilities at level

(i) BVS with flat prior and linear model with interaction terms (‘BVS: flat +int’);

(ii) BVS with a prior that is incorrect with respect to the true, underlying model: intra-pathway distance prior (_{2}) favouring small distances (

(iii) BVS with flat prior and linear model with no interaction terms (‘BVS: flat -int’);

(iv) BVS with a Markov random field prior

(v) penalised-likelihood Lasso regression

(vi) penalised-likelihood Lasso regression

(vii) a penalised-likelihood approach, proposed by Li and Li

(viii) absolute correlation coefficients between each predictor and response (‘corr’).

Markov random field priors have previously been used in Bayesian variable selection to take network structure of predictors into account
_{
i,j
}) be a binary symmetric matrix with _{
i,j
}= 1 if and only if edge (

The strength parameter

Lasso regression performs variable selection by placing an _{1} penalty on the regression coefficients. This has the effect of shrinking a subset of regression coefficients to exactly zero; the predictors with non-zero coefficients are taken as the inferred model. Sparsity of the inferred model is controlled by a tuning parameter, which we set by 5-fold cross-validation. This method results in a single inferred model (i.e. point estimate). However, a full ROC curve can still be obtained by thresholding absolute regression coefficients.

The penalised-likelihood method proposed by Li and Li

We observe that, in both simulations, the automated empirical Bayes analysis, with pathway-based priors, improves performance over the flat prior and provides substantial gains over an incorrect prior. The empirical Bayes approach selected the correct prior in 85% of iterations for Simulation 1 and 96% of iterations for Simulation 2 (for Simulation 1 correct prior parameters were

In Simulation 1, the strength parameter for the Markov random field prior was set to

The penalised-likelihood approach proposed in

Since the network-based penalised-likelihood approach

**Simulation 1**

**Simulation 2**

**Simulation 3**

**MA**

**MAP**

**MA**

**MAP**

**MA**

**MAP**

Predictions using small-sample training data (**
n
**

^{‡ } linear model with interaction terms for Simulations 1 and 2, and without interaction terms for Simulation 3.

BVS: EB prior^{†}

0.819±0.004

0.850±0.004

0.837±0.004

0.889±0.005

0.899±0.002

0.918±0.002

BVS: flat prior^{†}

0.845±0.004

0.919±0.005

0.845±0.004

0.919±0.006

0.904±0.002

0.927±0.003

BVS: ‘incorrect’ prior^{†}

0.858±0.003

0.895±0.003

0.918±0.003

1.003±0.004

0.969±0.003

1.036±0.003

BVS: MRF prior^{†}

0.830±0.004

0.877±0.005

0.871±0.004

0.920±0.006

0.886±0.002

0.911±0.002

Lasso^{†}

0.791±0.003

0.790±0.003

0.913±0.002

Li&Li

1.246±0.009

1.476±0.012

1.760±0.012

Baseline linear

1.000±0.002

1.000±0.002

1.000±0.002

The failure of the incorrect prior illustrates the importance of prior elicitation. Moreover, our results demonstrate that the proposed empirical Bayes approach can select a suitable prior automatically, even under very small sample conditions (here

For each dataset, we used the posterior predictive distribution (Equation 10; calculated via exact model averaging) to predict responses for held-out test data. Mean absolute predictive errors, obtained by averaging over all 5,000 train/test iterations, are shown in Table

Synthetic response data; effect of sparsity restriction and range of prior strength parameter

**Synthetic response data; effect of sparsity restriction and range of prior strength parameter.** Results reported in Figure
_{max }= 4. Posterior inclusion probabilities for 50 simulated datasets from Simulation 1 were compared with results obtained by exact model averaging with an increased maximum number of included predictors of _{max }= 5 (left) and using Markov chain Monte Carlo-based model averaging with no sparsity restriction (centre). Sensitivity to the range of prior strength parameter values considered by empirical Bayes was also assessed by comparing the posterior inclusion probabilities obtained with

Network and pathway structure for biologically informative priors in the cancer drug response data study

**Network and pathway structure for biologically informative priors in the cancer drug response data study.** Network constructed using information from

**Cancer drug response application.** Tables of proteins and cell lines included in the analysis, further details of the experimental procedure and Figure S1.

Click here for file

The only user-set parameters in the proposed method are _{
max
} (the maximum number of predictors allowed in a model), and the range of values for the prior strength parameter _{
max
}= 4 and considered
_{
max
}= 5; (ii) Markov chain Monte Carlo-based (MCMC) inference with no restriction on number of included predictors, and (iii) an increased range for the prior strength

In Simulation 2 and Simulation 3, the smallest value of

Cancer drug response data

Aberrant signalling is heavily implicated in almost every aspect of cancer biology

Phospho-protein abundance was assayed in a high-throughput manner using the KinetWorks^{TM} system (Kinexus Inc, Vancouver, Canada), for

Figure

Drug response data, empirical Bayes analysis

**Drug response data, empirical Bayes analysis.** Parameters controlling source of prior information and prior strength (

Drug response data, posterior inclusion probabilities

**Drug response data, posterior inclusion probabilities.** Obtained via exact model averaging with **(a)** biologically informative pathway-based model prior with parameters set objectively using empirical Bayes (**(b)** flat prior and **(c)** "incorrect" biologically informative prior that is not optimal according to empirical Bayes analysis (

**MA**

**MAP**

Predictions using leave-one-out-cross-validation (see text for details). Results shown are mean absolute predictive errors ± SEM for the following methods: Bayesian variable selection (BVS) with biologically informative pathway-based prior with source and strength parameters set by empirical Bayes, BVS with flat prior, BVS with ‘incorrect’ prior (contradicting empirical Bayes; see text for details), BVS with a Markov random field (MRF) prior, Lasso regression, penalised-likelihood approach proposed by Li and Li

BVS: EB prior +int

0.84±0.12

1.00±0.16

BVS: flat prior +int

0.86±0.11

1.26±0.17

BVS: ‘incorrect’ prior +int

0.93±0.15

1.22±0.17

BVS: MRF prior +int

0.86±0.11

1.24±0.17

Lasso +int

0.73±0.10

Li&Li

0.96±0.21

Baseline linear

1.00±0.14

Figure

We performed Leave-One-Out-Cross-Validation (LOOCV), making predictions for the held-out test sample using both posterior model averaging (Equation 10) and the MAP model (Equation 11). The full variable selection approach, including selection of hyperparameters with empirical Bayes, was carried out at each cross-validation iteration. Table

We again checked sensitivity of results to the restriction on the number of predictors included in a model, _{
max
}= 4. The results in Figure
_{
max
}= 5 and using MCMC-based inference with no such restriction (see Figure
_{
max
}= 4 and _{
max
}= 5 suggests that the minor differences observed between _{
max
}= 4 and MCMC are a result of inherent Monte Carlo error. We also see a close agreement between results in Figure

Drug response data; effect of sparsity restriction

**Drug response data; effect of sparsity restriction.** Posterior inclusion probabilities in Figure
_{max }= 4. These results were compared with results obtained by exact model averaging with an increased maximum number of included predictors of _{max }= 5 (left column) and using Markov chain Monte Carlo-based model averaging with no sparsity restriction (right column).

**Linear model without interaction terms**

**Linear model with interaction terms**

**
d
**

**
d
**

**
d
**

**
d
**

**
d
**

**
d
**

**
d
**

**
d
**

Computation times (in seconds) for proposed Bayesian variable selection procedure, using empirical Bayes to select between two priors (**
M
**

p=30

0.1

1.1

8.7

9.5

0.4

4.7

38.6

374.6

p=60

0.5

10.5

114.3

−

1.8

39.4

661.6

−

p=120

2.8

116.3

−

−

8.2

350.1

−

−

p=500

150.3

−

−

−

238.7

−

−

−

Discussion

Model priors incorporating biological information can play an important role in variable selection, especially at the small sample sizes characteristic of molecular studies. In applications where there are multiple sources of prior information, or multiple possible prior specifications, the empirical Bayes approach we put forward permits objective selection and weighting. This aids prior elicitation and guards against the use of mis-specified priors. We demonstrated that a biologically informative prior, with hyperparameters set by empirical Bayes, can have benefits over both a flat prior and a subjectively formed prior which is incorrect with respect to the underlying system. We also observed that, whilst Lasso regression can offer some improvement in predictive performance over the Bayesian approaches, its accuracy in selecting the correct underlying model (i.e. variable selection) can be inferior to the proposed empirical Bayes approach, thereby affecting interpretability of results. Empirical Bayes approaches have previously been used in variable selection, but with standard Bernoulli-distributed priors

We developed informative priors in the context of protein signalling based on two high-level features derived from network information: the number of pathways a subset of predictors incorporates and the intra-pathway distance between proteins in a proposed model. This formulation used the entire network structure in an intuitive way, removing the the need to specify individual prior probabilities for each variable and avoiding assumptions of prior independence between variables.

Our pathway-based priors form part of a growing literature on exploiting existing domain knowledge to aid inference, especially in the small sample setting. For example, recent variable selection studies also make use of graph structure within a Bayesian Markov random field prior

We compared our pathway-based priors to the Markov random field prior, but found in Simulation 2 that empirical Bayes frequently set the prior strength parameter to an incorrect value, resulting in a prior that penalises models containing predictors that are neighbours in the network, instead of promoting them. This is likely due to the parameterisation of the Markov random field prior, which is not agnostic to the number of included predictors in the model |γ|; addition of a predictor to a model could lead to a substantial increase in the prior score. Indeed, it has previously been noted that Markov random field priors can be unstable with the occurance of phase transitions in |γ|

We also compared our approach to the network-based penalised-likelihood method proposed by Li and Li

We used a continuous regression framework with interaction terms. Whilst discrete models are naturally capable of capturing non-linear interplay between components, the discretisation process results in a loss of information. Continuous models avoid this loss, but the response is usually assumed to depend linearly on predictors. The product terms in our model provide the possibility of capturing influences on the response of interest by interplay between predictors, including higher-order interactions. Chipman

We carried out variable selection using exact model averaging. This was made possible by means of a sparsity restriction. Sparsity constraints have been employed in previous work in Bayesian variable selection

In applications of higher dimensionality, where the exact calculation is no longer feasible, empirical Bayes can still be performed using an approximate conditional marginal ‘likelihood’ approach as seen in George and Foster

Illustrative computational times for our approach are shown in Table
_{
max
}(maximum number of predictors allowed in a model). We also considered linear models with and without interaction terms. Empirical Bayes was used to select between two priors (_{
max
}= 3. We note that shortage of memory was the limiting factor on our machine. Computational time could also be improved by using multiple cores to calculate empirical Bayes marginal likelihood scores for multiple values of

We showed examples of automated selection between multiple sources of ancillary information, but, rather than selecting a single source, the methods proposed could be generalised to allow combinations of complementary information sources as seen in Jensen

Conclusions

In this paper we have proposed an empirical Bayes method for objective selection and weighting of biologically informative prior information for integration within Bayesian variable selection. The method is computationally efficient, exact and has very few user-set parameters. We developed informative pathway-based priors in the context of protein signalling and illustrated our method on synthetic repsonse data. We demonstrated that in situations where there are several plausible formulations for the prior, it is capable of selecting the most appropriate. In particular, the approach has potential to significantly improve results by guarding against mis-specification of priors. Comparisons were made to alternative methods, demonstrating that the proposed approach offers a competitive variable selection performance. We have also shown an application on cancer drug response data and obtained biologically plausible results. Our method is general and can be applied in any setting with multiple sources of prior knowledge.

Competing interests

The authors declare that they have no competing interests.

Authors contributions

SMH designed the priors, carried out all computational analyses and wrote the paper. SM conceived the study, provided feedback on all aspects and revised the manuscript. RMN and NB carried out the cell line assays. WLK and SZ carried out the drug response assays. PTS and JWG led the biological aspects of the work and provided feedback. All authors read and approved the final manuscript.

Acknowledgements

This work was supported by the Director, Office of Science, Office of Basic Energy Sciences, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231, by the National Institutes of Health, National Cancer Institute grants U54 CA 112970 and P50 CA 58207 to JWG, and the Cancer Systems Biology Center grant from the Netherlands Organisation for Scientific Research. SMH and SM were supported under EPSRC EP/E501311/1.