Department of Microbiology, University of Washington, Box 358070, Seattle, WA, 98195, USA

Department of Statistics, University of Washington, Box 354320, Seattle, WA, 98195, USA

Department of Biochemistry, University of Washington, Box 357350, Seattle, WA, 98195, USA

Department of Genetics and Genomic Sciences, Mount Sinai School of Medicine, New York, NY, 10029, USA

Abstract

Background

Inference about regulatory networks from high-throughput genomics data is of great interest in systems biology. We present a Bayesian approach to infer gene regulatory networks from time series expression data by integrating various types of biological knowledge.

Results

We formulate network construction as a series of variable selection problems and use linear regression to model the data. Our method summarizes additional data sources with an informative prior probability distribution over candidate regression models. We extend the Bayesian model averaging (BMA) variable selection method to select regulators in the regression framework. We summarize the external biological knowledge by an informative prior probability distribution over the candidate regression models.

Conclusions

We demonstrate our method on simulated data and a set of time-series microarray experiments measuring the effect of a drug perturbation on gene expression levels, and show that it outperforms leading regression-based methods in the literature.

Background

With recent advances in high-throughput biological data collection, reverse engineering of regulatory networks from large-scale genomics data has become a problem of broad interest to biologists. The construction of regulatory networks is essential for defining the interactions between genes and gene products, and predictive models may be used to develop novel therapies

A regulatory network can be represented as a directed graph, in which each node represents a gene (in our case an mRNA level) and each directed edge (

In this article, we present a network inference method that addresses the dimensionality challenge with a Bayesian variable selection method. Our method uses a supervised learning framework to incorporate external data sources. We applied our method to a set of time-series mRNA expression profiles for 95 yeast segregants and their parental strains, over six time points in response to a drug perturbation. This extends our previous work

Previous work

Bayesian networks

In regression-based methods, network construction is recast as a series of variable selection problems to infer regulators for each gene. The greatest challenge is the fact that there are usually far more candidate regulators than observations for each gene. Some authors have used singular value decompositions to regularize the regression models

Ordinary differential equations (ODE) provide another class of network construction strategies

To help mitigate problems with using gene expression data in network inference, external data sources can be integrated into the inference process. Public data repositories provide a rich resource of biological knowledge relevant to transcriptional regulation. Integrating such external data sources into network inference has become an important problem in systems biology. James et al.

Our contributions

This article is an extension of Yeung et al.

Our contributions are four-fold. First, we develop a new method called iBMA-prior that explicitly incorporates external biological knowledge into iBMA in the form of a prior distribution. Intuitively, we consider models consisting of candidate regulators supported by considerable external evidence to be frontrunners. A model that contains many candidate regulators with little support from external knowledge is penalized. Second, we demonstrate the merits of specifying the expected number of regulators per gene as priors through iBMA-size, which is a simplified version of iBMA-prior without using gene-specific external knowledge. Third, we refine the supervised framework to adjust for sampling bias towards positive cases in the training data, thereby calibrating the prior distribution. Fourth, we expand our benchmark to include simulated data, and compare our iBMA methods to L1-regularized regression-based methods. Specifically, we applied iBMA-prior to real and simulated time-series gene expression data, and found that it out-performed our previous work

Overview of iBMA-prior with a highlight of our main contributions

**Overview of iBMA-prior with a highlight of our main contributions.**

Results and discussion

We applied our method, iBMA-prior, to a time-series data set of gene expression levels for 95 genotyped haploid yeast segregants perturbed with the macrolide drug rapamycin over 6 time points

Comparison of different methods

First, we assessed the improvement of iBMA-prior over that of our previous work iBMA-shortlist from Yeung et al.

We also compared the performance of our method with and without using external biological knowledge. We assessed hybrid methods by combining LASSO and LAR with the same supervised learning stage that was used in iBMA-prior and iBMA-shortlist. Table

**Method**

**Data used**

**Description**

iBMA-prior

Gene expression + external data

Our proposed methodology that incorporates prior model probabilities in BMA. These prior probabilities were computed using external data sources.

iBMA-shortlist

Gene expression + external data

Iterative BMA that uses external knowledge to shortlist

Network A from Yeung et al.

Gene expression + external data

This method is the same as in iBMA-shortlist, but using the old version of supervised step described in Yeung et al.

LASSO-shortlist

Gene expression + external data

LASSO

LAR-shortlist

Gene expression + external data

LAR

iBMA-size

Gene expression data only

A simplified version of iBMA-prior that disregards external knowledge, except for setting _{
gr
} =

iBMA-noprior

Gene expression data only

Iterative BMA without any use of external knowledge.

LASSO-noprior

Gene expression data only

LASSO without any use of external knowledge.

LAR-noprior

Gene expression data only

LAR without any use of external knowledge.

Assessment: recovery of documented relationships

To evaluate the accuracy of the network constructed by each method, we assessed its concordance with the Yeastract database, a curated repository of regulatory relationships between known TFs and target genes in the

**Supplementary figures.**

Click here for file

Table

**Supplementary tables.**

Click here for file

**Method**

**Data used**

**Network size**

**
p
**

**TPR (%)**
^{
b
}

**# mis-class.**
^{
c
}

**TP**

**O/E**
^{
d
}

^{
a
} The

^{
b
} True positive rate (TPR) is defined as the proportion of inferred regulatory relationships that are documented in Yeastract.

^{
c
} The number of misclassified cases is the sum of false positives and false negatives.

^{
d
} The O/E ratio is the number of folds the observed number of recovered relationships (i.e., TP) in excess of the expected count of recovery by chance.

iBMA-prior

Gene expression + external data

21951

<1.00E-320

18.00

19282

593

4.11

iBMA-shortlist

Gene expression + external data

67440

<1.00E-320

12.78

24673

1287

2.92

Network A from Yeung et al.

Gene expression + external data

65122

1.68E-111

9.98

22485

662

2.28

LASSO-shortlist

Gene expression + external data

255293

<1.00E-320

11.07

46482

4169

2.53

LAR-shortlist

Gene expression + external data

242495

<1.00E-320

11.28

44765

4017

2.57

iBMA-size

Gene expression data only

17202

5.75E-56

16.84

17622

114

3.84

iBMA-noprior

Gene expression data only

63026

1.75E-23

8.85

18903

186

2.02

LASSO-noprior

Gene expression data only

564321

2.56E-10

5.20

38399

1231

1.19

LAR-noprior

Gene expression data only

194687

1.38E-40

7.71

22777

511

1.76

Next, we compared our iBMA-based methods to L1-regularized methods. All the approaches that used LASSO and LAR generated networks that had far more mis-classifications than the iBMA-based methods. Specifically, applications of LASSO or LAR without the supervised framework (LASSO-noprior and LAR-noprior) had TPRs of 5.20% and 7.71% respectively, the lowest among all the methods considered. Incorporating external knowledge did improve both LASSO and LAR, increasing the TPRs to about 11% in both LASSO-shortlist and LAR-shortlist. However, these TPRs were still lower than the TPRs for our iBMA-based methods. Our iBMA-based methods therefore outperformed methods based on LASSO and LAR for these data.

Finally, we investigated the impact of priors in iBMA-size, in which we applied a model size prior to calibrate the sparsity of the inferred networks without using any external data sources. iBMA-size can be considered as a simplified version of iBMA-prior that sets the regulatory potential (the prior probability that a candidate regulates a given gene) to a constant parameter that controls the expected number of regulators per gene. From Table

In Table

Assessment: transcription factor binding site analysis

In another assessment, we checked whether the set of target genes containing known binding sites for a certain TF were enriched among the child nodes of that TF in each inferred network. We first extracted the known binding sites for 129 TFs documented in the JASPAR database

**Method**

**Data used**

**# TFs with enriched gene sets**
^{
a
}

^{
a
} FDR was controlled at 10%.

iBMA-prior

Gene expression + external data

38

iBMA-shortlist

Gene expression + external data

30

LASSO-shortlist

Gene expression + external data

41

LAR-shortlist

Gene expression + external data

44

iBMA-size

Gene expression data only

4

iBMA-noprior

Gene expression data only

9

LASSO-noprior

Gene expression data only

13

LAR-noprior

Gene expression data only

10

Comparison with Lirnet

Lee et al.

We applied iBMA-prior to the same 3152-gene subset of the Brem et al. data that Lee et al.

Same as before, we evaluated different methods by assessing the concordance of the inferred networks with the Yeastract database using Pearson’s chi-square test. The assessment results in Table

**Method**

**Network size**

**
p
**

**TPR (%)**
^{
b
}

**# misclass.**
^{
c
}

**TP**

**O/E**
^{
d
}

^{
a
} The

^{
b
} True positive rate (TPR) is defined as the proportion of inferred regulatory relationships that are documented in Yeastract.

^{
c
} The number of misclassified cases is the sum of false positives and false negatives.

^{
d
} The O/E ratio is the number of folds the observed number of recovered relationships (i.e., TP) in excess of the expected count of recovery by chance.

iBMA-prior

8000

7.75E-65

15.62

10198

323

2.41

iBMA-shortlist

35995

1.02E-59

10.99

14581

818

1.70

Lirnet

10491

1.90E-03

8.42

10080

132

1.30

Simulation study

We designed and conducted a series of simulations to further assess our proposed method. We used the fitted model obtained from applying iBMA-prior to the yeast time-series microarray data set as the true underlying network, and generated simulated expression data from the estimated linear regression model. Twenty data sets, each with the same dimensions as the real time-series expression data, were independently generated as follows:

1. Set the prior probability of a regulatory relationship for each gene pair to the same value as the regulatory potential obtained at the supervised learning stage using the real external data.

2. Set the expression levels of the 3556 genes for the 95 yeast segregants and the two parental strains at time

3. For each target gene _{
g
} of true regulators as those with a posterior probability of ≥50% in our inferred network using iBMA-prior and the real time-series data.

4. For time

For gene

where the

5. Generate the simulated observed gene expression levels by adding noise to the true expression levels without measurement errors, i.e.,

where _{
g,t,s
} ~ N(0, _{
g
}
^{2}) with _{
g
}
^{2} being given by the sample variance of the regression residuals in the real data analysis. Others, e.g.

To assess the accuracy of networks inferred with the simulated data sets, we compared each of these networks to the true network created in Step 3 of the data generation algorithm. We used the same assessment criteria as in the real data analysis with the true network replacing Yeastract as the reference. As shown in Table

**Method**

**Data used**

**Network size**

**
p
**

**TPR (%)**
^{
b
}

**# mis-class.**
^{
c
}

**TP**

^{
a
} The

^{
b
} True positive rate (TPR) is defined as the proportion of correctly inferred regulatory relationships.

^{
c
} The number of misclassified cases is the sum of false positives and false negatives.

Remark: The values reported in the table were averaged across the 20 replications. The true network for the simulation study contained a total of 21951 edges.

iBMA-prior

Generated data + prior probability matrix

14011

<1.00E-320

71.13

16029

9966

iBMA-shortlist

Generated data + prior probability matrix

30753

<1.00E-320

47.23

23652

14526

iBMA-size

Generated data only

9349

<1.00E-320

20.31

27503

1899

iBMA-noprior

Generated data only

29393

<1.00E-320

8.55

46317

2513

Conclusions

In this article, we have proposed a methodology that systematically integrates external biological knowledge into BMA for network construction. A key feature of our approach is a formal mechanism to account for model uncertainty. For each target gene, we arrive at a compact set of promising models from which to draw inference, the weights of which are calibrated by the external biological knowledge. Our method infers sparse, compact and accurate networks upon the input of a reasonable estimate of network density from both real and simulated data. It does not put a hard limit on the number of regulators per target gene, unlike some other methods, such as Bayesian network approaches that impose this constraint to reduce the computational burden. While known TFs are in general favored

We showed that our method, iBMA-prior, consistently outperformed our previous method

A key contribution of this work is the derivation of more compact networks with higher TPRs. Unfortunately, due to incomplete knowledge, the evaluation of false positives and false negatives is difficult using real data. Therefore, we supplemented our study with a simulation study designed to mimic the real data, and showed that iBMA-prior produced fewer misclassified cases (i.e. the sum of false positives and false negatives) than other iBMA-based methods.

There are many directions for future work. A time-lag regression model, i.e., one that accounts for the current expression level of a target gene with the past expression levels of its regulators, is used in our methodology. This model formulation is in line with many other regression-based methods targeting time-series gene expression data

Methods

Time-series gene expression data for yeast segregants

We applied our method to a set of time-series mRNA expression data measuring the gene expression levels of 95 genotyped haploid yeast segregants perturbed with the macrolide drug rapamycin

Bayesian model averaging (BMA)

BMA is a variable selection approach that takes model uncertainty into account by averaging over the posterior distribution of a quantity of interest based on multiple models, weighted by their posterior model probabilities _{1},…,_{
k
} are the models considered. Each model consists of a set of candidate regulators. In order to efficiently identify a compact set of promising models _{
k
} out of all possible models, two approaches are sequentially applied. First, the leaps and bounds algorithm

While BMA has performed well in many applications

Supervised framework for the integration of external knowledge

We formulated network construction from time series data as a regression problem in which the expression of each gene is predicted by a linear combination of the expression of candidate regulators at the previous time point. Let _{
g,t,s
} be the expression of gene _{
g
} the set of regulators for gene

where

To account for external knowledge in the network construction process, Yeung et al.

To study the relative importance of the various types of external knowledge from the supervised framework, we collected 583 positive examples of known regulatory relationships between TFs and target genes from the _{
gr
} of a candidate regulator

Incorporating prior probabilities into iBMA

The potential benefit of using information from external knowledge to refine the search for regulators was shown by Yeung et al. and many others

We associate each candidate model _{
k
} with a prior probability, namely:

where _{
gr
} is the regulatory potential of a candidate regulator _{
kr
} = 1 if _{
k
} and _{
kr
} = 0 otherwise

The posterior model probability of model _{
k
} is given by

where _{
k
}) is the integrated likelihood of the data _{
k
}, and the proportionality constant ensures that the posterior model probabilities sum up to 1.

Then Occam’s window was used to discard any model _{
k
} having a posterior odds less than 1/_{
opt
}. The parameter

Extension of iBMA: cumulative model support

In Yeung et al.

At the end of each iteration of iBMA, and after applying Occam’s window to all models considered, we compute the posterior inclusion probabilities for each candidate regulator

where F is the set of all possible models for gene g, β_{
gr
} is the regression coefficient of a candidate regulator _{
kr
} = 1 if _{
k
} and _{
kr
} = 0 otherwise. Finally, we infer regulators for each target gene

Extensions of the supervised framework

We have extended the supervised framework of Yeung et al.

Imputation of missing values in ChIP-chip data

About 9% of the ChIP-chip data used in the training samples were originally undefined. The ChIP-chip data take the form of

Truncation of extreme values in external data

Some of the external data types used in the supervised learning stage contained value ranges for individual genes that far exceeded the ranges for these genes in the training samples, e.g. the SNP-level information in Additional file

Adjustment for sampling bias regarding positive and negative cases

In the supervised framework of Yeung et al., the expected number of regulators per target gene, computed as the sum of regulatory potentials of all candidate regulators, mostly fell between 400 and 600 (see Figure

The expected number of regulators per target gene in accordance with external knowledge

**The expected number of regulators per target gene in accordance with external knowledge.** Histogram of the expected number of regulators per target gene in the **A**. absence / **B**. presence of a proper measure to account for the difference in sampling rates for positive and negative examples respectively at the supervised learning stage.

Here, we address this issue by using a strategy that is commonly used in case–control studies, in which disease (positive) cases are usually rare _{1} and _{0} be the sampling rates for positive and negative cases respectively. To adjust for the difference in the sampling rates, we add an offset of -log(_{1}/_{0}) to the logistic regression model. Equivalently, we divide the predicted odds by _{1}/_{0}. Previous literature has suggested that the in-degree distribution of gene regulatory networks decays exponentially ^{-0.45m
}, where _{1}/_{0} = 2853. For instance, if the original predicted probability is 0.9, i.e., the predicted odds is 9, then after scaling the odds adjusted for sampling bias, it becomes 9/2853 = 0.0032, implying an adjusted probability of 0.0032. As shown in Figure

To assess the sensitivity of our results to changes in the assumed prior average number of regulators per target gene, we repeated the analysis with various levels of the network density

**Text containing supplementary materials and methods.**

Click here for file

Summary: outline of algorithm

1. For each gene

2. Shortlist the top

3. Fill the BMA window with the top

4. Apply BMA with prior model probabilities based on the external knowledge:

a. Determine the best

b. For each selected model, compute its prior probability relative to the

c. Remove the _{
gr
} ≠ 0 |

5. Fill the

6. Repeat steps 4–5 until all the

7. Compute the prior probability for all selected models relative to all the

8. Take the collection of all models selected at any iteration of BMA, and apply Occam’s window, reducing the set of models.

9. Compute the posterior inclusion probability for each candidate regulator using the set of selected models, and infer candidates associated with a posterior probability exceeding a pre-specified threshold (50%) to be regulators for target gene

External knowledge is used in the following ways:

1. All the candidate regulators are ranked according to their regulatory potentials, which were predicted using the available external data sources at the supervised learning stage.

2. Model selection is performed by comparing models against each other based on their posterior odds. As shown by Equation (6), the posterior odds is proportional to a product of the integrated likelihood and the prior odds. The prior probability and, therefore, the prior odds, of a candidate model are formulated as a function of regulatory potentials.

3. The posterior inclusion probability of each candidate regulator, from which inference is made about the presence or absence of a regulatory relationship, is positively related to its regulatory potential. As shown in Equation (5), a factor of _{
gr
} is contributed to each model in which the candidate _{
gr
} is contributed to each model.

Abbreviations

BMA: Bayesian Model Averaging; iBMA: Iterative Bayesian Model Averaging; LAR: Least angle regression; LASSO: Least absolute shrinkage and selection operator; TF: Transcription factor.

Competing interests

The authors declare that they have no competing interest.

Author contributions

KL and AER developed the methodology. KL implemented the methods. KL and KYY analyzed the data. KMD performed and JZ, EES, REB designed the experiments. AER and KYY conceived the study. KL, AER and KYY wrote the manuscript. All authors read, edited and approved the final manuscript.

Acknowledgments

We would like to thank Dr. Chris Fraley for her code to generate the precision-recall curves in Supplementary Figure S4 and Supplementary Table S5, and Dr. John E. Mittler for helpful comments and discussions. In addition, we thank the Western Canada Research Grid (WestGrid) for providing computational resources.

KYY, KL, AER, KMD and REB are supported by NIH grants 5R01GM084163. REB, KL and KYY are also supported by 3R01GM084163-02S2. REB, KMD and KYY were supported by a generous basic research grant from Merck. AER was also supported by NIH grants R01 HD54511 and R01 HD070936.