Department of Systems Engineering and Engineering Management, City University of Hong Kong, Hong Kong

School of Mechanical and Electrical Engineering, Central South University, Changsha 410083, China

Abstract

Background

A key challenge in the post genome era is to identify genome-wide transcriptional regulatory networks, which specify the interactions between transcription factors and their target genes. Numerous methods have been developed for reconstructing gene regulatory networks from expression data. However, most of them are based on coarse grained qualitative models, and cannot provide a quantitative view of regulatory systems.

Results

A binding affinity based regulatory model is proposed to quantify the transcriptional regulatory network. Multiple quantities, including binding affinity and the activity level of transcription factor (TF) are incorporated into a general learning model. The sequence features of the promoter and the possible occupancy of nucleosomes are exploited to estimate the binding probability of regulators. Comparing with the previous models that only employ microarray data, the proposed model can bridge the gap between the relative background frequency of the observed nucleotide and the gene's transcription rate.

Conclusions

We testify the proposed approach on two real-world microarray datasets. Experimental results show that the proposed model can effectively identify the parameters and the activity level of TF. Moreover, the kinetic parameters introduced in the proposed model can reveal more biological sense than previous models can do.

Background

A challenge facing molecular biology is to develop quantitative, predictive models of gene regulation. The advance of high-throughput microarray technique makes it possible to measure the expression profiles of thousands of genes, and genome-wide microarray datasets are collected, providing a way to reveal the complex regulatory mechanism among cells. There are two broad classes of gene regulatory interactions: one based on the 'physical interaction' that aim at identifying relationships among transcription factors and their target genes (gene-to-sequence interaction) and another based on the 'influence interaction' that try to relate the expression of a gene to the expression of the other genes in the cell (gene-to-gene interaction).

In recent years, researchers have proposed many different computational approaches to reconstruct gene regulatory networks from high-throughput data, e.g. see reviews by Bansal et al. and Markowetz and Spang

Based on the above description, this paper aims to describe the transcriptional regulatory network quantitatively. In this work, a Bayesian inference based regulatory model is proposed to quantify the transcriptional dynamics. Multiple quantities, including binding energy, binding affinity and the activity level of transcription factor are incorporated into a general learning model. The sequence features of the promoter and the occupancy of nucleosomes are exploited to derive the binding energy. Compared with the previous models, the proposed model can reveal more biological sense.

Results

Case Ι: Circadian patterns in rat liver

Circadian rhythm is a daily time-keeping mechanism fundamental to a wide range of species. The basic molecular mechanism of circadian rhythm has been studied extensively. As a real example to test our approach, we considered the dynamics of the circadian patterns in rat liver. We employ the datasets from Almon et al

**Table S1**. The time series gene expression data for circadian patterns in rat liver.

Click here for file

Analysis of the predicted activity levels of transcription factors

To test the proposed model on the above dataset, we employ two important transcriptional regulators of which activity levels indicate the variation of heat signals in a subset of gene circadian network, hsf1 and ppara. In total, we selected 7 genes to perform posterior inference of TF activities. To ensure identifiability, we included three genes that are regulated solely by hsf1 (HSP110, HSPA8 and COL4A1), and two genes that are regulated solely by ppara (ACSL1 and HMGCS1). The remaining two genes are jointly regulated by hsf1 and ppara. These genes were chosen since they exhibit the largest variance in the microarray time course, and hence are likely to provide a cleaner representation of the output of the system.

The inferred TFs' activity levels are shown in Figure _{i }for the four selected circadian genes (HSPA8, ACSL1, HSP90AA1 and HSPA1B). The green column represents the response k_{1 }to hsf1 alone, the red column is the response k_{2 }to ppara alone and the black column represents the joint response k_{12}. It can be seen that, for gene, HSPA8, the model predicts a clear activation by hsfl alone, which is consistent with the experimental conclusion from Yan et al

Results on circadian patterns data

**Results on circadian patterns data**. (a) mean activity profile for hsf1, (b) mean activity profile for ppara, (c) bar-chart representation of the parameters ki, giving the logical structure of the interaction between two TFs.

The biological sense of kinetic parameters

Table _{i }

Relationship between scaling parameter k and the corresponding binding affinity φ.

**Gene**

**HSP110**

**HSPA8**

**COL4A1**

**ACSL1**

**HMGCS1**

**HSP90AA1-hsf1**

**HSPA1B- hsf1**

**HSPA1B- ppara**

H

H

H

H

L

H

L

H

φ

L

H

L

H

L

H

L

H

Figure _{j }and ω_{j}. The columns on the left, shaded red, show results from our model and the white columns are the estimates obtained by the method of Barenco et al. _{j }and ω_{j }obtained by our method have smaller variance than that of Barenco et al. Figure

The bar charts for basal transcription rates and decay rates

**The bar charts for basal transcription rates and decay rates**. (a) Basal transcription rates from our model and that of Barenco et al. Red are estimates obtained with our model, white are the estimates obtained by Barenco et al

The predicted mean expression profiles

**The predicted mean expression profiles**. (a) HSPA8, (b) COL4A1, (c) ACSL1, (d) HMGCS1, (e) HSP90AA1 and (f) HSPA1B. The red circle indicates the observed value at each time-points.

Case II: A yeast synthetic network for in vivo assessment

Validation of gene regulation network (GRN) inference methods has traditionally been done using in silico networks. However, depending on how realistic the choice of an in silico model is, this kind of validation approach has obvious limitations. To our knowledge, rarely the underlying model from which artificial/simulated data is generated is realistic enough. Real biological networks are fairly complex chemical reaction network models. In simulation setting one typically adds noise on top of a hypothetical simulation model, but the noise characteristics may not be realistic enough. Also, simulation models tend to be overly simplistic when compared to e.g. real gene regulatory networks. Data measured from a real biological system is, real. To overcome these problems, we use the IRMA network to evaluate out model. The IRMA network is a synthetically constructed GRN in the Saccharomyces cerevisiae genome

IRMA network

**IRMA network**. The rectangle indicates the gene while the oval represents the protein.

Analysis of the predicted activity levels of transcription factors

To evaluate whether the proposed model can effectively learn the TFs' activity level and the regulation type, we first evaluate the model using the switch-on time series data. The inferred TFs' activity levels are shown in Figure _{12}. It can be seen that, for gene, GAL80, the model predicts a clear activation by swi5 which is consistent with the experimental conclusion

Results on IRMA network data

**Results on IRMA network data**. (a) mean activity profile for regulator swi5, (b) mean activity profile for regulator ash1, (c) bar-chart representation of the parameters ki.

Analysis of the kinetic parameters

Table

Relationship between k and φ for IRMA network data.

**Gene**

**GAL80**

**GAL4**

**SWI5**

**ASH1**

**CBF1-swi5**

**CBF1-ash1**

k

H

L

H

L

H

H

φ

L

L

H

L

H

H

Figure _{j }and ω_{j}. The columns on the left, shaded red, show results from our model and the white columns are the estimates obtained by Opper et al. _{j }and ω_{j }obtained by our method have smaller variance than that of Opper et al.

The bar charts for basal transcription rates and decay rates

**The bar charts for basal transcription rates and decay rates**. (a) Basal transcription rates from our model and that of Opper et al.

For comparison, we also evaluate the model using the switch-off data. Figure

The predicated mean expression profiles

**The predicated mean expression profiles**. Expression profile and mean reconstruction of target genes. Switch-on time series: (a) GAL80, (b) ASH1, (c) CBF1, (d) GAL4, (e)-(h) The same genes in switch-off time series. The red circle indicates the observed value at each time-point.

Discussion

In this study, two real-world microarray datasets were exploited two evaluate the efficiency of the proposed model. Comparison shows that the kinetic parameters obtained by our method have smaller variance than that of Barenco et al.

The Bayesian inference based model of transcription rates and regulator activity levels allows us to handle these biologically relevant quantities despite the indirect measurement of the former and the lack of measurements of the latter. It also allows us to handle the inherently noisy measurement in a principled way. However, the proposed model still abstracts away some of the explicit processes that generate the actual observed expression data. A more explicit modelling of these will provide a more principled treatment of different sources of noise in the data. Furthermore, this model does not handle directly the upstream events that affect regulator activity. In fact, the current model is an open loop system, such that the regulation of regulator activity is itself viewed as exogenous to the system. By developing a richer modeling language we may capture more complex reaction models, model the upstream regulation of activity levels, and identify systems that involve feedback mechanisms and signalling networks.

Post-Transcriptional Modification Model (PTMM) have been previously used to model TF activities

Conclusions

In this work, we have proposed a computational model to reverse engineer simultaneously both the activity of TFs and the logical structure of promoters by integrating multiple sources of knowledge, including time-series gene expression data, TFs' binding information and sequence features of promoters. The approach relies on a detailed model of transcription, which is an approximation to the Michaelis-Menten model from classical enzyme kinetics, and therefore should be able to capture accurately the effects that changes in TF activity have on gene expression dynamics. We testify the proposed approach on two real-world microarray datasets. Experimental results show that the proposed model can effectively identify the parameters and the activity level of TF. Moreover, the kinetic parameters introduced in the proposed model can reveal more biological sense than previous models can do.

Methods

Problem statement

A microarray experiment only measures the "observed" quantities, as shown in Figure

A qualitative molecular model of transcriptional regulation

**A qualitative molecular model of transcriptional regulation**. mRNA encoding a transcription factor (TF) is translated to protein. The protein is activated and induces the transcription of a target gene at a certain rate (G). The final accumulation of G mRNA levels is determined by this transcription rate and by the rate of G's mRNA degradation.

Our approach relies on a continuous time, differential equation description of transcriptional dynamics where TFs are treated as latent on/off variables and are modelled using a switching stochastic process. The framework of the proposed method is shown in the Figure

The framework of proposed method

**The framework of proposed method**. The expression data and sequence features of promoters are incorporated into a general learning model. The outputs of the model are kinetic parameters and the activity levels of transcription parameters.

Kinematic model of regulation

Compared with the gene expression level, the gene transcription rate can capture more dynamic characteristics of transcription regulation. We here employ the transcription rate to model the regulation function. We first assume:

• The derived transcription rates are average rates over a cell population.

• The speed of a TF's binding to or dissociation from its target sites is assumed to be much more rapid than the transcription process, thus rapid-equilibrium approximation can be used.

Based on the above assumptions, the transcription rate of a gene is proportional to the amount of the gene bound by its regulators in all genes of the measured cell population. We first consider the case that a gene is regulated by a single activator. The corresponding regulation function can be properly described by Michaelis-Menten equation:

here x represents the mRNA concentration for a particular gene,

To incorporate the sequence feature and the TF binding preference into the model, we set the binding affinity

here

We now take the regulation involving two regulators into account. Denote by r_{1}(t) and r_{2}(t) the concentration of two regulators, _{1 }and _{2 }the binding affinity of two regulators from their own target sites, the regulation function can be written as below:

Considering the general case, a gene is regulated by n regulators. There are 2^{n }different binding states in total. The n-dimension binary vector is employed to indicate a certain binding state, e.g., a 4-dimension vector (0 1 0 1) indicates that the second and the fourth regulators are bound to their own target sites while the first and the third are not bound. The regulation function can be written as:

where S_{j }denotes the set of all 2^{n }possible state vectors, and s_{i }is the i_{th }element of the state vector

Modelling for binding affinity

Measuring affinities of molecular interactions in high-throughput format remains problematic, especially for transient and low-affinity interactions. We here try to describe the landscape of binding affinity in the perspective of binding energy between the various DNA-binding molecules and their target genes. Binding affinity landscapes describe how each molecule translates an input DNA sequence into a binding potential that is specific to that molecule. The presented framework models several important aspects of the binding process.

By allowing molecules to bind anywhere along the input sequence, the entire range of affinities is considered, thereby allowing contributions from both strong and weak binding sites

• Conventional cooperative binding interactions can be explicitly modelled by assigning higher statistical weights to configurations in which two molecules are bound in close proximity.

• The cooperativity that arises between factors when both nucleosomes and transcription factors are integrated is captured automatically

We first consider the simplest case that there is only one target site S_{ij }for TF

The site-specific binding affinity is given by

where C_{i }is a constant, E_{ij }the binding free energy between TF_{i }and the promoter of gene

The above case can be expanded to the general case that binding may happen in anywhere along the input sequence and the accessibility of target sites varies due to the occupancy of nucleosomes. The general binding affinity is modelled as

where ^{(n)}_{ij }^{(n)}_{ij }

Employing sequence features and the occupancy of nuclesomes to estimate the binding affinity

**Employing sequence features and the occupancy of nuclesomes to estimate the binding affinity**. The positional weight matrices are used to represent the sequence motif. The binding may occur anywhere along the input sequence, the entire range of affinities is considered.

Since the positional weight matrices (PWM) are often used to represent the sequence motif

where

here K^{(q) }is the scaling factor, M*_{L }indicates the maximum background frequency in the motif,

Regulatory network modelling using dynamic Bayesian inference

In many biological processes, the transcription factor transit from inactive to active state quickly as a consequence of fast post-translational modifications, (the time scale is micro second), so it is reasonable that we model the TF activity as a binary variable r(t) ∈{0,1}

For the regulation involving a single regulator, the TF activity can be seen as a two states Markov Jump Process. Based on Ref

here p_{1}(t) = p(r(t) = 1), p_{0}(t) = p(r(t) = 0) and _{± }indicates the transition rate.

The observed expression data is often assumed to be normally distributed

Normally distributed observational data

**Normally distributed observational data**. The solid line indicates the mean predicted expression while the dotted line represents the normally distributed observations.

Setting y_{j}(t) as the observations of mRNA species _{j}(t) the predicted expression and σ_{j }the variation, the noise model can be described as

Based on Refs

where

Variational inference and model optimization

We will use a variational formulation of the inference problem

By selecting a suitable family of approximating distributions, the inference problem is then turned into an optimization problem. It can be shown that the KL divergence is a convex functional of

here _{q}

According to Ref

here

The optimization procedure is based on a forward-backward procedure, leading to ordinary differential equations which can iteratively be solved. Taking the regulation involving two regulators for example, the free energy is a functional of both the approximating processes q^{1}, q^{2 }and their transition rates n_{1}, n_{2}. However, these are not independent, but are related by the Master equation. To incorporate this constraint, we add Lagrange multipliers as

where g_{1 }and g_{2 }are the rates of jumps from the 0 to the 1 state for process q^{1 }and q^{2}, respectively.

The Lagrange multiplier functions obey the final condition λ(T) = 0. Estimation of the parameters _{q}[

The framework of the inference

**The framework of the inference**.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

SQW proposed the method, performed the analysis; HXL supervised the work and revised the paper critically for important intellectual content.

Acknowledgements

This work was supported by a GRF project from Hong Kong SAR (CityU 117310) and the grant from NNSF China (51175519).

This article has been published as part of