Computational Biology and Machine Learning Lab, Center for Cancer Research and Cell Biology, School of Medicine, Dentistry and Biomedical Sciences, Queen’s University Belfast, Belfast, UK

Laboratory of Biosystem Dynamics, Computational Systems Biology Research Group, Department of Signal Processing, Tampere University of Technology, Tampere, Finland

Abstract

Background

Evidence suggests that in prokaryotes sequence-dependent transcriptional pauses affect the dynamics of transcription and translation, as well as of small genetic circuits. So far, a few pause-prone sequences have been identified from in vitro measurements of transcription elongation kinetics.

Results

Using a stochastic model of gene expression at the nucleotide and codon levels with realistic parameter values, we investigate three different but related questions and present statistical methods for their analysis. First, we show that information from in vivo RNA and protein temporal numbers is sufficient to discriminate between models with and without a pause site in their coding sequence. Second, we demonstrate that it is possible to separate a large variety of models from each other with pauses of various durations and locations in the template by means of a hierarchical clustering and a

Conclusions

This method can aid in detecting unknown pause-prone sequences from temporal measurements of RNA and protein numbers at a genome-wide scale and thus elucidate possible roles that these sequences play in the dynamics of genetic networks and phenotype.

Background

Noise is inherent in gene expression and affects the behavior of genetic circuits and thus phenotype determination. It is unknown to what extent this noise is evolvable. One mechanism that likely contributes to transcriptional noise in prokaryotes is RNA polymerase (RNAP) pausing during elongation
^{′} end and the RNA sequence affect pause duration and proneness for premature termination

Long-duration pauses usually occur only at specific DNA sequences

One of the best studied long-pause sites is the

Studies of transcriptional pausing have focused on the physical-chemical causes and its physiological role in gene expression

So far there are only hypotheses regarding what may be the roles of sequence-dependent pauses on the dynamics of gene expression and genetic circuits

With this aim, here we investigate whether, from temporal RNA and protein numbers, we can determine if there is a long-duration pause site in the elongation region of a gene. Additionally, we aim to estimate, at least by comparison, the mean duration of a pause and its location relative to the transcription start site. For that, we simulate stochastic gene expression dynamics at the nucleotide and codon levels

Methods

Modeling gene expression

We use a delayed stochastic model of prokaryotic transcription and translation at the nucleotide and codon level that includes the closed and the open complex formation, stepwise elongation, as well as alternative pathways to elongation, namely pausing, arrests, editing, pyrophosphorolysis, RNA polymerase traffic, and premature termination. Stepwise translation can begin after the formation of the ribosome binding site and accounts for variable codon translation rates, ribosome traffic, back-translocation, drop-off, and trans-translation

The dynamics follows the delayed Stochastic Simulation Algorithm

**#**

**Chemical reaction(s)**

**Parameters**

Chemical reactions, rate constants (in s^{−1}), and delays (in s) used to model transcription and translation. Pro – promoter, Rp – RNA polymerase, Rib – ribosome, [Rib^{R} – number of translating ribosomes on RNA strand, P – complete protein, U – unoccupied nucleotide and O – nucleotide occupied by Rp, A – activated nucleotide; U^{R},O^{R},A^{R} – corresponding ribonucleotides. _{P} – range of nucleotides that Rp occupies, _{P}=25. _{R} – range of ribonucleotides that ribosome occupies, _{R}=31. Notation ^{2}) denotes that the values of ^{2}. Parameter values are from measurements in _{fold}) is set according to measurements of a commonly used GFP mutant

1

_{tc}=0.0245, _{oc}∼^{2})

2

_{m}=150

3

_{a}=150 for _{a}=30 for

4

_{m}=150

5

_{p}=0.55,_{p}=3

6

_{m}=150

7

_{m}=150

8

_{ar}=100

9

_{ed}=0.009, _{ed}=5

10

_{pre}=1.9·10^{−4}

11

_{pyr}=0.75

12

_{f}=2

13

_{dr}=0.025

14

_{tl}=0.53

15

_{trA}=35, _{trB}=8, _{trC}=4.5

16

_{tm}=10,000

17

see above

18

see above

19

_{bt}=1.5

20

_{drop}=1.14·10^{−4}

21

_{tt} is sequence dependent

22

_{fold}∼^{2})

23

_{dp}=0.0029

The model of transcription accounts for the binding of the RNAP to the template and diffusion along the template (reaction 1 in Table
_{oc} in reaction 1)
_{P} + 1) occupied by the RNAP on the strand while elongating is 25

The model of translation includes translation initiation (reaction 14 in Table

Modeling sequence-dependent pauses

Two types of transcriptional pauses have been identified: i) ubiquitous pauses, which can occur at any nucleotide with approximately uniform probability of occurrence

Reaction 5 (forward direction) in Table

where

As specified in reaction 1, the duration of these pauses is randomly drawn from an exponential distribution with the appropriate mean pause duration each time it occurs. It is noted that the assumption of exponential duration of each pause event is based on measurements where the sequence causing the pause is present, but the subsequent sequence where hairpin loops form (stabilizing the paused state) is not

When a pause occurs, the ribosomes translating the RNA proceed only until the point where the RNAP is stranded. At that point, ribosomes pause until the RNAP is released

Detecting the presence of a sequence-dependent pause site

Simulations of the models are initialized without RNA or proteins in the system. For our analyses we use only the stationary part of a time series. The methods assume the time series to be weakly stationary, meaning that the first two moments (i.e. mean and variance) do not vary over time. This condition is, in all cases, tested by a two-sample t-test for the ensemble mean values for a sample size of 10.

We first present a method to detect a sequence-dependent pause site from the time series of RNA and protein numbers. We denote by _{M},
^{′}. _{M}(_{M},

Previous work based on the simulations of stochastic models similar to the one used here
^{′}from a randomly sampled time series

1. sample

2. sample _{s}∼unif(1:

3. estimate the mean number of mRNAs for model M and M’:

Here, the symbol unif(

This test results in a p-value, ^{′}. Repeating the above procedure _{M}and

Definition of feature vectors

For each of the models with pauses with distinct kinetic characteristics, we measure the number of mRNAs and of proteins, and the cumulative number of proteins as a function of time, represented by matrices, _{M}, _{M} and _{M}, respectively. Following the previous notation, each matrix has size

To perform a clustering and a classification of the time series data generated from the different models, we define the following 10 features, which we use to define feature vectors. These features capture information about the autocorrelation, cross-correlation and the duration of the transcription and translation processes. Specifically, we estimate the lag-_{xx}(

Here 0≤_{t}}. We estimate the lag-_{M}and _{M}, i.e., _{xx}(_{M}) and _{xx}(_{M}). Then we estimate the mean and the standard deviation of the autocorrelation function, _{xx}, up to lag

For our numerical analysis we set _{xy}(

with 0≤_{t}}. Also, for the cross-correlation function we estimate _{xy}) and _{xy}) up to lag _{M}and _{M}.

Further, we estimate the mean decay time of the transcripts and its standard deviation. To obtain these, we first determine a vector,

holds during the time series _{M}(_{M}(_{M}(_{M}(

A summary of all 10 variables is given in Table
_{M}(

**#**

**Feature**

**Description**

**Data**

Summary of the 10 variables we use to define a feature vector for a model

1

_{xx};_{M})

mean autocorrelation function

_{M}

2

_{xx};_{M})

standard deviation of autocorrelation function

_{M}

3

_{xx};_{M})

mean autocorrelation function

_{M}

4

_{xx};_{M})

standard deviation of autocorrelation function

_{M}

5

_{xy};_{M},_{M})

mean cross-correlation function

_{M} and _{M}

6

_{xy};_{M},_{M})

standard deviation of cross-correlation function

_{M} and _{M}

7

_{xy};_{M},_{M})

mean cross-correlation function

_{M} and _{M}

8

_{xy};_{M},_{M})

standard deviation of cross-correlation function

_{M} and _{M}

9

_{M}))

mean decay time

_{M}

10

_{M}))

standard deviation of decay time

_{M}

We would like to emphasize that all three types of measures introduced above, based on autocorrelation, cross-correlation and the decay time, are fundamentally different from each other. Whereas the first two types of measures are based on a different usage of correlation coefficients within (nr. 1, 2, 3, 4 see Table

Results and discussion

We model genes 1,000 nucleotides long. Unless otherwise stated, the long-pause site is at nucleotide 500 and has the same kinetic properties as a

For the following analysis, we consider six models, A through F, described in Table
^{−1}. Once occurring, such pauses last, on average, 3 s following an exponential distribution

Model

Features

The six models with different pause characteristics are considered for the purposes of detection and classification of sequence dependent pauses.

A

No sequence-dependent pause sites.

B

Pause site at nucleotide 500.

C

Pause site at nucleotide 250.

D

Pause site at nucleotide 750.

E

Pause site with mean duration

F

Pause site with mean duration

The comparison between models A and B tests if the presence of a long pause is detectable from time series of RNA and protein numbers. The other models are used to test whether the location and kinetic properties of the pause can be classified. For each model, we simulate 10 instances, each for 1,000,000 s. The sampling frequency of the number of RNAs and proteins is 1 s^{−1}. The different instances of each model differ in the codon sequences, as these are randomly generated as described in the Methods section. However, it is noted that the length of the sequence used here was found to be sufficient to not expect significant differences in the kinetics of translation elongation due to differences in the codon sequence.

We found that for

To visualize this problem, Figure

Average number of proteins

**Average number of proteins.** Average number of proteins for each model. Each time series has been averaged over 10 independent runs and each data point has been averaged over 100 time steps and smoothed over a window of size 20.

We would like to emphasize that, theoretically, different models can be distinguished from each other by calculating the

Detecting a sequence-dependent pause site

First, we test if a sequence-dependent pause with the aforementioned characteristics is detectable. Such a detection would discriminate a model with a pause site from a model without one. To study this, we compare model A with model B with the hypotheses tests described in the methods section.

The results of the analysis are shown in the first column in Figure

P-values for comparing model A and B

**P-values for comparing model A and B.** P-values in dependence of the sample size from two-sample t-tests. Top row:

It is visible that, with larger sample sizes, the median p-values fall below the

It is interesting to note that the information provided by the protein level allows a better discrimination for

To demonstrate that the null hypothesis is not rejected if the data come from the same model, i.e., when the null hypothesis is true, we repeat the above analysis to obtain p-values for the cases _{A,A} and _{B,B}. The second column in Figure

Classification of models

We hypothesize that despite the intricate dynamics of the gene expression model where, e.g., RNAPs can bump into each other causing mutual delays of transcription, the information captured on the mRNA and protein numbers suffices to distinguish models with different parameter configurations. To demonstrate this, we estimate feature vectors for each model, based on the 10 features defined in the methods section, and show numerically their discriminative power.

The rationale of the following analysis is, first, to use an unsupervised clustering analysis to demonstrate that our features are not only sufficient to recover different models in an unsupervised manner but also that such clusters are robust. Second, we use a random forest classifier to classify the models based on our feature vectors. This allows a precise quantification of the errors made by such a categorization.

First, we perform an unsupervised clustering analysis. Specifically, we generate for models A through F time series data from which we estimate 50 feature vectors
_{M}_{M}_{M}_{M}^{2}=1). Here, the symbol ’∼’ indicates that the random variable (left side) is sampled from a model (right side). To these feature vectors (profiles), we apply a hierarchical clustering using a Manhattan distance measure and the Mcquitty clustering

Results of hierarchical clustering of the models

**Results of hierarchical clustering of the models.** Hierarchical clustering of feature vectors from models B, E and F (left tree) and from models A, C and D (right tree). The labels index the feature vectors. Each model is represented by 50 feature vectors.

The sensible cluster formations of our hierarchical clustering in Figure

What the clustering in Figure

Significance of correlations among features

**Significance of correlations among features.** Graphical visualization of the p-values of correlation coefficients between different features. The colors red to blue represent low to high p-values. The diagonal is shown in black to indicate that the self-correlations are not of interest.

Next, we quantify the classification abilities of the feature vectors. We use a random forest classifier (RFC)

Overall, these findings demonstrate that the information measured by the mRNA and protein numbers suffice to distinguish the models from each other, however, not without error. We studied many additional variables by enlarging the dimension of the feature vectors and found that the above classification errors can be further lowered. However, due to the moderate decrease in the classification errors (3

Estimating the location of a pause site

Finally, we estimate the location of a pause site from time series data. For this, we consider the

Because for the model of gene expression used here there is no known likelihood function available that could be used to obtain a maximum likelihood estimate for this parameter, we use an approximation thereof. The approximation proposed is based on the feature variables defined in Table

Here, **y** is a **y**. The components of **y**_{i}, whereas the index refers to the **y**_{i}=(_{1}(_{V}(

For simplicity, we assume that the multivariate density **y**_{i}|_{i}from each other. In the previous section we saw that all random variables _{j} are required to obtain a sensible classification of the models. This justifies the independence assumption, because if these variables were strongly dependent, the dimension of the feature vector could have been reduced.

Further, we define _{j} in the models
_{θ}. More precisely, the joint probability density is calculated by

Here, the probability densities _{θ}, respectively. **y**, with sample size _{θ} to generate data with sample size ^{′}. Theoretically, ^{′}≠^{′}=

To motivate our approach, we note that the parameter ^{′} in model
**y**, and _{θ} that needs to be estimated. To estimate the probabilities, _{θ}for varying values of the parameter **y**)=**y**) with
**y**) (Equation 12), it follows that

Using this approach, we study if the location of the pause relative to the transcription start site (TSS) can be estimated from the time series measurements. In Figure

The range of the LRL is from zero (maximum) to minus infinity. In the figures, the vertically dotted lines in green corresponds to the true but unknown position (^{′}) of a pause site and the vertically dotted lines in blue are the maximum likelihood estimates of these positions. The error bars correspond to the standard deviation for the nucleotide positions estimated from

Maximum likelihood estimation of the position of pause sites

**Maximum likelihood estimation of the position of pause sites.** Logarithmic relative Likelihood for three models: Model C (left), model B (middle) and model D (right). The estimated maximum likelihood values of the nucleotide positions are 200, 410 and 780 (vertical blue lines) and the true position values (250, 500 and 750) are indicated by vertical green lines. The boundary of the 95% bootstrap confidence region of the ML estimates is indicated by horizontal lines.

Overall, due to the

Conclusions

So far the identification of pause-prone sequences has relied on in vitro studies that make use of complex measurement procedures to characterize the kinetics of elongation of the RNAP

Here we proposed a set of novel statistical methods that allow detecting the presence of pause sites, their location relative to the TSS, and their kinetics (mean duration), from time series data of mRNA and protein numbers at the single molecule level. This is motivated by the fact that such measurements are already possible to obtain in an almost genome-wide scale

For the cases studied, there may be alternative features that perform better, in one sense or another. For example, to detect the existence of a pause site we used the mean RNA and protein numbers. This feature is only suitable if the induction level is strong enough for several collisions between RNAPs to occur during the simulations. Additionally, this feature is affected by the codon sequences, which here are randomly generated in each simulation. In this case, and for the realistic parameter values used, this feature proved to be sufficient. In other conditions, the use of different or additional features may be required.

At the moment there is no means to experimentally validate the results. For that, one needs to measure, in vivo, RNA numbers at the single molecule level. The MS2-GFP tagging system of RNA molecules is likely to not be usable, not only because it immortalizes the RNA, but it most likely affects the secondary and tertiary structure of RNA as the binding of MS2 is likely to hamper formation of structures such as hairpin loops, which are needed to confer transcriptional pauses with stability
^{−1} or faster (the lac promoter is a tentative choice

The methods used here require data from different models to compare them with each other. This is regardless of the type of the statistical method employed. For example, to detect whether a pause exists from real gene expression data, one must provide a certain amount of data of the dynamics of expression of a gene that indeed contains a pause and data of a gene that does not contain the pause. Similar data are required if one wants to determine the location of pause sites and their durations. Hence, regardless if a hypothesis test, clustering or a classification method is used, one needs data that can be

From the above, the method proposed here to identify unknown pause-prone sequences is rather laborious on the experimental side. Nevertheless, it is feasible using known, relatively simple experimental techniques

A recent work

In another work

In conclusion, our methods provide means to detect unknown pause-prone sequences from temporal gene expression measurements and to determine their location in the sequence relative to the transcription start site and their kinetic properties. It may thus facilitate their identification from genome-wide temporal gene expression measurements. From this mapping, and by correlating these findings with the functions of the various proteins in the cells, we may enhance our understanding of whether and how this sequence-dependent mechanism is used in the regulation of genetic network dynamics

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

ASR and FES conceived the analysis. FES, AH and ASR performed the analysis and wrote the paper. All authors read and approved the final manuscript.

Acknowledgements

The work of FES is partly supported by the