Cambridge Systems Biology Centre, University of Cambridge, Tennis Court Road, Cambridge CB2 1QR, UK
Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK
Abstract
Background
Sitespecific Transcription Factors (TFs) are proteins that bind to specific sites on the DNA and control the activity of a target gene by enhancing or decreasing the rate at which the gene is transcribed by RNA polymerase. The process by which TF molecules locate their target sites is a key component of transcriptional regulation. Therefore it is essential to gain insight into the mechanisms by which TFs search for the target sites.
Research in this area uses experimental and analytical approaches, but also stochastic simulations of the search process. Previous work based on stochastic simulations focussed only on short sequences, primarily for reasons of technical feasibility. Many of these studies had to disregard possible biases introduced by reducing a genomewide system to a smaller subsystem. In particular, we identified crucial parameters that require adjustment, which were not adequately changed in these previous studies.
Results
We investigated several methods that adequately adapt the parameters of stochastic simulations of the facilitated diffusion, when the full sequence space is reduced to smaller regions of interest. We found two methods that scale the system accordingly: the
Conclusions
We propose a strategy to reduce the size of the system that adequately adapts important parameters to capture the behaviour of the full system. This enables correct simulations of a smaller sequence space (which can be as small as 100
Background
Transcription Factors need to locate their target sites on the DNA within a time frame that is shorter than can be achieved by random diffusion. The search process is further complicated by the fact that target sites are usually similar to a significant number of other sites (decoys), and by the fact that there are other molecules searching for their target sites simultaneously. To understand transcriptional regulation better, it is therefore essential to have a complete understanding of the mechanistic way in which this search process takes place.
In the last 40 years, both theoretical and experimental research were able to identify that the search mechanism is a combination of a threedimensional diffusion and a onedimensional random walk, which is often referred to as the
One way to address these questions are stochastic simulations of the facilitated diffusion mechanism
Despite the significant speedup compared to previous tools, it is still not feasible to use the full genomic sequence as a search space. To address a scientific question with GRiP, multiple simulations need to be performed to allow a meaningful statistical analysis of the results. Thus, even small improvements in simulation speed can add up to some significant time saving. The optimization of the algorithm or of the implementation can potentially increase the speed of the simulations, but this is limited by the level of detail in the simulated model. In addition, even in the case of significant algorithm optimisations, simulating eukaryotic systems that have more than 100
One strategy to increase simulation speed consists of system size reduction, following the logic that the properties of the search process are the same irrespective of simulating only a subset or the full genomic sequence. However, this requires a few simulation parameters to be adapted to the size of the subsystem (e.g. the number of TF molecules in the subsystem as compared to the full system). This change in system parameters is required in order to avoid biases in the results, e.g. TFs could locate the target sites faster or target sites might be occupied for longer time intervals if there is an inappropriate number of TFs. The main advantage of this approach is that smaller systems will display faster speeds due to smaller DNA regions and, consequently, due to lower number of molecules bound to the DNA which perform the onedimensional random walk.
Our results indicate that if the diffusion parameters are conserved and if the proportion of covered DNA is similar for the original system and the subsystem, then the subsystem captures the dynamic and steady state behaviour of the original system with negligible error.
In this contribution, we present two adaptation methods (the copy number method and the association rate method) that managed to keep the simulation results for the full system and the subsystem constant. We systematically investigate the degree to which the simulation results are affected when reducing the size of the system. The first method (
The second approach, the (
Overall, we show that copy number method performs well in the case of high abundance TFs, while for low abundance TFs, one needs to rely on the association rate method.
Results and discussion
In this study we consider the lac repressor (lacI) TF, since this is one of the best described TFs with respect to the facilitated diffusion mechanism. Details regarding the lacI parameters used in this paper are presented in the Methods section. For the purpose of this study, we did not aim to provide a complete and exact description of the lac repressor system, but rather to describe under which conditions one can reduce the size of the system.
We consider six subsystems which are smaller than the full system (4.6
DNA subsystems.
DNA subsystems. The vertical solid line indicates the position of the
Figure
Binding energy of lacI to all six subsystems.
Binding energy of lacI to all six subsystems. In this graph we plotted the PWM score as box plot for the six subsystems and for the full system.
Next we present two models that are intended to keep the subsystems equivalent to the full system with respect to the facilitated diffusion mechanism.
Model I: TF copy number reduction
If the full system contains a DNA molecule of size
where
A subsystem with a DNA molecule of size
Supplementary Material. The
Click here for file
where
One way to alter the number of bound molecules is to keep all the parameters the same and only reduce the total number of molecules in the cell proportionally to
Model II: association rate reduction
Alternatively, the number of bound molecules can be modified without changing the total TF abundance, but only by changing the association rate of the molecules (
where
The ratio of bound TF is given by:
In the case of the subsystems, the association rate becomes
In the association rate model, the total number of molecules in the system remains constant (only the number of molecules bound to the DNA decreases in the association rate model) and, thus, we have
from equation (5):
from equation (2):
The association rate of the subsystem (
from equations (5) and (9):
Comparison of the two models
Next we will compare the two models (the copy number model and the association rate one) and investigate under which conditions one model is better than the other. The comparison includes four performance parameters, namely: (
Occupancy bias correlation between the full system and the subsystems.
Occupancy bias correlation between the full system and the subsystems. We consider the smallest subsequence (46 Kbp) and the corresponding regions in all other sequences and we computed the Pearson correlation coefficient between occupancy biases. First, we compute the average occupancy bias for the full system using 60 independent simulations and then, for each simulation (including the full system), we compute the correlation of the current occupancy bias and the mean value of the full system. Only lacI molecules were added to the system and each simulation was run for: 2000 s (in the case of 10 molecules), 200 s (in the case of 100 molecules) and 20 s (in the case of 1000 molecules). On the first row none of the parameters are changed, while on the second and third ones, the number of lacI molecules and the association rate was varied according to the system size.
The ratio between normalized affinity and normalized occupancy.
The ratio between normalized affinity and normalized occupancy. We consider the smallest subsequence (46 Kbp) and consider the top ≈ 180 sites (the binding energy is not lower than 30% compared to the strongest site). Only lacI molecules were added to the system and 60 simulations were run for each set of parameters, each simulation was run for: 2000 s (in the case of 10 molecules), 200 s (in the case of 100 molecules) and 20 s (in the case of 1000 molecules). On the first row none of the parameters are changed, while on the second and third ones the number of lacI molecules and the association rate was varied according to the system size.
Time to reach the target site.
Time to reach the target site. 60 independent simulations were run, only lacI molecules were added to the system and each simulation was run for: 2000 s (in the case of 10 molecules), 200 s (in the case of 100 molecules) and 20 s (in the case of 1000 molecules). On the first row none of the parameters are changed, while on the second and third ones the number of lacI molecules and the association rate was varied according to the system size.
The probability that the target site is occupied by a TF molecule.
The probability that the target site is occupied by a TF molecule. The proportion of time the
We consider three cases with respect to TF abundance, namely: (
DNA size
lacI
The full system consists of (
4.6
1000
100
10
2400
496
50
5
172.28
169.09
168.75
1.0
216
22
2
50.78
49.79
49.68
460
99
10
1
20.60
20.19
20.15
230
50
5

9.81
9.61
9.59
100
22
2

4.15
4.07
4.06
46
10
1

1.89
1.85
1.85
Note that the association rate for the full system (2400
The proportion of time spent on the DNA can be computed using the approach described in
In addition, Table
Occupancy bias
Figure
The correlation between the occupancy bias of the full system and all the subsystems indicates that the peaks in the occupancy bias data are captured by all subsystems for both models (copy number and association rate models). However, to capture the complete perspective on the occupancy bias we need to investigate if the size of these peaks is conserved, i.e., we are interested whether the same ratio between occupancy and affinity is found in the subsystems as compared to the full system. To do this, we use the ratio between normalized affinity and normalized occupancy for all sites that have a certain minimum affinity. This minimum affinity threshold removes low affinity sites from the data, where the noise in occupancy bias is high (and could lead to misinterpretation of the data).
For the sites with the affinity above a certain value we compute the ratio between the normalized affinity and the normalized occupancy. In the low and medium abundance TFs we expect the ratio to be around one, but in the case of high abundance TFs, due to the high crowding, the ratio should be significantly lower than 1 (resulting in many false positives, in the sense that these sites are identified as highly occupied sites, with prospective high affinity, but the actual affinity is lower than predicted based the occupancy)
Figure
Time to reach the target site
Next, we are interested in how the system size reduction influences the search process. Figure
The probability that the target site is occupied
Usually, the activity of TF regulated genes is controlled by the presence or absence of TF molecules at certain target sites. Using our model, we measured the proportion of time the target site was occupied. For long time intervals, this proportion of time approximates the probability that a target site is occupied by a TF.
Figure
Simulation speed
The main reason we reduced the size of the system was to increase simulation speed. Figure
Time required to simulate
Time required to simulate
Both models produce accurate results compared to the full system and lead to significant enhancement of the simulation speed. In particular, the errors in the approximate subsystems compared to the full system are negligible and are overshadowed by the speed enhancement produced by these methods.
Conclusions
When simulating the facilitated diffusion mechanism, one usually needs multiple long runs for the same set of parameters. This can take a significant amount of CPU time and can lead to undesirable simulation time (greater than 2 months). One solution is to enhance the current algorithms, but this might lead to coarser grained models unable to capture enough details of the mechanism of facilitated diffusion. Alternatively, one could simulate a subsystem of the full system. Figure
To keep the full system and the subsystems equivalent, we developed two models: (
The association rate model surpasses both drawbacks of the copy number model by managing to reduce the system independently of TF copy number and reproduces the results of the full system with high accuracy. However, this model assumes measuring the actual time the TF molecules spend on DNA in the full system
In the context of GRiP software
In conclusion, this paper offers a comprehensive description and analysis of the methods that need to be applied when performing nongenomewide stochastic simulations of the facilitated diffusion mechanism. More specifically, we show that one does not have to perform genomewide studies of the TF search process for their target sites as long as the parameters of subsystem (the subsystem which considers only a small area around the region of interest) are correctly adjusted.
Methods
Lac repressor
We consider the case of the lac repressor in
To compute the PWM we use the information theory based approach proposed in
where
Using a pseudocount value of
LacI sequence logo.
LacI sequence logo.
The binding energies for the entire
The lac repressor has three sites that control the activity of the lac operon, namely:
Competing interests
The author declares that he has no competing interests.
Authors’ contributions
NRZ designed the study, performed the analysis and wrote the paper.
Acknowledgements
The author would like to thank Boris Adryan and his group (in particular to Rob Foy) for useful discussions and comments on the manuscript. This work was supported by the Medical Research Council [G1002110].