Oxford Centre for Gene Function, University of Oxford, South Parks Road, Oxford OX1 3QB, UK

Eötvös University and Hungarian Academic of Science, Theoretical Biology and Ecology Group, Pázmány Péter sétány 1/c, 1117 Budapest, Hungary

Abstract

Background

Most of the existing RNA structure prediction programs fold a completely synthesized RNA molecule. However, within the cell, RNA molecules emerge sequentially during the directed process of transcription. Dedicated experiments with individual RNA molecules have shown that RNA folds while it is being transcribed and that its correct folding can also depend on the proper speed of transcription.

Methods

The main aim of this work is to study if and how co-transcriptional folding is encoded within the primary and secondary structure of RNA genes. In order to achieve this, we study the known primary and secondary structures of a comprehensive data set of 361 RNA genes as well as a set of 48 RNA sequences that are known to differ from the originally transcribed sequence units. We detect co-transcriptional folding by defining two measures of directedness which quantify the extend of asymmetry between alternative helices that lie 5' and those that lie 3' of the known helices with which they compete.

Results

We show with statistical significance that co-transcriptional folding strongly influences RNA sequences in two ways: (1) alternative helices that would compete with the formation of the functional structure during co-transcriptional folding are suppressed and (2) the formation of transient structures which may serve as guidelines for the co-transcriptional folding pathway is encouraged.

Conclusions

These findings have a number of implications for RNA secondary structure prediction methods and the detection of RNA genes.

Background

Most of the existing computational methods for RNA secondary structure prediction fold an already completely synthesized RNA molecule. This is done either by minimizing its free energy (e.g. done by MFOLD

RNA molecules are known to fold as they emerge during transcription

Co-transcriptional folding leads to the formation of temporary secondary structure elements

The speed of transcription also has an effect on folding which can be investigated by varying the nucleoside triphosphate concentration

Among the multitude of biochemical processes which are known to occur transcriptionally

RNA sequences can also promote the proper folding of other RNA sequences. It is known, for example, that the temporary interaction with highly conserved leader sequences of bacterial rRNA-operons is needed for the proper formation of 30S ribosomal subunits and the maturation of 16S rRNA

All these experimental and the few theoretical findings suggest that co-transcriptional folding may play an important role in the correct folding of RNA molecules. They also show that the functional structure may only be a transient one which is available during a certain time span and that the functional structure need not correspond to the structure which would dominate the ensemble of structures after an infinite time span.

Little is known whether co-transcriptional folding is mainly governed by the specific or non-specific binding of proteins (or other molecules) which target the emerging RNA or whether the primary structure of the RNA molecule itself conveys the desired properties to guide its own correct co-transcriptional folding.

In this paper, we propose several statistics in order to detect, if and how co-transcriptional folding influences RNA sequences. Using these statistics, we show that the effects of co-transcriptional folding are widespread in RNA genes.

Methods

Theory

We want to show that an RNA sequence is organized in such a way to help the formation of the functional secondary structure during transcription. We aim to support this hypothesis by detecting two different features:

• **Possible competitors of helices in the functional structure are suppressed. **When the 3' end of a helix that is part of the final secondary structure emerges during transcription, the number of possible competitors for the 5' part of the helix should be as low as possible in order to promote the formation of the correct helix.

• **The folding pathway is engineered. **During transcription, several temporary helices are formed which may guide the folding process.

We investigate these features using several statistics which are based on the known primary and secondary structures of our RNA sequences. A crucial point in investigating these features is to define a set of statistics that have expectation of zero in the _{0 }case, when we suppose no co-transcriptional folding. However, verifying that these statistics have an expectation value of zero in the _{0 }case cannot simply be achieved by analyzing random sequences. Indeed, even generating random sequences is not trivial. First, it is hard to reliably predict the minimum free energy structure for the randomized sequences as most secondary structure prediction algorithms discard pseudo-knots and, even without pseudo-knots, predict only on average about 70 % of the base-pairs correctly. In addition, there is no guarantee that the secondary structure with the lowest free energy would correspond to the functional one. Second, even if the random sequences are generated by a shuffling algorithm which keeps the given secondary structure fixed, it cannot be guaranteed that the fixed structure remains the correct one for the new primary sequence. Generating random sequences therefore provides no straightforward solution for obtaining a _{0 }statistics with expectation value zero.

We circumvent this problem by studying pairs of statistics, where both statistics have the same, unknown expectation value in the _{0 }case and where one statistics has a bias away from the _{0 }expectation value in case of co-transcriptional folding, while the other statistics is not affected by co-transcriptional folding. By studying the difference of these two statistics, we thus gain a new statistics with expectation value zero in the case of no co-transcriptional folding and an expectation value larger or smaller than zero in the case of co-transcriptional folding.

The statistics (which we will define in detail below) measure the presence of alternative helices which compete for at least one base-pair with the helices of the known secondary structure. These competing alternative helices are required to consists of at least _{stem }= 9 consecutive base-pairs of type {G - C, C - G, A - U, U - A, G - U, U - G} and are calculated by a dynamic programming procedure in which the known primary and secondary structure of the RNA is fixed, see Figure _{stem }values (data not shown). While calculating all helices of at least _{stem }length, we test which of these helices constitute competing alternatives to helices of the known secondary structure and record each such competing case in one of our statistics. These alternative helices may be part of a pseudo-knotted structure and we do not discard them. As each of the two bases _{stem }length. The remaining four classes, see Figure

Definition of a competing, alternative helix

**Definition of a competing, alternative helix. **Pictorial definition of a competing, alternative helix. The known base-pair between sequence positions

Definition of the statistics

**Definition of the statistics. **Pictorial definitions of the four configurations 3'_{stem}. See the text for more explanation.

It is important to note that even without co-transcriptional folding, the destabilizing effects of competing

We proceed as follows to detect if co-transcriptional folding takes place: For every RNA sequence of the data set, we detect events of type 3'_{x}_{x}_{x }and _{x }for

Definitions of the different statistics. Definitions of the different statistics used.

_{
x
}

1/((

|_{ci}|/((

_{
x
}

1/((

|_{ci}|/((c -

5'_{x}

1/((

|G_{ic}|/((

5'_{x}

1/((

|G_{ic}|/((

_{
x
}

5'_{x }- 3'_{x}

_{
x
}

3'_{x }- 5'_{x}

3'_{x}

Σ_{#3'cis }3'_{x}

3'_{x}

Σ_{#3'trans }3'_{x}

5'_{x}

Σ_{#5'cis }5'_{x}

5'_{x}

Σ_{#5'trans }5'_{x}

_{
x
}

5'_{x }- 3'_{x}

_{
x
}

3'_{x }- 5'_{x}

where

We can now define the two statistics which are capable of measuring the two main types of asymmetry within each RNA sequence:

which can calculate for both types of weights. Without co-transcriptional folding, the expectation value of these two statistics is zero. Co-transcriptional folding induces two types of asymmetries by suppressing the number of alternative helices which compete with the final helices (indicated by an increased number of

Without co-transcriptional folding, the introduced statistics have an expectation of zero, moreover, the distributions should be symmetric. The number of positive cases

where

Data

All 16S rRNA, 23S rRNA as well as Group I and Group II type intron sequences with completely known secondary structures were downloaded from the Comparative RNA Web (CRW) Site

Composition of the two data sets.

Taxonomic unit

all

16S rRNA

23S rRNA

Group I

Group II

Data set A

Archea

28

22

6

0

0

Bacteria

277

232

45

0

0

Eukaryotes

41

35

6

0

0

Chloroplasts

6

6

0

0

0

Mitochondria

9

9

0

0

0

Sum

361

304

57

0

0

Data set B

Eukaryotes

15

0

0

15

0

Bacteria

5

0

5

0

0

Chloroplasts

5

0

5

0

0

Mitochondria

23

0

17

0

6

Sum

48

0

27

15

6

Organellar 23S rRNA sequences frequently contain Group I introns and recent research revealed that the 23S rRNA of several hyperthermophilic bacteria also have Group I intron

rRNA genes in bacteria are encoded in the so-called rrn-operon (see for example

We divided the gathered sequences into two sets: data set A which consists of all RNA sequences that are thought to correspond to the originally transcribed sequence units and data set B which contains all those RNA sequences that are known to differ from the originally transcribed sequence units. Data set B thus contains the Group I and II intron sequences, organellar and hyperthermophilic bacteria 23S RNA sequences. As we neither know the sequence nor the secondary structure of the original transcript units from which the sequences of data set B were derived, we are limited to detecting the effects of co-transcriptional folding within these shorter sequences. We expect this to be much more difficult than in sequences that correspond to the originally transcribed sequence units as co-transcriptional folding introduces long range effects which are harder to detect the shorter the investigated sub-sequence gets. See Table

Results

We calculated the _{x}_{x}_{x }and _{x }values for both types of weights, i.e. _{x }and _{x }values, again for both

Distribution of

**Distribution of Cis and Trans values. **Distribution of

Average values for different statistics. Final values of the different statistics which were obtained by averaging the values of each sequence in the data set. The error shown is the standard deviation.

dataset

A

0.215 ± 0.009

0.461 ± 0.032

0.285 ± 0.009

0.382 ± 0.032

0.070 ± 0.004

0.079 ± 0.026

B

0.298 ± 0.040

0.562 ± 0.086

0.296 ± 0.043

0.521 ± 0.075

-0.003 ± 0.015

0.041 ± 0.082

dataset

A

2.916 ± 0.106

6.236 ± 0.431

3.710 ± 0.111

5.134 ± 0.354

0.794 ± 0.061

1.102 ± 0.384

B

3.392 ± 0.406

7.033 ± 1.050

3.362 ± 0.456

6.380 ± 0.954

-0.030 ± 0.184

0.653 ± 1.253

The first thing to note in Figure

The mean values of

A

A

In addition,

Overall, we can thus conclude from the average values in Table

In order to quantify the influence of co-transcriptional folding further, we calculated two statistics, a t-test for the hypothesis that the given statistics have an expectation value of zero as well as the p-value of the number of positive cases for our two co-transcriptional folding indicators, see Table

Statistical significance of results. p-values of t-test for the hypothesis that the final values in Table 3 have an expectation value of zero as well as the p-values for the hypothesis that the number of positive cases follows a binomial distribution with parameter 0.5.

dataset

A

B

p-value for t-test

p-value for

p-value for t-test

p-value for

< 0.0001

< 0.0001

0.5733

0.6137

< 0.0001

< 0.0001

0.5650

0.6137

0.0012

< 0.0001

0.3093

0.8068

0.0021

< 0.0001

0.3011

0.5000

Discussion

Recent experimental studies

Although our statistics are able to reveal two general effects of co-transcriptional folding within data set A, we cannot conclude that they would be powerful enough to serve as a reliable indicator of co-transcriptional folding for single RNA sequences, as some of the sequences in data set A may not correspond to the originally transcribed sequence units. In addition, all of our statistics consider only a first order effect of co-transcriptional folding by studying alternative helices for the known helices, but do not take higher order effects into account as e.g. alternative helices of alternative helices etc.

Based on computer simulations, H. Isambert et. al.

Conclusions

To summarize, our findings show that co-transcriptional folding is a guiding principle in the formation of functional RNA structure and that it can influence both the primary and potential secondary structures of an RNA molecule. This has several implications. Current algorithms for RNA secondary structure prediction can probably be improved by adopting co-transcriptional folding as a guiding principle rather than only free energy minimization. This may hopefully provide the extra information needed to be able to reliably detect RNA genes

Most importantly, co-transcriptional folding should lead to a better understanding of

In this study, we neither attempted to study the effects that co-transcriptional folding may have on sequences that are transcribed together (e.g. genes in an operon) nor to study the influence that the binding by proteins or RNA sequences or RNA editing may have on the co-transcriptional folding pathway and the final, functional RNA structure. This will almost certainly require more refined investigation methods, but we hope that this study provides enough insight and motivation to start to tackle these exciting questions.

Authors' contributions

I.M.M. proposed this work and contributed the main idea for the statistics. I.M. selected the data and evaluated the statistical significance of the results. Both authors shared the programming tasks and the writing of the manuscript.

Acknowledgments

I.M.M. acknowledges support from EPSRC grant HAMJW and MRC grant HAMKA. I.M. is supported by a Békésy György postdoctoral fellowship.