A comparison of RNA folding measures

Freyhult, Eva; Gardner, Paul P; Moulton, Vincent

doi:10.1186/1471-2105-6-241

Research article
Open access
Published: 03 October 2005

A comparison of RNA folding measures

Eva Freyhult¹,
Paul P Gardner² &
Vincent Moulton³

BMC Bioinformatics volume 6, Article number: 241 (2005) Cite this article

10k Accesses
74 Citations
Metrics details

Abstract

Background

In the last few decades there has been a great deal of discussion concerning whether or not noncoding RNA sequences (ncRNAs) fold in a more well-defined manner than random sequences. In this paper, we investigate several existing measures for how well an RNA sequence folds, and compare the behaviour of these measures over a large range of Rfam ncRNA families. Such measures can be useful in, for example, identifying novel ncRNAs, and indicating the presence of alternate RNA foldings.

Results

Our analysis shows that ncRNAs, but not mRNAs, in general have lower minimal free energy (MFE) than random sequences with the same dinucleotide frequency. Moreover, even when the MFE is significant, many ncRNAs appear to not have a unique fold, but rather several alternative folds, at least when folded in silico. Furthermore, we find that the six investigated measures are correlated to varying degrees.

Conclusion

Due to the correlations between the different measures we find that it is sufficient to use only two of them in RNA folding studies, one to test if the sequence in question has lower energy than a random sequence with the same dinucleotide frequency (the Z-score) and the other to see if the sequence has a unique fold (the average base-pair distance, D).

Background

Noncoding RNAs (ncRNAs) are sequences that are transcribed from DNA that function as RNA rather than being translated to protein. Many of the known ncRNAs, such as transfer RNA (tRNA), ribosomal RNA (rRNA), spliceosomal RNA (snRNA), and microRNAs (miRNA), have key functions in the cell. Moreover, various new families of ncRNAs are emerging, and, as indicated in recent studies in mouse [1] and 10 human chromosomes [2], many more transcripts are for ncRNAs than was previously expected. In the late 1980's Maizel and co-workers proposed the use of thermodynamic stability to identify noncoding RNAs in sequence data [3–5]. Since then, there has been a great deal of discussion concerning whether or not ncRNA sequences support secondary structure features that are significantly different from those of random sequences. In particular, following some contradictory results concerning the stability of messenger RNAs (mRNA) presented in [6–8], in [9] it was concluded that ncRNAs have more stable structures than random sequences, but that the difference is not significant enough to be of use in identifying novel RNAs in sequence data on its own (see also [10]). Even so, more recent findings suggest that thermodynamic stability can be used to identify novel members of special families of RNAs [11], and that stability coupled with comparative genomics data is a useful tool for identifying ncRNAs in general [12].

To shed more light on the above findings, we present a large scale investigation for how well ncRNA sequences fold compared with random sequences. In particular, we investigate six measures for how well an RNA sequence folds (normalised energy (dG), Z-score (Z) and p-value (p) of minimal free energy (MFE), Shannon entropy (Q), average base pair distance (D), and valley index (VI), for definitions see the Methods section), and compare the behaviour of these measures over a large range of Rfam ncRNA families (see Table 1), including many of the families that appeared in the studies mentioned above.

Table 1 The data sets used in this study. The first column contains a short name describing the data set, that is later used in text and figures. The RNA families/data sets can contain several types of sequences, such as the RNase family that contains both RNase P and RNase MRP. The different sequence types, or family members, are given in column two. In column three and four the number of family members (N_FM) and the total number of sequences (N_S) are given, respectively. The last two columns in the table give the mean and standard deviation of the sequence length and %GC-content.

Full size table

Methods

Data sets

All data sets, except the protein control and the ribosomal RNA data sets, were obtained from Rfam 6.1 [13]. Rfam seed alignments were used to select a collection of RNA families, which are specified in Table 1. The rRNA data set consists of a large representative subset of the eukaryotic SSU rRNA sequences in the European rRNA database (see additional file 1 for further details).

For each class of RNA we obtained an alignment of sequences, which we filtered so that it had no more than 80% sequence identity. This was done using the program weight, that is part of the Sean Eddy "squid" utilities (downloaded 2004 from http://selab.wustl.edu/cgi-bin/selab.pl?mode=software#squid).

In addition to the 13 data sets specified in Table 1, two control data sets were included; a protein control data set, consisting of 32 small protein coding sequences, and a set of shuffled RNA sequences (see additional file 1 for further details). The shuffled data set consists of 10 sequences from each of the 13 RNA data sets that were permuted, preserving dinucleotide frequencies [14], resulting in 130 sequences.

RNA folding statistics

Several quantities have been proposed for predicting how well an RNA molecule folds. In this paper we consider the following: The normalised minimal free energy (MFE) per base-pair (dG), the Z-score (Z), the p-value (p), the Shannon entropy (Q), the average base-pair distance (D), and the valley index (VI). We now present formal definitions for each of these measures. Let x = x₁ ⋯ x_L, denote an RNA sequence of length L, so that x_iis either A, C, G or U for each 1 ≤ i ≤ L.

The normalised energy, dG, is arrived at from a free energy minimisation procedure. It is defined as

$d G (x) : = \frac{E (x)}{L},$

where E(x) is the minimal free energy (MFE) for sequence x, as computed using RNAfold [15]. This program implements the folding algorithm presented in [16].

The Z-score and the p-value compare the MFE of the sequence x to the MFEs of permuted versions of x having identical dinucleotide composition. These compositions are preserved due to the importance of stacked base-pairs in the calculation of MFE [8]. For each sequence in this study, 500 shuffled sequences were generated using a mono- and dinucleotide frequency preserving procedure implemented in the program shuffle that is part of the Sean Eddy "squid" utilities.

The Z-score [17] is the number of standard deviations by which the MFE of x deviates from the mean MFE of the set $X_{s h u f f l e d} (x)$ of shuffled sequences [6, 8, 9, 17]. It is defined as

$Z (x) : = \frac{E (x) - < X_{s h u f f l e d} (x) >}{σ (X_{s h u f f l e d} (x))},$

where <·> and σ(·) denote the mean and the standard deviation of the MFEs of the sequences in $X_{s h u f f l e d} (x)$ .

The p-value of x is the fraction of sequences in $X_{s h u f f l e d} (x)$ having MFE lower than x or, expressed differently, the area under the distribution function to the left of the MFE of x. It is defined as

$p (x) : = \frac{M}{N},$

where M is the number of sequences in $X_{s h u f f l e d} (x)$ with MFE lower than the MFE of x, and N is the number of shuffled sequences, | $X_{s h u f f l e d} (x)$ |.

In vivo, RNAs commonly exist in an ensemble of structures. The distribution of these structures can be modelled by a Boltzmann distribution. Using this setup, it is possible to efficiently compute the partition function, Z, for the ensemble $S (x)$ of secondary structures corresponding to an RNA sequence x [18]. In particular, the probability of a structure S_α∊ $S (x)$ (which we regard as a set of base-pairs) is given by $P (S_{α}) = \frac{e^{- E_{α} / R T}}{Z}$ , where $Z = \sum_{S_{α} \in S (x)} e^{- E_{α} / R T}$ , E_αis the free energy of S_α, R = 8.31451 Jmol^-1K^-1 is the molar gas constant, and T is the temperature, which we take as 310.15 K (37°C). The base-pair probability p_ij(the probability that x_ipairs with x_j) is then given by $p_{i j} = \sum_{S_{α} \in S (x)} P (S_{α}) δ_{i j}^{α}$ is 1 if x_iand x_jis a base-pair in S_α, and 0 otherwise.

We use the implementation of McCaskill's algorithm in RNAfold to compute base-pair probabilities. The normalised Shannon entropy of x [19] is then defined as

$Q (x) : = \frac{- \sum_{i < j} p_{i j} \log_{2} (p_{i j})}{L} .$

We can also use base-pair probabilities to compute the average base pair distance between all structures in $S (x)$ , <d_BP> as follows (I.Hofacker, pers. commun.). (Version 1.5beta of RNAfold output this measure as "ensemble diversity".) The base-pair distance, d_BP(S_α, S_β) between two structures S_αand S_βon x is defined as the number of base-pairs not shared by the structures S_αand S_β(see e.g. [20]). Hence, if |S_α| is the number of base-pairs in S_α, i.e. $| S_{α} | = \sum_{i < j} δ_{i j}^{α'}$ , where i and j lie between 1 and L, then the base-pair distance between structures S_αand S_βequals

$\begin{matrix} d_{B P} (S_{α}, S_{β}) = | S_{α} \cup S_{β} | - | S_{α} \cap S_{β} | =| S_{α} | + | S_{β} | - 2 | S_{α} \cap S_{β} | \\ = \sum_{i < j} (δ_{i j}^{α} + δ_{i j}^{} - 2 δ_{i j}^{} δ_{i j}^{}) . \end{matrix}$

In particular,

$< d_{B P} > = \frac{1}{2} \sum_{S_{α}, S_{β} \in S (x)} [P (S_{α}) P (S_{β}) \sum_{i < j} (δ_{i j}^{α} + δ_{i j}^{β} - 2 δ_{i j}^{α} δ_{i j}^{β})] .$

Since $| S_{α} | = \sum_{i < j} δ_{i j}^{α'}$ , <d_BP> can thus be rewritten as

$\begin{array}{l} < d_{B P} > = \frac{1}{2} \sum_{S_{α}, S_{β} \in S (x)} [P (S_{α}) P (S_{β}) \sum_{i < j} (δ_{i j}^{α} + δ_{i j}^{β} - 2 δ_{i j}^{α} δ_{i j}^{β})] \\ = \frac{1}{2} \sum_{i < j} [\underset{p_{i j}}{\underset{}{\sum_{S_{α} \in S (x)} P (S_{α}) δ_{i j}^{α}}} \underset{1}{\underset{}{\sum_{S_{β} \in S (x)} P (S_{β})}} + \underset{1}{\underset{}{\sum_{S_{α} \in S (x)} P (S_{α})}} \underset{p_{i j}}{\underset{}{\sum_{S_{β} \in S (x)} P (S_{β}) δ_{i j}^{β}}} \\ - 2 \underset{p_{i j}}{\underset{}{\sum_{S_{α} \in S (x)} P (S_{α}) δ_{i j}^{α}}} \underset{p_{i j}}{\underset{}{\sum_{S_{β} \in S (x)} P (S_{β}) δ_{i j}^{β}]}} \\ = \frac{1}{2} \sum_{i < j} [p_{i j} + p_{i j} + 2 p_{i j} p_{i j}] \\ = \sum_{i < j} (p_{i j} - p_{i j}^{2}) . \end{array}$

Thus normalising by length, the average base-pair distance is given by

$D (x) : = \frac{\sum_{i < j} (p_{i j} - p_{i j}^{2})}{L} .$

The last measure that we consider in this study is the valley index (VI) [21]. It can be regarded as an approximation to D (see below), and is meant to measure the number of "valleys" in the RNA folding landscape of x.

Formally it is defined as follows: List the suboptimal structures of x according to their free energies so that S_opt, an MFE structure for x, is first and S₁,..., S_nare the next n structures on x with E_opt≤ E₁ ≤ ⋯ ≤ E_n. Put S_subopt= {S_opt, S₁,..., S_n}, and define

$V I (x) : = \frac{\sum_{S_{α}, S_{β} \in S_{s u b o p t}} d_{B P n o r m} (S_{α}, S_{β}) w (α) w (β)}{\sum_{S_{α}, S_{β} \in S_{s u b o p t}} w (α) w (β)},$

where $w (α) = e^{- (E_{α} - E_{o p t}) / R T}$ is the Boltzmann factor, and

$d_{B P n o r m} (S_{α}, S_{β}) : = \frac{d_{B P} (S_{α}, S_{β})}{L} .$

Note that our definition of VI differs slightly from the Kitagawa et al.'s definition since we use normalised base-pair distance, d_BPnorm, rather than the coarse-grained tree metric in their study. The suboptimal structures S₁,..., S_nare randomly sampled with probabilities equal to their Boltzmann weights using the program RNAsubopt [22]. We sample 300 structures resulting in between 16 (regulatory) and 300 (telomerase) unique structures.

In principle, the valley index for an RNA with a low number of valleys in the folding landscape should be low, whereas an RNA with a multi-valley folding landscape should have a correspondingly higher index. Note that the sums in the definition on VI are taken over all structures in a set of suboptimal structures within a certain energy distance from the MFE. If the energy distance is increased this set of structures will eventually include all the sequences in the ensemble $S (x)$ . In this situation, in view of the definition of w(α) it follows that the valley index of x can be rewritten as

$\begin{matrix} V I (x) = \frac{\sum_{S_{α}, S_{β} \in S (x)} d_{B P n o r m} (S_{α}, S_{β}) (e^{- (E_{α} - E_{o p t}) / R T}) (e^{- (E_{β} - E_{o p t}) / R T})}{\sum_{S_{α}, S_{β} \in S (x)} (e^{- (E_{α} - E_{o p t}) / R T)}) (e^{- (E_{β} - E_{o p t}) / R T})} \\ = \frac{\sum_{S_{α}, S_{β} \in S (x)} d_{B P n o r m} (S_{α}, S_{β}) (e^{- E_{α} / R T}) (e^{- E_{β} / R T})}{\sum_{S_{α}, S_{β} \in S (x)} (e^{- E_{α} / R T}) \sum_{S_{β} \in S (x)} (e^{- E_{β} / R T})} \\ = \frac{\sum_{S_{α}, S_{β} \in S (x)} d_{B P n o r m} (s_{α}, S_{β}) (e^{- E_{α} / R T}) (e^{- E_{β} / R T})}{Z^{2}} \\ = \sum_{S_{α}, S_{β} \in S (x)} \sum P (S_{α}) P (S_{β}) d_{B P n o r m} (S_{α}, S_{β}) = 2 D (x) . \end{matrix}$

Thus, $\frac{1}{2} V I (x)$ can be thought of as an approximation of D(x) in case the set S_subopt(x) used in the computation of VI(x) is a proper subset of $S (x)$ .

Results and discussion

Comparison of measures

The six measures that we investigated are correlated to varying degrees; see Table 2 and Figure 1. The measures Q and D are highly correlated (correlation coefficient = 0.98), which could be due to the fact that they are both computed using McCaskill base pair probabilities, p_ij. Also, as expected, the Z-score and p-value are strongly correlated, but not in a linear fashion (see Figure 1). We see that the Z-scores are more sensitive for low values than the p-values (e.g. all Z-scores below -3 correspond to a p-value of 0.0), and so Z-scores are more informative.

Table 2 Correlations between measures. Correlation coefficients between the different measures, values above 0.5 are in bold.

Full size table

The statistic dG is weakly correlated to all other measures. However, it is interesting to note that dG is negatively correlated to %-GC. This is to be expected since GC base pairs have lower energy than the other possible base-pairings. The miRNA family is an exception to this rule, since it has low dG values, but an average %-GC of about 50%, see Figure 1.

Table 2 shows that the correlation between VI and the other measures is low over all families. However, Figure 1 indicates that for a subset of all the sequences the correlation between VI and Q or D is very strong. This is also confirmed by computing the correlation coefficients for the 15 RNA families separately. miRNA, SRP, tRNA, telomerase, and Hh1 show strong correlations (> 0.65) between VI and Q or D, whereas the corresponding correlations for rRNA, snRNA, riboswitch, regulatory, and snoRNA are weak (< 0.3).

Comparison between RNA families

In general, we deem an RNA sequence to have a stable secondary structure if the measures dG, Z, and p are significantly lower than the corresponding values for the shuffled control data sets. To check whether this was the case for the different data sets, we applied a Mann-Whitney rank sum test [23]. This test compares two data sets and computes the probability that the two data sets are sampled from the same distribution. Unlike the t-test, the Mann-Whitney test is distribution free since it compares the ranks of the data values instead of the data values themselves.

At a significance level of 99% the Mann-Whitney test indicated that Z and p are higher for the shuffled data set than for any of the real RNA data sets, except for mRNA and Hh1. The same held for the normalised energy dG, except for the tmRNA, tRNA, regulatory and snoRNA families. This result agrees with those observed in [10], that ncRNAs have significantly lower Z-score than unstructured sequences. This can also be seen in Figure 2.

The measures Q and D can be used to indicate whether a sequence folds into a unique secondary structure or into several alternative structures [24]. The riboswitch data set consists of sequences known to have alternative structures, and so we expected the values of Q and D to be rather high for this data set. We did find this to be the case, but surprisingly they were also as high or even higher for other data sets (see Figure 2).

The high values of Q and D obtained for the mRNA and shuffled data sets is probably due to the fact that these RNAs are unstructured, and hence there are many alternative possible structures. This could also explain the values of Q and D for tmRNA, since tmRNAs are to a large extent mRNA-like (large parts of such molecules are unstructured). Other RNA families like tRNA and RNAse have tertiary interactions that aren't included in secondary structure, which explains their relatively high Q- and D-values. The interaction of rRNAs and snoRNAs with proteins and other RNAs most likely stabilise their native structures, even though alternative structures are possible.

The values of our measures for the telomerase sequences were unexpected. Telomerase has low energy per base, yet it has a rather high Z-score compared to the other ncRNAs. The high stability of this molecule is most likely due to an unusual sequence composition; the telomerase sequences have a high %-GC level, 65% (see Figure 3). The high values of Q and D suggest that the telomerase sequences have alternative structures.

The miRNAs have very stable structures, indicated by low Z and dG, especially in view of their %GC level (~50%). This has previously been observed in [11]. The miRNAs also have low values of Q, D, and VI, indicating a unique structure.

Comparison with previous studies

Seffens and Digby [6] examined 51 mRNA sequences and observed that they have lower folding energy than shuffled versions of the sequences preserving mono- but not dinucleotide frequencies. Shortly after, Workman and Krogh examined 46 of the 51 mRNAs and showed that they do not have lower folding energy than shuffled versions of the sequences, when the dinucleotide frequencies are preserved [8]. In our study, in which sequences were shuffled so as to preserve both mono- and dinucleotide frequencies, we confirm that mRNAs do not have lower folding energy than shuffled sequences. In [8] a small sample of rRNA and tRNA sequences were also investigated and it was indicated that rRNA, but not tRNA has lower folding energy than dinucleotide shuffled sequences. Our study, with significantly more data, agrees with their findings for rRNA, but differs for tRNAs, which we found to have significantly lower Z-scores than shuffled sequences. Rivas and Eddy [9] argue that secondary structure alone is generally not significant for the detection of ncRNA, but note that ncRNAs have slightly lower folding energies than shuffled sequences. Note that in [9] sequences are shuffled preserving mononucleotides only, whereas in our study we shuffled sequences preserving dinucleotide frequencies. Rivas and Eddy computed Z-scores for a large set of tRNAs, and even though we adopt a different shuffling procedure, our results for tRNA are in good agreement with Rivas and Eddy's findings.

Kitagawa et al. [21] observed that five snRNAs have low folding energies compared to shuffled sequences. Our studies confirm this observation, and in general we found that snRNA sequences have lower folding energies than shuffled sequences with the same dinucleotide frequency. Kitagawa et al. also computed VI values for the same five snRNAs, and observed that the values varied considerably (indicating that some have uni-valley landscapes while other have multi-valley landscapes). Although we used a variant of VI, we also found that the VI value varies considerably for different snRNA sequences.

Bonnet et al. observed that miRNAs have considerably lower folding energy than dinucleotide shuffled sequences, unlike tRNA and rRNA [11]. Our studies confirm this observation, although Bonnet et al. investigated shorter regions of the rRNA, while we investigated full rRNA sequences.

In our study, we found the mean Z-scores (and p-values) to be significantly lower for ncRNAs (except the Hammerhead type I family) than for the shuffled sequences (although the Z-scores for mRNA were not lower). This is in agreement with recent results presented in [10], where it is shown that non-coding RNAs have lower Z-scores than coding RNAs for a selection of RNA families (tRNA, Hammerhead type III, a regulatory element (SECIS), SRP, snRNA (U1 and U2), mRNA (divided into coding sequence and 5'- and 3'-untranslated regions)).

Conclusion

We have studied six previously defined measures for predicting how well an RNA molecule is expected to fold (dG, Z, p, Q, D, and VI), and applied them to a large collection of RNAs from the Rfam database. We found all of these measures to be correlated to some degree. The measures Z and p are strongly correlated, but Z is more sensitive than p. Since dG is a measure of MFE it is strongly correlated to the nucleotide composition of the sequence, and so a low dG does not necessarily imply a stable structure. Hence, it is probably sufficient to use Z as opposed to p and dG. For the families that we used in this study, we found the mean Z-scores (and p-values) to be significantly lower for ncRNAs than for the shuffled sequences.

The three measures Q, D and VI can be regarded as measures of the ruggedness of the RNA folding landscape. Both Q and D are computed from the partition function and are thus strongly correlated, and so either of them is probably sufficient for measuring ruggedness. The valley index VI can be viewed as an approximation of the average base-pair distance D (see Methods section), and so there is no advantage in computing VI, especially since it is slow to compute, whereas D can be computed efficiently. RNA families having high values of D (and Q) were either unstructured RNA sequences, long RNA sequences that fold with the help of proteins, or RNAs with alternative folds or pseudoknot structures.

Thus, in summary, we expect that rather than using all of dG, Z, p, Q, D, and VI to predict how well an RNA molecule folds, that it is sufficient to use only Z and D (or Q). Our studies suggest that a combination of Z-score and D value might be useful for identifying well-defined RNA structures, such as the miRNAs (in agreement with results presented in [11]), and, based on our results, we expect that variations of these measures (such as the alignment Z-scores described in [12]), will provide a useful tool for the general problem of RNA structure identification.

References

Suzuki M, Hayashizaki Y: Mouse-centric comparative transcriptomics of protein coding and non-coding RNAs. Bioessays 2004, 26: 833–843. 10.1002/bies.20084
Article CAS PubMed Google Scholar
Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, Patel S, Long J, Stern D, Tammana H, Helt G, Sementchenko V, Piccolboni A, Bekiranov S, Bailey DK, Ganesh M, Ghosh S, Bell I, Gerhard D, Gingeras T: Transcriptional Maps of 10 Human Chromosomes at 5-Nucleotide Resolution. Science 2005, 308(5725):1149–54. 10.1126/science.1108625
Article CAS PubMed Google Scholar
Le S, Chen J, Currey K, Maizel JJV: A program for predicting significant RNA secondary structures. Comput Appl Biosci 1988, 4(1):153–159.
CAS PubMed Google Scholar
Le SY, Chen JH, Maizel JV: Thermodynamic stability and statistical significance of potential stem-loop structures situated at the frameshift sites of retroviruses. Nucleic Acids Res 1989, 17: 6143–6152.
Article PubMed Central CAS PubMed Google Scholar
Chen J, Le S, Currey K, Maizel J: A computational procedure for assessing the significance of RNA secondary structure. Comput Appl Biosci 1990, 6(1):7–18.
PubMed Google Scholar
Seffens W, Digby D: mRNAs have greater negative folding free energies than shuffled or codon choice randomized sequences. Nucleic Acids Res 1999, 27: 1578–1584. 10.1093/nar/27.7.1578
Article PubMed Central CAS PubMed Google Scholar
Schultes EA, Hraber PT, LaBean TH: Estimating the contributions of selection and self-organization in RNA secondary structure. J Mol Evol 1999, 49: 76–83.
Article CAS PubMed Google Scholar
Workman C, Krogh A: No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution. Nucleic Acids Res 1999, 27: 4816–4822. 10.1093/nar/27.24.4816
Article PubMed Central CAS PubMed Google Scholar
Rivas E, Eddy SR: Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs. Bioinformatics 2000, 16: 583–605. 10.1093/bioinformatics/16.7.583
Article CAS PubMed Google Scholar
Clote P, Ferré F, Kranakis E, Krizanc D: Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency. RNA 2005, in press.
Google Scholar
Bonnet E, Wuyts J, Rouze P, Van de Peer Y: Evidence that microRNA precursors, unlike other non-coding RNAs, have lower folding free energies than random sequences. Bioinformatics 2004, 20: 2911–2917. 10.1093/bioinformatics/bth374
Article CAS PubMed Google Scholar
Washietl S, Hofacker IL, Stadler PF: Fast and reliable prediction of noncoding RNAs. Proc Natl Acad Sci USA 2005, 102: 2454–2459. 10.1073/pnas.0409169102
Article PubMed Central CAS PubMed Google Scholar
Griffiths-Jones S, Bateman A, Marshall M, Khanna A, Eddy SR: Rfam: an RNA family database. Nucleic Acids Res 2003, 31: 439–441. 10.1093/nar/gkg006
Article PubMed Central CAS PubMed Google Scholar
Altschul SF, Erickson BW: Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. Mol Biol Evol 1985, 2: 526–538.
CAS PubMed Google Scholar
Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P: Fast folding and comparison of RNA secondary structures. Monatshefte für Chemie 1994, 125: 167–188. 10.1007/BF00818163
Article CAS Google Scholar
Zuker M, Stiegler P: Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res 1981, 9: 133–148.
Article PubMed Central CAS PubMed Google Scholar
Le SY, Maizel JV Jr: A method for assessing the statistical significance of RNA folding. J Theor Biol 1989, 138: 495–510.
Article CAS PubMed Google Scholar
McCaskill JS: The equilibrium partition function and base pair binding probabilities for RNA secondary structures. Biopolymers 1990, 29: 1105–1119. 10.1002/bip.360290621
Article CAS PubMed Google Scholar
Huynen M, Gutell R, Konings D: Assessing the reliability of RNA folding using statistical mechanics. J Mol Biol 1997, 267: 1104–1112. 10.1006/jmbi.1997.0889
Article CAS PubMed Google Scholar
Moulton V, Zuker M, Steel M, Pointon R, Penny D: Metrics on RNA secondary structures. J Comput Biol 2000, 7: 277–292. 10.1089/10665270050081522
Article CAS PubMed Google Scholar
Kitagawa J, Futamura Y, Yamamoto K: Analysis of the conformational energy landscape of human snRNA with a metric based on tree representation of RNA structures. Nucleic Acids Res 2003, 31: 2006–2013. 10.1093/nar/gkg288
Article PubMed Central CAS PubMed Google Scholar
Wuchty S, Fontana W, Hofacker IL, Schuster P: Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers 1999, 49: 145–165. 10.1002/(SICI)1097-0282(199902)49:2<145::AID-BIP4>3.0.CO;2-G
Article CAS PubMed Google Scholar
Mann H, Whitney D: On a test whether one of two random variables is stochastically larger than the other. Ann Math Statist 1947, 18: 50–60.
Article Google Scholar
Mathews DH: Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. RNA 2004, 10: 1178–1190. 10.1261/rna.7650904
Article PubMed Central CAS PubMed Google Scholar

Download references

Acknowledgements

The authors thank Ivo Hofacker for documentation regarding the average base pair distance and Peter Clote for providing access to his work [10] before publication. PPG was supported by a Carlsberg Foundation Grant (21-00-0680).

Author information

Authors and Affiliations

The Linnaeus Centre for Bioinformatics, Uppsala University, Uppsala, Sweden
Eva Freyhult
Dept. of Evolutionary Biology, University of Copenhagen, Universitetsparken 15, 2100, Copenhagen Ø, Denmark
Paul P Gardner
School of Computing Sciences, University of East Anglia, Norwich, NR4 7TJ, UK
Vincent Moulton

Authors

Eva Freyhult
View author publications
You can also search for this author in PubMed Google Scholar
Paul P Gardner
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Moulton
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vincent Moulton.

Additional information

Authors' contributions

EF was involved in selecting the data sets from Rfam and implementing the analyses. PPG developed the ideas presented in the paper and was involved in selecting the data sets from Rfam and implementing the analyses. VM was involved in developing the ideas presented in the paper. All authors contributed to the writing of this manuscript. All authors read and approved the final manuscript.

Electronic supplementary material

Additional File 1: Data sets. This zip-file contains all the sequences we have used for this study. (ZIP 456 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Freyhult, E., Gardner, P.P. & Moulton, V. A comparison of RNA folding measures. BMC Bioinformatics 6, 241 (2005). https://doi.org/10.1186/1471-2105-6-241

Download citation

Received: 03 May 2005
Accepted: 03 October 2005
Published: 03 October 2005
DOI: https://doi.org/10.1186/1471-2105-6-241

A comparison of RNA folding measures