Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

This article is part of the supplement: Selected articles from the First IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2011): Bioinformatics

Open Access Research

Maximum expected accuracy structural neighbors of an RNA secondary structure

Peter Clote123*, Feng Lou2* and William A Lorenz4

Author affiliations

1 Department of Biology, Boston College, Chestnut Hill, MA 02467, USA

2 Laboratoire de Recherche en Informatique (LRI), Université Paris-Sud XI, 91405 Orsay cedex, France

3 Laboratoire d'Informatique (LIX), Ecole Polytechnique, 91128 Palaiseau, France

4 Department of Mathematics and Computer Science, Denison University, Granville, OH 43023-0810, USA

For all author emails, please log on.

Citation and License

BMC Bioinformatics 2012, 13(Suppl 5):S6  doi:10.1186/1471-2105-13-S5-S6

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/13/S5/S6


Published:12 April 2012

© 2012 Clote et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Since RNA molecules regulate genes and control alternative splicing by allostery, it is important to develop algorithms to predict RNA conformational switches. Some tools, such as paRNAss, RNAshapes and RNAbor, can be used to predict potential conformational switches; nevertheless, no existent tool can detect general (i.e., not family specific) entire riboswitches (both aptamer and expression platform) with accuracy. Thus, the development of additional algorithms to detect conformational switches seems important, especially since the difference in free energy between the two metastable secondary structures may be as large as 15-20 kcal/mol. It has recently emerged that RNA secondary structure can be more accurately predicted by computing the maximum expected accuracy (MEA) structure, rather than the minimum free energy (MFE) structure.

Results

Given an arbitrary RNA secondary structure S0 for an RNA nucleotide sequence a = a1,..., an, we say that another secondary structure S of a is a k-neighbor of S0, if the base pair distance between S0 and S is k. In this paper, we prove that the Boltzmann probability of all k-neighbors of the minimum free energy structure S0 can be approximated with accuracy ε and confidence 1 - p, simultaneously for all 0 ≤ k < K, by a relative frequency count over N sampled structures, provided that <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M1">View MathML</a>, where Φ(z) is the cumulative distribution function (CDF) for the standard normal distribution. We go on to describe the algorithm RNAborMEA, which for an arbitrary initial structure S0 and for all values 0 ≤ k < K, computes the secondary structure MEA(k), having maximum expected accuracy over all k-neighbors of S0. Computation time is O(n3 · K2), and memory requirements are O(n2 · K). We analyze a sample TPP riboswitch, and apply our algorithm to the class of purine riboswitches.

Conclusions

The approximation of RNAbor by sampling, with rigorous bound on accuracy, together with the computation of maximum expected accuracy k-neighbors by RNAborMEA, provide additional tools toward conformational switch detection. Results from RNAborMEA are quite distinct from other tools, such as RNAbor, RNAshapes and paRNAss, hence may provide orthogonal information when looking for suboptimal structures or conformational switches. Source code for RNAborMEA can be downloaded from http://sourceforge.net/projects/rnabormea/ webcite or http://bioinformatics.bc.edu/clotelab/RNAborMEA/. webcite

Background

RNA secondary structure conformational switches play an essential role in a number of biological processes, such as regulation of viral replication [1] and of viroid replication [2], regulation of R1 plasmid copy number in E. coli by hok/sok system [3], transcriptional and translational gene regulation in prokaryotes by riboswitches [4], regulation of alternative splicing in eukaryotes [5], and stress-responsive gene regulation in humans [6], etc. Due to the biological importance of conformational switches, several groups have developed algorithms that attempt to recognize switches - in particular, thermodynamics-based methods such as paRNAss[7], RNAshapes[8], RNAbor[9], as well as an approach using the second eigenvalue of the Laplacian matrix [10].

Riboswitches are portions of the 5' untranslated region (UTR) of messenger RNAs, experimentally known to regulate genes in bacteria by allostery [4], and to regulate alternative splicing of the gene NMT1 in the eukaryote Neurospora crassa [5]. Riboswitches are composed of a 5' aptamer and a 3' expression platform. Since the aptamer binds to a specific ligand with high affinity (KD ≈ 5 nM), thus triggering the conformational change of the expression platform upon ligand binding [11], its sequence and secondary structure tend to be highly conserved. In contrast, there is lower sequence for the expression platform, which forms a bistable switch, effecting gene regulation by premature abortion of transcription (as in guanine riboswitches [12]), or by sequestering the Shine-Dalgarno sequence (as in thiamine pyrophosphate riboswitches [13]). Due to the conserved sequence and secondary structure within the aptamer, all existent algorithms (to the best of our knowledge), such as [14-16], attempt only to detect riboswitch aptamers, without the expression platform. In addition to these specific algorithmic approaches, more general computational tools that rely on stochastic context free grammars, such as Infernal[17] and CMFinder[18], have been trained to recognize riboswitch aptamers; in particular, Infernal was used to create the Rfam database [19], which includes 14 families of riboswitch aptamers.

Since current riboswitch detection algorithms do not attempt to predict the location of the expression platform, we have developed tools, RNAbor-Sample and RNAborMEA, described in this paper, which yield information concerning alternative, or suboptimal, structures of a given RNA sequence. These tools can suggest the presence of a conformational switch; however, much more work must be done to actually produce a riboswitch gene finder, part of the difficulty due to the fact that riboswitch aptamers contain pseudoknots that cannot be captured by secondary structure.

In previous work [20,21], we described a novel program RNAbor to predict RNA conformational switches. For a given secondary structure S of a given RNA sequence s, the secondary structure T of s is said to be a k-neighbor of S, if the base pair distance between S and T is k. (Base pair distance is the minimum number of base pairs that must be either added or removed, in order to transform the structure S into T.) Given an arbitrary initial structure S0, for all values 0 ≤ k < K, the program RNAbor[20], computes the secondary structure MFE(k), having minimum free energy over all k-neighbors of S0. (Note that K ≤ 2 · n, since the base pair distance between any two secondary structures of a length n RNA sequence is at most 2 · n.) As well, RNAbor computes for each value 0 ≤ k K, the Boltzmann probability <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M2">View MathML</a>, where Z(k) is the sum of all Boltzmann factors exp(-E(S)/RT) of all structures S having base pair distance k from an initially given structure S0, and where the partition function Z is the sum of all Boltzmann factors of all secondary structures of the given RNA sequence. Here E(S) is the free energy of secondary structure S, with respect to the Turner energy model [22,23], R = 0.001987 kcal mol-1 K-1 is the universal gas constant, and T is absolute temperature. In the case that S0 is the minimum free energy structure, the existence of one or more 'peaks', or values k ≫ 0, where pk is relatively large, suggests that there are two or more low energy structures having large base pair distance k from S0 - i.e., a potentially distinct meta-stable structure, as shown in Figure 1.

thumbnailFigure 1. Output of RNAboron the 27 nt bistable switch with nucleotide sequence CUUAUGAGGG UACUCAUAAG AGUAUCC and initial structure S0, the minimum free energy (MFE) structure....... ((((((((....)))))))) with free energy -10.3 kcal/mol. The 16-neighbor of S0 is the metastable structure ((((((((....))))))))....... with free energy -9.9 kcal/mol. The MFE structure appears above the leftmost peak, while the MFE(16) structure appears above the rightmost peak. The output of RNAbor includes a graph of the Boltzmann probabilities <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M52','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M52">View MathML</a>, and MFE(k) structures, for all 0 ≤ k ≤ 2n. The existence of distinct 'peaks' suggests the presence of a conformational switch.

In [24], Do et al. introduced the notion of maximum expected accuracy (MEA) secondary structure, determined as follows: (i) compute base pairing probabilities p(i, j) using a trained stochastic context free grammar; (ii) compute probabilities <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M3">View MathML</a> that position i does not pair; (iii) using a dynamic programming algorithm similar to that of Nussinov and Jacobson [25], determine that secondary structure S having maximum score <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M4">View MathML</a>, where the first sum is over paired positions (i, j) of S and the second sum is over positions i located in loop regions of S, and where α, β >0 are parameters with default values 1. Subsequently Kiryu et al. [26] computed the MEA structure by replacing the stochastic context free grammar computation of base pairs in (i) by using McCaskill's algorithm [27], which computes the Boltzmann base pairing probabilities

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M5">View MathML</a>

(1)

The sum in the numerator is taken over all secondary structures of the given RNA sequence, that contain base pair (i, j), while the sum in the denominator is taken over all secondary structures of the given RNA sequence. Thus p(i, j) is the sum of the Boltzmann factors of all secondary structures that contain the fixed base pair (i, j), divided by the partition function, which latter is the sum of Boltzmann factors of all secondary structures. In fact, Kiryu et al. [26] describe an algorithm to compute the MEA structure common to all RNAs in a given alignment. Later, Lu et al. [28] rediscovered Kiryu's method; in addition, Lu et al. computed suboptimal MEA structures by implementing an analogue of Zuker's method [29].

Our motivation in developing both RNAbor-Sample and RNAborMEA, was to simplify and improve our previous software, RNAbor, in detecting conformational switches. Since RNAbor computes the minimum free energy structure, MFE(k), over all structures having base pair distance k from an initially given structure S0, a complex portion of the code in RNAbor concerns the retrieval of free energy parameters from the Turner model [22,23]. The idea of RNAborMEA was to compute the base pairing probabilities p(i, j) - see equation (1) - by McCaskill's algorithm using RNAfold, then to compute the maximum expected accuracy structure, MEA(k), which needs no retrieval of energy parameters, and which we hoped would be very similar to the MFE(k) structure, in light of previously mentioned results [26,28]. Surprisingly, it turns out that MEA(k) structures are quite different from MFE(k) structures, as shown later in one of the figures.

In this paper, we begin by showing rigorously how to approximate the output of RNAbor by frequency counts from sampling, using Sfold[30]. We then extend the MEA technique to compute the maximum expected accuracy k-neighbor of a given RNA secondary structure S0; i.e., that secondary structure which has maximum expected accuracy over all structures that differ from S0 by exactly k base pairs. By analyzing the family of purine riboswitches, obtained by retrieving full riboswitch sequences (aptamer and expression platform) from corresponding EMBL genomic data, by extending the aptamers from the seed alignment of Rfam family RF00167 [31], we show that our software RNAborMEA produces strikingly different results from other software that produce suboptimal structures (RNAbor, RNAbor-Sample, RNAlocopt, RNAshapes, UNAFold).

Since the detection of computational switches remains an open problem, despite the success of some tools such as RNAshapes and RNAbor, we feel the addition of the tool RNAborMEA could prove useful, since it appears to be orthogonal to all other methods of generating suboptimal secondary structures.

Results and discussion

In this paper, we describe the following new results, discussed in the 'Methods' section in greater detail with attendant definitions of unexplained concepts.

1. We describe a Python script RNAbor-Sample that approximates the output <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M6">View MathML</a> of RNAbor by frequency counts <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M7">View MathML</a> from sampled structures, for all 0 ≤ k ≤ 2n, using Sfold[30], or RNAsubopt -p[32].

2. We prove that for any desired accuracy 0 < ε and probability 0 < α <1, if at least

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M8">View MathML</a>

(2)

structures are sampled, then

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M9">View MathML</a>

(3)

for all 0 ≤ k < K; i.e., RNAbor-Sample furnishes estimates <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M10">View MathML</a> of pk, for all 0 ≤ k < K, which with confidence 1 - α are within ε of the actual values pk. Here, Φ(z) is the cumulative distribution function (CDF) for the standard normal distribution.

3. We develop an algorithm, RNAborMEA, running in time O(n3 · K2) and space O(n2 · K), which computes simultaneously for all 0 ≤ k K, the maximum expected accuracy k-neighbors of a given RNA secondary structure S0; i.e., for each 0 ≤ k K, RNAborMEA computes that structure Sk which has maximum expected accuracy over all structures that differ from S0 by exactly k base pairs. The algorithm RNAborMEA additionally computes, for each 0 ≤ k K, the pseudo partition function

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M11">View MathML</a>

Moreover, RNAborMEA allows the user to stipulate (partial) hard constraints, that stipulate whether particular nucleotides are unpaired, or base-pair with certain other nucleotides. The implementation of hard constraints follows ideas from Mathews [33], albeit suitably modified to simultaneously consider all k-neighbors, for 0 ≤ k K.

We now describe the 13 figures and 4 tables, corresponding to computational experiments performed with RNAbor-Sample and RNAborMEA. These tables and figures are only briefly described, and we refer the reader to the captions of the figures and tables, which explain the results in greater detail.

Figure 1 illustrates the presence of two peaks, corresponding to the Boltzmann probability of each of the metastable structures for a 27 nt bistable switch previously considered by Hofacker et al. Figure 2 displays the Boltzmann probabilities pk from RNAbor, Boltzmann probabilities estimates <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M12">View MathML</a> from RNAbor-Sample for the SAM riboswitch aptamer with GenBank accession code AP004597.1/11894-11904. Clearly, probability estimates <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M13">View MathML</a> are close to actual values pk. The figure additionally shows probabilities rk from our software RNAlocopt[34], computed by <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M14">View MathML</a>, where Z(LO) is the sum of Boltzmann factors of all locally optimal secondary structures, and Zk(LO) is the sum of all locally optimal k-neighbors of S0. A secondary structure S is said to be locally optimal, if its energy does not decrease by the addition or removal of a single (valid) base pair; i.e., E(S ∪ {(x, y)}) ≥ E(S), and E(S - {(x, y)}) ≥ E(S). Figure 3 displays the experimentally determined GENE ON and GENE OFF structures of an XPT guanine riboswitch from B. subtilis, taken from [35]. Figure 4 shows the outputs of RNAborMEA, RNAbor, and RNAshapes, which are most similar to the GENE ON structure from the previous Figure 3. Figures 5 and 6 determine the structural simlarity, as measured by the program NestedAlign[36], between that structure output by RNAborMEA (as well as structures output by RNAbor, RNAbor-Sample, RNAlocopt, RNAshapes, and UNAFold), which are most similar to the XPT purine riboswitch, displayed in Figure 3. Figure 5 determines the structural similarity to the GENE ON structure (left panel of Figure 3), while Figure 6 determines the structural similarity to the GENE OFF structure (right panel of Figure 3). None of the structural neighbors, or sampled structures, are identical to the GENE ON or GENE OFF structures; however, there are some candidates that bear some resemblance to those structures. At this point, we can say that RNAbor-Sample and RNAborMEA are methods that generate suboptimal structures, some of which may be similar to the metastable structures of a conformational switch; however, much additional work is necessary before a robust method can be developed to detect conformational switches.

thumbnailFigure 2. Boltzmann density plot for RNAbor, along with approximating relative frequency plots for RNAborMEAandRNAlocoptfor the 101 nt RNA sequence UACUUAUCAA GAGAGGUGGA GGGACUGGCC CGCUGAAACC UCAGCAACAG AACGCAUCUG UCUGUGCUAA AUCCUGCAAG CAAUAGCUUG AAAGAUAAGU U for the SAM riboswitch aptamter with GenBank accession code AP004597.1/118941-119041. The program RNAbor computes the Boltzmann probability <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M53','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M53">View MathML</a>, where <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M54','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M54">View MathML</a>, where S0 is the initial structure (taken as the minimum free energy here). The script RNAbor-Sample calls Sfold on 1000 structures, in order to compute a relative frequence fk pk of all k-neighbors of S0. Finally, we compute relative frequency of RNAlocopt[34], a program that samples only locally optimal secondary structures, having the property that one cannot obtain a lower energy structure by adding or removing a single base pair.

thumbnailFigure 3. GENE ON (left) and GENE OFF (right) secondary structures for the 148 nt. XPT guanine riboswitch from B. subtilis with sequence CACUCAUAUA AUCGCGUGGA UAUGGCACGC AAGUUUCUAC CGGGCACCGU AAAUGUCCGA CUAUGGGUGA GCAAUGGAAC CGCACGUGUA CGGUUUUUUG UGAUAUCAGC AUUGCUUGCU CUUUAUUUGA GCGGGCAAUG CUUUUUUU. Sequence and secondary structure taken from [35]. The free energy of GENE ON resp. GENE OFF secondary structrure is -16.46 kcal/mol resp. -22.6 kcal/mol. Free energies were determined using RNAeval and figures produced using RNAplot, both from the Vienna RNA Package [40].

thumbnailFigure 4. Given riboswitch sequence X83878/168-267 and initial structure S0, the minimum free energy structure, a structure output by RNAborMEAis most structurally similar to the XPT GENE ONstructure, as measured by NestedAlign[36]. The NestedAlign score for RNAborMEA is 87.5, while optimal score for RNAbor is 60.0, and for RNAshapes is 64.0.

thumbnailFigure 5. For each RNA sequence in the seed alignment from Rfam family RF00167 of purine riboswitch aptamers, we retrieved downstream flanking residues from the appropriate EMBL files, in order to ensure likelihood that the expression platform was included. Then the following six programs were run: RNAbor, RNAborMEA, RNAbor-Sample, RNAlocopt, RNAshapes, UNAFold. Each program outputs a number of near-optimal secondary structures, each according to different criteria. Taking RNAbor and RNAborMEA as examples, the programs RNAbor and RNAborMEA were run, in order to compute the MFE(k) structure and the MEA(k) structure, which have minimum free energy resp. maximum expected accuracy among all k-neighbors of the intial minimum free energy structures S0. Subsequently, we applied the program NestedAlign described in [36] to compute the structural similarity between the experimentally determined GENE ON structure for XPT guanine riboswitch of B. subtilis; i.e. the left panel of Figure 3. (Similar structures have positive scores; dissimilar structures have negative scores.) For each RNA in the seed alignment of RF00167, we determined the value k1, such the MEA(k1) structure for that RNA has the greatest structural similarity with the XPT GENE ON structure, as determined by NestedAlign. (See the left panel of Figure 3 for the experimentally determined GENE ON structure of XPT.) As earlier explained, we performed similar computations for the programs RNAshapes[39] and UNAFold [41], the programs RNAborMEA and RNAbor-Sample, described in this paper, and programs RNAbor[9] and RNAlocopt[34], developed by our lab. In 21 out of 34 instances, RNAborMEA produced the secondary structure most structurally similar to the experimentally determined XPT GENE OFF structure, as measured by NestedAlign.

thumbnailFigure 6. For each RNA sequence in the seed alignment from Rfam family RF00167 of purine riboswitch aptamers, we retrieved downstream flanking residues from the appropriate EMBL files, in order to ensure likelihood that the expression platform was included. Then the following six programs were run: RNAbor, RNAborMEA, RNAbor-Sample, RNAlocopt, RNAshapes, UNAFold. Each program outputs a number of near-optimal secondary structures, each according to different criteria. Taking RNAbor and RNAborMEA as examples, the programs RNAbor and RNAborMEA were run, in order to compute the MFE(k) structure and the MEA(k) structure, which have minimum free energy resp. maximum expected accuracy among all k-neighbors of the intial minimum free energy structures S0. Subsequently, we applied the program NestedAlign described in [36] to compute the structural similarity between the experimentally determined GENE OFF structure for XPT guanine riboswitch of B. subtilis; i.e. the right panel of Figure 3. (Similar structures have positive scores; dissimilar structures have negative scores.) For each RNA of the seed alignment of RF00167, we determined the value k1, such the MEA(k1) structure for that RNA has the greatest structural similarity with the XPT GENE OFF structure, as determined by NestedAlign. (See the right panel of Figure 3 for the experimentally determined GENE OFF structure of XPT.) As earlier explained, we performed similar computations for the programs RNAshapes[39] and UNAFold [41], the programs RNAborMEA and RNAbor-Sample, described in this paper, and RNAbor[9] and RNAlocopt[34]. In 22 out of 34 instances, RNAborMEA produced the secondary structure most structurally similar to the experimentally determined XPT GENE OFF structure, as measured by NestedAlign.

Figure 7 shows that the MEA(k) structural neighbors, as computed by RNAborMEA, are very different than the MFE(k) structural neighbors, as computed by RNAbor. At present, such computational experiments show RNAborMEA computes suboptimal structures, which seem to share (chimeric) similarities between parts of low energy structures, but which themselves do not have very low energies. Such suboptimal structures appear to be 'orthogonal' to those output by all other methods, such as Sfold, RNAbor, RNAbor-Sample, RNAlocopt, RNAshapes, UNAFold). Figure 8 displays the output of RNAborMEA, given the sequence of a TPP riboswitch with EMBL accession code AF269819/1811-1669. In this instance, RNAborMEA found two low energy structures having large base pair distance from each other. (Other computational experiments did not yield such a good example.) Figure 9 displays the free energy and maximum expected accuracy scores, for each of the k-neighbors of the given TPP riboswitch sequence, just described in Figure 8. Figures 10 and 11 present the pseudocode for the RNAborMEA algorithm, which given an RNA sequence a1, .. .,an and initial structure S0, computes the MEA(k) structure and pseudo partition function <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M15">View MathML</a>, for each 0 ≤ k K in time O(n3 · K2) and space O(n2 · K). Figure 12 presents pseudocode for the O(n2) algorithm to sample structures from the ensemble of structures having high MEA scores - a maximum expected accuracy analogue of the sampling algorithm Sfold[30]. Figure 13 displays the pseudo-Boltzmann probabilities <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M16">View MathML</a> for two small RNA sequences. While temperature T has a natural significance, when computing Boltzmann probabilities <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M17">View MathML</a>, there is no natural meaning of temperature T, when computing pseudo Boltzmann factors exp(MEA(S)/RT), and indeed very different curves can be obtained with different temperatures.

thumbnailFigure 7. Figure depicting the increasing divergence between RNAborand RNAborMEA. For each RNA sequence in the seed alignment from Rfam family RF00066 of U7 small nuclear RNAs, both RNAbor and RNAborMEA were run, in order to compute the MFE(k) structure and the MEA(k) structure, which have minimum free energy resp. maximum expected accuracy among all k-neighbors of the intial minimum free energy structures S0. We computed the base pair distance between the MFE(k) structure and the MEA(k) structure over all sequences in the seed alignment of RF00066. The figure displays the average ± one standard deviation of base pair distance.

thumbnailFigure 8. Sample outputs from RNAborMEAon a 143 nt TPP-riboswitch, AF269819/1811-1669 with sequence CUACUAGGGG AGCCAAAAGG CUGAGAUGAA UGUAUUCAGA CCCUUAUAAC CUGAUUUGGU UAAUACCAAC GUAGGAAAGU AGUUAUUAAC UAUUCGUCAU UGAGAUGUCU UGGUCUAACU ACUUUCUUCG CUGGGAAGUA GUU. We took the TPP riboswitch aptamer from the Rfam database [19], then extracted right-flanking nucleotides from the corresponding EMBL file, in order to include the expression platform. Displayed from left to right are the structures MEA(0) and MEA(61) (the structure MEA(52) is similar to that of MEA(61) and corresponds to a free energy local minimum in the left figure.) The structure MEA(61) had the highest MEA score over all structural neighbors, including the original structure S0 = MEA(0), and had free energy, -46.0 kcal/mol, that was equal to that of the initial structure S0 = MEA(0), which is the minimum free energy structure for the given sequence.

thumbnailFigure 9. (Left) Free energy for all MEA(k) structural neighbors, 0 ≤ k ≤ 99, of the TPP-riboswitch, AF269819/1811-1669, described in the previous figure. Clearly, MEA(0) and MEA(61) have the least energy, - 46.0 kcal/mol, and MEA(61) has the largest MEA score, 134.555, of all secondary structures for the given RNA sequence. It is more common that the free energy of the MEA(k) structure is monotonically increasing as a function of k. (Right) MEA score for all MEA(k) structural neighbors, 0 ≤ k ≤ 99, of the TPP-riboswitch, AF269819/1811-1669, described in the previous figure. Clearly, MEA(61) has the largest MEA score, 134.555, of all secondary structures for the given RNA sequence.

thumbnailFigure 10. Initial portion of pseudocode for RNAborMEAalgorithm, which continues in Figure 11. Given RNA sequence s = s1, .. .,sn of length n, initial secondary structure S0 of s, RNAborMEA computes for all values of 0 ≤ k n that structure S with base pair distance k from S0, which maximizes the value <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M55','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M55">View MathML</a>. The pseudocode actually computes only values M(i, j, k) for all i, j, k; the MEA structures are obtained by backtracing. This algorithm clearly runs in O(n5) time with O(n3) space.

thumbnailFigure 11. Pseudocode for RNAborMEAalgorithm. Given RNA sequence s = s1, .. ., sn of length n, initial secondary structure S0 of s, RNAborMEA computes for all values of 0 ≤ k n that structure S with base pair distance k from S0, which maximizes the value <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M56">View MathML</a>. The pseudocode actually computes only values M(i, j, k) for all i, j, k; the MEA structures are obtained by backtracing. This algorithm clearly runs in O(n3) time with O(n3) space.

thumbnailFigure 12. Pseudocode for the O(n2) traceback computed by our RNAborMEAalgorithm. Note that run time could be reduced to O(n ln n) by applying the boustrephedonic method described in [42].

thumbnailFigure 13. (Left) Pseudo-Boltzmann and uniform probabilities of structural neighbors MEA(k) for the 49 nt SECIS sequence fdhA, with nucleotide sequence CGCCACCCUG CGAACCCAAU AAUAAAAUAU ACAAGGGAGC AAGGUGGCG and where S0 is (((((((.(((...(((.................))).))).))))))). Here, the (formal) parameter RT taken to be 49 (length of sequence), in order to uniformize MEA scores to range between 0 and 1. The pseudo-Boltzmann probability is defined by <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M57','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M57">View MathML</a>, where (i) Z(k) = Σexp(MEA(S)/RT), the sum being taken over all S such that dBP(S0, S) = k, and (ii) Z = ΣkZ(k). The uniform probability is defined by <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M58','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M58">View MathML</a>, where N(k) is the number of k-neighbors of S0 and N is the total number of secondary structures. (Right) Pseudo-Boltzmann probabilities for MEA(k) structural neighbors of the 27 nt Vienna bistable switch with nucleotide sequence CUUAUGAGGG UACUCAUAAG AGUAUCC and initial (minimum free energy) structure.......((((((((....)))))))). The left curve is when RT = 0.6, the approximate value obtained by multiplying the universal gas constant 0.00198 kcal/mol times 310 Kelvin. In contrast, the right curve is when RT = 27 (length of sequence). Though not shown in this graph, the pseudo-Boltzmann distribution is identical with the uniform distribution, when RT = n, where n is sequence length.

We now briefly describe Tables 1, 2, 3, 4. Table 1 provides some sample sizes N, computed by the formula from equation (2), for an ε approximation of Boltzmann probabilities pk, 0 ≤ k < K, with 1 - α confidence level. Tables 2 and 3 provide the numerical values for the earlier described Figures 5 and 6, where the NestedAlign structural similarity is computed for the most similar k-neighbor, determined by RNAborMEA, RNAbor-Sample and RNAlocopt. Table 4 presents the number of times that each of the methods RNAborMEA, RNAbor, RNAbor-Sample, RNAlocopt, RNAshapes, UNAFold output the most similar structure to the GENE ON resp. GENE OFF structure for the XPT purine riboswitch described in Figure 3. This computational experiment was performed for all RNA sequences in the seed alignment of the Rfam purine riboswitch family RF00167 [31]. This table shows that RNAborMEA and RNAbor both outperform any other method in determining structures similar to the GENE OFF XPT structure; however, RNAborMEA uniquely outperforms all methods, including RNAbor, in determining structures similar to the GENE ON XPT structure. One of the reasons for this excellent result is that unlike other methods, RNAborMEA does not look for low energy structures, but rather for maximum expected accuracy structures.

Table 1. Number of samples needed for high-confidence approximation of Boltzmann probabilities

Table 2. Comparison of NestedAlign similarity scores for the GENE ON structure of the XPT guanine riboswitch

Table 3. Comparison of NestedAlign similarity scores for the GENE OFF structure of the XPT guanine riboswitch

Table 4. Number of times that the most similar structure was produced

The figures and tables show, in summary, that RNAborMEA provides useful suboptimal structures, which may be closer to metastable structures of a conformational switch than more traditional methods, which rely on searching for low energy structures.

Conclusions

We have applied the notion of maximum expected accuracy within the context of structural neighbors of a given RNA sequence a1, .. ., an and structure S0. Our software RNAborMEA not only computes the structures MEA(k) having maximum expected accuracy over all structures S, whose base pair distance dBP(S0, S) is equal to k. In addition, RNAborMEA allows the user to enter structural constraints, which specify partial secondary structures required of all MEA(k) structures, if so desired. Additionally, RNAborMEA computes an analogue of the temperature-dependent partition function, defined by

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M18">View MathML</a>

and

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M19">View MathML</a>

Here, the expected accuracy score σ is defined by

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M20">View MathML</a>

where first sum is taken over all base pairs (i, j) belonging to S, and the second sum is taken over all unpaired positions in S, and where pi,j [resp. qi] is the probability that i, j are paired [resp. i is unpaired] in the ensemble of low energy structures, as computed by McCaskill's algorithm [27]. Finally, RNAborMEA allows the user to sample structures from the maximum expected accuracy ensemble, in a fashion analogous to Ding-Lawrence sampling from the low energy Boltzmann ensemble, as in Sfold[30].

Our preliminary investigations have not indicated a clear application of the partition function analogue, though it may be construed to provide a representation of the temperature-dependent mixing of various structures having large score σ. On the other hand, in computational experiments reported in the Results Section, it appears that RNAborMEA produces near-optimal structures that are closer to the biologically functional structures, in the case of conformational switches that are difficult to predict by any method.

Indeed, in 18 [resp. 11] out of 34 instances, RNAborMEA produced the secondary structure most structurally similar to the experimentally determined XPT GENE ON [resp. GENE OFF] structure, as measured by NestedAlign[36]. See Table 4. Since there appears to be little to no correlation between the structures MFE(k) output by RNAbor[20] and the structures MEA(k) output by our current program RNAborMEA, it appears that RNAborMEA yields a signal that is orthogonal and complementary to that provided by state-of-the-art thermodynamics software, such as UNAFold, RNAfold, RNAstructure, Sfold, RNAshapes, RNAbor, etc. For these reasons, we feel that RNAborMEA has a certain value, along with the programs UNAFold, RNAfold, RNAstructure, Sfold, RNAshapes, RNAbor, etc. when producing suboptimal structures. RNAborMEA is written in C and available at http://sourceforge.net/projects/rnabormea/ webcite and http://bioinformatics.bc.edu/clotelab/RNAborMEA/. webcite

Methods

Preliminaries

Recall the definition of RNA secondary structure.

Definition 1 A secondary structure S on RNA sequence a1, .. ., an is defined to be a set of ordered pairs (i, j), such that 1 ≤ i < j n and the following are satisfied.

1. Watson-Crick or GU wobble pairs: If (i, j) belongs to S, then pair (ai, aj) must be one of the following canonical base pairs: (A, U), (U, A), (G, C), (C, G), (G, U), (U, G).

2. Threshold requirement: If (i, j) belongs to S, then j - i > θ, where θ, generally taken to be equal to 3, is the minimum number of unpaired bases in a hairpin loop; i.e., there must be at least θ unpaired bases in a hairpin loop.

3. Nonexistence of pseudoknots: If (i, j) and (k, ℓ) belong to S, then it is not the case that i < k < j <ℓ.

4. No base triples: If (i, j) and (i, k) belong to S, then j = k; if (i, j) and (k, j) belong to S, then i = k.

The preceding definition provides for an inductive construction of the set of all secondary structures for a given RNA sequence a1, .. ., an. For all values of d = 0, .. ., n and all values of i = 1, .. ., n - d, the collection <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M21">View MathML</a>of all secondary structures for ai, .. ., ai+d is defined as follows. If 0 ≤ d θ, then <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M22">View MathML</a>; i.e., the only secondary structure for ai, .. ., ai+d is the empty structure containing no base pairs (due to the requirement that all hairpins contain at least θ unpaired bases). If d > θ and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M23">View MathML</a> has been defined by recursion for all i j < i + d, then

Any secondary structure of ai, .. ., ai+d-1 is a secondary structure for ai, .. ., ai+d, in which ai+d is unpaired.

If ai, aj can form a Watson-Crick or wobble base pair, then for any secondary structure S for ai+1, .. ., ai+d-1, the structure S ∪ {(i, j)} is a secondary structure for ai, ..., ai+d.

For any intermediate value i + 1 ≤ r j - θ - 1, if ar, aj can form a Watson-Crick or wobble base pair, then for any secondary structure S for ai, .. .,ar-1 and any secondary structure T for ar+1, ..., aj-1, the structure S T ∪ {(r, j)} is a secondary structure for ai, .. ., ai+d.

Given two secondary structures S, T, we define the base pair distance between S, T, denoted by dBP (S, T), to be the cardinality of the symmetric difference of S, T; i.e., dBP (S, T) = |(S - T) ∪ (T - S)|.

RNAbor-Sample

In this section, we describe how sampling from the Boltzmann ensemble, using Sfold[30], leads to a simple and fast approximation of RNAbor computations, provided that the input consists of an RNA sequence and the minimum free energy (MFE) secondary structure for that sequence. The work of this section is inspired by sampling approximations of the number of structural motifs, such as hairpins, multiloops, etc. of Ding and Lawrence [30], as well as the sampling approximation used in RNAshapes[8] for large sequences. A novelty of our work is that we provide a rigorous justification for the accuracy of the approximation, depending on sample size.

Let RNAbor-Sample denote the protocol, where we apply Sfold[30] to sample N secondary structures S of an input RNA sequence a1, .. .,an, then subsequently compute the relative frequencies fk for 0 ≤ k < K, where <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M24">View MathML</a> is defined to be the number Nk of sampled structures S, whose base pair distance with S0 is k, divided by N. Since Sfold appears to be only available as a web server, where the user can not stipulate the number of sampled structures, we instead use the Vienna RNA Package implementation of Sfold, given in RNAsubopt -p[32].

Let a1, .. .,an be an arbitrary RNA sequence having MFE structure of S0. Following [9], let Zk denote the sum of Boltzmann factors of all k-neighbors of S0; i.e.,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M25">View MathML</a>

As usual, let Z denote the partition function, representing the sum of Boltzmann factors of all secondary structures of a1, .. ., an; i.e.,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M26">View MathML</a>

and let <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M27">View MathML</a> denote the Boltzmann probability of all k-neighbors.

Given a desired approximation accuracy of ε, a probability p, and an upper bound K, we wish to compute a value N = N(ε, p, K), such that whenever we sample at least N secondary structures from the Boltzmann ensemble using Sfold, the relative frequency fk of k-neighbors sampled is within ε of the probability pk of k-neighbors, for all 0 ≤ k < K, with confidence level of (1 - p). Formally, this means that for each 0 ≤ k < K,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M28">View MathML</a>

(4)

Consider the value k as bin number. For a fixed bin k, let us denote the exact value of <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M29">View MathML</a> by pk. If we sample N structures, each falling in the kth bin with probability pk, then the number of structures in the kth bin is given by the random variable Xk having binomial distribution with mean N · pk and variance N · pk(1 - pk). It follows that the proportion <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M30">View MathML</a> of structures in the kth bin has mean µ = pk and standard deviation <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M31">View MathML</a>. To determine minimum sample size sufficient to ensure a certain approximation accuracy with certain confidence interval, we adapt a standard argument from statistics [37] (see equation (24.35) on p. 529), by approximating the binomial distribution by the standard normal distribution using Z-scores.

Before starting, we mention that it will suffice for our intended application of RNAbor-Sample to have a precise approximation of those pk which exceed some modest lower bound, such as δ = 0.01 or δ = 0.0001. Thus we intend to prove that for all 0 ≤ k < K, if pk δ, then Equation (4) holds.

Temporarily, we fix k. Let X be a Bernouilli random variable with success probability pk, corresponding to the indicator random variable that returns 1 if a single sampled secondary structure is a k-neighbor of S0. Provided that we sample a number N of structures, which satisfies N · pk ≥ 30, then the standard normal distribution can be used to approximate the left and right tail of the distribution of Z-scores of sampled proportions <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M32">View MathML</a>, defined by

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M33">View MathML</a>

(5)

Let <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M34">View MathML</a> denote the cumulative distribution function (CDF) for the standard normal distribution. Given desired confidence interval of C = 1 - α, recall that the critical value zα/2 is defined by

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M35">View MathML</a>

If ε is the margin of error in the left tail (-∞, -zα/2) and right tail (zα/2, +∞) for the normal approximation of the binomial distribution, then by a well-known argument (e.g. equation (24.35) on p. 529 of [37]), we have

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M36">View MathML</a>

It follows that

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M37">View MathML</a>

provides a sufficient lower bound on number of samples necessary to guarantee margin of error ε. Let <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M38">View MathML</a> and define

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M39">View MathML</a>

(6)

We have just shown that for N N(ε, p, K), <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M40','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M40">View MathML</a>, hence

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M41','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M41">View MathML</a>

The following is now a key step. If we have K bins, and we desire to have a small probability p that we are off by more than ε in our estimate of the probability of any bin (in our intended application, the kth bin, for 0 ≤ k < K, is the collection of k-neighbors of S0, i.e., those structures S, whose base pair distance with S0 is k) then it suffices that we have a probability <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M42','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M42">View MathML</a> that we are off by more than ε in any single bin. Indeed, let Yk denote the indicator random variable, with value 1 provided that |fk - pk| > ε, where fk is the relative frequency of sampling a k-neighbor of S0, after sampling N secondary structures, where by Equation (5), N is chosen so that

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M43','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M43">View MathML</a>

then

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M44','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M44">View MathML</a>

Putting everything together, we have shown that for given ε, p, K, we can define by defining N

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M45">View MathML</a>

(7)

we have

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M46','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M46">View MathML</a>

We have completed a more rigorous argument using Chernoff bounds, but prefer the exposition given here for simplicity. Note that the same argument, given verbatim, can be used for any binning procedure. In particular, this approach provides information on sufficient number of samples in order to approximate the result of RNAshapes [8,38,39] by means of sampling.

We can make some basic conclusions from the above analysis. The number of samples sufficient to ensure that |fk - pk| < ε for 0 ≤ k < K with confidence 1 - p is reasonable, and only slightly increases for a higher number K of bins or to ensure greater confidence 1 - p. However, the number of samples increases greatly when higher precision estimates (smaller ε values) are needed, even for one bin.

In the case of one bin, it is important to remember that the value N is a conservative estimate, sufficient to ensure our conclusion. This estimate uses the worst case of 50% probability of being in a bin, which maximizes the standard deviation. For bins with small probability, one can re-estimate what is needed. However, for bins with smaller probability, higher precision is needed as well, unless all that is required is to verify that the bin has small probability. Also, clearly if a bin has probability of 10-6, then at least on the order of one million samples are needed, just for a reasonable probability of sampling the bin once.

Algorithm description

Given an RNA sequence a = a1, .. ., an, a secondary structure S0 of a, and a maximum desired value Kmax n, the RNAborMEA algorithm computes, for each 1 ≤ i < j n and each 0 ≤ k <Kmax n, the maximum score M(i, j, k)

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M47','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M47">View MathML</a>

where the first sum is taken over all base pairs (i, j) belonging to S, the second sum is taken over all unpaired positions in S, and where pi,j [resp. qi] is the probability that i, j are paired [resp. i is unpaired] in the ensemble of low energy structures, and α, β >0 are weights. Our computational experiments, as in Figure 9, were carried out with default values of 1 for α, β. (See Equation 1 for the formal definition of Boltzmann base pairing probability pi,j.)

The dynamic programming computation of M(i, j, k) is performed by recursion on increasing values of j - i for all values 1 ≤ i j n and 0 ≤ k Kmax. The value of M(i, j, k), stored in the upper triangular portion of matrix M, will involve taking the maximum over three cases, which correspond to the inductive construction of all secondary structures on ai, .. ., aj, as described in the previous section. At the same time, the value M(j, i, k), stored in the lower triangular portion of matrix M, will consist of a triple r, k0, k1 of numbers, such that the following approximately holds (in this section, we provide the motivating idea; the actual algorithm description, which deviates slightly from the description here, is given in the next section and in Figures 10 and 11). (i) If r = 0 then M(i, j, k) is maximized by a k-neighbor S of S0[i, j] for the subsequence ai, .. ., aj in which aj is unpaired. In this case, k0 = k and k1 = 0. (ii) If r = i, then M(i, j, k) is maximized by a k-neighbor S of S0[i, j] for the subsequence ai, ...,aj in which base pair (i, j) ∈ S. In this case, k0 = 0 and k1 = k - 1. (i) If i < r j - θ - 1 then M(i, j, k) is maximized by a k-neighbor S of S0[i, j] for the subsequence ai, .. .,aj in which base pair (r, j) ∈ S. The left portion of S, which is S[i, r - 1] will be a k0 neighbor of S[i, r - 1], while the right portion of S, which is S[r, j] must contain the base pair (r, j) and itself be a k1 neighbor of S[r, j]. In summary, the values r, k0, k1 will be used in computing the traceback, where the maximum expected accurate structure that is a k-neighbor of S[i, j] will be constructed by one of the following: (i) MEA k-neighbor of S[i, j - 1], in the event that aj is unpaired in [i, j]; (ii) MEA k - 1-neighbor of S[i + 1, j - 1], in the event that ai, aj form a base pair; (iii) MEA k0-neighbor of S[i, r - 1] and the MEA k1-neighbor of S[r, j], where k0 + k1 = k, in the event that ar, aj form a base pair.

Pseudocode for the algorithm RNAborMEA is given in Figures 10 and 11. An array M of size n × n × Kmax is required to store the MEA scores in M(i, j, k) for all subsequences [i, j] and all base pair distances 0 ≤ k Kmax between structures S[i, j] and initially given structure S0[i, j]. For 1 ≤ i j n and all 0 ≤ k Kmax, the pseudocode in Figure 11 stores a value of the form (x, y, z) in the lower triangular portion, M(j, i, k), of the array. Here, x = 0 indicates that the optimal structure on [i, j], i.e., having maximum MEA score over all k-neighbors of S0[i, j], is obtained by not pairing j with any nucleotide in [i, j]; for values x >0, hence x ∈ [i, j - θ - 1], the optimal k-neighbor of S0[i, j] is obtained by pairing x with j. The values y, z correspond to the values k0, k1, such that: (i) if x = 0, then the optimal k-neighbor of S0[i, j] is obtained by first computing the optimal k0-neighbor of S0[i, j - 1], where k0 = k - b0, then leaving j unpaired; (ii) if x = i, then the optimal k-neighbor of S0[i, j] is obtained by first computing the optimal k1-neighbor of S0[i + 1, j - 1], where k1 = k - b1, then adding the enclosing base pair (i, j); (iii) if x = r ∈ [i + 1, j - θ - 1], then the optimal k-neighbor of S0[i, j] is obtained by first computing the optimal k0-neighbor of S0[i, r - 1] as well as the optimal k1-neighbor of S0[r + 1, j - 1], then adding the base pair (r, j). This last calculation must be done over all values k0, k1 such that k0 + k1 = k. Using the values M(j, i, k) = (x, y, z), the traceback can be easily computed by recursion; see Figure 12 for pseudocode of traceback.

In a manner similar to the pseudocode of Figures 10 and 11 (essentially, one replaces the operation of taking the maximum by the a summation, and one replaces the MEA score by the pseudo-Boltzmann factor exp(MEA(S)/RT)), RNAborMEA also computes the pseudo-Boltzmann partition function values

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M48','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M48">View MathML</a>

We have graphed the Boltzmann probabilities <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M49','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M49">View MathML</a> as well as the uniform probabilities <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M50">View MathML</a>, where <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M51','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S5/S6/mathml/M51">View MathML</a> is the number of k-neighbors, and N1,n is the total number of secondary structures. When RT = n, which normalizes the MEA score to a maximum of 1, it appears that the Boltzmann distribution is the same as the uniform distribution, as shown in Figure 13.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

RNAbor-Sample was conceived by W.A.L., who provided a proof for sample size sufficient to ensure a desired degree of accuracy with a desired level of confidence. RNAborMEA was conceived by P.C., then designed and programmed by P.C. (prototype implementation in Python) and F.L. (implementation in C). Computational experiments were carried out by F.L., figures were created by F.L. and P.C. and the manuscript was written by P.C.

Acknowledgements

Funding for the research of P. Clote and F. Lou was provided by the Digiteo Foundation for the project RNAomics. Additional funding was provided to P. Clote by National Science Foundation grants DMS-1016618 and DMS-0817971. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

This article has been published as part of BMC Bioinformatics Volume 13 Supplement 5, 2012: Selected articles from the First IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2011): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S5.

References

  1. Olsthoorn R, Mertens S, Brederode F, Bol J: A conformational switch at the 3' end of a plant virus RNA regulates viral replication.

    EMBO J 1999, 18:4856-4864. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  2. Repsilber D, Wiese S, Rachen M, Schroder A, Riesner D, Steger G: Formation of metastable RNA structures by sequential folding during transcription: time-resolved structural analysis of potato spindle tuber viroid (-)-stranded RNA by temperature-gradient gel.

    RNA 1999, 5:574-584. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  3. Franch T, Gultyaev AP, Gerdes K: Programmed Cell Death by hok/sok of Plasmid R1: Processing at the hok mRNA 3H-end Triggers Structural Rearrangements that Allow Translation and Antisense RNA Binding.

    J Mol Biol 1997, 273:38-51. PubMed Abstract | Publisher Full Text OpenURL

  4. Mandal M, Boese B, Barrick J, Winkler W, Breaker R: Riboswitches control fundamental biochemical pathways in Bacillus subtilis and other bacteria.

    Cell 2003, 113(5):577-586. PubMed Abstract | Publisher Full Text OpenURL

  5. Cheah MT, Wachter A, Sudarsan N, Breaker RR: Control of alternative RNA splicing and gene expression by eukaryotic riboswitches.

    Nature 2007, 447(7143):497-500. PubMed Abstract | Publisher Full Text OpenURL

  6. Ray PS, Jia J, Yao P, Majumder M, Hatzoglou M, Fox PL: A stress-responsive RNA switch regulates VEGFA expression.

    Nature 2009, 457(7231):915-919. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  7. Voss B, Meyer C, Giegerich R: Evaluating the predictability of conformational switching in RNA.

    Bioinformatics 2004, 20(10):1573-1582. PubMed Abstract | Publisher Full Text OpenURL

  8. Voss B, Giegerich R, Rehmsmeier M: Complete probabilistic analysis of RNA shapes.

    BMC Biol 2006., 4(5) OpenURL

  9. Freyhult E, Moulton V, Clote P: Boltzmann probability of RNA structural neighbors and riboswitch detection.

    Bioinformatics 2007, 23(16):2054-2062.

    Doi: 10.1093/bioinformatics/btm314

    PubMed Abstract | Publisher Full Text OpenURL

  10. Barash D: Second eigenvalue of the Laplacian matrix for predicting RNA conformational switch by mutation.

    Bioinformatics 2004, 20(12):1861-1869. PubMed Abstract | Publisher Full Text OpenURL

  11. Mandal M, Breaker RR: Adenine riboswitches and gene activation by disruption of a transcription terminator.

    Nat Struct Mol Biol 2004, 11:29-35. PubMed Abstract | Publisher Full Text OpenURL

  12. Serganov A, Yuan Y, Pikovskaya O, Polonskaia A, Malinina L, Phan A, Hobartner C, Micura R, Breaker R, Patel D: Structural Basis for Discriminative Regulation of Gene Expression by Adenine- and Guanine-Sensing mRNAs.

    Chem Biol 2004, 11(12):1729-1741. PubMed Abstract | Publisher Full Text OpenURL

  13. Serganov A, Polonskaia A, Phan AT, Breaker RR, Patel DJ: Structural basis for gene regulation by a thiamine pyrophosphate-sensing riboswitch.

    Nature 2006, 441(7097):1167-1171. PubMed Abstract | Publisher Full Text OpenURL

  14. Abreu-Goodger C, Merino E: RibEx: A web server for locating riboswitches and other conserved bacterial regulatory elements.

    Nucleic Acids Res 2005, 33:W690-W692. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  15. Bengert P, Dandekar T: Riboswitch finder - A tool for identification of riboswitch RNAs.

    Nucleic Acids Res 2004, 32:W154-W159. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  16. Chang TH, Huang HD, Wu LC, Yeh CT, Liu BJ, Horng JT: Computational identification of riboswitches based on RNA conserved functional sequences and conformations.

    RNA 2009., 15(7) OpenURL

  17. Nawrocki EP, Kolbe DL, Eddy SR: Infernal 1.0: inference of RNA alignments.

    Bioinformatics 2009, 25(10):1335-1337. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  18. Weinberg Z, Barrick JE, Yao Z, Roth A, Kim JN, Gore J, Wang JX, Lee ER, Block KF, Sudarsan N, Neph S, Tompa M, Ruzzo WL, Breaker RR: Identification of 22 candidate structured RNAs in bacteria using the CMfinder comparative genomics pipeline.

    Nucleic Acids Res 2007, 35(14):4809-4819. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  19. Gardner PP, Daub J, Tate JG, Nawrocki EP, Kolbe DL, Lindgreen S, Wilkinson AC, Finn RD, Griffiths-Jones S, Eddy SR, Bateman A: Rfam: updates to the RNA families database.

    Nucleic Acids Res 2009, 37(Database):D136-D140. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  20. Freyhult E, Moulton V, Clote P: Boltzmann probability of RNA structural neighbors and riboswitch detection.

    Bioinformatics 2007, 23(16):2054-2062. PubMed Abstract | Publisher Full Text OpenURL

  21. Freyhult E, Moulton V, Clote P: RNAbor: a web server for RNA structural neighbors.

    Nucleic Acids Res 2007, 35(Web):W305-W309. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  22. Matthews D, Sabina J, Zuker M, Turner D: Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure.

    J Mol Biol 1999, 288:911-940. PubMed Abstract | Publisher Full Text OpenURL

  23. Xia T, SantaLucia JJ, Burkard M, Kierzek R, Schroeder S, Jiao X, Cox C, Turner D: Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs.

    Biochemistry 1999, 37:14719-35. OpenURL

  24. Do CB, Mahabhashyam MS, Brudno M, Batzoglou S: ProbCons: Probabilistic consistency-based multiple sequence alignment.

    Genome Res 2005, 15(2):330-340. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  25. Nussinov R, Jacobson AB: Fast Algorithm for Predicting the Secondary Structure of Single Stranded RNA.

    Proceedings of the National Academy of Sciences, USA 1980, 77(11):6309-6313. Publisher Full Text OpenURL

  26. Kiryu H, Kin T, Asai K: Robust prediction of consensus secondary structures using averaged base pairing probability matrices.

    Bioinformatics 2007, 23(4):434-441. PubMed Abstract | Publisher Full Text OpenURL

  27. McCaskill J: The equilibrium partition function and base pair binding probabilities for RNA secondary structure.

    Biopolymers 1990, 29:1105-1119. PubMed Abstract | Publisher Full Text OpenURL

  28. Lu ZJ, Gloor JW, Mathews DH: Improved RNA secondary structure prediction by maximizing expected pair accuracy.

    RNA 2009, 15(10):1805-1813. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  29. Zuker M: On finding all suboptimal foldings of an RNA molecule.

    Science 1989, 244(7):48-52. PubMed Abstract | Publisher Full Text OpenURL

  30. Ding Y, Lawrence CE: A statistical sampling algorithm for RNA secondary structure prediction.

    Nucleic Acids Res 2003, 31:7280-7301. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  31. Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, Griffiths-Jones S, Finn RD, Nawrocki EP, Kolbe DL, Eddy SR, Bateman A: Rfam: Wikipedia, clans and the "decimal" release.

    Nucleic Acids Res 2011, 39(Database):D141-D145. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  32. Gruber A, Lorenz R, Bernhart S, Neubock R, Hofacker I: The Vienna RNA websuite.

    Nucleic Acids Research 2008, 36:70-74. OpenURL

  33. Mathews DH, Disney MD, Childs JL, Schroeder SJ, Zuker M, Turner DH: Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure.

    Proc Natl Acad Sci USA 2004, 101(19):7287-7292. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  34. Lorenz W, Clote P: Computing the partition function for kinetically trapped RNA secondary structures.

    Public Library of Science One (PLoS ONE) 2011, 6:316178.

    Doi:10.1371/journal.pone.0016178

    OpenURL

  35. Wakeman CA, Winkler WC, Dann C: Structural features of metabolite-sensing riboswitches.

    Trends Biochem Sci 2007, 32(9):415-424. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  36. Blin G, Denise A, Dulucq S, Herrbach C, Touz H: Alignments of RNA structures.

    IEEE/ACM Transactions on Computational Biology and Bioinformatics 2010. OpenURL

  37. Zar J: Biostatistical Analysis. Prentice-Hall, Inc; 1999. OpenURL

  38. Giegerich R, Voss B, Rehmsmeier M: Abstract shapes of RNA.

    Nucleic Acids Res 2004, 32(16):4843-4851. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  39. Steffen P, Voss B, Rehmsmeier M, Reeder J, Giegerich R: RNAshapes: an integrated RNA analysis package based on abstract shapes.

    Bioinformatics 2006, 22(4):500-503. PubMed Abstract | Publisher Full Text OpenURL

  40. Hofacker I: Vienna RNA secondary structure server.

    Nucleic Acids Res 2003, 31:3429-3431. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  41. Markham NR, Zuker M: UNAFold: software for nucleic acid folding and hybridization.

    Methods Mol Biol 2008, 453:3-31. PubMed Abstract | Publisher Full Text OpenURL

  42. Ponty Y: Efficient sampling of RNA secondary structures from the Boltzmann ensemble of low-energy: The boustrophedon method.

    J Math Biol 2008, 56(1-2):107-127. PubMed Abstract | Publisher Full Text OpenURL