Institute of Applied Mathematics, University of British Columbia, Vancouver, BC V6T 1Z2, Canada
Department of Computer Science, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
Abstract
Background
We investigate the empirical complexity of the RNA secondary structure design problem, that is, the scaling of the typical difficulty of the design task for various classes of RNA structures as the size of the target structure is increased. The purpose of this work is to understand better the factors that make RNA structures hard to design for existing, highperformance algorithms. Such understanding provides the basis for improving the performance of one of the best algorithms for this problem, RNASSD, and for characterising its limitations.
Results
To gain insights into the practical complexity of the problem, we present a scaling analysis on random and biologically motivated structures using an improved version of the RNASSD algorithm, and also the RNAinverse algorithm from the Vienna package. Since primary structure constraints are relevant for designing RNA structures, we also investigate the correlation between the number and the location of the primary structure constraints when designing structures and the performance of the RNASSD algorithm. The scaling analysis on random and biologically motivated structures supports the hypothesis that the running time of both algorithms scales polynomially with the size of the structure. We also found that the algorithms are in general faster when constraints are placed only on paired bases in the structure. Furthermore, we prove that, according to the standard thermodynamic model, for some structures that the RNASSD algorithm was unable to design, there exists no sequence whose minimum free energy structure is the target structure.
Conclusion
Our analysis helps to better understand the strengths and limitations of both the RNASSD and RNAinverse algorithms, and suggests ways in which the performance of these algorithms can be further improved.
1 Background
Ribonucleic acids (RNA) are macromolecules that play fundamental roles in many biological processes, and in many cases their structure is essential for their biological function. A secondary structure for an RNA strand is simply a set of pairing interactions between bases in the strand. Each base can be paired with at most one other base. Most basepairings occur between WatsonCrick complementary bases C and G or A and U, respectively (canonical pairs). Other pairings, such as G•U, can be found occasionally. Secondary structure determines many important aspects of RNA tertiary structure; it can, for example, be used in part to explain translational controls in mRNA
Almost all widely used computational approaches for prediction of RNA secondary structures from single sequences are based on thermodynamic models that associate a free energy value with each possible secondary structure of a strand. The secondary structure with the lowest possible free energy value, the minimum free energy (MFE) structure, is predicted to be the most stable secondary structure for the strand. There are widely used dynamic programming algorithms that, given an RNA strand of length
1.1 The RNA Secondary Structure Design Problem
This work focuses on the design of RNA strands that are predicted to fold to a given MFE secondary structure, according to a standard thermodynamic model such as that of Mathews et al.
Dirks et al.
The purpose of this work is to understand better the factors that render RNA structures hard to design. Such understanding provides the basis for improving the performance of RNASSD and for characterising its limitations. To our knowledge, it has not been determined whether there is a polynomialtime algorithm for RNA secondary structure design. Schuster et al.
Therefore, to gain insights into the practical complexity of the RNA secondary structure design problem, we present an empirical analysis of an improved version of the RNASSD algorithm of Andronescu et al.
Secondly, we introduce and analyse a version of RNASSD that additionally allows the specification of primary structure constraints. Such constraints are important, for example, when designing RNAs such as ribozymes or tRNAs, where certain base positions must be fixed in order to permit interaction with other molecules. We show that depending on their number and location, such constraints can have a significant impact (positive or negative) on the running time of the design algorithm. Our results indicate that when the primary structure constraints are restricted to stems, our new version of RNASSD is faster than when the constraints are distributed randomly, and in both cases the algorithm's median expected running time scales polynomially with the size of the structure to be designed.
1.2 The RNASSD algorithm
The RNA secondary structure design problem can be formalised as a discrete constraint satisfaction problem, where the constraint variables are the positions in the desired RNA strand, the values assigned to these variables correspond to the bases at the respective positions, and the constraints capture the basepairings that define the given secondary structure. For this problem, evaluating the quality of candidate solutions is computationally expensive, since it uses an implementation of the algorithm of Zuker and Stiegler
RNASSD is a stochastic local search (SLS) algorithm that iteratively modifies single unpaired bases or basepairs of a candidate strand in order to obtain a sequence that is predicted to have the target MFE structure. The
2 Results
The main contributions of our work fall into four categories: Improvements to the RNASSD algorithm, including support for primary structure constraints; results on the scaling of run time for our new RNASSD algorithm and RNAinverse on design problems
2.1 Improvement of the RNASSD algorithm
In preliminary experiments, we found that some structures are very difficult to design by using the hierarchical decomposition of Andronescu et al.
(a) Randomly generated structure of length 75 (RND75n62) with loops separated by short stems
(a) Randomly generated structure of length 75 (RND75n62) with loops separated by short stems. The line represents the location where the structure is split into two substructures. Parts (b) and (c) show the corresponding substructures with a static cap structure and dangling ends, respectively. Parts (d) and (e) show the same substructures with a dynamic cap structure and dynamic dangling ends, respectively.
This mechanism can be improved by introducing a dynamic cap structure and dynamic dangling ends in order to create structural boundary conditions that are exactly identical to the original structure in terms of the number of paired and unpaired bases adjacent to the split point. In our new mechanism, the number of paired bases in the cap structure added to one fragment depends on the number of paired bases at the beginning of the other substructure (Figure
We also extended RNASSD to support primary structure constraints, that is, constraints on the bases that occur in certain sequence positions. The additional sequence constraints limit certain sequence positions to specific bases or sets of bases. For this purpose, the standard IUPAC symbols
IUPAC nomenclature for nucleic acids.
Symbol
Meaning
Origin of designation
G
G
Guanine
A
A
Adenine
T
T
Thymine
C
C
Cytosine
R
G or A
puRine
Y
T or C
pYrimidine
M
A or C
aMino
K
G or T
Ketone
S
G or C
Strong interaction (3 H bonds)
W
A or T
Weak interaction (2 H bonds)
H
A or C or T
notG, H follows G in the alphabet
B
G or T or C
not A, B follows A
V
G or C or A
notT (notU), V follows U
D
G or A or T
notC, D follows C
N
G or A or T or C
aNy
Our extended version of RNASSD supports primary structure constraints as follows. Given a sequence specification using the IUPAC symbols listed in Table
2.2 Analysis of RNASSD and RNAinverse on secondary structures without constraints
We now report results from our analysis of the empirical complexity of solving RNA secondary structure design problems with the improved version of RNASSD and with the RNAinverse algorithm. We performed experiments on random and biologically motivated structures of different lengths. (Details of our experimental protocol are given in the Section 5.)
We study the behaviour of the algorithm on biological structures since it will have an impact in biological applications such as ribozyme design. Because of the limited availability of true biological structures, we generated structures with biological characteristics based on the set of real structures listed in Table
Biological RNA structures.
No.
Description
Size (bases)
1
Minimal catalytic domains of the hairpin ribozyme satellite RNA of the Tobacco ringspot virus
65
2
U3 snoRNA 5' domain from
79
3
122
4
VS ribozyme from Neurospora mitochondria
167
5
R180 ribozyme
178
6*
XS1 ribozyme,
314
7*
Homo Sapiens RiboNuclease P RNA
342
8
S20 mRNA from
372
9
375
10
Group II intron ribozyme D135 from
583
Biological structures obtained from the literature and used by Andronescu et al. [9]. The structures marked with an asterisk (*) were obtained from original, pseudoknotted structures by eliminating 8 base pairs in each case to remove the pseudoknot.
Statistics of biological structures from Table 2.
Hairpins
Stems
2Branch loops
Multiloops
Bulges
Size
[4,8]
[3,12]
[4, 11]
[6,17]
[1,3]
Number


[1,8]
[0,5]
[0,0.17]*
Branches



[3,4]

Properties of the structures from Table 2; the intervals specify the minimal and maximal values observed for the respective features. These parameters were used to generate structures with biological properties. * This value denotes the ratio of bulges to base pairs in the stems.
Figure
Scaling analysis of RNASSD and RNAinverse
Scaling analysis of RNASSD and RNAinverse. Scaling analysis of the expected run time (yaxis) of structures of lengths 50, 75, 100, 125, 150, 200 and 450 (xaxis). A logarithmic scale is used on both axes. The lines correspond to best fits of the data, for structures with lengths 50 to 150, using a polynomial that is specified in each case. The expected run time for structures longer than 150 appear close to the corresponding fit line. (a) Expected run time of RNASSD to design biological structures and median (Q50), 0.1quantile (Q10) and 0.9quantile (Q90) of expected run time for RNASSD applied to biologically motivated structures. (b) Median of expected run time of random and biologically motivated structures using RNASSD and RNAinverse. The structures of length 200 are the largest structures from the respective data sets that we designed with RNAinverse.
As can be seen from Figure
Search cost distribution of RNASSD
Search cost distribution of RNASSD. Distribution of expected run time of RNASSD on (a) random structures and (b) biologically motivated structures. For each point, the xvalue indicates an expected run time and the yvalue corresponds to the fraction of structures whose run time is at most the xvalue. We arbitrarily (but unambiguously) report the expected run time for structures that RNASSD is unable to design as 10^{6 }CPU seconds.
Search cost distribution of RNAinverse
Search cost distribution of RNAinverse. Distribution of expected run time of RNAinverse on (a) random structures and (b) biologically motivated structures. We report the expected run time for structures that RNAinverse is unable to design as 10^{6 }CPU seconds.
The random structures are designable by construction since they were obtained by folding a set of random sequences with the
Examples of structures not designed by RNASSD
Examples of structures not designed by RNASSD. Structures not designed by RNASSD have short stems separated by loops, indicated by arrows in the Figure. (a) Random structure of length 450 (RND450n84). This is the only random structure in our data set that RNASSD did not design. Note that it has two internal loops separated only by one base pair. (b) Biologically motivated structure of length 74 (BIOM50n262).
As can be seen from Figure
To further explore RNASSD's ability to design larger structures, we evaluated its performance on two additional sets, containing random structures of length 450 and biologically motivated structures of length 500, respectively. In these experiments, we found that RNASSD designed 99.78% of the randomly generated structures and 94.4% of the biologically motivated ones within a cutoff time of 30 CPU minutes.
2.3 Undesignable structures
When examining the structures that appeared to be undesignable by the RNASSD algorithm, we found that they typically have short stems separated by loops, as shown in Figure
Undesignable motifs
Undesignable motifs. Two structure motifs of our data set that are not compatible with the thermodynamic model. Bold lines represent base pairs. (a) Motif B: bulges separated by one base pair. (b) Motif 2I: internal loops separated by one base pair.
Unstable motifs were found in several biologically motivated structures, and they also seem to appear in nature. For example, according to the Comparative RNA Web (CRW) Site, which provides RNA secondary structures based on comparative sequence analysis
2.4 Analysis of RNASSD on secondary structures with constraints
From the previous experiments we learned that the empirical timecomplexity of the RNA design problem is polynomial for random and biologically motivated structures. Next, we will investigate the hardness of the problem when primary structure constraints are imposed on the design of the biologically motivated structures that we used for the unconstrained case.
The hardness of an instance of this constrained secondary structure design problem not only depends on the given secondary structure, but also on the set of primary structure constraints. To capture the impact of the primary structure constraints on the performance of RNASSD, we used every secondary structure with a number of different sets of primary structure constraints; furthermore, because of the stochastic nature of RNASSD, we performed multiple runs of our algorithm for each such problem instance. The expected CPU time required to design a structure with a given set of primary structure constraints was estimated from these runs. Most of our analysis is based on the median expected run time of RNASSD over all sets of constraints for a given structure. Because of the computational burden incurred by the large number of runs per secondary structure required by this protocol, we performed these experiments on smaller sets of biologically motivated structures; these sets were obtained by uniform random sampling (without replacement) from the respective sets used for our empirical analysis of the unconstrained case. Two different methods were used to create sets of primary structure constraints. One of these essentially selects the base positions to be fixed within the given structure at random, while the other fixes the base assignments of entire stems. In both cases, the bases in the selected positions are fixed according to a sequence that folds stably into the given structure. (Details are described in Section 5.)
As can be seen in Figure
Search cost distribution for the design of structures with primary structure constraints using RNASSD
Search cost distribution for the design of structures with primary structure constraints using RNASSD. Distribution of expected run time of RNASSD on three structures of approximately 150 bases: RND150n85, BIOM150n89 and VS ribozyme from Neurospora mitochondria. The structures were designed with two sets of primary base constraints: one where the bases are fixed at random positions and another where the bases are fixed on stems for each structure. Both sets have the same range [
Figure
Scaling analysis on biologically motivated structures with different primary structure constraints using RNASSD
Scaling analysis on biologically motivated structures with different primary structure constraints using RNASSD. Scaling analysis for the median expected run time of biologically motivated structures with no primary base constraints and with bases constrained in fifty percent of random positions and fifty percent of stems. The lines represent the polynomial that best fits the data for structures with lengths 50, 75 and 100. The experiment with primary structure constraints is computationally expensive, and for this reason, fewer structures of each length were used. Note that the run times for constrained structures longer than 100 appear below the corresponding fit line.
2.5 Performance of RNASSD with different number and locations of primary base constraints
In a second series of experiments, we studied the correlation between the number of bases constrained and the performance of the RNASSD algorithm. The experiments were conducted using some biological structures from Table
Structures for the study of the performance of RNASSD as a function of primary structure constraints.
No.
Description (source)
Size (bases)
expected run time [CPU sec]
number of multiloops
number of of stems
1
VS ribozyme from Neurospora mitochondria
167
0.64
2
11
2
Bio150n38
172
0.53
1
9
3
Bio150nl4
167
12.94
2
10
4
Group II intron ribozyme D135 from
584
11.54
5
32
5
Bio200nl9
208
7.62
3
12
6
Bio150nl2
150
0.16
6
Set of structures used to study the correlation between the primary structure constraints and the performance of RNASSD. Structures with similar characteristics (such as size, number of multiloops, etc.) appear in the same group. The structure Bio150nl2 was included in this set because it is relatively easy to design.
Figure
Impact of constrained bases on the difficulty of secondary structure design using RNASSD
Impact of constrained bases on the difficulty of secondary structure design using RNASSD. Correlation between the fraction of bases constrained in a particular structure (xaxis) and the median expected run time for designing the structure with RNASSD (yaxis). We report the fraction of constrained bases after propagation for constraints on randomly chosen base positions. This fraction, for both randomly chosen bases and stems, corresponds to the median fraction of bases constrained in a set of 50 constraints that were generated by fixing a given percentage of bases or stems. There are two curves in each graph, one for designing structures with base constraints located in random positions and the other for constraints located in stems. (a) VS ribozyme from Neurospora mitochondria; (b) Group II intron ribozyme D135 from
3 Discussion
In earlier work by Andronescu et al.
Both, RNAinverse and RNASSD, failed to design some structures, but there was no case in which RNASSD was unable to design a structure solved by RNAinverse. Some of the structures that could not be designed by RNASSD contain motifs that are provably not allowed by the thermodynamic model of RNA secondary structure and are hence inherently undesignable using that model. Such motifs contain short stems that are not stable enough to compensate for the penalty associated with the adjacent loops; we have observed similar motifs in all structures that RNASSD failed to design, and suspect that most (if not all) of these structures may be inherently undesignable. On the other hand, we also found inherently undesignable structural motifs in trusted structures of biological RNAs. This could be due to inaccuracies of the thermodynamic model commonly used for RNA secondary structure, tertiary structure effects or interaction of the RNA with other molecules, which prevent it from folding into its "true" MFE conformation.
We also found that artificially generated structures with statistical features derived from trusted biological structures (here called "biologically motivated structures") are easier to design than structures of random sequences, probably because they contain more structural motifs that are easy to design. Also, for the undesigned trusted biological structures, it is not clear
One of the improvements over the first version of our RNASSD algorithm (as described by Andronescu et al.
However, we observed marked improvements in the running time and success rates in many cases. For example, Andronescu et al.
Our study also sheds light on the hardness of designing structures with primary structure constraints. In particular, our detailed analysis of primary structure (that is, base sequence) constraints on the performance of RNASSD suggests that it is generally easier to design a structure when the stems are constrained. This is intuitively plausible, given that generally, stems represent the most stable parts of RNA secondary structures. However, there are exceptions: structure
Biologically motivated structure Bio150n14
Biologically motivated structure Bio150n14. Biologically motivated structure with ten stems. When constraining the bases in stems 7 and 8, this structure is hard to design. The structure motif formed by these stems, which are short and separated by a bulge, is unstable.
We also observed that for structures with similar characteristics (same number of bases, multiloops or stems, or same difficulty to design without constraints), the behaviour of RNASSD algorithm shows significant qualitative variation. Structures such as that of the
However, the problem can also get harder as the number of constrained bases increases and then becomes easier again, as approximately 80% or more of the bases are constrained or when all the stems are constrained. This is observed for structures
Somewhat surprisingly, as can be observed for structure
It should be noted that our empirical complexity results do not rule out the possibility that the RNA secondary structure design problem (with or without primary structure constraints) could be NPhard, but suggest that such worstcase asymptotic scaling is not reflected in the typical behaviour of existing algorithms applied to distributions of random and biologically plausible structures studied here. However, careful examination of our scaling data indicates that the degree of the polynomial characterising the scaling of run time with structure size is considerably higher for the hardest structures in our testsets than it is for typical or easy structures, which could be seen as an indication of possible exponential scaling of the run time of RNAinverse and RNASSD in the worst case.
4 Conclusion
We have introduced an empirical analysis for the design of RNA secondary structures with the RNAinverse algorithm from the Vienna RNA Package and with an improved version of RNASSD that supports primary structure constraints. Our analysis helps us to better understand the strengths and limitations of both algorithms. For this study we used a big set of structures (5000 in total) of different lengths generated randomly and also generated with structural and statistical properties (such as loop size, number of multiloops, etc.) based on different classes of biological RNAs. We investigate the hardness of the design of these structures without primary structure constraints and with different number and locations of base constraints in the structure. In every case the problem scales polynomially with the size of the structure. Experiments on biologically motivated structures show that in general there is an advantage in the design if we impose primary base constraints in stems. When we tried to determine if the structure design is easier as we increase the number of fixed positions, we found that this is not always the case. The design of some structures gets harder when approximately 50% of the bases are constrained. This suggests a reduction in the effective search space size that depends on the properties of the structure.
RNASSD performs substantially better than RNAinverse, both in terms of speed as well as with respect to the structures that can be designed within a given amount of time. We compared both algorithms on random structures without primary structure constraints and found that the scaling of the median expected runtime is about
We also identified some structural motifs that make the RNA design task harder (data not shown). In particular, short stems separated by loops are difficult to design. Short stems are not stable enough to compensate for the penalties associated with adjacent loops, and therefore, energetically more favourable motifs are preferred. Some of these motifs are not allowed by the thermodynamic model
The results of this study suggest further improvements to the RNASSD algorithm. For example, it is possible that structural splitting leads to substructures that, apart from the cap structure, are completely determined by primary base constraints. Such substructures can cause artificial challenges to our search algorithm and should be treated differently. Alternatively, the structural decomposition approach could be modified in such a way that the fraction of constrained bases in each substructure is balanced. Another improvement which has already been proposed by Andronescu et al.
Interactions between RNA molecules are of substantial biological interest, and we are therefore planning to extend RNASSD to the design of duplexes of interacting RNAs. With this extension of the algorithm, it will be possible to design pairs of strands in biomolecular nanostructures
Very recently, Busch and Backofen
However, compared to RNAinverse and RNASSD, RNAINFO is more biased towards sequences that form lowenergy structures and can hence be expected to find more restricted ensembles of solutions to any given RNA secondary structure design problem. We conjecture that by combining features of RNASSD and RNAINFO, in particular RNASSD's less biased initialisation and balanced hierarchical decomposition approach with RNAINFO's more efficient SLS procedure, further performance improvements could be achieved. Furthermore, RNAINFO currently does not support primary structure constraints, and it would be interesting (and not too hard) to incorporate these into a future version.
5 Methods
To investigate the empirical complexity of designing structures without constraints we used the following data sets. We generated random structures by folding, with the
Sets of randomly generated structures.
Set name
Size (bases)
Number of structures
RND50
50
1000
RND75
75
1000
RND100
100
100
RND125
125
100
RND150
150
100
RND200
200
100
RND450
450
100
Sets of structures generated by folding random sequences with the
Sets of biologically motivated structures.
Set name
Size (bases)
Number of structures
BIOM50
[50,75)
1000
BIOM75
[75,100)
1000
BIOM100
[100,125)
100
BIOM125
[125,150)
100
BIOM150
[150,175)
100
BIOM200
[200,225)
100
BIOM500
[500,525)
100
Sets of structures generated with the RNA structure generator, using the parameters from Table 3.
For the experiment in which RNASSD was used to design structures with primary structure constraints, we utilised only biologically motivated structures. This experiment was computationally expensive because it required the design of a given structure with several constraints. For this reason, we chose subsets of the previously described sets of biologically motivated structures by means of random sampling (without replacement). These subsets consist of 50 structures of the data sets BIOM50 and BIOM75; 45 structures of the data set BIOM100; and 10 structures of the data sets BIOM125, BIOM150, BIOM200 and BIOM500, respectively.
The primary base constraints were generated in the following way. For each structure, we used RNASSD to obtain 100 sequences that are computationally predicted to fold into it. Of these, we selected the sequence that gave the most stable MFE structure and used it for generating base constraints for certain positions using two different methods. In one method, we sampled 50% of the sequence positions uniformly at random (without replacement). Additionally, when generating a constraint for a paired base, we also generated a constraint for the base to which it is paired to be fixed to the correct WatsonCrick complementary base; consequently, more than 50% of the bases may be fixed in the resulting design problem. In the other method, we sampled 50% of the stems in the given structure uniformly at random (without replacement) and fixed all bases occurring in these stems.
To control for the variation in run time of the design algorithms due to the choice of constrained bases, we generated all of the possible sets of constraints in cases where this number was found to be less than 50, and random samples of size 50 otherwise. Thus, for each structure in a test set, we considered up to 50 possible sets of constraints obtained by each of the two generation methods. For structures of length 500, which are computationally expensive to design, we used only 10 instead of 50 constraint sets (also obtained by random sampling without replacement).
All computational experiments were carried out on PCs with dual Intel Xeon 2.40 GHz processors (only one processor was used in our experiments), 512 KB cache, and 1 GB RAM running Red Hat Linux, kernel version 2.6.51.358smp. Both, RNASSD and RNAinverse are highly stochastic algorithms: when applied to the same structure multiple times, the time for finding a solution may vary substantially. (Note, however, that by using the same random seed, any run of RNASSD can be perfectly reproduced.) Therefore, it is necessary to perform sufficiently many runs on each problem instance in order to get reasonably stable statistics on run time. For the unconstrained experiment we performed 50 runs on a given structure and estimate the expected time required for finding a solution as
where
For the experiments with primary structure constraints, 50 runs were performed for each structure and set of primary structure constraints. The expected CPU time required for designing a structure with a given constraint was estimated from these runs using the same formula as in the unconstrained case, and the median over the 50 sets of constraints per structure was used for all analyses.
(The data sets and the algorithm will be made available online at the time of publication.)
Authors' contributions
RAH, HH and AC jointly developed the improvements and extensions of RNASSD, designed experiments and analysed results; RAH implemented the improved version of RNASSD and performed all experiments; all authors were involved in writing the paper.
Appendix
Consider the structure motif B from Figure
Energetically favourable structures
Energetically favourable structures. (a) Motif I: internal loop formed by breaking the base pair
Let
Δ
following the notation of Andronescu
Free energy parameters for internal loops.
Parameter
Explanation
Values
Δ
Destabilizing energy of internal loop of size
1.7, 1.8, 2.0,...
Δ
Destabilizing energy of bulge of size
3.8, 2.8, 3.2,...
Δ
Terminal mismatch free energy of closing base pair (
1.1, 0.7, 0.4, 0.0 and 0.7
Δ
Penalty for asymmetric internal loops
Δ
+ Δ
+ Δ
⇒
+ Δ
= 2.7 + 2·07 = 4.1
On the other hand,
Δ
= 2.8 + 3.2 = 6 ∀
Then
Therefore,
Δ
Consider the motif 2
Δ
Δ
Δ
Δ
where
Δ
⇒ Δ
On the other hand,
Δ
+ [Δ
⇒ Δ
⇒ Δ
Therefore,
Δ
Acknowledgements
This material is based upon work supported by the National Sciences and Engineering Research Council of Canada, the Mathematics of Information Technology and Complex Systems (MITACS) Network of Centres of Excellence, the Government of Canada Awards Program and CONACyT (Consejo Nacional de Ciencia y Tecnología).