Department of Statistics, University of Oxford, Oxford OX1 3TG, UK

Department of Computer Science, University of Oxford, Oxford OX1 3QD, UK

ETH Zürich, Department of Biosystems Science and Engineering, 4058 Basel, Switzerland

Mathematics Institute, University of Oxford, Oxford OX1 3LB, UK

Abstract

Background

RNA secondary structure prediction, or folding, is a classic problem in bioinformatics: given a sequence of nucleotides, the aim is to predict the base pairs formed in its three dimensional conformation. The inverse problem of designing a sequence folding into a particular target structure has only more recently received notable interest. With a growing appreciation and understanding of the functional and structural properties of RNA motifs, and a growing interest in utilising biomolecules in nano-scale designs, the interest in the inverse RNA folding problem is bound to increase. However, whereas the RNA folding problem from an algorithmic viewpoint has an elegant and efficient solution, the inverse RNA folding problem appears to be hard.

Results

In this paper we present a genetic algorithm approach to solve the inverse folding problem. The main aims of the development was to address the hitherto mostly ignored extension of solving the inverse folding problem, the multi-target inverse folding problem, while simultaneously designing a method with superior performance when measured on the quality of designed sequences. The genetic algorithm has been implemented as a Python program called Frnakenstein. It was benchmarked against four existing methods and several data sets totalling 769 real and predicted single structure targets, and on 292 two structure targets. It performed as well as or better at finding sequences which folded

Conclusions

Our method illustrates that successful designs for the inverse RNA folding problem does not necessarily have to rely on heavy biases in base pair and unpaired base distributions. The design problem seems to become more difficult on larger structures when the target structures are real structures, while no deterioration was observed for predicted structures. Design for two structure targets is considerably more difficult, but far from impossible, demonstrating the feasibility of automated design of artificial riboswitches. The Python implementation is available at

Background

The function of the RNA molecule depends on the way it folds – structural changes can change protein binding sites, or affect activity for ribozymes, for example. RNA folding allows the single strand of nucleotides to fold upon itself and form more complex structures such as helical junctions and pseudoknots; almost as soon as RNA started to be sequenced, methods were established to determine the structure from the sequence of nucleotides. Early attempts include

The inverse RNA folding problem is defined as follows: given a particular RNA secondary structure (target structure), find a sequence of base pairs that would fold into this structure. One could adopt two possible solution techniques: either find an exact match, i.e. a sequence whose predicted structure matches the target structure exactly, or look for a sequence whose predicted structure is as close as possible to the target structure (a suboptimal solution). Then, the inverse folding problem becomes an optimization problem: the goal is to minimize the distance metric defined between a given target structure and the predicted structure of a sequence. Here we consider only base pairs {

There are several existing approaches to the RNA Inverse Folding Problem. RNAInverse

However, there are currently no implementations available capable of solving the inverse folding problem under multiple structural constraints. To our knowledge, the only existing method for inverse folding with multiple structure targets was published by

Methods

The inverse folding is implemented by a fairly standard genetic algorithm (GA) approach

In addition to a random search, our method also implements several strategies for more directed evolution. Instead of making uninformed evolutionary changes and leave it to the selection part of the GA to direct the search, it is also possible to bias the choice of change towards e.g. mutating positions with a predicted structure that does not match the target, or in a recombination towards parts that have a good match to the target structure in the regions where they contribute sequence. These strategies are all available through command line options (and as classes when the implementation is used as a library rather than stand-alone program) to allow experimentation and tailoring to specific applications. Through experimentation, a set of parameters was found which worked reasonably on all types of structure, which is set as a pre-defined default. In the following, we will use

Positional fitness

A key concept in the GA is the fitness of positions. This allows directing evolutionary changes by choosing unfit positions for mutations – an approach also taken in

We define positional fitness schemes relative to a single target structure

**
Scheme 1.
** Binary indicator of whether position has correct predicted structure,

**
Scheme 2.
** Boltzmann probability of target structure,

**
Scheme 3.
** Truncated negative logarithm of Boltzmann probability of target structure,

**
Scheme 4.
** Binary indicator of whether probability of target structure exceeds threshold

**
Scheme 5.
** Sigmoid transformed difference between Boltzmann probability of target structure and most probable alternative structure,

The

Finally, the following positional fitness schemes specifically designed for multiple structure targets

**
Scheme 6.
** Minimum Boltzmann probability of target structures,

**
Scheme 7.
** Product of Boltzmann probabilities of target structure,

For single structure targets, they are equivalent to the Boltzmann scheme, scheme 2. For multiple structure targets, scheme 6 exclusively focuses on the worst fitness over all target structures, while scheme 7 includes Boltzmann probabilities from all target structures. However, by multiplying the probabilities, having a low fitness on just a single target structure will have a much more notable effect, than under the sum implicit in the averaging of scheme 2.

In addition to a single of the above schemes, there is the possibility of using a weighted combination of any subset of them to define positional fitness. E.g. combining the first two schemes would divide positions based on whether they have a predicted structure matching the target, but further graduate the fitness by the marginal Boltzmann probability of the target structure at each position.

The concept of positional fitness underpins most operations of the GA: mutation, recombination, selection, and termination. Whenever fitness of a region (for recombination cross over point selection) or the entire sequence (for selection and termination) is needed, this is obtained as the sum of the positional fitnesses in the region or sequence. Different positional fitness schemes can be used for these four aspects, with the limitation that negative logarithms of Boltzmann probabilities are only used for mutation, and product of Boltzmann probabilities cannot be used for mutation.

Fitness and objective

Often fitness and objective of GAs are considered equivalent, but we make the distinction of using fitness for the selection in each round of the GA and objective for determining when an adequate solution has been found and the search can be terminated. In a standard design problem where the aim is to find a sequence folding to one specific target structure, it is natural to base the objective on whether positions are correct in the predicted structure and terminate when the number of errors reaches 0. However, a more fine grained selection may be desirable, for example substituting or combining the number of errors with scheme 2 – instead of choosing randomly between two sequences with e.g. 10% positions that are wrong in the predicted structure, we would prefer the one with higher probabilities of positions being correct.

A global, i.e. non-positional, scheme

**
Scheme 8.
** Logarithm of structure probabilities in Boltzmann ensemble and their variance:

based on the cost functions discussed in

Finally, to maintain diversity in the GA population, the fitness can be augmented with a weighted contribution from the average Hamming distance to already selected sequences. If

Mutation

The position targeted for mutation in a sequence is chosen either uniformly at random, or with probability proportional to positional fitnesses. Similarly, sequences can be chosen for mutation either equally many times, uniformly at random, or with probability proportional to the reciprocal of the sequence fitness (with sequences with fitness 0 given twice the probability of the otherwise most fit sequences).

When choosing a new nucleotide for a position chosen for mutation, we want to maintain

The

In
**P** hard

The following forms our method for sampling nucleotides on a connected component in the TDG, starting with position

Choose

**while**
**do**

Choose

Choose

**end while**

where

Recombination

Due to the hierarchical nature of RNA secondary structures, the GA uses recombination mimicking gene conversion rather than cross over, i.e. an infix of one sequence is recombined with the corresponding prefix and suffix of the other sequence. The easiest way to keep all base pairs canonical, is to always take two positions forming a base pair in a target structure from the same sequence. If we create a recombinant on sequences

**for**
**do**

^{
′
}=

for **do**

^{
′
}=^{
′
}∪

(

**endfor**

^{
′
}

**end for**

As a starting point, pairs of points are chosen by first choosing a set of pairwise permissible points with probability proportional to the set size, then choosing a pair from the set uniformly at random, ensuring an overall uniform probability that a point is chosen. This distribution can be biased proportional to

where _{
s
} and _{
t
} are positional fitnesses for

Initialisation

Initialisation of sequences in the starting population can either be done randomly, by sampling nucleotides for each connected component in the TDG as outlined for mutation, but without the presence of current nucleotides, or by running RNAinverse from a random starting point. The latter option, an approach also used by RNAexinv

Data

Data was taken from two main sources, to benchmark Frnakenstein and other inverse folding methods. The first data set used in our benchmarks is the data set used in

Secondly, data was taken from RNASTRAND

However, with both data sets, it may be possible that there is no sequence which RNAfold will fold into the reference structure, and so the method might not be able to acheive 100% accuracy, due to RNAfold, not the search heuristic. Consequently the sequences corresponding to the structures in the RNASTRAND data set were re-folded using RNAfold, so there is known to be at least one sequence which will correctly fold. This dataset will be denoted as the

Results and discussion

Multi-structure targets

One of the main objectives of Frnakenstein was to develop a method capable of solving the inverse folding problem under multiple structural constraints. As mentioned earlier, the only existing method for inverse folding with multiple structure targets was published by

Design of artificial SV11 RNA

**Design of artificial SV11 RNA.** Dot plot of base pair Boltzmann probabilities for the designed sequence for the bistable SV11 target. Superimposed on the dot plot is a plot of base pairs in the two metastable SV11 structures, shown with open squares in different shades of grey. The secondary structures are also shown in the same shades of grey. Dots reflecting Boltzmann probabilities were rescaled by a factor of 0.75 to clearly separate them from any enclosing square representing a structure base pairs. The two conformations, that share no base pairs, are also show:, the native state (top) and meta-stable state (bottom).

For all benchmarks on multiple structure targets we set the number of generations to the total target length, i.e. the number of structures in the target multiplied by the length of the sequence to be designed. Again Frnakenstein was run with default values, which means that compared to the single structure target default outlined above, positions for mutations were chosen based on a 1:1:1 combination of schemes 1, 2, and 3; cross over points were chosen based on a 1:1:2 combination of schemes 1, 2, and 7; fitness was based on a 1:1:2:4 combination of schemes 1, 2, and 7, and the diversity maintaining contribution from Hamming distances, except for the SV11 example above where a fitness based on scheme 8 with

To some extent the SV11 target poses an impossible challenge, as we cannot find a sequence having both conformations as the most stable structure. Hence, for bi-stable targets, we cannot measure performance by simply reporting successes and failures. To avoid this problem, we decided to test the performance of multiple structure target design by providing target structures at two different temperatures. We folded the 304 sequences with at most 200 nucleotides of the RNASTRAND data set under 20°C and 37°C, simulating a change from room temperature to normal body temperature. After eliminating duplicates, 291 two structure targets remained.

177 targets, with a total of 11,188 nucleotides, had identical structures at the two temperatures. 114 targets, with a total of 10,578 nucleotides, had different structures at the two temperatures, with an average of 16.9 positions where the structures differed. Even when the structures differ, it will be possible to design a sequence that successfully folds to the correct target structure at each temperature, allowing a simple and easy to understand measure on performance of number of successes. This does not directly test the ability to design a bi-stable molecule. However, it does test performance on multiple structure targets in a related realistic scenario for inverse RNA folding, where the aim is to design a molecule that under different conditions, either remains stable or performs as a riboswitch reacting to the change in conditions.

Table

**Both**

**One**

**None**

**n**

**n**

**n**

Performance on 291 two-structure targets generated by folding shorter RNASTRAND at 20°C and 37°C.

Identical structures

173

61.5

–

0

–

–

–

4

139.0

–

3.3

Different structures

54

71.5

7.5

32

94.1

19.9

8.1

28

132.5

31.8

9.6

Successful designs were obtained for 227 targets. Of the remaining 64 targets, the design folded correctly at one temperature for 32 targets, with an average of 8.1 positions where the predicted structure of the design differed from the target at the other temperature. For the remaining 32 targets, the design did not fold correctly at either temperature, with an average of 8.8 positions being wrong. It should be remembered that targets were created by folding a specific sequence at two different temperatures using RNAfold, so in all cases we know that a perfect design does exist. Comparing this to the results obtained on RNASTRAND-Refolded, it is evident that the multiple structure target problem is considerably more difficult than the single structure target problem, in particular when the target structures differ.

Single-target structures

Since Frnakenstein works for single targets too, the performance of our method on single targets could be benchmarked against other methods that are publicly distributed as source code or executables. This includes RNAinverse

For each method being benchmarked, efforts were made to give it the same number of attempts at the problem, despite them employing different search heuristics. Our method and MODENA were both run 10 times with a population size of 50, and a number of generations equal to the number of positions in the structure (with a minimum of 50 generations). RNAinverse, RNA-SSD, INFO-RNA, NUPACK: Design, and Inv were all run 500 times with default parameters. All methods apply the Vienna RNA package for structure prediction, except for NUPACK: Design that uses the NUPACK suite, and Inv, which uses its own thermodynamic model. NUPACK allows interior loops of arbitrary sizes, whereas the Vienna package limits interior loops to a maximum size of 30. This allows NUPACK: Design to report a successful design on target structures with large interior loops, where all other methods will necessarily fail due to this limitation of RNAfold. Notably this is seen on the RF00016 and RF00024 structures in the Rfam data set, which contains interior loops with 83 and 67 unpaired nucleotides, respectively. Inv is even more restrictive on permissible structures, allowing only structures with a minimum stack size of 3, and minimum arc length of 4. This means that many trusted structures in, for instance, the Rfam database it deems as invalid, and thus will produce an error, suggesting it will perform badly in the benchmarks.

Frnakenstein was run with default values, which means mutation was biased towards fit sequences, positions were chosen based on a 2:1 combination of schemes 1 and 2; recombination was biased towards pairs of sequences with good complementary match and cross over points chosen based on a 1:1 combination of schemes 1 and 2; fitness was based on a 1:1:2 combination of schemes 1 and 2, and the diversity maintaining contribution from Hamming distance to already selected sequences, while objective was simply the number of erroneous positions in the predicted structure, cf. scheme 1. As one cannot determine statistically what fitness, mutation, and recombination schemes will optimise success of the algorithm, the defaults were determined heuristically to make Frnakenstein most likely to find a successful sequence for the target structure.

Rfam structures

Results from benchmarking on the Rfam data set are shown in Table

**Acc.**

**Len.**

**Frnakenstein**

**MODENA**

**RNA-SSD**

**INFO-RNA**

**RNAinverse**

**NUPACK**

**Inv**

Comparison of five inverse folding methods on 29 Rfam structures.

01

117

**1.0** / 1204

**1.0** / 17

**0.01** / 8.2

**0.95** / 1.2

**0.01** / 61

**0.29** / 11613

∗

02

151

**1.0** / 13505

**1.0** / 19

0.0 / –

0.0 / 175

0.0 / 277

0.0 / 141999

∗

03

161

**0.1** / 9997

**1.0** / 37

0.0 / –

**0.01** / 112

0.0 / 304

0.0 / 50698

∗

04

193

**1.0** / 198

**1.0** / 27

0.0 / –

**0.27** / 64

**0.10** / 164

**1.0** / 5597

∗

05

74

**1.0** / 0.14

**10.** / 4.8

**1.0** / 0.34

**0.99** / 0.31

**0.87** / 1.4

**1.0** / 148

**0.29** / 9.2

06

89

**1.0** / 27

**1.0** / 5.9

**0.98** / 3.4

**0.66** / 4.6

**0.06** / 20

**0.99** / 299

∗

07

154

**1.0** / 115

**1.0** / 22

**0.05** / 5.1

**0.85** / 7.4

**0.08** / 70

**0.95** / 1611

∗

08

54

**1.0** / 0.09

**1.0** / 2.8

**0.96** / 0.05

**1.0** / 0.15

**0.95** / 0.22

**1.0** / 47

**0.22** / 1.1

09

348

**1.0** / 129057

**1.0** / 123

0.0 / –

0.0 / 4127

**0.01** / 7100

**0.78** / 111487

∗

10

357

0.0 / 245868

0.0 / 180

0.0 / –

0.0 / 4046

0.0 / 8007

0.0 / 92211

∗

11

382

0.0 / 500078

0.0 / 184

0.0 / –

0.0 / 7040

0.0 / 16634

0.0 / 77273

∗

12

215

**1.0** / 5455

**1.0** / 35

0.0 / –

**0.01** / 329

**0.01** / 558

**0.98** / 2825

∗

13

185

**1.0** / 65

**1.0** / 27

0.0 / –

**0.37** / 61

**0.09** / 127

**1.0** / 190

∗

14

87

**1.0** / 0.15

**1.0** / 7.3

**0.94** / 0.09

**1.0** / 0.30

**1.0** / 0.27

**1.0** / 34

∗

15

140

**1.0** / 333

**1.0** / 13

0.0 / –

**0.51** / 29

**0.05** / 118

**1.0** / 40696

∗

16

129

0.0 / 18734

0.0 / 11

0.0 / –

0.0 / 102

0.0 / 124

**0.48** / 10167

∗

17

301

**1.0** / 318

**1.0** / 117

0.0 / –

**0.94** / 21

**0.23** / 263

**1.0** / 703

∗

18

360

**1.0** / 210591

**1.0** / 180

0.0 / –

**0.01** / 4260

**0.01** / 5305

0.0 / 101125

∗

19

83

**1.0** / 1.0

**1.0** / 6.3

**0.4** / 0.63

**0.98** / 0.52

**0.57** / 3.7

**1.0** / 46

∗

20

119

0.0 / 3149

0.0 / 10

0.0 / –

0.0 / 7.8

0.0 / 15

0.0 / 810

∗

21

118

**1.0** / 0.23

**1.0** / 13

**0.99** / 0.44

**1.0** / 0.77

**0.96** / 1.7

**1.0** / 130

∗

22

148

**1.0** / 293

**1.0** / 16

0.0 / –

**0.15** / 46

**0.02** / 97

**1.0** / 5335

∗

24

451

0.0 / 138348

0.0/ 182

0.0 / –

0.0 / 1530

0.0 / 4170

**0.23** / 8533

∗

25

210

**1.0** / 11838

**1.0** / 29

0.0 / –

**0.06** / 132

0.0 / 463

**1.0** / 10420

∗

26

102

**1.0** / 290

**1.0** / 6.1

0.0 / –

**0.04** / 50

**0.02** / 60

**1.0** / 1429

∗

27

79

**1.0** / 0.23

**1.0** / 8.1

**1.0** / 0.29

**1.0** / 0.49

**0.82** / 1.2

**1.0** / 112

**0.86** / 9.7

28

344

**0.3** / 197498

0.0 / 125

0.0 / –

**0.01** / 4627

0.0 / 6003

**0.09** / 31920

∗

29

73

**1.0** / 33

**1.0** / 4.1

**0.82** / 1.4

**0.67** / 0.58

**0.2** / 1.5

**0.72** / 229

∗

30

340

**1.0** / 133627

**1.0** / 115

0.0 / –

**0.01** / 1896

0.0 / 5011

0.0 / 118373

∗

Total successes

24

23

10

22

19

22

3

This data set was sufficiently small that RNA-SSD could be included in the benchmark by manually uploading the targets to the RNA-SSD server. Our method and MODENA, the only two genetic algorithm based methods in the benchmark, exhibits the best performance, each successfully designing sequences for 23 of the 29 targets. INFO-RNA also performs well, with 21 successful designs, while RNA-SSD and RNAinverse have more limited success, and Inv performing poorly. Every target Inv considered to not be invalid it succeeds with, but it is so limited on what it permits that it does not attempt the majority of structures.

Of the 5 target structures for which all RNAfold based method failed, two (RF00016 and RF00024) have internal loops with more than 30 nucleotides, which makes it impossible to reach a successful design as discussed above. All methods, including NUPACK: Design, failed on the remaining three (RF00010, RF00011, and RF00020). These all contain a bulge of a single nucleotide separated from either a large hairpin loop or the exterior of the structure by an isolated base pair. With the current energy parameters, it is impossible to design a sequence where the same structure with the isolated base pair removed would not be more stable

If we look at how nucleotides and base pairs are utilised by the different methods, Table

**Paired**

**Unpaired**

**Total**

Comparison of the nucleotide distributions of the successfully designed sequences from different methods on the Rfam dataset, with distribution observed across the original sequences from Rfam shown in the first row.

**GC**

**AU**

**GU**

**A**

**C**

**G**

**U**

**A**

**C**

**G**

**U**

Original data

0.57

0.30

0.13

0.30

0.20

0.23

0.27

0.23

0.24

0.28

0.24

Frnakenstein

0.55

0.36

0.09

0.32

0.31

0.09

0.29

0.25

0.29

0.19

0.26

MODENA

0.82

0.18

0

0.82

0.06

0.06

0.06

0.48

0.22

0.22

0.07

INFO-RNA

0.93

0.06

0.01

0.36

0.22

0.20

0.22

0.19

0.35

0.32

0.14

RNA-SSD

0.56

0.44

0

0.32

0.24

0.19

0.25

0.27

0.26

0.24

0.23

RNAInverse

0.46

0.41

0.14

0.29

0.25

0.21

0.25

0.23

0.24

0.26

0.26

NUPACK

0.73

0.27

0

0.42

0.26

0.09

0.22

0.28

0.31

0.22

0.18

Inv

0.32

0.39

0.28

0.30

0.26

0.22

0.22

0.20

0.21

0.30

0.29

The other four methods have less bias in the base pair uses, although only Frnakenstein, RNAinverse, and Inv utilise wobble GU base pairs to any real degree. Indeed, the base pair distribution observed in the Frnakenstein distribution is very close to the distribution observed in the original data. There is, in the unpaired nucleotides, a mild overrepresentation of Cs and a mild under representation of Gs, though. This is perhaps due to the increased thermodynamic stability in CG base pairs– it is perhaps easy to have Cs unpaired, or Gs unpaired, but too many will form a pair eventually. RNAInverse, on the other hand, had an unpaired distribution very close to the real data set, but the base pair distribution is off a little. While it may in many cases be less important whether the distributions observed in the designed sequences are heavily biased, one consequence of this bias will be a reduced diversity in the set of solutions that are generated. In this sense, given Frnakenstein’s performance against RNA Inverse and Inv, when application dictates a sensible nucleotide distribution, Frnakenstein is the clear winner.

Frnakenstein was designed with little focus on running time, choosing Python as implementation language for the ease of development and flexibility it offers. Additionally, the more advanced choices in mutation and recombination selection provide additional computational burden. Not only do you now have to calculate the full partition function for the thermodynamic model necessary to obtain Boltzman probabilities, but the selection of individuals for recombination and the recombination points themselves becomes considerably more computationally expensive. It is thus not overly surprising that among the methods tested Frnakenstein is one of the slowest. Only NUPACK: Design vies with Frnakenstein for bottom slot regarding speed. Even accounting for the fact that average running times should be divided by success ratio to get an approximate value of total time until first successful design, Frnakenstein and NUPACK: Design tend to be three to four orders of magnitude slower than the other four tested methods on some targets. For easy targets, Frnakenstein mitigates this concern by the application of RNAinverse for sequence initialisation. For applications where minimising time to find a successful design is the key priority, as opposed to nucleotide distribution, MODENA is a strong contender with run time at most a few minutes and a high rate of success on most of the Rfam targets.

Additionally, we analysed the performance of different positional fitness schemes. A subset of structures were taken from the Rfam data set and Frankenstein run many times with different positional fitness options, and for each run, the minimum objective recorded for each generation. For each positional fitness option, averages were then taken over runs, and results are found in Figure

Analysis of positional fitness schemes

**Analysis of positional fitness schemes.** Plot showing the minimum objective value in the population through the generations of the GA for the default parameters (solid black line) and ten variations where a single feature is changed by invoking the respective options shown in the legend. These corresponds to choosing positions for mutation uniformly at random, as well as based on positional fitness schemes 1, 2, 3, and 5; choosing pairs of sequences for recombination uniformly at random or based on individual fitnesses; and choosing recombination points uniformly at random, as well as based on positional fitness schemes 1 and 2.

RNAStrand data

Results for both the RNASTRAND and RNASTRAND-Refolded can be found in Table

**Frnakenstein**

**MODENA**

**INFO-RNA**

**RNAinverse**

**NUPACK**

**Inv Frnakenstein**

Successes of the benchmarked approaches on the RNASTRAND and RNASTRAND-Refolded.

RNASTRAND

189

178

196

176

185

73

RNASTRAND-Refolded

383

377

383

336

383

113

These results confirm the general picture seen for the Rfam data set: Frnakenstein, MODENA, and INFO-RNA have similar success rates with RNAinverse lagging slightly behind, although only on the re-folded structures and not quite to the same degree as for the Rfam data set. Inv once again performs poorly, although there are more structures it permits in this data set, once again succeeding on all of them. The dependency of performance on target length, cf. Table

**Range**

**10-23**

**24-36**

**37-56**

**57-77**

**78-98**

**99-117**

**118-151**

**152-269**

**270-311**

**312-1037**

**Av. length**

**17.9**

**29.4**

**46.2**

**69.8**

**86.7**

**108.0**

**127.7**

**205.6**

**293.8**

**528.0**

**Bin size**

**38**

**34**

**36**

**41**

**34**

**36**

**35**

**36**

**36**

**37**

The 363 unique structures of the RNASTRAND data set were binned according to length in 10 bins of roughly equal size, and for each bin, the range of lengths covered by the bin, the average length of structures in the bin, the number of structures in the bin are listed, as well as the success ratio on each bin computed for each method.

Frnakenstein

0.50

0.50

0.56

0.76

0.62

0.44

0.97

0.47

0.17

0.22

MODENA

0.47

0.50

0.56

0.76

0.59

0.44

0.97

0.44

0.03

0.14

INFO-RNA

0.50

0.50

0.56

0.78

0.65

0.53

0.91

0.50

0.28

0.19

RNAinverse

0.50

0.50

0.56

0.78

0.62

0.42

0.97

0.42

0.08

0.00

NUPACK

0.50

0.50

0.56

0.78

0.56

0.44

0.97

0.47

0.14

0.16

Inv

0.44

0.38

0.47

0.51

0.15

0

0

0

0

0

Conclusions

In this paper we have described how to use a genetic algorithm approach to find useful solutions for the inverse RNA folding problem. The method allows a combination of the predicted minimum free energy structure and the computed Boltzmann distribution over the ensemble of structures to be used to guide the main aspects of the genetic algorithm, in particular mutation, recombination, and selection. It performed as well, or better, than the other methods tested on all benchmarks, without introducing strong biases in the composition of the designed sequences.

One of the major advantages of our method is that it allows multiple structures, either at identical or different conditions, to be specified as targets. To our knowledge, only one previously published method has this capability, and the software implementing this method is only available on request. While the benchmarks were done on two targets, there are no upper restrictions to how many targets can be aimed for.

Our method uses the RNA secondary structure prediction software as a black box. While a more efficient solution could be obtained by a more complex interaction with the folding software, allowing reuse of already computed values when mutating and recombining sequences, the chosen approach makes Frnakenstein much more flexible. The folding method can be replaced with relative ease, e.g. to use a grammar based method or a method capable of predicting structures with pseudoknots, simply by providing an alternative implementation of the module invoking and parsing the output from the folding software. Combining predictions from several folding methods, possibly using a multi-objective framework similar to MODENA, allows designs more robust to the uncertainties of structure prediction, and is an interesting direction for future research.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

RL and JWJA developed the initial idea for the method, which was tested and implemented by RL, AB, TH, and ES. Benchmarks were designed and carried out by JWJA, RL, ES, AB, and TH. Manuscript was drafted by RL, JWJA, and ES. All authors read and approved the final manuscript.

Acknowledgements

This work was carried out as part of the Oxford Summer School in Computational Biology, 2011, in conjunction with the Department of Plant Sciences and the Department of Zoology. Funding was provided by the EU COGANGS Grant. We thank S. Kelly for providing computational resources. JWJA would like to thank the EPSRC for funding.