Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, UK

Bioinformatics Research Centre, Aarhus University, C.F. Møllers Allé 8, DK–8000 Aarhus C, Denmark

Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, UK

Abstract

Background

Stochastic Context–Free Grammars (SCFGs) were applied successfully to RNA secondary structure prediction in the early 90s, and used in combination with comparative methods in the late 90s. The set of SCFGs potentially useful for RNA secondary structure prediction is very large, but a few intuitively designed grammars have remained dominant. In this paper we investigate two automatic search techniques for effective grammars – exhaustive search for very compact grammars and an evolutionary algorithm to find larger grammars. We also examine whether grammar ambiguity is as problematic to structure prediction as has been previously suggested.

Results

These search techniques were applied to predict RNA secondary structure on a maximal data set and revealed new and interesting grammars, though none are dramatically better than classic grammars. In general, results showed that many grammars with quite different structure could have very similar predictive ability. Many ambiguous grammars were found which were at least as effective as the best current unambiguous grammars.

Conclusions

Overall the method of evolving SCFGs for RNA secondary structure prediction proved effective in finding many grammars that had strong predictive accuracy, as good or slightly better than those designed manually. Furthermore, several of the best grammars found were ambiguous, demonstrating that such grammars should not be disregarded.

Background

RNA secondary structure prediction is the process of predicting the position of hydrogen bonds in an RNA molecule based only on its nucleotide sequence. These predictions can be used to better understand the functioning of cells, characteristics of gene expression and the mechanisms involved in protein production

Stochastic Context Free Grammars

A context–free grammar

One possible grammar, generating strings which may be interpreted as addition/multiplication expressions using only the number 1, may be represented thus:

Note that each instance of

A SCFG is a grammar with an associated probability distribution over the production rules which start from each

Application of SCFGs to RNA Secondary Structure Prediction

The use of SCFGs in RNA secondary structure prediction was based on the success of Hidden Markov Models (HMMs) in protein and gene modelling

The Pfold algorithm

While KH99 was effective, it seems to have been chosen relatively arbitrarily, in that there is little discussion about the motivation behind the choice of production rules. This problem was addressed by

Evolutionary approaches have already been implemented for HMMs. Indeed,

Methods

In this paper, only

Normal forms

To develop algorithms for analysing sequences under grammatical models, it is convenient to restrict the grammar to a normal form, with only a few possible types of productions. The normal form most commonly used is Chomsky Normal Form (CNF), as every context–free grammar is equivalent to one in CNF. However, a grammar in CNF cannot introduce the corresponding parentheses of paired nucleotides in a single production, and therefore does not capture structure in a straightforward manner. Thus it was necessary to create a new double emission normal form (so called because paired bases are emitted simultaneously) which was able to capture the fundamental features of RNA secondary structure: branching, unpaired bases, and paired bases. For any combination of non–terminals (

This normal form allows the development of the structural motifs commonly found in RNA. For example (where _{
i
} correspond to non–terminals) base–pair stacking can be generated by rules of the type _{1} → (_{1}), hairpins by _{1} → (_{2}), _{2} → _{2}
_{3}|_{3}
_{3} and _{3} →., and bulges by _{1} → (_{2}), _{2} → _{3}
_{1} and _{3} →.|_{3}
_{3}.

Furthermore, with the exception of the ability to generate empty strings, this normal form allows the expression of dependencies of any context–free grammar producing valid structures. It was also designed to avoid cyclical productions; that is, combinations of production rules which result in the same string that they started from. These are undesirable as they permit a countably infinite number of derivations for some strings. For this reason, rules of the form

As a result of eliminating these rules, many grammars already established in RNA secondary structure prediction cannot be exactly replicated, since they are not initially in the above normal form. For example, the KH99 contains the rule

Secondary structure prediction

Secondary structure can be predicted by two methods, both of which are employed here. Firstly, one can find the maximum likelihood derivation of a sequence, during which a structure is generated. The Cocke–Younger–Kasami (CYK) algorithm

Secondly, one can employ a posterior decoding method using base–pairing probability matrices. The base–pair probability matrix for a SCFG are obtained via the inside and outside algorithms

Both methods were used in the search, as this additionally gave a chance to compare the two prediction methods.

Ambiguity and completeness

A grammar is said to be

Nine grammars are tested in

We define a grammar to be

Practically, it is very difficult to ensure both unambiguity and completeness. A complete, unambiguous grammar cannot be simply modified without compromising one of the properties. Adding any production rules (if they are ever used) will create ambiguity by providing additional derivations. Equally, removing production rules will create incompleteness (unless the rule is never used in a derivation), as the original grammar is assumed unambiguous. Because of this, an automated grammar design based on simple–step modification is practically impossible without creating ambiguous and incomplete grammars. Moreover, grammars that are unambiguous and complete are vastly outnumbered by grammars that are not. Therefore, grammars not possessing these desirable qualities must be considered and as a result our grammar search serves as a test of the capabilities of ambiguous or incomplete grammars.

Parameter inference

Training data, consisting of strings of nucleotides and trusted secondary structures, is used to obtain the probabilities associated with each production rule, as well as paired and unpaired nucleotide probabilities. If derivations are known for the training sequences, then there are simple multinomial maximum likelihood estimators for the probabilities. Usually, though, the derivation is unknown. Again, one can estimate probabilities by finding derivations for the training set using CYK, or by the inside and outside algorithms.

For the CYK algorithm, in the case of ambiguous grammars, one cannot know which derivation produced the known structure, so probabilities cannot be obtained. Consequently, we train these grammars using the same approach as

Again, both CYK and inside–outside were used for parameter inference in the search and evaluation.

Evolutionary algorithm

With the double emission normal form, for ^{3} production rules of type ^{2} of type

Initial population

When forming the initial population, the size of the space of grammars quickly becomes problematic. The space is clearly large, even for small m, so the population size cannot approach that usually afforded in evolutionary algorithms

where between zero and four of the

Mutation

Mutations constitute the majority of movement through the search space, so are particularly important. They give the grammar new characteristics, allow it more structural freedom, and add production rules which may be used immediately or may lie dormant. For non–terminals _{
i
} ∈

• The start variable (and corresponding production rules) change,

• A production rule is added or deleted,

• A new non–terminal variable ^{
′
} is added along with two new rules that ensure that ^{′} is reachable and that

• A non–terminal variable is created with identical rules to a pre–existing one,

• A production rule of the form _{
i
} → _{
j
}
_{
k
} is changed to _{
i
} → _{
j
}
_{
l
}, _{
i
} → _{
l
}
_{
k
} or _{
i
} → _{
l
}
_{
p
}, or production rule of the form _{
i
} → (_{
j
}) is changed to _{
i
} → (_{
k
}).

This form of mutation is very basic, but allows many structural features to develop over generations. The rate of mutation determines movement speed through the search space and development of these structural features. Adding rules too slowly prevents grammars from developing structure, while too many results in a lot of ambiguity and thus creates ineffective grammars. Deleting rules almost always results in a worse grammar. To aid the grammar design, especially in consideration of facets of the normal form, the rule

More complex mutation is clearly possible. The derivation could be used to find the rules used more often and make mutations of those rules more or less likely. A model for simultaneous mutations could be developed, which might be able to make use of expert understanding of RNA structure, in combination with an evolutionary search. We have found the above model to give sufficient mobility in the search space, and therefore did not investigate other extensions.

Breeding

The breeding model forms a grammar which can produce all derivations of its parent grammars. The grammar _{1} and _{2} has start symbol _{1}, _{2}, …, _{
n
} and _{1}, _{2}, …, _{
m
},

•

• For _{
i
}:
_{1} are replaced with

• For _{
i
}:
_{2} are replaced with

This breeding model was chosen to keep the size of the grammar relatively small, whilst allowing expression of both bred grammars to be present in derivations.

Selection

We grow the population in each generation by introducing a number of newly mutated or bred grammars, then we pare it back to a fixed population size by stochastic elimination. We determine the probability of elimination of a grammar by the inverse of some fitness measure. Fitness functions we use include mountain metric distances

Brute force search

In addition to the evolutionary algorithm, we have run a brute force search to evaluate small grammars which might be effective. One of the main points of emphasis of

Data

We took data from RNASTRAND

The spectrum of sequence length, is of particular significance in selecting data. The CYK and training algorithms are of cubic order in the length of the string, so we decided to use large training and test sets with small strings. Longer strings require longer derivations, thus they have a larger weight in the parameter training, which might lead to overtraining. Equally, if one omits longer strings, poorer predictions may result from overtraining on the shorter strings. We found the trained parameters highly sensitive to the choice of training data set, and struggled to balance this with computational efficiency.

We used a final data set from a variety of families, consisting of 369 sequences with corresponding structures. There was a total of 57,225 nucleotides with 12,126 base pairs. As with

As well as measuring performance on our own data, we have used results obtained with the
^{′} is a good representation of KH99.

Results and discussion

Figure

Fitness evolution

**Fitness evolution.** The change over generations in average fitness of population, and the fitness of the best SCFG. Here, a lower fitness is more desirable, the SCFG predicting better secondary structure. Many improvements to both the whole population and best SCFG are made in the first 100 or so generations. After this, the best SCFG does not become much better, but the average population fitness continues to fluctuate. Clearly the algorithm continues to explore alternative SCFGs and tries to escape the local optimum.

Across all our experiments, over 300,000 grammars were searched. A number of strong grammars were found using both CYK and IO training and testing, denoted GG1–GG6. KH99^{
′
} is KH99 in the double emission normal form. Results on the sensitivity, PPV, and F–score of each grammar can be found in Table

**Grammar**

**KH99**
^{
′
}

**GG1**

**GG2**

**GG3**

**GG4**

**GG5**

**GG6**

**KH99**

**UNAfold**

**RNAfold**

**Pfold**

The sensitivities, PPVs, and F–scores of grammars GG1–GG6 and KH99^{′} on the evaluation set and on the on the

Our data

Sensitivity

0.496

0.505

0.408

0.413

0.474

0.469

**0.526**

PPV

0.479

0.481

**0.551**

0.550

0.454

0.467

0.479

F–score

0.478

0.441

0.473

0.470

0.461

0.339

**0.488**

DE data

Sensitivity

0.465

0.466

0.372

0.379

0.408

**0.487**

0.465

0.47

0.558

0.558

0.39

PPV

0.406

0.405

0.643

**0.646**

0.344

0.432

0.376

0.45

0.501

0.495

0.69

F–score

**0.480**

0.468

0.466

0.472

0.430

0.479

0.451

**Grammar**

**KH99**
^{
′
}

**GG1**

**GG2**

**GG3**

**GG4**

**GG5**

**GG6**

**Best**

**Grammar found by**

**Local**

**IO**

**IO**

**CYK**

**CYK**

**CYK**

The sensitivities, PPVs, and F–scores of grammars GG1–GG6 and KH99^{′} on the evaluation set, using different methods of training and testing. 'CYK’ indicates that the CYK algorithm was used, and 'IO’ that the inside and outside algorithms were used. The column 'Best’ was calculated by selecting, for each structure, the prediction with the highest F–score, and then recording the sensitivity, PPV, and F–score for that prediction. It is perhaps not surprising that the 'best’ predictions for CYK are better than the 'best’ predictions for IO, as IO is in some sense averaging over all predictions. One might expect the predictions to be more similar than those from CYK, as seen by comparing IO values for GG6 and 'best’, giving less increase when considering those with best F–score.

CYK

Sensitivity

0.496

0.505

0.330

0.374

0.474

0.469

**0.526**

0.675

PPV

0.479

**0.481**

0.258

0.322

0.454

0.467

0.479

0.585

F–score

**0.478**

0.441

0.426

0.435

0.461

0.339

0.461

0.622

IO

Sensitivity

0.387

0.392

0.408

**0.413**

0.373

0.404

0.410

0.450

PPV

0.552

0.517

0.551

0.550

0.566

0.556

**0.583**

0.584

F–score

0.461

0.443

0.473

0.470

0.449

0.471

**0.488**

0.493

This shows grammars with very different structures perform well on the same (full evaluation) data set. KH99^{′} is still a strong performer, but we have shown that there exist many others which perform similarly (these GG1–GG6 form just a subset of the good grammars found in the search).

GG1 is KH99^{′} with two rules added,

GG2 and GG3 were found using the posterior decoding version of the evolutionary algorithm. They have a high density of rules, that is many rules for each non–terminal variable. Particularly, GG2 has almost all of the rules it is possible for it to have, given

GG4 has only two variables (A and C) used almost exclusively in producing base pairs. It then uses various exit sequences to generate different sets of unpaired nucleotides and returns to producing base pairs. Finally, GG5 and GG6 are typical of larger grammars we have found with complex structure. It is not obvious to us how their structure relates to their success in secondary structure prediction. GG4, GG5, and GG6 were all found using the CYK version of the evolutionary algorithm, and perhaps their complex structure can be accredited to this. GG6 is a strong performer throughout, particularly when considering F–score.

Most grammars achieved lower predictive power on the Dowell and Eddy dataset. The difference in performance between KH99 and KH99^{′} is small and confirms that the representation of KH99 as KH99^{′} is a good one. Particularly noteworthy is the performance of GG3 and GG5. GG3 has had a considerable increase in PPV, likely due to the posterior decoding prediction method. Given many of the structures in the Dowell and Eddy dataset contain pseudoknots, other grammars score poorly trying to predict pairs where there are not, in contrast to GG3. By predicting fewer base pairs, GG3 gains higher PPV as more of them are correct, but lower sensitivity. GG5 is a grammar which was unremarkable in its results on the original data set, however, it has outperformed the rest of the grammars on the benchmark set and is the only grammar with improved sensitivity when compared to the RNASTRAND dataset.

Figure
^{′} and GG1–GG6. This was produced using the posterior decoding method by varying the parameter ^{′} does not distinguish itself much from the other grammars, being in the middle in terms of area underneath the curve.

Sensitivity/PPV curve

**Sensitivity/PPV curve.** A graph showing how sensitivity and PPV change for grammars when the posterior decoding parameter

Overall, the grammars found in the evolutionary search still perform well because they are not overadapted to deal with the original data. Determining which is best depends on the measure of strength of prediction, whether the size of the grammar is a concern, ability to approximate structures with pseudoknots effectively, and so on. However, it is clear that a selection of effective grammars has been found. Results shown by UNAfold and RNAfold continue to be superior to those produced by SCFG methods.

We also checked that the grammars obtained from the evolutionary algorithm do not merely produce similar structures to KH99^{′} by using different derivations. To do this we define the relative sensitivity of method A with respect to method B as the sensitivity of method A as a predictor of the structures produced by method B. The relative PPV is defined in a similar manner. We then compared the predictions of the grammars by building a heat map of the relative sensitivities and PPVs (Figure
^{′} and GG1 predict almost identical structures, as they are highly similar. Similarly, it is perhaps not surprising that GG2 and GG3 have very similar predictions given they produce structure through posterior decoding. The rest of the methods have sensitivity and PPV relative to other prediction methods of approximately 0.4 – 0.6. As they are designed to predict RNA secondary structure using the same training set, one would expect some similarity in the predictions, although not as much as with KH99^{′} and GG1. This is confirmed by our results, suggesting that the new grammars produce different kinds of structures which are good representations of RNA secondary structure.

Relative sensitivities/PPVs

**Relative sensitivities/PPVs.** A heat map showing the relative sensitivities and PPVs of the different prediction methods, or between prediction method and known structure. KH99^{′} and GG1, produce very similar structures which is not surprising, given they were found by changing one and two rules of KH99^{′} respectively. Otherwise, the methods have relative sensitivities/PPVs of approximately 0.5-0.6, which is as expected, given they are all designed to predict RNA secondary structure. However, it is clear that they are markedly distinct from KH99^{′} in their structure predictions.

Further analysis of

To test the local features of the space, we evaluated variations of KH99^{′} against the full data set. Where a single rule was deleted, only one grammar had prediction accuracy of the order of KH99^{′}. This is the grammar without the rule ^{′}, with probability 0.014). However, it is clear that deleting rules has a strong negative effect on the predictive power of KH99^{′}, given that no others have sensitivity greater than 0.25. Of course, this might be expected given that this SCFG has been constructed manually, and it is therefore unlikely to have unnecessary production rules.

With addition of rules, the number of grammars to check quickly becomes large. With one production rule added, 32 grammars must be evaluated, with two added this increases to 496. A similar local search for larger grammars would be impractical, since there are many more grammars with one or two altered production rules (for GG6, there are 584 grammars with only one new production rule, and 170,236 with two). Ambiguity of tested grammars had little or no effect. Results of this local search can be seen in Figure

Local search results

**Local search results.** Summary of the effects of adding one (giving 32 grammars) or two (giving 961 grammars) production rules to KH99^{′}. The plot shows the cumulative proportion of grammars with given sensitivity. The grammars’ sensitivity is mostly still equal to the sensitivity of KH99^{′}, with only a few outliers. GG1 was the top outlier for two production rule added. In this sense the space is reasonably flat.

Brute force search

The brute force search illustrated how, with this normal form, larger grammars are needed to provide effective prediction. Most small grammars will only be able to produce one type of string. Also, it suggested that the existing grammars are close to locally optimal and that the space around them is quite flat, demonstrating the need for intelligent searching methods. Figure

Brute force search

**Brute force search.** The distribution of sensitivity and corresponding PPV of grammars with at most 2 nonterminal variables. Approximately one quarter of grammars have sensitivity 0, as many cannot produce long strings. It is only the larger grammars that start to predict long strings which might correspond to structure. However, the prediction quality is still poor by both measures.

Ambiguity and Completeness

One of the results of the search which we find most interesting is the ambiguity and completeness of GG1–GG6, shown in Table
^{′}, being a slight modification of it. Particularly, it is clear that GG2 and GG3 have many different derivations for each structure, and their strong performance relies on this ambiguity, as they perform poorly when tested with CYK. GG5 demonstrates further that ambiguous grammars can even be effective at approximating structures with pseudoknots. The effectiveness of some ambiguous grammars is likely due to the prediction algorithm picking structures that, whilst perhaps suboptimal, are close to what the best prediction would be. Clearly there is room for a further investigation into quite why some grammars cope better with ambiguity than others.

**Grammar**

**Ambiguity**

**Completeness**

Ambiguity and completeness of KH99^{′} and GG1–GG6 grammars. All grammars found in the search were ambiguous. Some of the grammars found (GG4 and GG5), are incomplete but heuristically it seems that the structures that cannot be generated have little biological relevance.

KH99^{′}

No

Yes

GG1

Yes

Yes

GG2

Yes

Yes

GG3

Yes

Yes

GG4

Yes

No

GG5

Yes

No

GG6

Yes

Yes

Similarly, it might be surprising that some of the grammars found (GG4 and GG5), are incomplete. However, heuristically it seems that the structures that cannot be generated have little biological relevance (e.g. GG4 cannot generate “(…)(…)(…)(…)”). In some sense therefore, the incompleteness is permissible, as the grammar is still able to generate any relevant structure.

Conclusions

Our brute force search and search around KH99 demonstrate that intelligent searching methods are necessary, and overall, the method of evolving SCFGs for RNA secondary structure prediction proved effective. We found many grammars with strong predictive accuracy, as good or better than those designed manually. Furthermore, several of the best grammars found were both ambiguous and incomplete, demonstrating that in grammar design such grammars should not be disregarded. One of the strengths of the method is the ease of application and effectiveness for RNA structure problems. In particular, grammatical models are used in phylogenetic models of RNA evolution

Competing interests

The authors declare that they have no competing interests.

Author’s contributions

JWJA conceived the idea in discussion with RL and JH. JWJA then developed the methodology with PT and JS, with help from RL. PT and JS then designed and wrote the code, and results were analysed and written up by JWJA, with help from PT and JS. All authors were involved in critical redrafting of the manuscript. All authors read and approved the final manuscript.

Acknowledgements

JWJA would like to thank the EPSRC for funding. JH would like to acknowledge the Miller Institute for funding and hospitality. All authors would like to thank the Department of Plant Science, University of Oxford for their support and use of facilities. We should like to acknowledge the EU grant, COGANGS, for support.