Department of Computer Science, University of Kaiserslautern, P.O. Box 3049, D-67653 Kaiserslautern, Germany

Abstract

Background

Over the past years, statistical and Bayesian approaches have become increasingly appreciated to address the long-standing problem of computational RNA structure prediction. Recently, a novel probabilistic method for the prediction of RNA secondary structures from a single sequence has been studied which is based on generating statistically representative and reproducible samples of the entire ensemble of feasible structures for a particular input sequence. This method samples the possible foldings from a distribution implied by a sophisticated (traditional or length-dependent) stochastic context-free grammar (SCFG) that mirrors the standard thermodynamic model applied in modern physics-based prediction algorithms. Specifically, that grammar represents an exact probabilistic counterpart to the energy model underlying the Sfold software, which employs a sampling extension of the partition function (PF) approach to produce statistically representative subsets of the Boltzmann-weighted ensemble. Although both sampling approaches have the same worst-case time and space complexities, it has been indicated that they differ in performance (both with respect to prediction accuracy and quality of generated samples), where neither of these two competing approaches generally outperforms the other.

Results

In this work, we will consider the SCFG based approach in order to perform an analysis on how the quality of generated sample sets and the corresponding prediction accuracy changes when different degrees of disturbances are incorporated into the needed sampling probabilities. This is motivated by the fact that if the results prove to be resistant to large errors on the distinct sampling probabilities (compared to the exact ones), then it will be an indication that these probabilities do not need to be computed exactly, but it may be sufficient and more efficient to approximate them. Thus, it might then be possible to decrease the worst-case time requirements of such an SCFG based sampling method without significant accuracy losses. If, on the other hand, the quality of sampled structures can be observed to strongly react to slight disturbances, there is little hope for improving the complexity by heuristic procedures. We hence provide a reliable test for the hypothesis that a heuristic method could be implemented to improve the time scaling of RNA secondary structure prediction in the worst-case – without sacrificing much of the accuracy of the results.

Conclusions

Our experiments indicate that absolute errors generally lead to the generation of useless sample sets, whereas relative errors seem to have only small negative impact on both the predictive accuracy and the overall quality of resulting structure samples. Based on these observations, we present some useful ideas for developing a time-reduced sampling method guaranteeing an acceptable predictive accuracy. We also discuss some inherent drawbacks that arise in the context of approximation. The key results of this paper are crucial for the design of an efficient and competitive heuristic prediction method based on the increasingly accepted and attractive statistical sampling approach. This has indeed been indicated by the construction of prototype algorithms.

Background

In computational structural biology, a well-established probabilistic methodology towards single sequence RNA secondary structure prediction is based on modeling secondary structures by

Traditionally, SCFG based prediction approaches are realized by dynamic programming algorithms (DPAs) that require

However, for a very long time, the free energy minimization (MFE) paradigm has been the most common technique for predicting the secondary structure of a given RNA sequence. The respective methods are traditionally realized by DPAs that employ a particular thermodynamic model for the derivation of the corresponding recursions. They basically require

In the traceback steps of the corresponding DPAs, base pairs are successively generated according to the energy minimization principle, such that the predicted set of suboptimal foldings often contains many structures that are not significantly different (that have the same or very similar shapes and contain mostly the same actual base pairings). To overcome these problems, several statistical sampling methods and clustering techniques have been invented over the last years that are based on the partition function (PF) approach for computing base pair probabilities as introduced in

In fact, over the past years, statistical approaches to RNA secondary structure prediction have become an attractive alternative to the standard energy-based approach (which basically relies on several thousands of experimentally-determined energy parameters). In principle, many of these approaches – in contrast to Sfold – rely on (thermodynamic) parameters estimated from growing databases of structural RNAs. For instance, the CONTRAfold tool

Notably, following CONTRAfold, several other statistical methods have been subsequently developed, such as for instance

Anyway, statistical methods for RNA folding have previously been chosen to be either purely physics-based (e.g., Sfold) or discriminative and implementing a thermodynamic model (e.g., CONTRAfold), not generative. This might have been due to the misconception that SCFGs could not easily be constructed to mirror energy-based models (as mentioned e.g. in

However, a generative statistical method for predicting RNA secondary structure has recently been proposed

Lately, in an attempt to improve the quality of generated sample sets, this probabilistic sampling approach has been extended to being capable of additionally incorporating

It remains to mention that although all three sampling approaches (PF, SCFG and LSCFG based variants) need

For these reasons, the main objective of this paper is given as follows: We will consider the (L)SCFG based statistical sampling approach from

The prime motivation for such a disturbance analysis lies in the following facts: Suppose both the samples and predictive results are observed to behave rather resistant even with respect to large errors in the distinct sampling probabilities (compared to the exact values). Then it seems adequate to believe that the sampling procedure does not have to calculate these probabilities in the exact way, but it may efficiently suffice if they are only (adequately) approximated. Thus, in this case it might obviously be possible to employ an approximation algorithm (or at least a heuristic method) for sampling probability calculations in order to decrease the worst-case time (and maybe also space) requirements for statistical sampling and hence finally for structure prediction. Furthermore, to ensure that the quality of the generated sample sets and the predictive accuracy remains sufficiently high, analysis results on the effects of different disturbance levels and types should be taken into account for the development of an appropriate approximation scheme (or heuristic). From the other perspective, suppose the quality of sampled structures seems to strongly react on rather slight disturbances already. In that case, there is obviously little hope that the worst-case complexities of the sampling method can be improved by finding a suitable heuristic procedure for the computation of the needed sampling probabilities.

The aim of our study might hence be declared as to prove or disprove the hypothesis that a heuristic method could be implemented to improve the worst-case complexity of single sequence RNA structure prediction, and to discuss some potential ideas and inherent drawbacks that seem relevant in connection with still guaranteeing highly accurate results. Although existing algorithms are in practice quite fast on any sequence for which reasonable structure prediction accuracy is expected (e.g., it takes less than an hour to predict the thermodynamic PF for a 23S rRNA of 2500 nucleotides), sacrificing little accuracy might still be assumed worthwhile, given the practical speedup of efficient heuristic methods compared the corresponding exact (non-heuristic) algorithms (e.g., the conference paper

Note that since for any input sequence, the time (and space) complexities are dominated by those of the inside-outside computations (realized by a corresponding DPA which inherently scales

As we will see subsequently, the (L)SCFG based statistical sampling algorithm strongly reacts to any kind of rather small absolute errors already, whereas its reaction even to rather large relative disturbances is in most cases indeed fair enough to still obtain samples of acceptable quality and corresponding meaningful structure predictions. Hence, it seems possible that a reduction of the worst-case time requirements of the evaluated probabilistic sampling approach might be reached – without sacrificing too much predictive accuracy – by approximating the needed sampling probabilities in an appropriate way. Throughout this article, we will actually present some useful considerations on how a corresponding approximation scheme (or heuristic procedure) should be constructed in order to ensure that the sampling quality remains sufficiently high.

The rest of this paper is organized as follows: Section Methods introduces the formal framework, including the (L)SCFG model, definitions of various types and levels of disturbances and a corresponding recursive sampling strategy that will be considered within this article. A comprehensive disturbance analysis based on exemplary RNA data and the corresponding results will follow in Section Results and Discussion, where both the quality of generated sample sets and their applicability to the problem of RNA structure prediction are investigated. Notably, we not only compare different ways for extracting predictions from generated samples in order to assess the predictive accuracy, but also present results on the abstraction level of shapes that is of great interest and relevance for biologists. Section Results and Discussion also includes considerations on how to develop a corresponding time-reduced sampling strategy without significant losses in sampling quality. Notably, some of the key results are discussed in Section Errors Only on Particular Values. Finally, Section Conclusions concludes the paper.

Methods

In this section, we provide all needed information and introduce the formal framework that will be used subsequently. We start by a recap of the relevant details of the probabilistic sampling method from

Note that we assume the reader to be familiar with the notions and basic concepts regarding SCFGs. A fundamental introduction on stochastic context-free languages can be found in

Sampling based on (L)SCFG model

In general, probabilistic sampling based on a suitable (L)SCFG has two basic steps: The first step (preprocessing) computes the inside and outside probabilities for all substrings of a given input sequence based on the considered (L)SCFG model. The second step (structure generation) takes the form of a recursive sampling algorithm to randomly draw a complete secondary structure by consecutively sampling substructures (defined by base pairs and unpaired bases) according to conditional sampling probabilities for particular sequence fragments that strongly depend on the inside and outside values derived in step one.

Step One – Preprocessing

According to the traditional DPA approach for predicting RNA structure via (L)SCFGs, a particular underlying grammar, say

**Definition 2.1 **(_{
h
}:= min_{
HL
} ≥ 1 and _{
s
}:= min_{hel} ≥ 1,

Note that _{
HL
} for hairpin loops and second, a minimum number of min_{hel} consecutive base pairs for helices, where common choices are min_{
HL
} ∈ {1, 3} and min_{hel} ∈ {1, 2}. However, within this work we will only consider min_{
HL
}= min_{hel} = 1, which corresponds to the least restrictive (yet also most unrealistic) choice and usually yields the worst sampling results (see

Moreover, the needed grammar parameters (trained on a suitable RNA structure database) are splitted into a set of _{
tr
}(_{
em
}(_{
x
}) for _{
tr
}(_{
em
}(_{
x
},

However, according to

and all outside probabilities

for a sequence

Step Two – Random structure generation

Once the preprocessing is finished, different strategies may be employed for realizing the recursive sampling step. In general, for any sampling decision (for example choice of a new base pair), a particular strategy relies on the respective set of all possible choices that might actually be formed on the currently considered fragment of the input sequence. Any of these sets contains exactly the mutually exclusive and exhaustive cases as defined by the alternative productions (of a particular intermediate symbol) of the underlying grammar. The corresponding random choice is then drawn according to the resulting conditional sampling distribution (for the considered sequence fragment). This means the respective sampling distributions are defined by the inside and outside values derived in step one (providing information on the distribution of all possible choices according to the actual input sequence) and the grammar parameters (transition probabilities).

In this work, we will only consider the well-established strategy from

For example, suppose fragment _{
i, j
}:= _{
i
} … _{
j
} of input sequence _{
i,j
} must correspond to a (valid) derivation of a particular intermediate symbol _{
i,j
}, which actually correspond to all possible substructures on _{
i,j
} (the mutually exclusive and exhaustive cases for _{
i,j
}). Under the assumption that the alternatives for intermediate symbol

where

and

Consequently, we have to sample from the corresponding conditional probability distribution induced by

where obviously,

must hold, which can in general easily be guaranteed by using _{
X
}(_{
X
}(

which then ensures that the corresponding sampling probabilities still sum up to unity, such that they indeed define a conditional probability distribution).

Note that the sampling strategy effectively works conform with the SCFG model, which means that it actually samples one of the possible parse trees of the given input sequence by randomly drawing one of the respective mutually exclusive and exhaustive cases (corresponding to the distinct grammar rules with same premise) at any point in the already partially constructed parse tree in order to generate one of the possible subtrees for the given input sequence (corresponding to one the possible substructures on the considered sequence fragment, which is currently being folded recursively).

Hence, according to the sampling process, we could have never gotten to a point where we have to consider all mutually exclusive and exhaustive cases for a particular premise _{
i,j
}, 1 ≤ _{1} … _{
i-1}
_{
j+1} … _{
n
} from the start symbol (axiom) _{
X
}(_{
i,j
}) from which the strategy randomly samples one of the possible substructures (one valid subtree of the already partially constructed parse tree) is not influenced by the corresponding outside probability, due to the fact that _{
X
}(_{
i,j
}. In fact, the sampling strategy randomly draws one of the elements from

since _{
X
}(_{
Y
}(_{
VW
}(

Formal definitions of all corresponding sets ^{a} (of Additional file

**Supplementary Material.**

Click here for file

Considered disturbance types and levels

Obviously, under the assumption of a particular (L)SCFG model (trained beforehand on arbitrary RNA data), the most straightforward way for improving the performance of the corresponding overall sampling algorithm seems to be by reducing the worst-case complexity of the inside calculations. Therefore, we decided to quantify to which extend the algorithm reacts to different types and degrees of disturbances incorporated into the considered inside probabilities in order the get evidence if it could actually be possible to find a corresponding approximation algorithm (or at least an appropriate heuristic method) that eventually requires less time but causes only acceptable losses in accuracy. In fact, with respect to developing a suitable heuristic method to be applied in practice, it is necessary to know about the effects of different disturbance levels and types to get an idea on how precisely the respective values need to be approximated in order to guarantee sufficiently good results and to find out which types of errors pose fundamental problems and which ones are negligible.

For these reasons, given an arbitrary input sequence ^{b}

for _{
X
}(_{
X
}(

However, in order to reach our previously declared goal, for any fixed value ^{c}, we decided to draw

such that only inside values of particularly chosen intermediate symbols that lie outside (_{
X
}(_{
X
}(

(i.e., disturbances only inside or outside fix-sized window, but for all intermediate symbols),

(i.e., errors for all subword lengths, but only for particular intermediate symbols), or simply

(i.e., disturbances on all considered inside values).

Moreover,

or

might be employed for randomly drawing a relative error

For similar reasons, in order to randomly choose an absolute error

or

with

Note that random errors on all outside probabilities _{
X
}(

Finally, it should be clear that for _{
X
}(^{d}, it seems practically impossible to tell how to find an appropriate fixed error value for creating absolute disturbances.

Resulting modified sampling strategy

It should be clear that after the desired errors (according to any of the previously specified variants of either mep,fep,mev or fev) have been incorporated into the precomputed exact inside (and outside) values for a given sequence, the needed conditional sampling distributions (as considered by a particular strategy) are induced by the exact grammar parameters and the disturbed inside (and outside) probabilities for that sequence. This, however, might create the need to (slightly) modify the respective particularly employed sampling strategy such that it finally gets capable to deal with these skewed distributions.

As for this work, consider the previously sketched recursive sampling strategy from _{
i
}._{
j
} should close a multiloop, then the sequence fragment _{
i+1,j-1}:= _{
i+1} … _{
j-1} is guaranteed to be folded into an admissible multiloop that by definition contains at least two helical regions radiating out from this loop. However, by using disturbed sampling probabilities (given by the exact parameters of the underlying (L)SCFG model and disturbed inside values for input sequence _{
i+1,j-1}, although this would actually not be possible.

Therefore, we had to slightly modify the sampling procedure such that in any case where the chosen substructure type can not be successfully generated, it settles for the partially formed substructure. That is, it either leaves the complete fragment unpaired (if the desired base pairs could not be sampled at all), or else it for example only creates a bulge/interior loop although a multiloop should have been constructed (but only one helix has been successfully sampled). The resulting modified versions of the distinct sampling steps (in pseudocode) are given in Section Sm-I (of Additional file

Flowchart for recursive sampling of an RNA secondary structure _{1,n} for a given input sequence

**Flowchart for recursive sampling of an RNA secondary structure **
**
S
**

Note that alternatively, the algorithm could have been modified to revise any decisions that lead to incompletely generated substructures, resulting in some sort of backtracking procedures that obviously would have to be applied in order to sample more realistic overall structures for a given RNA sequence. However, as this effectively results in much more complex modifications and eventually yields significant losses in performance, we opted for the simpler and more straightforward first variant to get rid of the described problem.

Results and discussion

The aim of this section is to perform a comprehensive experimental analysis on the influence of disturbances (in the ensemble distribution for a given input sequence) on the quality of sample sets generated by the (L)SCFG based statistical sampling approach from

RNA structure data

For our examinations, we decided to consider different sets of trusted RNA secondary structure data for which the (L)SCFG based sampling approach yields good quality results when no disturbances are included in the respective sampling distributions for a given sequence. Therefore, we took the same tRNA database (of 2163 distinct tRNA structures with lengths in [64, 93] and about 76 on average, derived from

Probability profiling for specific loop types

A statistical sample of all possible secondary structures for a given RNA sequence can be used for sampling estimates of the probabilities of any structural motifs. Actually,

Since this application is rather intuitive, we decided to use it as a starting point for our disturbance analysis. Particularly, we derived a number of statistical samples for the well-known ^{
Ala
} sequence by applying the sampling strategy from Section Resulting Modified Sampling Strategy on the basis of diverse sets of probabilistic parameters (inside probabilities disturbed according to several particular variants as defined in Section Considered Disturbance Types and Levels) for that sequence and calculated corresponding probability profiles. All relevant results are displayed in Additional file

Hairpin loop profiles for ^{Ala}, calculated from a random sample of 1000 structures generated with the SCFG (figures on the left) and LSCFG (figures on the right) approach, respectively (under the assumption of the less restrictive grammar parameters min_{hel}=1 and min_{HL}=1)

**Hairpin loop profiles for ****tRNA**^{Ala}**, calculated from a random sample of 1000 structures generated with the SCFG (figures on the left) and LSCFG (figures on the right) approach, respectively (under the assumption of the less restrictive grammar parameters min**_{hel }**= 1 and min**_{HL }**= 1).** The exact (undisturbed) results are displayed by the thin black lines, and the correct hairpin loops in ^{Ala} are illustrated by the black points.

Hairpin loop profiles corresponding to those presented in Figure **mev**^{win,+}** (prob)** (thick gray lines) and

**Hairpin loop profiles corresponding to those presented in Figure **
**b, where absolute errors were derived according to mev**
^{
win
,+
}
**(**
**
prob
**

Hairpin loop profiles corresponding to those presented in Figure **mev**^{win,-}** (prob)** (thick gray lines) and

**Hairpin loop profiles corresponding to those presented in Figure **
**b, where absolute errors were derived according to mev**
^{
win
,-
}
**(**
**
prob
**

Errors on all values

Let us first consider the profiles displayed in Figure

Note that for any given input sequence, it seems to be usually much more important for the employed sampling strategy to be able to identify which ones of the (combinatorially) possible substructures can actually be (validly) formed on the considered sequence fragment rather than to know their exact probabilities (according to the conditional distribution for the respective fragment), for two contrary reasons: First, in order to avoid drawing practically impossible choices, which later forces it to leave the considered sequence fragment (at least partially) unpaired^{e}. Second, for ensuring that none of the actually valid choices is prohibited during the folding process, such that the sampling procedure might inevitably prefer other (potentially even impossible) substructures.

Consequently, in order to prevent a decline in accuracy of generated structures and a reduction of the overall sampling quality, it seems to be of great importance that the sampling strategy is capable of distinguishing between inside values and especially sampling probabilities that are equal and unequal to zero according to the exact (undisturbed) ensemble distribution for the given input sequence. By adding absolute errors, however, inside or sampling probabilities being equal (unequal) to zero in the exact case might often become unequal (equal) to zero according to the resulting skewed (disturbed) distributions, whereas by incorporating relative errors, all considered inside and sampling probabilities obviously stay equal or unequal to zero (as in the exact case), which intuitively explains the basic observations made from Figure

Relevant sampling probabilities

Nevertheless, in order to draw more detailed conclusions, we counted and compared the relevant (i.e., greater than zero) inside and sampling probabilities that were considered for obtaining the profiles presented in Figure

First, it seems obvious that due to the more explicit length-dependent version of the considered grammar parameters (length-dependently trained transition and emission probabilities), there should generally result a much smaller number of relevant inside values and sampling probabilities when applying the LSCFG model rather than the conventional one. Tables S1 and S2 exemplarily prove this intuitive assumption. Note that this effect might indeed be responsible for the observation that the LSCFG based sampling approach reacts considerably less to large relative errors than the conventional length-independent variant, as indicated by Figure

Moreover, there are much more relevant exact inside and sampling probabilities than corresponding relevant disturbed values for basically any (intermediate) symbol when considering the traditional SCFG model, whereas for the LSCFG variant the contrary holds, that is generally way more inside and sampling probabilities are relevant in the disturbed cases than in the exact case. Actually, in both cases (length-dependent and not), the numbers of relevant disturbed inside values

Finally, it remains to mention that under the assumption of the conventional SCFG model, it happens that for any

Errors only on particular values

Now, in an attempt to find out in which cases particular absolute errors have a very significant (negative) impact on the resulting sampling quality and to identify potentially existing situations where they barely influence the output of the applied statistical sampling algorithm, we want to consider some of the more specialized variants for generating absolute disturbances (as defined in Section Considered Disturbance Types and Levels). The corresponding profiles are basically shown in Figures

Notably, even if absolute disturbances may only occur for inside values _{
X
}(_{
X
}(_{
X
}(

Nevertheless, as we can see from Figure _{
X
}(_{
X
}(

Finally, for the sake of completeness, it should be noted that by incorporating absolute errors (for all subword lengths) only for any of the distinct intermediate symbols _{
X
}(

Prediction accuracy – Sensitivity and PPV

In connection with sampling approaches, there exist diverse (more or less) efficient well-defined principles for extracting a particular structure prediction from a generated set of candidate structures for a given input sequence. In fact, under the condition that a corresponding folding can be calculated in ^{f}. Briefly, these two common measures are widely used in order to quantify the accuracy of RNA secondary structure prediction methods and are usually defined as follows (see e.g.

• Sens. is the relative frequency of correctly predicted pairs among all position pairs that are actually paired in a stem of native foldings, whereas

• PPV is defined as the relative frequency of correctly predicted pairs among all position pairs that were predicted to be paired with each other.

Formally, they are given by Sens. = ^{-1} and PPV = ^{-1}, where

In order to investigate to what extend the accuracy of predicted foldings changes when different dimensions of relative disturbances are incorporated into the needed sampling probabilities, we decided to perform a series of cross-validation experiments based on the same partitions of the tRNA and 5S rRNA databases into 10 approximately equal-sized folds, respectively, as considered in

Briefly, we employed two different well-defined selection procedures in order to identify one particular structure from the produced sample as prediction: First, we picked the most likely secondary structure (i.e., the one with the highest probability among all feasible structures for the input sequence according to the induced (L)SCFG model), in strong analogy to traditional SCFG based probabilistic structure prediction methods. This choice will be denoted by

Note that if the samples are indeed representative with respect to the underlying ensemble distribution (i.e., if a sufficiently large number of candidate foldings is randomly generated on the basis of the corresponding conditional probability distributions considered by the employed strategy), then these two predictions should be rather identical in most cases, at least if no disturbances are considered (i.e., under the condition that the exact inside probabilities are used for deriving the respective conditional sampling distributions). In fact, any representative set of candidate structures for a given input sequence obtained by (L)SCFG based statistical sampling obviously reflects the probability distribution on all feasible foldings of that sequence which strongly depends on the corresponding inside probabilities. Thus, if the preprocessed inside values contain any errors, then the MF structure of a particular statistically representative sample set corresponds to the most likely folding of the given sequence with respect to the skewed ensemble distribution induced by the disturbed inside values, whereas the MP structure of that sample is indeed equal to the most likely folding (among all generated candidate structures) with respect to the exact ensemble distribution^{g}. Hence, the results for MP and MF structure predictions might differ in the disturbed cases, especially as the gravity of generated disturbances grows.

However, we decided to additionally apply two different commonly used construction schemes for computing a new structure as predicted folding, where the predicted structure itself must not necessarily be contained in the given sample. Particularly, we first determined a

Last but not least, we derived two different sets of so-called _{t-o}-MEA and _{t-o}-centroid structures for the produced samples, respectively, as defined in _{t-o} ∈ [0, _{t-o} = 1 serves as the neutral element with respect to the prediction, meaning the prediction is neither biased towards a better sensitivity nor to a better PPV and corresponds to the above described well-known MEA or unique centroid structure, respectively. Notably, by measuring the performance at several different settings of _{t-o} (i.e. by determining the (adjusted) sensitivity and PPV for various values of _{t-o}), it becomes possible to derive a corresponding ^{h} and to calculate the estimated _{t-o} = 1.

However, the (unadjusted) sensitivity and PPV measures obtained by considering the four different (unparameterized) prediction principles sketched above are listed in Additional file ^{i}, where a few selected ones are presented in Table _{t-o} are all collected in Additional file _{t-o} ∈ {1.25^{
k
}∣ - 12 ≤ ^{
k
}∣0 ≤

**(a)For our tRNA database**

All values have been computed by 10-fold cross-validation procedures, using sample size 1000 and **min**
_{
hel
}**= ****min**
_{
HL
}**= 1**.

**Approach**

**Errors**

**MP struct.**

**MF struct.**

**MEA struct.**

**Centroid**

Sens.

PPV

Sens.

PPV

Sens.

PPV

Sens.

PPV

SCFG

—

0.7818

0.8437

0.7792

0.8445

0.7324

0.8939

0.6754

0.9158

mep (0.5)

0.7822

0.8447

0.7599

0.8370

0.7169

0.8927

0.6607

0.9140

mep (0.99)

0.7590

0.8388

0.6768

0.8004

0.6414

0.8877

0.5817

0.9127

fep (0.5)

0.7798

0.8440

0.7234

0.8184

0.6864

0.8896

0.6292

0.9134

fep (0.99)

0.4101

0.7295

0.2864

0.5590

0.2532

0.7776

0.2157

0.8291

LSCFG

—

0.8545

0.9534

0.8542

0.9535

0.8335

0.9736

0.8250

0.9783

mep (0.5)

0.8545

0.9534

0.8429

0.9524

0.8236

0.9731

0.8150

0.9773

mep (0.99)

0.8519

0.9533

0.7988

0.9413

0.7833

0.9676

0.7735

0.9726

fep (0.5)

0.8548

0.9536

0.8224

0.9486

0.8029

0.9707

0.7940

0.9758

fep (0.99)

0.7530

0.9325

0.5769

0.8623

0.5668

0.9075

0.5567

0.9195

**(b) For our 5S rRNA database**

**Approach**

**Errors**

**MP struct.**

**MF struct.**

**MEA struct.**

**Centroid**

Sens.

PPV

Sens.

PPV

Sens.

PPV

Sens.

PPV

SCFG

—

0.4251

0.5372

0.4251

0.5363

0.3403

0.6967

0.2689

0.8044

mep (0.5)

0.4143

0.5280

0.4160

0.5290

0.3334

0.6987

0.2643

0.8051

mep (0.99)

0.3897

0.5227

0.3894

0.5216

0.2957

0.7069

0.2362

0.8072

fep (0.5)

0.4055

0.5203

0.4049

0.5198

0.3209

0.7068

0.2532

0.8087

fep (0.99)

0.2043

0.4410

0.1756

0.3788

0.1066

0.6867

0.0814

0.7666

LSCFG

—

0.8993

0.9412

0.8997

0.9409

0.8959

0.9513

0.8873

0.9574

mep (0.5)

0.8993

0.9412

0.8909

0.9380

0.8903

0.9478

0.8819

0.9541

mep (0.99)

0.8989

0.9414

0.8639

0.9269

0.8659

0.9408

0.8574

0.9482

fep (0.5)

0.8993

0.9412

0.8796

0.9328

0.8798

0.9445

0.8716

0.9515

fep (0.99)

0.8251

0.9052

0.7162

0.8375

0.7148

0.8661

0.6986

0.8879

**(a) For our tRNA database**

All values have been computed by 10-fold cross-validation procedures, using sample size 1000 and **min**
_{
hel
}**= ****min**
_{
HL
}**= 1**.

**Approach**

**Errors**

**MEA struct.**

**Centroid**

SCFG

—

0.828522

0.833894

mep (0.5)

0.819658

0.823811

mep (0.99)

0.786645

0.788478

fep (0.5)

0.805999

0.807240

fep (0.99)

0.440021

0.422778

LSCFG

—

0.936285

0.919736

mep (0.5)

0.932121

0.916321

mep (0.99)

0.916540

0.896024

fep (0.5)

0.924191

0.908943

fep (0.99)

0.752030

0.722737

**(b) For our 5S rRNA database.**

**Approach**

**Errors**

**MEA struct.**

**Centroid**

SCFG

—

0.409278

0.408549

mep (0.5)

0.401914

0.400515

mep (0.99)

0.376683

0.375488

fep (0.5)

0.400827

0.397566

fep (0.99)

0.189628

0.182902

LSCFG

—

0.914801

0.918933

mep (0.5)

0.911963

0.915503

mep (0.99)

0.902330

0.905126

fep (0.5)

0.906507

0.911063

fep (0.99)

0.776239

0.777355

Let us first consider the results reported in Table

Moreover, Table

Finally, it should be mentioned that all these observations and conclusions are actually affirmed by comparing the more reliable AUC results given in Table

Sampling quality – Specific values related to shapes

Obviously, the sensitivity and PPV measures used in the last section for assessing the accuracy of predicted foldings depend only on the numbers of correctly and incorrectly predicted base pairs (compared to the trusted database structure). For biologists, however, it is usually much more important to get the correct

For these reasons, we decided to complete our analysis of the influence of disturbances to the quality of probabilistic statistical sampling by considering the following meaningful specific values related to the shapes of predictions and sampled structures as defined in

• Frequency of prediction of correct structure (CSP_{freq}): In how many cases is the predicted secondary structure (or its shape) equal to the correct structure (or the correct shape)?

• Frequency of correct shape occurring in a sample (CSO_{freq}): In how many cases can the correct shape (on different levels) be found in the generated sample set?

• Number of occurrences of correct shape in a sample (CS_{num}): How many times can the correct shape be found in the generated sample set?

• Number of different shapes in a sample (DS_{num}): How many different secondary structures (or shapes) can be found in the generated sample set?

We can easily compute the respective values from the predicted structures and the corresponding sample sets that were derived for the calculation of the sensitivity and PPV measures in the last section. The obtained results are collected in Additional file

**(a) CSP**
_{
freq
}**values (for selection principle MP struct.)**

Tables record specific values related to shapes of predictions and sampled structures, obtained from our tRNA database. All results were computed by 10-fold cross-validation procedures, using sample size 1000 and **min**
_{
hel
}**= ****min**
_{
HL
}**= 1**.

**Approach**

**Errors**

**Shape level**

0

1

2

3

4

5

SCFG

—

0.2413

0.4082

0.5548

0.5548

0.5552

0.6278

mep (0.5)

0.2409

0.4068

0.5548

0.5548

0.5552

0.6265

mep (0.99)

0.1877

0.3551

0.5382

0.5382

0.5386

0.6075

fep (0.5)

0.2339

0.4017

0.5511

0.5511

0.5516

0.6269

fep (0.99)

0.0014

0.0384

0.1979

0.1979

0.1984

0.2326

LSCFG

—

0.3324

0.4956

0.6574

0.6574

0.6579

0.7351

mep (0.5)

0.3324

0.4956

0.6574

0.6574

0.6579

0.7351

mep (0.99)

0.3236

0.4892

0.6560

0.6560

0.6565

0.7332

fep (0.5)

0.3324

0.4966

0.6588

0.6588

0.6593

0.7369

fep (0.99)

0.0624

0.2626

0.6246

0.6250

0.6250

0.6967

**(b) CSP**
_{
freq
}**values (for selection principle MF struct.)**

**Approach**

**Errors**

**Shape level**

0

1

2

3

4

5

SCFG

—

0.2099

0.3699

0.5594

0.5594

0.5599

0.6302

mep (0.5)

0.1683

0.3301

0.5372

0.5372

0.5377

0.6047

mep (0.99)

0.0522

0.1822

0.4517

0.4517

0.4517

0.5215

fep (0.5)

0.1049

0.2547

0.5155

0.5155

0.5160

0.5793

fep (0.99)

0.0000

0.0125

0.1110

0.1110

0.1119

0.2062

LSCFG

—

0.3269

0.4892

0.6560

0.6565

0.6565

0.7337

mep (0.5)

0.2534

0.4235

0.6708

0.6708

0.6713

0.7485

mep (0.99)

0.1137

0.2954

0.6801

0.6801

0.6801

0.7568

fep (0.5)

0.1794

0.3653

0.6704

0.6704

0.6709

0.7531

fep (0.99)

0.0023

0.1262

0.6334

0.6334

0.6357

0.7240

**(c) CSP**
_{
freq
}**values (for selection principle MEA struct.)**

**Approach**

**Errors**

**Shape level**

0

1

2

3

4

5

SCFG

—

0.0555

0.2094

0.4193

0.4193

0.4207

0.4679

mep (0.5)

0.0416

0.1817

0.4045

0.4045

0.4055

0.4489

mep (0.99)

0.0125

0.0989

0.3112

0.3112

0.3126

0.3570

fep (0.5)

0.0245

0.1364

0.3662

0.3662

0.3666

0.4059

fep (0.99)

0.0000

0.0014

0.0245

0.0245

0.0250

0.0546

LSCFG

—

0.1854

0.3574

0.4919

0.4919

0.4919

0.5465

mep (0.5)

0.1405

0.3056

0.4998

0.4998

0.4998

0.5567

mep (0.99)

0.0730

0.2191

0.4753

0.4753

0.4753

0.5284

fep (0.5)

0.1003

0.2556

0.4836

0.4836

0.4836

0.5409

fep (0.99)

0.0009

0.0781

0.3902

0.3902

0.3921

0.4508

**(d) CSP**
_{
freq
}**values (for selection principle Centroid)**

**Approach**

**Errors**

**Shape level**

0

1

2

3

4

5

SCFG

—

0.0374

0.1276

0.2973

0.2973

0.2977

0.3130

mep (0.5)

0.0273

0.1045

0.2779

0.2779

0.2783

0.2908

mep (0.99)

0.0083

0.0541

0.2007

0.2007

0.2007

0.2173

fep (0.5)

0.0134

0.0795

0.2473

0.2473

0.2473

0.2603

fep (0.99)

0.0000

0.0009

0.0120

0.0120

0.0120

0.0227

LSCFG

—

0.1729

0.3158

0.4300

0.4300

0.4300

0.4762

mep (0.5)

0.1322

0.2728

0.4374

0.4374

0.4374

0.4859

mep (0.99)

0.0693

0.1914

0.4101

0.4101

0.4101

0.4558

fep (0.5)

0.0957

0.2261

0.4207

0.4207

0.4207

0.4642

fep (0.99)

0.0009

0.0633

0.3264

0.3264

0.3269

0.3648

**(a) CSP**
_{
freq
}**values (for selection principle MP struct.)**

Tables record specific values related to shapes of predictions and sampled structures, obtained from our 5S rRNA database. All results were computed by 10-fold cross-validation procedures, using sample size 1000 and **min**
_{
hel
}**= ****min**
_{
HL
}**= 1**.

**Approach**

**Errors**

**Shape level**

0

1

2

3

4

5

SCFG

—

0.0000

0.0026

0.0052

0.0131

0.0366

0.7110

mep (0.5)

0.0000

0.0009

0.0026

0.0113

0.0287

0.7128

mep (0.99)

0.0000

0.0026

0.0044

0.0095

0.0227

0.6919

fep (0.5)

0.0000

0.0017

0.0043

0.0113

0.0374

0.6954

fep (0.99)

0.0000

0.0000

0.0000

0.0017

0.0096

0.5474

LSCFG

—

0.2141

0.4256

0.4744

0.4900

0.9408

0.9843

mep (0.5)

0.2141

0.4256

0.4744

0.4900

0.9408

0.9843

mep (0.99)

0.1941

0.4221

0.4761

0.4892

0.9452

0.9852

fep (0.5)

0.2124

0.4248

0.4726

0.4883

0.9417

0.9852

fep (0.99)

0.0209

0.3029

0.3725

0.4186

0.8529

0.9809

**(b) CSP**
_{
freq
}**values (for selection principle MF struct.)**

**Approach**

**Errors**

**Shape level**

0

1

2

3

4

5

SCFG

—

0.0000

0.0026

0.0052

0.0131

0.0357

0.7128

mep (0.5)

0.0000

0.0009

0.0026

0.0122

0.0305

0.7180

mep (0.99)

0.0000

0.0026

0.0044

0.0105

0.0235

0.6902

fep (0.5)

0.0000

0.0017

0.0043

0.0113

0.0383

0.6971

fep (0.99)

0.0000

0.0000

0.0000

0.0035

0.0200

0.5439

LSCFG

—

0.2002

0.4256

0.4700

0.4866

0.9417

0.9861

mep (0.5)

0.1332

0.3960

0.4439

0.4587

0.9434

0.9869

mep (0.99)

0.0365

0.3630

0.4308

0.4491

0.9304

0.9861

fep (0.5)

0.0801

0.3847

0.4404

0.4561

0.9400

0.9861

fep (0.99)

0.0035

0.1497

0.2106

0.3325

0.5440

0.9730

**(c) CSP**
_{
freq
}**values (for selection principle MEA struct.)**

**Approach**

**Errors**

**Shape level**

0

1

2

3

4

5

SCFG

—

0.0000

0.0000

0.0000

0.0000

0.0261

0.3821

mep (0.5)

0.0000

0.0000

0.0000

0.0000

0.0209

0.3698

mep (0.99)

0.0000

0.0000

0.0000

0.0000

0.0122

0.3003

fep (0.5)

0.0000

0.0000

0.0000

0.0000

0.0252

0.3438

fep (0.99)

0.0000

0.0000

0.0000

0.0000

0.0026

0.0444

LSCFG

—

0.1062

0.3891

0.4291

0.4378

0.9051

0.9835

mep (0.5)

0.1010

0.3751

0.4134

0.4239

0.8921

0.9782

mep (0.99)

0.0392

0.3429

0.3986

0.4213

0.8712

0.9791

fep (0.5)

0.0740

0.3839

0.4239

0.4387

0.8877

0.9791

fep (0.99)

0.0017

0.1358

0.1863

0.2942

0.4970

0.9634

**(d) CSP**
_{
freq
}**values (for selection principle Centroid)**

**Approach**

**Errors**

**Shape level**

0

1

2

3

4

5

SCFG

—

0.0000

0.0000

0.0000

0.0000

0.0104

0.1097

mep (0.5)

0.0000

0.0000

0.0000

0.0000

0.0104

0.1062

mep (0.99)

0.0000

0.0000

0.0000

0.0000

0.0078

0.0827

fep (0.5)

0.0000

0.0000

0.0000

0.0000

0.0061

0.0932

fep (0.99)

0.0000

0.0000

0.0000

0.0000

0.0009

0.0078

LSCFG

—

0.0966

0.2916

0.3238

0.3316

0.8703

0.9686

mep (0.5)

0.0879

0.3142

0.3516

0.3621

0.8625

0.9686

mep (0.99)

0.0322

0.2924

0.3377

0.3595

0.8294

0.9651

fep (0.5)

0.0662

0.3194

0.3551

0.3638

0.8512

0.9695

fep (0.99)

0.0017

0.1053

0.1471

0.2219

0.4831

0.9339

First, as regards tRNAs, we observe that for MP predictions, disturbances caused by mep(_{freq} value for shape levels 2 to 5 and under the assumption of the LSCFG approach, where for MF structures, there indeed results a slightly higher CSP_{freq} value with increasing relative error percentage _{freq} on abstraction levels 2 to 5, where for MP and MF structure predictions it obviously behaves quite resistant to the imposed distributions even for large values of

Similar results are observed for 5S rRNAs (see Additional file _{freq} values (for all shape levels in case of MP predictions and at least for shape levels 1 to 5 for all other prediction types) generally do not get significantly worse when applying the LSCFG sampling approach with inside values disturbed according to mep(

Moreover, comparing the discussed CSP_{freq} results for the LSCFG variant to the corresponding ones for the conventional SCFG approach, we get additional evidence that the length-independent sampling method reacts stronger to relative disturbances in the underlying ensemble distribution for a given sequence than its length-dependent counterpart. As already mentioned, this is due to the fact that the ensemble distribution considered in the length-dependent case is much more centered due to the more explicit (length-dependently trained) grammar parameters, such that randomly generated errors on particular probabilities carry less weight.

Now, let us consider the three remaining specific values CSO_{freq}, CS_{num} and DS_{num} that can eventually be used to assess the overall quality of generated sample sets rather than the accuracy of corresponding selected predictions. Basically, the obtained CSO_{freq} and CS_{num} results for tRNAs and 5S rRNAs (as reported in Tables S7e to S7f and Tables S8e to S8f), respectively, show a similar picture and thus yield similar conclusions as the corresponding CSP_{freq} values discussed above. As a consequence to the fact that for larger relative error percentages _{freq} and CS_{num} usually get smaller, the corresponding DS_{num} values inevitably increase with growing disturbance influences imposed by mep(

Conclusions

In this article, we performed a comprehensive experimental analysis on the effect of disturbances in the ensemble distribution for a given sequence to the quality of corresponding sets of candidate structures generated with the (L)SCFG based statistical sampling method studied in

During our analysis (on the basis of trusted sets of tRNA and 5S rRNA data), we immediately observed that even incorporating only rather small absolute errors into (all or particular instances of the) inside values causes problematic disturbances of the resulting sampling probabilities that generally lead to the generation of useless sample sets. This can be assumed to be due to the fact that the installation of absolute errors usually makes it impossible for the employed sampling strategy to identify which ones of the considered inside probabilities for a given input sequence must originally (i.e., in the exact case) have been equal or unequal to zero, which inevitably results in a misguided behavior of the strategy, as it is no longer ensured that it creates only reasonable substructures for a considered sequence fragment.

However, both SCFG approaches (length-dependent and traditional one) behave rather resistant to disturbances of the needed conditional sampling probabilities that are caused by generating (moderate) relative errors on all (and also only on particular) inside values for a given input sequence. In general, even large relative errors seem to have no enormous negative impact on both the predictive accuracy and the overall quality of generated sample sets. That is, the reaction of the (L)SCFG based statistical sampling algorithm to the relative disturbances is fair enough to still obtain meaningful structure predictions (especially if the most likely structure of the sample is selected as predicted folding, in strong analogy to conventional SCFG based DPAs), and the overall quality of the resulting sample sets is still acceptable such that they might often also be used for further applications (like, e.g. probability profiling for specific loop types).

Consequently, it seems reasonable to believe that the needed sampling probabilities do not necessarily have to be computed in the exact way, but it may probably suffice to only (adequately) approximate them. In fact, the worst-case time complexity of any particular (L)SCFG based sampling method could potentially be reduced by developing a suitable approximation procedure (or at least an adequate heuristic method) for the computation of the needed sampling probabilities, where an appropriate approximation ratio (or at least an acceptable ratio of correctly and incorrectly computed zero values) should be attempted to ensure that the sampling quality remains sufficiently high, as indicated by the experimental disturbance analysis results discussed within this article.

Endnotes

^{a} All references starting with ^{b} Note that the function max(min(^{c} Note that ^{d} In general, longer words tend to be generated with smaller probability since we have to apply more grammar rules, each implying a factor (typically) less than 1 to the probability.^{e} If those decisions are not revised by employing backtracking procedures, see the description of the modifications incorporated into the sampling algorithm in order to deal with such situations as given in Section Resulting Modified Sampling Strategy.^{f} Note that the positive predictive value is often called ^{g} This is due to the fact that the probability of a particular folding of a given RNA sequence (i.e., the probability of the corresponding derivation tree) depends only on the considered set of grammar parameters (transition and emission probabilities).^{h} Note that we here assume sensitivity as a function of PPV is an ROC curve, although correctly an ROC curve is sensitivity as a function of specificity.^{i} Note that the corresponding standard deviations on sensitivity values and PPV are recorded in Additional file

Competing interests

Both authors declare that they have no competing interests.

Authors’ contributions

AS developed and implemented the algorithms for generating statistical samples based on disturbed ensemble distributions. AS performed all experiments and evaluated the decline of sampling quality implied by considering the diverse kinds of disturbances. MEN supervised the work and development of ideas. AS drafted the manuscript; a revision and its final version have also been prepared by AS. Both authors have read and approved the final manuscript.

Acknowledgements

AS thanks Carl Zeiss Foundation for supporting her research. All authors wish to thank an anonymous reviewer for careful reading and helpful remarks and suggestions made for a previous version of this article.