Computer Science Department, University of British Columbia, Vancouver, BC, Canada
Abstract
Background
RNA molecules play critical roles in the cells of organisms, including roles in gene regulation, catalysis, and synthesis of proteins. Since RNA function depends in large part on its folded structures, much effort has been invested in developing accurate methods for prediction of RNA secondary structure from the base sequence. Minimum free energy (MFE) predictions are widely used, based on nearest neighbor thermodynamic parameters of Mathews, Turner et al. or those of Andronescu et al. Some recently proposed alternatives that leverage partition function calculations find the structure with maximum expected accuracy (MEA) or pseudoexpected accuracy (pseudoMEA) methods. Advances in prediction methods are typically benchmarked using sensitivity, positive predictive value and their harmonic mean, namely Fmeasure, on datasets of known reference structures. Since such benchmarks document progress in improving accuracy of computational prediction methods, it is important to understand how measures of accuracy vary as a function of the reference datasets and whether advances in algorithms or thermodynamic parameters yield statistically significant improvements. Our work advances such understanding for the MFE and (pseudo)MEAbased methods, with respect to the latest datasets and energy parameters.
Results
We present three main findings. First, using the bootstrap percentile method, we show that the average Fmeasure accuracy of the MFE and (pseudo)MEAbased algorithms, as measured on our largest datasets with over 2000 RNAs from diverse families, is a reliable estimate (within a 2% range with high confidence) of the accuracy of a population of RNA molecules represented by this set. However, average accuracy on smaller classes of RNAs such as a class of 89 Group I introns used previously in benchmarking algorithm accuracy is not reliable enough to draw meaningful conclusions about the relative merits of the MFE and MEAbased algorithms. Second, on our large datasets, the algorithm with best overall accuracy is a pseudo MEAbased algorithm of Hamada et al. that uses a generalized centroid estimator of base pairs. However, between MFE and other MEAbased methods, there is no clear winner in the sense that the relative accuracy of the MFE versus MEAbased algorithms changes depending on the underlying energy parameters. Third, of the four parameter sets we considered, the best accuracy for the MFE, MEAbased, and pseudoMEAbased methods is 0.686, 0.680, and 0.711, respectively (on a scale from 0 to 1 with 1 meaning perfect structure predictions) and is obtained with a thermodynamic parameter set obtained by Andronescu et al. called BL* (named after the Boltzmann likelihood method by which the parameters were derived).
Conclusions
Large datasets should be used to obtain reliable measures of the accuracy of RNA structure prediction algorithms, and average accuracies on specific classes (such as Group I introns and Transfer RNAs) should be interpreted with caution, considering the relatively small size of currently available datasets for such classes. The accuracy of the MEAbased methods is significantly higher when using the BL* parameter set of Andronescu et al. than when using the parameters of Mathews and Turner, and there is no significant difference between the accuracy of MEAbased methods and MFE when using the BL* parameters. The pseudoMEAbased method of Hamada et al. with the BL* parameter set significantly outperforms all other MFE and MEAbased algorithms on our large data sets.
Background
RNA molecules are essential to many functions in the cells of all organisms. For example, these molecules are involved in gene translation and also act as catalysts and as regulators of gene expression
Our interest is in assessing the merits of some recent advances in secondary structure prediction in statistically robust ways. We focus on thermodynamically informed approaches for predicting pseudoknot free secondary structures from the base sequence. A widely used method finds the minimum free energy (MFE) structure with respect to the nearest neighbour thermodynamic model of Mathews, Turner and colleagues
Knudsen et al.
This recent work motivates the following questions. Are comparisons of the (pseudo)MEAbased and MFE approaches on specific RNA classes reliable when the size of available datasets is small? Do the MEA or pseudoMEAbased approaches produce significantly more accurate predictions than MFE on the latest energy parameter sets? What is the best combination of algorithm and thermodynamic model? To answer these questions, we report on the accuracy of both (pseudo)MEAbased and MFE methods with respect to two versions of the Turner parameters as well as the recent BL* and CG* parameters of Andronescu et al., on datasets for specific RNA classes as well as large datasets that combine multiple RNA classes.
We present three main findings. First, we show that Fmeasure accuracies on our large datasets are likely to be reliable estimates of accuracy of a population represented by such sets, in the sense that highconfidence interval widths for Fmeasure obtained using the bootstrap percentile method are within a small, 2% range. Average accuracy on smaller classes is less reliable. For example, confidence intervals for both MEA and MFE have an 8% range on a class of 89 Group I introns that has been used previously in benchmarking algorithms. Second, there is a clear "winner" in terms of overall prediction accuracy, namely the pseudoMEAbased method of Hamada et al.
Methods
In this section we first describe the datasets, thermodynamic models, and algorithms considered in this paper. We then describe the accuracy measures and statistical methods used in our analyses.
Datasets
We use three datasets, as follows
• SFull is a comprehensive set of 3,245 RNA sequences and their secondary structures that has been assembled from numerous reliable databases
• MT was used by Lu et al.
• MA is a subset of the SFull set, containing exactly those sequence of SFull that are in the RNA classes included in MT. We formed the MA dataset in order to compare our algorithms on the same classes as did Lu et al., while using sets of RNAs from these classes with as large a size as possible.
An overview of different RNA classes in the MT and MA data sets is shown in Table
Overview of the different RNA classes in MT and MA data sets
RNA class
No. in MT
mean ± std of length
Avg. similarity
No. in MA
mean ± std of length
Avg. similarity
16S Ribosomal RNA
89
377.88 ± 167.18
0.60
675
485.66 ± 113.02
0.62
23S Ribosomal RNA
27
460.6 ± 151.3
0.53
159
453.44 ± 117.85
0.57
5S Ribosomal RNA
309
119.5 ± 2.69
0.88
128
120.98 ±3.21
0.88
Group I intron
16
344.88 ± 66.42
0.63
89
368.49 ±103.58
0.63
Group II intron
3
668.7 ± 70.92
0.70
2
578 ± 47
0.72
Ribonuclease P RNA
6
382.5 ± 41.66
0.74
399
332.78 ±52.34
0.72
Signal Recognition RNA
91
267.95 ± 61.72
0.71
364
227.04 ± 109.53
0.65
Transfer RNA
484
77.48 ± 4.8
0.96
489
77.19 ± 5.13
0.95
Total
1024
2305
Overview of different RNA classes in the MT and MA data sets, including the number of RNAs, the mean and standard deviation of the length and the average normalized similarity between RNA sequences in these two data sets for each RNA class.
Lu et al. reported their results on a restricted format of the MT set in which certain bases of some sequences are in lower case, indicating that the base is unpaired in the reference structure. They have used this structural information in their predictions. However, we have not employed this information, and therefore our accuracy measures on the MT set are different from those of Lu et al.
Thermodynamic Models
A thermodynamic model consists of
We also consider variants of the Turner model, used by Andronescu et al.
Algorithms
We analyze four RNA secondary structure prediction algorithms. The first predicts secondary structures that have minimum free energy (MFE) with respect to a given thermodynamic model. The second is the maximum expected accuracy (MEA) algorithm as introduced by Lu et al.
The third algorithm that we analyze is the generalized centroid estimator method of Hamada et al.
The rsMEA and rsMFE algorithms always use the parameters of Turner99 model as their free energy parameter set, while ubcMEA and ubcMFE use parameter sets in the MultiRNAFold model format, namely the BL*, CG* and T99MRF sets. The gCg1 and gCpMFmeas algorithms also employ BL* parameters using a Turner99 format.
Accuracy Measures
We use three measures for determining the structural prediction accuracy, namely sensitivity (also called precision or precision rate), positive predictive value or PPV (also called recall), and Fmeasure, which combines the sensitivity and PPV into a single measure.
Sensitivity is the ratio of correctly predicted base pairs to the total base pairs in the reference structures. PPV is the fraction of correctly predicted base pairs, out of all predicted base pairs. Fmeasure is the harmonic mean of the sensitivity and PPV. This value is equal to the arithmetic mean when sensitivity and PPV are equal. However, Fmeasure becomes smaller than the arithmetic mean as one of the numbers approaches 0 (while the other is fixed). The possible values for these three measures are between 0 and 1; the closer to 1, the better prediction.
The Fmeasure is widely used measure in the literature; it is also the common measure in the studies by Hamada et al. and by Lu et al., to which refer in our study. Mostly to facilitate comparison of their results to ours, we decided to use the Fmeasure rather than the Matthews correlation coefficient (another wellknown measure of accuracy predominantly used to assess binary classification methods).
Throughout this paper, we calculate three types of averages for a given measure "
The three averages are defined as follows when there are
and
where
Bootstrap Percentile Confidence Intervals
We calculated the confidence intervals of average Fmeasure for various RNA classes, by using the bootstrap percentile method
We verified that the bootstrap distributions are close to Gaussian, using the AndersonDarling test, which indicates that the bootstrap percentile intervals can be expected to be reasonably accurate
Permutation Test Method
To assess the statistical significance of the observed performance differences, we used permutation tests, following Aghaeepour and Hoos
The test that we applied proceeds as follows. First, we calculated the difference in means between sets of Fmeasure values obtained by the two given structure prediction procedures, an MEAbased method and MFE. For simplicity, we call these two sets
Furthermore, to assess whether the difference in accuracy between the MEAbased methods and the MFE method on a given parameter set is significant, we performed a twosided permutation test. This test works exactly like the onesided permutation test, except that its pvalue is calculated as the proportion of the sampled permutations where the absolute difference was greater than or equal to that of absolute difference of the means of sets
All of these calculations were performed using the "perm" package of the R statistics software environment.
Results and Discussion
In this section, we investigate to which degree the prediction accuracy of the energybased methods studied in this paper is dependent on different datasets and different thermodynamic parameter sets used for prediction.
We start by considering how the size of a dataset will influence the accuracy achieved by RNA secondary structure prediction methods. We then study the dependency of prediction accuracy of the MFE, MEA, gCg1, and gCpMFmeas algorithms on the thermodynamic parameter sets that they use.
Dependency of the Energybased Methods on Data Characteristics
Results of the Energybased Methods on Different Data Sets
Because measures of accuracy on reference datasets are used to assess the quality of prediction achieved by various algorithms, it is important to understand how such measures vary depending on the reference datasets used. Later in this section we also consider how accuracy measures vary depending on the energy model used.
The baseline data in Tables
Fmeasure prediction accuracy of the MEA, MFE, gCg1, and gCpMFmeas algorithms on the MT dataset
RNA class
Class size
Mean ± std of length
Fmeas (BL*)
Fmeas (Turner99)
ubcMEA
ubcMFE
gCg1
gCpMFmeas
rsMEA
rsMFE
16S Ribosomal RNA
88
377.88 ± 167.18
0.649
0.621
0.640
0.659
0.574
0.539
23S Ribosomal RNA
27
460.6 ± 151.3
0.711
0.683
0.693
0.733
0.681
0.646
5S Ribosomal RNA
309
119.5 ± 2.69
0.739
0.743
0.725
0.746
0.625
0.642
Group I intron
16
344.88 ± 66.42
0.705
0.650
0.674
0.708
0.627
0.599
Group II intron
3
668.7 70.92
0.720
0.739
0.683
0.750
0.744
0.703
Ribonuclease P RNA
6
382.5 ± 41.66
0.471
0.460
0.519
0.495
0.517
0.522
Signal Recognition RNA
91
267.95 ± 61.72
0.641
0.621
0.633
0.637
0.518
0.557
Transfer RNA
484
77.48 ± 4.8
0.718
0.775
0.727
0.782
0.726
0.727
Unweighted Average
0.669
0.662
0.662
0.689
0.627
0.617
Weighted Average
0.710
0.732
0.707
0.743
0.660
0.665
SWeighted Average
0.670
0.652
0.660
0.684
0.612
0.598
Fmeasure prediction accuracy of the MEA, MFE, gCg1, and gCpMFmeas algorithms on the MA dataset
RNA class
Class size
Mean ± std of length
Fmeas (BL*)
Fmeas (Turner99)
ubcMEA
ubcMFE
gCg1
gCpMFmeas
rsMEA
rsMFE
16S Ribosomal RNA
675
485.66 ± 113.02
0.625
0.645
0.630
0.665
0.561
0.521
23S Ribosomal RNA
159
453.44 ± 117.85
0.645
0.643
0.626
0.664
0.588
0.562
5S Ribosomal RNA
128
120.98 ± 3.21
0.782
0.780
0.763
0.782
0.616
0.630
Group I intron
89
368.49 ± 103.58
0.644
0.631
0.642
0.670
0.576
0.550
Group II intron
2
578 ± 47
0.540
0.609
0.524
0.582
0.472
0.471
Ribonuclease P RNA
399
332.78 ± 52.34
0.643
0.603
0.656
0.678
0.615
0.575
Signal Recognition RNA
364
227.04 ± 109.53
0.730
0.721
0.680
0.708
0.609
0.625
Transfer RNA
489
77.19 ± 5.13
0.706
0.764
0.719
0.773
0.727
0.726
Unweighted Average
0.668
0.676
0.655
0.690
0.596
0.583
Weighted Average
0.672
0.682
0.669
0.708
0.618
0.600
SWeighted Average
0.658
0.660
0.648
0.681
0.588
0.568
Using data from Tables
Relative performance of the MEA and MFE algorithms on the MT and MA datasets for Ribonuclease P RNAs and Group I introns
Relative performance of the MEA and MFE algorithms on the MT and MA datasets for Ribonuclease P RNAs and Group I introns.
Bootstrap Confidence Intervals for the Prediction Accuracy of the Energybased Methods on different RNA Classes
Following the work of Aghaeepour and Hoos
Figure
95% bootstrap percentile confidence intervals for the Fmeasure average of the ubcMEA and ubcMFE algorithms
95% bootstrap percentile confidence intervals for the Fmeasure average of the ubcMEA and ubcMFE algorithms. 95% bootstrap percentile confidence intervals are shown for the Fmeasure average of the ubcMEA (dashed red bars) and ubcMFE (solid black bars) algorithms on the MA and SFull sets and also different RNA classes in MA.
95% bootstrap percentile confidence intervals for the Fmeasure average of the gCpMFmeas and ubcMFE algorithms
95% bootstrap percentile confidence intervals for the Fmeasure average of the gCpMFmeas and ubcMFE algorithms. 95% bootstrap percentile confidence intervals are shown for the Fmeasure average of the gCpMFmeas (dashed red bars) and ubcMFE (solid black bars) algorithms on the MA and SFull sets and also different RNA classes in MA.
Confidence intervals obtained by the bootstrap percentile method for the MA and SFull sets and different RNA classes in MA
RNA class
Class size
Confidence Interval (CL = 0.95)
ubcMEA
ubcMFE
rsMEA
rsMFE
gCpMFmeas
16S Ribosomal RNA
675
(0.611, 0.639)
(0.630, 0.660)
(0.548, 0.574)
(0.507, 0.536)
(0.652, 0.679)
23S Ribosomal RNA
159
(0.609, 0.677)
(0.607, 0.677)
(0.556, 0.618)
(0.530, 0.593)
(0.633, 0.693)
5S Ribosomal RNA
128
(0.754, 0.807)
(0.751, 0.806)
(0.573, 0.657)
(0.586, 0.671)
(0.758, 0.804)
Group I intron
89
(0.605, 0.680)
(0.594, 0.666)
(0.540, 0.611)
(0.513, 0.587)
(0.634, 0.704)
Ribonuclease P RNA
399
(0.629, 0.659)
(0.588, 0.618)
(0.602, 0.628)
(0.561, 0.588)
(0.665, 0.690)
Signal Recognition RNA
364
(0.708, 0.750)
(0.698, 0.742)
(0.583, 0.635)
(0.599, 0.651)
(0.685, 0.728)
Transfer RNA
489
(0.683, 0.723)
(0.742, 0.785)
(0.707, 0.747)
(0.705, 0.748)
(0.754, 0.791)
MA
2305
(0.664, 0.680)
(0.673, 0.691)
(0.610, 0.627)
(0.591, 0.609)
(0.700, 0.715)
SFull
3246
(0.673, 0.688)
(0.678, 0.694)
(0.615, 0.631)
(0.598, 0.616)
(0.704, 0.718)
Confidence intervals are obtained by the bootstrap percentile method for the MA and SFull sets and different RNA classes in MA (except Group II intron) when ubcMEA and ubcMFE use the BL* parameter set and rsMEA and rsMFE use the Turner99 parameter set and the confidence level (CL) is set to 0.95.
First, the confidence intervals of all algorithms on the MA and SFull sets have a width of at most 0.018, indicating that the average accuracy measured on these sets is likely to be a good estimate  within 1%  of the accuracy of a population of RNA molecules represented by these sets. However, the interval widths on individual classes can much higher, e.g., 0.075 for ubcMEA on the Group I intron class, suggesting that average accuracy is not a reliable measure of a method's overall accuracy on such classes. We note that, without the use of rigorous statistical methods, a 0.02 difference in accuracy is considered significant in some related work
Second, the confidence interval widths of RNA classes do not strictly decrease as the size of the class increases (i.e., the number of RNAs in the respective part of the reference set). For example, as shown in Table
One possibility is that classes with greater similarity among structures would have smaller confidence intervals, since in the limit, if all sequences in a class are identical then the confidence interval has zero width. However, data from Tables
Understanding classspecific factors underlying differences in prediction accuracy remains a goal for further study.
Third, for any given dataset (either a specific RNA class or the MA or SFull sets), the confidence interval width for the ubcMFE algorithm is close to that for ubcMEA and gCpMFmeas, even though the location of the interval (i.e., the prediction accuracy values) can be quite different. For example, the average Fmeasure of ubcMEA and ubcMFE on the Ribonuclease P RNA class differ by 0.04, but the confidence interval widths are identical. As another example, the smallest and largest confidence interval widths of the 16S Ribosomal RNA class among all four algorithms are, respectively, 0.026 and 0.030, which indicates a difference of less than 0.01 in the interval widths for this class (see Table
Confidence interval width versus RNA class size in the MA set for the ubcMEA, ubcMFE and gCpMFmeas methods
Confidence interval width versus RNA class size in the MA set for the ubcMEA, ubcMFE and gCpMFmeas methods.
The other important observation from this figure is that, as one could expect, the width of the confidence interval tends to decrease with increasing number of RNAs of a given type in the reference set, but (as previously noted) there are notable exceptions to this trend for sets of size below 500.
The final interesting observation from Figure
Similar graphs for the rsMEA and rsMFE algorithms are provided in the additional files. Additional files
95% bootstrap percentile confidence interval graphs for the Fmeasure average of the rsMEA and rsMFE.95% bootstrap percentile confidence intervals are shown for the Fmeasure average of the rsMEA (dashed red bars), and rsMFE (solid black bars) algorithms on the MA and SFull sets and also different RNA classes in MA.
Click here for file
Confidence interval width versus RNA class size in the MA set for the rsMEA and rsMFE methods. The figure shows the confidence interval width of RNA classes in the MA set for the rsMEA and rsMFE methods.
Click here for file
Dependency of the Energybased Methods on Thermodynamic Parameters
The accuracy of algorithmic methods achieved by energybased secondary structure prediction approaches, such as gCPMFmeas, MEA and MFE, depends on the underlying thermodynamic parameter sets. Lu et al. showed that the accuracy of MEA is better than that of MFE when using the Turner99 parameter set (on the MT reference set of RNAs). But does MEA or gCPMFmeas outperform MFE on other published parameter sets? If so, is the difference in accuracy statistically significant? We address these questions in the following.
Table
Accuracy comparison of different prediction algorithms with various parameter sets on the SFull set
Algorithm
FMeasure
T99MRF
BL*
CG*
Turner99
ubcMEA
0.582 (0.574,0.591)
0.680 (0.673,0.688)
0.636 (0.628,0.644)
n/a
ubcMFE
0.601 (0.592,0.609)
0.686 (0.678,0.694)
0.671 (0.663,0.679)
n/a
rsMEA
n/a
n/a
n/a
0.623 (0.615,0.632)
rsMFE
n/a
n/a
n/a
0.607 (0.598,0.615)
gCpMFmeas

0.711 (0.704, 0.718)


The table presents the prediction accuracy of different algorithms with different thermodynamic sets in terms of Fmeasure. The 95% percentile confidence intervals of their accuracies are also shown in parentheses. The parameter set T99MRF refers to the Turner99 parameters in MultiRNAFold format. BL* and CG* are the parameter sets obtained by the BL and CG approaches of Andronescu et al.
Accuracy comparison of different prediction algorithms with various parameter sets on the SFullTest set. The table presents the prediction accuracy of different algorithms with different thermodynamic sets in terms of Fmeasure. The parameter set T99MRF refers to the Turner99 parameters in MultiRNAFold format. BL* and CG* are the parameter sets obtained by the BL and CG approaches of Andronescu et al.
Click here for file
In contrast, the relative accuracy of MEA versus MFE varies, with MFE being slightly better than MEA in two cases, MEA being slightly better than MFE in one case, and both having approximately the same accuracy in one case, namely BL*. To assess whether MEA has significantly better accuracy than MFE on our parameter sets, we first performed the onesided permutation test (see Methods) with the alternative hypothesis that MEA yields higher prediction accuracy than MFE. As recorded in Table
Pvalue of onesided permutation test for the MEA versus MFE algorithm on different parameter sets
Parameter set
T99MRF
BL*
CG*
Turner99
pvalue
0.999
0.848
1
0.003
Pvalue of onesided permutation test is computed for the MEA versus MFE algorithm on different parameter sets; pvalues below the standard significance level of 0.05 are shown in bold face and indicate significantly higher prediction accuracy of MEA compared to MFE.
We next performed the twosided permutation test on all mentioned parameter sets. The results, as recorded in Table
Pvalue of twosided permutation test for the MEA and MFE algorithms on different parameter sets
Parameter set
T99MRF
BL*
CG*
Turner99
pvalue
0.003
0.299
0
0.006
Pvalue of twosided permutation test is computed for the MEA and MFE algorithms on different parameter sets; pvalues below the standard significance level of 0.05 are shown in bold face and indicate significant differences in prediction accuracy.
Furthermore, the results in Table
Finally, we observe that for a given class of RNAs, depending on the energy model, MFE sometimes performs better than MEA and vice versa (see Table
Overall, we conclude that the relative performance of MEA versus MFE depends on the thermodynamic parameter set used, and that both ubcMEA and ubcMFE with the BL* parameter set have significantly better accuracy than rsMEA with the Turner99 parameter set.
Conclusions
Improvements both in algorithmic methods and in thermodynamic models lead to important advances in secondary structure prediction. In this work, we showed that gCpMFmeas with the BL* parameter set significantly outperforms the other MFE and MEAbased methods we studied. However, the relative performance of MEAbased and MFE methods vary depending on the thermodynamic parameter set used. For example, MEAbased methods significantly outperform minimum free energy (MFE) methods for the Turner99 model but the opposite is true for other models and in fact the difference in performance between MEAbased and MFE methods is negligible on our best thermodynamic model, BL*. Our findings suggest that, as thermodynamic models continue to evolve, a diverse toolbox of algorithmic methods will continue to be important.
We also showed the importance of using large datasets when assessing accuracy of specific algorithms and thermodynamic models. Specifically, we showed that bootstrap percentile confidence intervals for average prediction accuracy on our two largest datasets, MA and SFull, have narrow widths (at most 0.018), indicating that the average accuracies measured on these sets are likely to be good estimates of the accuracies of the populations of RNA molecules they represent. In contrast, interval widths for several of the specific RNA classes studied in the paper were much larger, with no clear correlation between confidence interval width and either class size or average normalized similarity. It may be the case that confidence interval widths are correlated with the distribution of evolutionary distances among the class members, rather than on the more simplistic average normalized similarity that we consider in this paper; studying this further would be an interesting direction for future research. Regardless, our analysis shows that larger datasets are needed to obtain reliable accuracy estimates on specific classes of RNAs, underscoring the importance of continued expansion of RNA secondary structure data sets.
Our work illustrates the use of statistical techniques to gauge the reliability and significance of measured accuracies of RNA secondary structure prediction algorithms. We hope that this work will provide a useful basis for ongoing assessment of the merits of RNA secondary structure prediction algorithms.
Authors' contributions
MH acquired, analyzed and interpreted the data and drafted the manuscript. AC supervised the research and participated in analysis of data and revising the manuscript. HH consulted in approaches used and participated in data interpretation and manuscript revision. All authors read and approved the final manuscript.
Acknowledgements
The authors thank and acknowledge Dr. Mirela Andronescu and Dr. David H. Mathews for providing their data sets and software used in the presented work. They also thank the anonymous reviewers for their very helpful comments on an earlier draft of the paper. This research was funded by a grant from the Natural Sciences and Engineering Research Council of Canada (NSERC).