Department of Computer Science, University of Georgia, Athens, GA 30602, USA
Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA
Department of Plant Biology, University of Georgia, Athens, GA 30602, USA
Abstract
Background
RNA secondary structure plays a scaffolding role for RNA tertiary conformation. Accurate secondary structure prediction can not only identify doublestranded helices and single strandedloops but also help provide information for potential tertiary interaction motifs critical to the 3D conformation. The average accuracy in
Results
This research identified energetic rules for coaxial stacks and geometric constraints on stack combinations, which were applied to developing an efficient dynamic programming application for simultaneous prediction of secondary structure and coaxial stacking. Results on a number of noncoding RNA data sets, of short and moderately long lengths, show a performance improvement (specially on tRNAs) for secondary structure prediction when compared with existing methods. The program also demonstrates a capability for prediction of coaxial stacking.
Conclusions
The significant leap of performance on tRNAs demonstrated in this work suggests that a breakthrough to a higher performance in RNA secondary structure prediction may lie in understanding contributions from tertiary motifs critical to the structure, as such information can be used to constrain geometrically as well as energetically the space of RNA secondary structure.
Introduction
RNA secondary structure plays the critical role of scaffolding the tertiary structure (i.e., 3D conformation)
Elements of the secondary structure are interrelated with tertiary interaction motifs
There were only a few previous results in computational investigation of RNA helix coaxial stacking. Walter
The new method has been developed into a dynamic programming application (called
Results
We implemented the algorithm into a program named
Data preparation
We downloaded five ncRNA datasets from seed alignments of Rfam. Ninetyfive (10% of) tRNA sequences were randomly picked up from the corresponding seed alignment of 967 tRNAs. All ninetyeight available Intron Group II sequences and all eightyfour available Hammerhead type III sequences were retrieved directly from seed datasets. We also downloaded all 30 Intron Group I sequences available from its seed alignment, and extracted the P4P6 domain of each sequence. Similarly, we retrieved all 79 HCV IRES sequences available from its seed alignment, and extracted domain III of each sequence. The average lengths of tRNAs, Intron group II, Hammerhead type III, P4P6, and domain III of HCV IRES are 73.62, 87.18, 55.36, 126, and 111.68, respectively. Many of these sequences contain long inserted regions compared to their annotated consensus structures, with lengths greatly exceeding the corresponding average lengths (see Table
Sensitivity based on the number of correctly predicted base pairs
ncRNA
Num. of sequences
Avg len.
Min len.
Max len.
Sensitivity (
Sensitivity (
Hh3
84
55
40
82
85.04%
95.71%
tRNA
95
74
66
93
81.67%
64.59%
IntrongII
98
87
42
154
81.94%
83.71%
P4P6
30
126
58
191
57.42%
64.62%
HCV
79
112
85
116
83.01%
78.43%
Performance comparison between
Performance in secondary structure prediction
We conducted two types of evaluations on the predicted structures. One is to consider the percentage of base pairs correctly predicted by the programs. The other is to consider the number of sequences whose overall structure topology is correctly predicted. Shown in the next section, we also evaluated the capability of
Table
where
Test results on the tRNA data set demonstrates the true advantage of incorporating coaxial stacking into prediction of ncRNAs that may contain coaxial stacking motifs.
Table
Sensitivity based on the number of correctly predicted topologies
ncRNA
Topology sen. (%)
Adjusted topology sen. (%)
Hh3
75
92.86
N/A
N/A
tRNA
72.63
24.21
86.32
27.37
IntrongII
75.51
84.69
N/A
N/A
P4P6
30
56.67
66.67
86.67
HCV
74.68
75.95
N/A
N/A
Performance comparison between
The general tRNA tertiary structure (and the secondary structure in the box)
The general tRNA tertiary structure (and the secondary structure in the box). Four helices (in acceptor,
We point out that the relatively low sensitivity for
For longer sequences of P4P6, counting correctly predicted base pairs appeared to distance
Performance in coaxial stacking prediction
To evaluate the performance of our method in coaxial stacking prediction, we computed both the sensitivity and positive predictive value (PPV) on the number of correctly predicted coaxial stacks. The PPV is defined as
where
Table
PPV and sensitivity based on the number of correctly predicted coaxial stackings
ncRNA
Num of sequences
TP
FP
PPV(%)
Sensitivity(%)
tRNA
95
130
44
74.71
68.42
Hh3
84
59
37
61.45
70.23
HCV
79
74
38
66.07
46.83
Performance of
We compare these results with a previous work by Tyagi and Mathews who tested the idea of coaxial stack prediction using the energy minimization with nearestneighbor parameters
Discussion
While our program,
We point out the small differences in performance between
We did not use the positive predictive value (PPV) to measure the performance in the correctly predicted base pairs. This was because some base pairs not belonging to the consensus structure but predicted by the programs may be valid if they fall in inserted regions of the consensus structure. Counting such base pairs as false positives would be bias against sequences substantially longer than the consensus. The situation was evident by our tests on these sequences, typically tRNAs where the variable loop may contain an extra stemloop.
We have also examined the coaxial stacking prediction on the P4P6 sequences by
The outcome of the tests on tRNAs is most interesting. The secondary structures of tRNAs were difficult to predict from individual sequences with energybased methods, in spite of the conserved native structure across types and species. This is because a tRNA may have many alternative structures with free energies within 510% of the minimum free energy.
Conclusions
This work introduced a new method for simultaneous prediction of RNA secondary structure and coaxial stacking between helices. The aim of the incorporation of coaxial stacking detection included improving the performance of energybased
The significant leap of performance on tRNAs in this work suggests that a breakthrough to a higher performance in RNA secondary structure prediction may lie in understanding contributions from tertiary motifs critical to the structure, as such information can be used to constrain geometrically as well as energetically the space of RNA secondary structure. Since coaxial stacking is still a local tertiary motif, incorporating information of tertiary motifs of higher orders, such junctions, may further improve the prediction performance.
Methods
In the secondary structure, canonical base pairs form doublestranded stems (called helices in tertiary structure) that join and enclose unpaired, singlestrand loops. Figure
Coaxial stacking of helices
Coaxial stacking of helices. A secondary structure illustration of a coaxial stacking between two helices that share the same contiguous singlestrand loop, in which unpaired nucleotides may be present. The terminal base pairs from both helices stack each other, resulting in an extra energy reduction calculated as if they were contiguous base pairs (shown in the callout).
Coaxial stacking rules
Previous investigations on threeway junctions
Based on this survey, we were able to identify two energy thresholds: less than 2.5 Kcal/mol for
Definition. We denote (
1. Coaxial stack (
2. Coaxial stack (
In particular, coaxial stacks in twoway junctions are always nested. In multipleway junctions, coaxial stacks may be either nested or parallel (see Figures
The amount of reduced energy, attributed to a coaxial stack, is defined as the free energy contributed from the two stacked base pairs on the interface (see Figure
Geometric constraints
We applied additional constraints on coaxially stacked helices based on geometric feasibility. This is to consider when two or more coaxial stacks may occur simultaneously, and they all involve some helix. In particular, we identified the following rules to ensure consistency in geometry. Assume helix
1. Stacks (
2. Stack (
3. Stack (
Figure
Compound coaxial stacks
Compound coaxial stacks. An illustration for the three general situations of compound coaxial stacks, where 5' and 3' indicate the backbones from the 5' end and to the 3' end of the sequence, respectively. In the left structure, the stacks (
Algorithm
We developed our method into an algorithm for
Preprocessing of helices
The preprocessing step picks up helix candidates and identifies potential coaxial stacks. A semiglobal alignment algorithm is used for searching helix candidates
Two helices are recognized as a potential coaxial stack if they share a contiguous singlestrand backbone with at most one unpaired nucleotide. Potential coaxial stacks are classified into parallel and nested stacks based on the conditions given in the section above about
Prediction via dynamic programming
We adopted the idea in Nussinov's algorithm
Candidates and orderings
A helix consists of two base pairing regions; each region is a contiguous backbone consisting of a number consecutive nucleotides. A helix found by the preprocessing step can be viewed as two base pairing regions. Throughout this section we will refer to candidate regions simply as
On an RNA sequence
•
•
If two candidates occupy the exact same region on the sequence, then one of them gets the lower index in a consistent manner throughout the algorithm.
The recurrence relations in our dynamic programming algorithm have the general form
Algorithm overview
Similar to Nussinov's algorithm, four different cases can happen when finding the optimal structure of the subsequence spanned from candidate
• Region
• Region
• Region
• The optimal structure is formed by putting together the optimal substructures of the subsequence from region
Our algorithm can recursively generate the following types of topological constructs:
1. An
2. An
(a) an
(b) a 2way junction where the helices coaxially stack,
(c) an
(d) an
Each of the three types of helices, defined earlier (see the
•
• Functions of the form
 2
 @


 

• Functions of the form
Notation
We use the following notation throughout this section:
•
•
•
•
•
•
•
•
•
•
•
•
•
• In rules of the form
Recurrences
Assuming that the preprocessing step results in
The function
In the above function, the first case, with
The following recurrence is used for generating a helix that does not coaxially stack with any helix outside of the current subsequence.
The following recurrence is used for performing bifurcations such that no helix in the substructure coaxially stacks with any helix outside the current subsequence.
The following recurrence is used for the case that helix
Similarly, the following recurrence is used for the case that helix
The following recurrence is used for the case that helix
The following recurrence is used for the case that helix
The following recurrence is used for generating a helix with the assumption that it forms a parallel coaxial stacking with a helix to the right of the subsequence from region
The following recurrence is used for performing bifurcations such that the rightmost helix of the resulting substructure forms a parallel coaxial stacking with a helix to the right of the subsequence from region
Similarly we define the recurrences
Abbreviations
SPO: Starting Position Order; EPO: Ending Position Order.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
PS designed and implemented the prediction algorithm. In addition to contributing to drafting this manuscript, he was also in charge of acquiring data, testing, and analysing the results. YW designed and implemented the preprocessing algorithm. He also helped with data acquisition and result analysis. RM provided the biological insight, and also contributed to data acquisition and results analysis. LC conceived the overall model and algorithm and drafted the manuscript. All authors read and approved the manuscript.
Acknowledgements
This article has been published as part of
This research project was supported in part by the NSF MRI 0821263 grant, the NIH BISTI R01GM07208001A1 grant, the NIH ARRA Administrative Supplement to NIH BISTI R01GM07208001A1, and the NSF IIS grant of award No: 0916250. We used the software