Current advances of the next-generation sequencing technology have revealed a large number of un-annotated RNA transcripts. Comparative study of the RNA structurome is an important approach to assess their biological functionalities. Due to the large sizes and abundance of the RNA transcripts, an efficient and accurate RNA structure-structure alignment algorithm is in urgent need to facilitate the comparative study. Despite the importance of the RNA secondary structure alignment problem, there are no computational tools available that provide high computational efficiency and accuracy. In this case, designing and implementing such an efficient and accurate RNA secondary structure alignment algorithm is highly desirable.
In this work, through incorporating the sparse dynamic programming technique, we implemented an algorithm that has an O(n3) expected time complexity, where n is the average number of base pairs in the RNA structures. This complexity, which can be shown assuming the polymer-zeta property, is confirmed by our experiments. The resulting new RNA secondary structure alignment tool is called ERA. Benchmark results indicate that ERA can significantly speedup RNA structure-structure alignments compared to other state-of-the-art RNA alignment tools, while maintaining high alignment accuracy.
Using the sparse dynamic programming technique, we are able to develop a new RNA secondary structure alignment tool that is both efficient and accurate. We anticipate that the new alignment algorithm ERA will significantly promote comparative RNA structure studies. The program, ERA, is freely available at http://genome.ucf.edu/ERA webcite.
Non-coding RNAs (ncRNAs) have recently been recognized as important regulators of the biological systems [1,2]. They participate in the control of alternative splicing , gene transcription  and translation , and mRNA localization . Most of the ncRNAs exert their biological functions by folding into specific structures, which makes the study of the RNA structurome a critical step towards complete understanding of the operational mechanism of the biological system . Recently, genome-wide RNA structurome analysis has led to many interesting discoveries regarding novel regulatory mechanisms. For example, analysis of the RNA structural elements in Drosophila melanogaster 3’-UTR suggests a cluster of ncRNA elements that can direct the localization of their upstream genes within the spermatids . Similar studies have also been applied to the Ciona intestinalis genome for novel ncRNA family discovery . With the finishing of the ENCODE  and modENCODE  projects, we expect that much more RNA transcripts will be experimentally identified. Many of these RNA transcripts may have exceptionally large sizes , and calls for more efficient computational tools to analyze their structures.
As more RNA transcripts are being discovered, the experimental approaches for probing ncRNA structures are also being revolutionized, allowing more accurate functional investigation through exploiting the structure-function relationship. Traditional RNA three-dimensional (3D) structure determination techniques such as X-ray crystallography, NMR and cryo-EM are expensive, making them inappropriate for genome-wide survey of RNA structures. Currently, the emerging massive parallel sequencing technology has been incorporated into the traditional chemical probing methods, making genome-wide experimental determination of RNA secondary structures possible and with low cost. Available techniques in this category include PARS , FragSeq , and SHAPE-seq . The RNA secondary structures determined by these techniques are much more accurate than those predicted by pure computational methods. For example, when coupled with SHAPE-seq data, the free energy minimization approach  is able to predict the secondary structure of a 16S rRNA with over 95% accuracy . In this case, the major purpose of this work is to develop an efficient and accurate RNA secondary structure alignment algorithm to facilitate genome-wide comparative studies of these RNA secondary structures.
There are many existing algorithms that focus on the RNA secondary structure alignment problem [18-24]. RNA secondary structures can be represented as tree structures, and the edit-distance between the tree structures can be used to represent their structural similarity . Algorithms using such strategy are usually called tree editing algorithms. Using heavy path decomposition, Klein  improved the time complexity of the tree editing algorithm to O(l3logl). Recently, Demaine et al. further improved the time complexity to O(l3) based on Klein’s algorithm. However, Jiang et al. proposed to compute tree alignment distance for the comparison of trees. Algorithms that compute such a measure are called tree alignment algorithms. The tree alignment algorithm is a special case of the tree editing algorithm . The tree alignment algorithm has been implemented into an RNA secondary structure alignment tool called RNAforester. Both the tree editing and tree alignment algorithms rely on tree representation of the RNA structure, and make sophisticated scoring functions difficult to implement (such as the affine gap penalty for the loop regions). In addition, both tree editing and tree alignment algorithms do not treat base pairs as units of comparison, and make it difficult to implement a complete set of base-pair edit operations for RNA secondary structures editing (base-pair match, mismatch, breaking, altering, and removing; as defined by Jiang et al.). We demonstrate such a problem by showing a real example from the implementation of the widely-used RNA secondary structure alignment tool RNAforester.
Consider that the two RNA structures shown in Figure 1 (a) are being aligned as trees. In the first RNA structure, due to the insertion of a uracil (U), an additional base pair is predicted (dashed arc, Row 1). Both structures are enclosed by G-C base pairs, and we focus on the alignment of their inner regions (boxed regions, Row 1). Following RNAforester’s extended tree representation , the two RNA structures can be transformed into two trees (Row 2). The ‘P’ node represents a base pair formed between the two corresponding nucleotides. Because there is no base pair in the second structure, the only allowed operations are bond breaking and base-pair deletion (Row 3). For the bond breaking operation, the base pair formed between A and U is broken, leaving them aligned to A and G in the second structure, respectively (blue boxes, Row 3). The alignment between the U (first structure) and G (second structure) introduces an unnecessary mismatch, making the alignment incorrect (blue boxes, Row 4). For the base-pair deletion operation, the entire base pair (including the two nucleotides A and U) is deleted (red box, Row 3). This operation opens two unnecessary gaps in the alignment (red boxes, Row 4), making it underestimate the real structural similarity. On the other hand, we expect to handle the mis-predicted base pairs in a more straightforward way. As shown in Figure 1 (b), we simply break the base pair interaction and disassociate the two corresponding nucleotides completely (red cross, Row 2). These two nucleotides are then treated as regular unpaired nucleotides. We can use the standard sequence alignment algorithm  (with affine gap penalty for better alignment quality in the unpaired regions) to evaluate the pure sequence similarity between the boxed hairpin-loop regions (Row 3). The resulting alignment contains only one gap, and correctly interprets the true structural difference between the two RNA structures (red boxes, Row 4).
Figure 1. Comparison between the tree-based alignment approach and the general edit-distance alignment approach in handling mis-predicted base pairs. (a) The tree-based alignment algorithm in handling mis-predicted base pairs. Row 1: The arcs on the sequences indicate the base pairs (solid arc indicates real base pairs, while dashed arc indicates mis-predicted base pairs). The structure regions indicated by the boxes are being aligned. Row 2: The two RNA structures are modeled into trees according to RNAforester. The ‘P’ node was introduced to represent a base pair. Row 3: Either the bond breaking or the base-pair deletion operation is taken. The blue boxes indicate the aligned nucleotides in the bond-breaking case. The red box indicates the base pair (including its nucleotides) being deleted in the base-pair deletion case. Row 4: The corresponding alignments resulted from both operations. The boxes in the alignments correspond to those in the RNA structure trees. Neither of the alignments is correct. (b) The general edit-distance alignment algorithm in handling mis-predicted base pairs. Row 1: The same RNA structures are being aligned. Row 2: The base-pair interaction is deleted (red cross), leaving two free nucleotides. Row 3: The sequence similarity between the boxed regions is assessed using a traditional sequence alignment algorithm . Row 4: The corresponding alignment is generated correctly. The boxes correspond to nucleotides that form the mis-predicted base pair.
The above example clearly shows the limitation of the implementation of the tree-based RNA secondary structure alignment algorithm RNAforester. Implementing the complete set of base-pair edit operations under the tree representation appears to be not a trivial task. Therefore, we propose to implement the general edit-distance alignment approach where all edit operations can be implemented naturally. To guarantee that the implementation is as efficient as the Demaine et al.’s algorithm (O(l3)), we incorporate the sparse DP technique into a simultaneous alignment and folding (SAF) algorithm RNAscf and restrict its input to fixed RNA secondary structures (recall that the general edit-distance alignment algorithm is a restricted case of the SAF algorithm). Using this technique, we can reduce the original time complexity by reducing a factor from n2 to z, where n is the number of base pairs in the fixed RNA structures and n<z<n2. Under the assumption of the polymer-zeta property of RNA molecules , it is expected that z≪n2 and even z∈O(n). In this case, the new general edit-distance RNA structure-structure alignment algorithm will have a time complexity of O(zn2+zl2). The new time complexity has an expected cubic (z=O(n)=O(l)) growth behavior, and is the same as Demaime et al.’s algorithm . In addition, we also devise a novel online pruning technique to further speedup the new algorithm, which deletes obsolete candidates on-the-fly. By combining both speedup techniques, the new RNA structure alignment algorithm is capable of comparing RNA secondary structures efficiently and accurately.
We have implemented the proposed RNA structure alignment algorithm into a program called ERA (Efficient RNA Alignment). The benchmark results showed that ERA has the expected O(zl2) time complexity. We showed the O(zl2) time complexity of ERA through aligning Rfam  RNA structures that were carefully chosen to represent a wide rage of input sizes. We also used a BraliBase II  benchmark to compare tools ERA, LocARNA and RNAforester when aligning RNAs with known structures. Nearly identical alignment quality can be observed for the general edit-distance alignment tools ERA and LocARNA, while both of them are more accurate than the tree alignment algorithm RNAforester. Finally, we also concluded that ERA is efficiently implemented by observing an average of 10 fold speedup over LocARNA, and RNAforester in terms of real RNA structure alignments. Based on these results, we confirmed that the sparse DP technique and the online pruning technique are successfully incorporated into the original RNAscf algorithm. We also anticipate that ERA will become an important bioinformatics tool for comparative RNA structure analysis.
In this section, we will present a novel general edit-distance RNA structure alignment algorithm by incorporating the sparse DP technique into the RNAscf algorithm. RNAscf was originally designed to identify the consensus structure between two RNA sequences. It guides the DP process though stacks and has a time complexity of O(n4+n2l2). Comparing to LocARNA (which has a time complexity of O(l4+n2l2)), the indexing scheme used by RNAscf makes it easier to incorporate the sparse DP technique, which aims to reduce the size of n instead of l. In addition to the sparse DP technique, we will also present an online pruning technique, which tries to reduce the search space of the algorithm as the DP proceeds. Through combining these two speedup techniques, the novel algorithm will have an expected O(zl2) time complexity, where n<z≪n2.
The Methods section is organized as follows: In Section ‘Preliminaries and definitions’, we will give the basic definition of RNA structures and the RNA alignment problem. In Section ‘The original O(n4+n2l2) algorithm’, we will reintroduce the RNAscf algorithm as a basis to understand the novel algorithm that is developed in this work. In Section ‘Triangular inequality and optimal pair matchings’, we will present the triangular inequality in RNA alignment with necessary proofs, which serves as a theoretical foundation for the sparse DP technique. In Section ‘Detection of optimal pair matchings’, we will further discuss the implementation details of incorporating the sparse DP technique. In Section ‘A new algorithm with cubic time complexity’, we will present the novel RNA alignment algorithm with the incorporation of the sparse DP technique. In Section ‘Online pruning of optimal pair matchings’, we will present the online pruning technique as an additional speedup step to the novel algorithm. Finally, in Section ‘Pseudo-code’, we will summarize the new algorithm using pseudo-code that can be directly implemented.
Preliminaries and definitions
We will begin with the introduction of the basic symbols and notations. The secondary structure of an RNA A of length lA is represented by a set of base pairs in A, denoted as . A base pair is an interaction formed between two nucleotides in the sequence of A, whose positions are denoted by l(pA) and r(pA) (without loss of generality, we assume l(pA)<r(pA)). The base pair pA can also be represented as (l(pA),r(pA)). The base pairs are partially ordered by the increasing order of their ending nucleotides, i.e. if and only if . Since we do not consider RNA ensembles, no crossing base pair is allowed. That is, we do not allow . The two base pairs and are either enclosing or juxtaposing to each other. The base pair encloses if , denoted as . The base pair juxtaposes to and before if , and is denoted by .
We also define loop regions (i.e. hairpin loop, internal/bulge loop, and multi-branch loop) whose sequence similarities are assessed by the alignment. The loop regions can be viewed as the unpaired regions in the RNA sequence that are segregated by the paired nucleotides. Let A[ i…j] denote a continuous sequence region in RNA A, which begins with the ith nucleotide and ends with the jth nucleotide. Define L(pA) as the sequence A[ l(pA)+1…r(pA)-1] (hairpin loop). If , define as the sequence , and as the sequence (internal or bulge loop). If , define as the sequence (multi-branch loop).
The structure alignment between RNA A and B is the optimal matching between their base-pair sets and and the corresponding loop similarities. In other words, the alignment between RNAs A and B is a one-to-one binary relation on the base-pair sets and . To ensure that the alignment will not lead to conflicting base-pair matchings, for any and , either and , or and . Given the alignment , the matched base pairs in will partition the RNA sequences A and B into two sets of loop regions, and , respectively. The sequence similarity between these two sets of loop regions is added to compute the overall alignment score. The optimal alignment is the relation that maximizes overall alignment score M that combines both structure and sequence similarities:
Here, the first term is the summation of all structural similarities (Sstr) between the annotated base pairs. The structural similarity score for base-pair substitution is set using the RIBOSUM matrix , denoting such base-pair substitution matrix as R. We do not give penalty for base-pair deletion or insertion, as we may expect incorrectly predicted base pairs in the input RNA structures. The second term is the summation of the sequence similarities (Sseq) on all loop (unpaired) regions that are determined by base-pair matchings in. The sequence similarity between two sequence regions is computed as traditional sequence alignment, with D as a 4-by-4 matrix that accounts for nucleotide substitution (set using the RIBOSUM matrix), g as the gap opening penalty, and e as the gap extension penalty  (g and e are both set to negative values and g<e). The weights w1 and w2 are used to balance the structural and sequence contribution to the overall alignment score, and we set w1>w2 to emphasize structural similarity. To simplify the expressions, in the rest of this article, we assume that w1 has been multiplied to all structural similarity terms (R), and w2 has been multiplied to all sequence similarity terms (D, g, and e).
We will now define the matrices that are used by the DP algorithm. Denote M[ pA,pB] as the optimal structure alignment score between the regions enclosed by pA and pB, given that pA is matched with pB. Denote Mh[ pA,pB] as the optimal alignment score when the matching of pA and pB corresponds to a hairpin loop in the consensus structure. Similarly, Ml[ pA,pB] stores the optimal alignment score when the matching of pA and pB corresponds to an internal, a bulge, or a multi loop in the consensus structure. Assume that , and , Ml[ pA,pB] can be computed by referring to the matrix , which stores the optimal alignment score between the juxtaposed base-pair chains (each chain contains at least one base pair) that end with and , respectively. The optimal alignment between A and B can be retrieved from , where and are pseudo base pairs such that , , and .
The original O(n4+n2l2) algorithm
In this section, we briefly reintroduce the RNAscf algorithm for RNA consensus structure prediction as a basis for understanding the novel algorithm developed in this work. The recursive functions for the RNAscf algorithm are outlined as follows:
In these recursive functions, Sstr denotes the structural similarity between two base pairs pA and pB, Sseq denotes the sequence similarity between two unpaired regions, and G indicates the gap penalty for completely deleting the corresponding unpaired region. Note that G(|L|)=g+|L|∗e if |L|>0, and G(|L|)=0 otherwise. The base pair set contains all base pairs that are directly before and juxtaposed to . In other words, if , then there is no such base pair , such that . In most real scenarios, is considered as a constant [29,35]. This chaining technique based on the set enables us to handle the multi-loop case efficiently, by only considering cases when computing Mc.
Recall that the input RNA sequences have an average length of l and form an average of n base pairs. This algorithm can be computed with an expected time complexity of O(n4+n2l2). To see the time complexity, first note that all sequence similarity scores that are referred in the recursive functions can be computed within O(n2l2) time. Because all loop regions are segregated by base pairs, the number of loop regions is clearly bounded by O(n). Therefore, there are O(n2) combinations of loop matchings, and computing each matching requires O(l2) time using a standard sequence alignment algorithm . To this point, we assume all sequence similarities are computed using O(n2l2) time, and are stored in a matrix for constant-time lookup. Now, observe that this algorithm computes the optimal alignment by filling up the DP table M, which contains O(n2) values. Computing each value in the matrix M depends on the corresponding values of Mh, Ml, and Mc. The computation of values in matrix Mh can be finished in a constant time due to the pre-computed sequence similarities. The computation of Ml requires O(n2) time, as determined by the necessity of traversing all possible combinations i and i′ (see Equation 4). Finally, Mc can also be expected to be computed in a constant time, as is assumed to be a constant. In this case, the computation of matrix M requires O(n4) time. Adding up the time required to pre-compute all sequence similarities of the loops, the overall time complexity for this algorithm thus becomes O(n4+n2l2).
Triangular inequality and optimal pair matchings
The triangular inequality property servers as the theoretical foundation for the sparse DP technique, which saves search space while maintaining the global optimality. For computational RNA studies, this technique has been used in RNA folding , RNA consensus folding (SAF) [36,37], as well as RNA-RNA interaction prediction  applications. In this work, our aim is to bring this technique into the RNA structure alignment application, where fixed RNA structures are considered instead of RNA structure ensembles.
Consider the alignment between the RNA secondary structures within the two regions A[ i…j] and B[ i′…j′] (see Figure 2 (a)). Denote M[ i,j;i′,j′] as the optimal alignment score for such alignment. The triangular inequality can be summarized using the following inequality:
where i≤k<j and i′≤k′<j′. This is because the partitions of the regions A[ i…j] and B[ i′…j′] at positions k and k′, respectively, do not necessarily compatible with the optimal alignment.
Figure 2. Illustration of the triangular inequality property. (a) Triangular inequality property of RNA secondary structure alignment. The horizontal lines indicate RNA sequences A and B. The dashed arcs are the pseudo base pairs added to the specific nucleotides, while the shaded areas define the correspondence between regions that are being aligned. (b) Alternative paths that go through either pA and pB, or and . The two shadings (dark and light gray) along the arcs represent the two alternative paths.
To simplify the expression of the triangular inequality property, we define a number of pseudo base pairs to indicate specific regions of interest. A pseudo base pair is a void interaction, such that the structural similarity between any two pseudo base pairs is defined to be 0. For instance, let p and p′ be two arbitrary pseudo base pairs, we will have Sstr(p,p′)=0. The pseudo base pairs are only used for the sake of representational simplicity, and are not required for the implementation of the algorithm. Define a pseudo base pair pA=(i,j) and a pseudo base pair pB=(i′,j′). In this case, the optimal alignment score between the regions A[ i…j] and B[ i′…j′], i.e. M[ i,j;i′,j′], can be rewritten as M[ pA,pB]. Similarly, define pseudo base pairs , , , and (see Figure 2 (a)). The triangular inequality can be simplified using the following observation:
Using Observation 1, we can detect potential redundant computations in the original algorithm. Consider the structural configurations shown in Figure 2 (b), and assume that the base pairs pA and pB are being aligned at the current stage. Let and be arbitrary base pairs such that . Note that may also represent a pseudo base pair in order to consider an arbitrary subregion enclosed by pA. Define pseudo base pairs , , , , , and . Pseudo base pairs are also added to B symmetrically (see Figure 2 (b)). We can then prove Lemma 1 using Observation 1:
The first inequality is a direct application of Observation 1, and the second inequality is specified in the condition of Lemma 1.
Because and are arbitrary base pairs, Lemma 1 implies that the matching between pA and pB is guaranteed to be suboptimal. That is, the overall alignment score, given that pA matches with pB, is always lower than that when assuming they do not match (as the matching of pA and pB is conflicted with the matching of and , as well as the matching of and ). In this case, we can devise the DP algorithm to bypass the redundant references to the scenarios where pA matches pB. Conversely, for the implementation of this idea, the DP algorithm will refer to the scenarios of matching pA and pB only when the condition specified in Lemma 1 is NOT satisfied. These necessary base-pair matchings are called the Optimal Pair Matchings (OPMs). If the matching of pA and pB is an OPM, we denote this OPM as oA,B. Similarly, we represent the OPM formed by base pairs and as . The new RNA alignment algorithm will maintain an OPM list , which is modified online as the DP proceeds, so as to include newly identified OPMs and remove obsolete OPMs (which will be discussed in Section ‘Online pruning of optimal pair matchings’). If we assume that the RNA molecules have the polymer-zeta property , restricting the search space of the DP using the OPM list will reduce the time complexity of the RNA alignment algorithm to O(zl2) (as will be discussed in Section ‘A new algorithm with cubic time complexity’).
Detection of optimal pair matchings
In the previous section, we have proved that Lemma 1 can be used to detect the OPMs and save redundant computations. In this section, we will briefly discuss how it will be implemented. Lemma 1 states that if the alignment score assuming pA matches pB (M[ pA,pB]) is higher than the alignment score assuming pA does not match pB, the matching between pA and pB is an OPM. Therefore, to detect the OPMs, we need to compute two alignment scores, i.e. the one when assuming pA matches pB and the one when assuming pA does not match pB.
Based on previous definition, the first alignment score is computed as M[ pA,pB]. In this case, we only need to compute the second alignment score. However, computing the second alignment score (assuming pA does not match pB) is difficult. Instead, we can compute the overall alignment score without assuming any restrictions. Apparently, the overall alignment score includes both cases disregarding whether pA matches with pB. Therefore, if M[ pA,pB] is greater than or equal to such an overall optimal alignment, it is guaranteed to be greater than the alignment score when assuming pA does not match pB, and ipso facto the matching of pA and pB is an OPM.
Recall that the alignment score M[ pA,pB] corresponds to the case where pA matches with pB, and therefore it can be decomposed as the sum of two parts: the structure similarity between the two base pairs themselves Sstr(pA,pB), and the optimal alignment score between the regions A[ l(pA)+1…r(pA)-1] and B[ l(pB)+1…r(pB)-1] without any restrictions. In this case, define two pseudo base pairs and , then can also be decomposed as the sum of two parts: , and the optimal alignment score between the regions A[ l(pA)…r(pA)] and B[ l(pB)…r(pB)] without any restrictions. Note that and are both pseudo base pairs, and thus based on the definition, we have . Therefore, is exactly the overall alignment score we need to detect the OPMs.
In this case, based on Lemma 1, if , we will consider the matching of pA and pB as an OPM, and add the OPM oA,B to the OPM list . The overhead for detecting the OPM is that we need to double the computation for each combination of pA and pB. However, such overhead will not raise the time complexity, and it is worthy as it will lead to a more significant speedup of the algorithm. In the following section, we will devise a new algorithm by assuming that the OPM list is available.
A new algorithm with cubic time complexity
In this section, we introduce a new general edit-distance RNA structure alignment algorithm, which improves the original RNAscf algorithm based on Lemma 1 and has a time complexity of O(z(n2+l2)). Here, z is the size of the OPM list , and we expect that z∈O(n) when assuming polymer-zeta property . If we also assume O(n)=O(l) (with fixed input RNA structures or efficiently pruned RNA structure ensembles), the overall time complexity of the new algorithm becomes O(zl2).
The new algorithm is developed based on the RNAscf algorithm . Therefore, we adopt the same definition and notation as introduced in Section ‘Preliminaries and definitions’, as well as the similar recursive functions style used in Section ‘The original O(n4+n2l2) algorithm’. Because the computations of M[ pA,pB] and Mh[ pA,pB] are boundary cases for the algorithm and are directly computed without referring to previous alignment results, the recursive functions for computing them are exactly the same as in the original algorithm:
The computation of Ml[ pA,pB], on the other hand, refers to the previous alignment results that assumes matches (see Equation 4). Using Lemma 1, it is clear to see that instead of traversing all combinations of and , we only need to consider the cases when the matching of and is an OPM:
Similarly, for the computation of , we need to refer to the scenarios where matches and matches . The matching of and is guaranteed to be an OPM, as ensured by Equation 8. Therefore, we only need to modify Equation 5 to ensure that the matching of and is an OPM:
Recall that the time complexity of the original algorithm is O(n4+n2l2). The first term O(n4) results from O(n2) computations by traversing all combinations of pA and pB (see Equation 2) and O(n2) time for computing Ml (see Equation 4). In the new algorithm, we introduce the OPM constraint to Equation 8 and Equation 9, and thus reduce the time complexity for computing Ml from O(n2) to O(z). In this case, the first term O(n4) of the original time complexity can be reduced to O(zn2).
The second term O(n2l2) in the original time complexity results from computing the sequence similarities between all loop regions. Note that all loop similarities required for computing Ml (Equation 8) and Mc (Equation 9) are associated with OPMs. For example, in Equation 8, all the loops are defined according to and , whose matching is expected to be an OPM. And in Equation 9, all the loops are defined according to and , as well as and , where both of these matchings are assumed to be OPMs. In this case, we do not need to compute loop similarities for all O(n2) base-pair combinations, instead we only need to compute the loop similarities that are associated with the OPMs. In this case, the time complexity for computing the sequence similarities between all loops that are required by the computation of Ml and Mc can be finished in O(zl2) time.
The only exception for the sequence similarity computation is the hairpin loop similarity Sseq(L(pA),L(pB)), which is required for computing Mh (Equation 7). The computation of Mh is not constrained by the OPM list, and therefore O(n2l2) time is still required. To resolve this issue, we observe that most of the RNA structure alignment algorithms emphasize the structure similarity other than sequence similarity (w1>w2 in Equation 1). In this case, if there exist some base pairs within the regions enclosed by pA and pB to be matched, we can expect that Ml[ pA,pB]>Mh[ pA,pB] in Equation 6. In this case, to avoid the unnecessary computation of Mh[ pA,pB], we can derive an upper bound , which satisfies and can be estimated in unit time. Note that if , we are sure that Ml[ pA,pB]>Mh[ pA,pB] by transition, and thus can save the computation of Mh[ pA,pB]. The upper bound can be easily derived by assuming maximum number of nucleotide matchings and minimum number of gaps:
where dmax is the highest score in the 4-by-4 nucleotide substitution matrix D, and I is a boolean variable that is set to 1 if |L(pA)|≠|L(pB)| and set to 0 otherwise. For the computation of each M[ pA,pB], we first estimate the upper bound in a unit time, and then compute Ml[ pA,pB] in O(z) time. By comparing these two values, we will determine whether the computation of Mh[ pA,pB] is necessary. The computation of Mh[ pA,pB] is only necessary when there are only a few base pair enclosed by pA and pB to be matched. Such condition implies the scenarios that either pA or pB is a real hairpin loop in the RNA structures, whose number is bounded by O(n). Overall, the hairpin loop similarity matrix Mh can be computed in O(nl2) time, and the overall time complexity of this algorithm is thus O(z(n2+l2)).
Online pruning of optimal pair matchings
In the previous sections, we have presented our approaches for detecting OPMs and building an OPM list , as well as a more efficient algorithm that is developed based on . Time complexity analysis of the algorithm claims that O(z(n2+l2)) time is sufficient for this new algorithm. The size of the OPM list , i.e. z, thus becomes an important factor that determines the efficiency of the novel algorithm. Under the current algorithmic setup, as well as other similar works that implement a candidate list [30,37], z continuously grows as the algorithm proceeds. In this case, it is desirable to devise an online pruning technique, which can remove the obsolete OPMs from , and thus achieve further speedup of the algorithm.
In this section, we will present such an online pruning technique to reduce the size of the OPM list . The intuition of this online pruning technique comes from the following observation. The RNA structures are primarily stabilized by a number of helices, or perfectly stacked base pairs. If is perfectly stacked on , then , and . Consider the alignment between two helices, where each one of them contains m+1 perfectly stacked base pairs. Assume that the first helix contains base pairs , and the second helix contains base pairs . Based on Lemma 1, there will be at least m OPMs detected from such alignment, i.e. . Apparently, maintaining all these m OPMs is unnecessary, as these base pairs should be aligned together as two complete helices, rather than be aligned separately as two sets of individual base pairs. In this case, maintaining only one OPM, i.e. , is sufficient to represent such an alignment. The other m OPMs become obsolete as soon as the OPM is detected, and can be removed from the OPM list to improve computational efficiency. In the following paragraphs, we will extend this idea to consider all situations in addition to the perfectly stacked scenario, as well as give formal description of this technique and related proofs.
We will demonstrate the major idea of our novel online OPM pruning technique using Figure 2 (b). Imagine that at the current stage, M[ pA,pB] has just been computed and oA,B has been identified as an OPM, where is an arbitrary OPM that has been previously identified and is enclosed by oA,B ( and ). Our aim is to estimate whether the detection of the OPM oA,B will make obsolete. Let and be arbitrary base pairs such that and . The regions enclosed by and can be partitioned using at least one of the following ways: (which is indicated by dark gray in Figure 2 (b)) and (which is indicated by light gray in Figure 2 (b)). If the corresponding score for the first path is higher than the second, will not be referred to by any future matching between arbitrary base pairs and , and thus making the OPM obsolete. In this case, the OPM can be removed from .
which can be rewritten as:
To utilize such criterion, we need to have access to all values included in the above inequality. However, we only know the values at the left hand side of the inequality (M[ pA,pB] and ), while the other values at the right hand side are unknown. This is because the definitions of these pseudo base pairs are determined by and , which are arbitrary base pairs that have not yet been computed by the DP algorithm. To solve this issue, observe that the score is strongly related to the regions and , and is strongly related to the regions and . Note that the regions and can be determined when pA and are known, which makes the estimation of their impact on future alignments possible (similarly for the regions and ). In this case, we can develop two upper bounds and , such that:
Now, we can discuss the details for setting up the upper bounds and . Because and are defined symmetrically, we only discuss the computation of . Note that the upper bound needs to satisfy the condition . Clearly, the difference between directly comes from concatenating the region to the region , as well as concatenating the region to the region . The best case scenario for such an operation, is to assume that the concatenation of the regions and will result in as many new base-pair and nucleotide matches as possible.
Assume that there are base pairs that are annotated in the region , and base pairs that are annotated in the region . Also assume the maximum base-pair substitution score in the RIBOSUM matrix R is rmax. By concatenating the regions and , we introduce at most more base-pair matchings to the alignment indicated by . This implies the maximum structure alignment score increment of . Similarly, at most more nucleotide matches, or gap fill-ups, are possible, compared to the existing alignment indicated by the score . The corresponding alignment score for such case is thus: . To explicitly represent the upper bound using only the identified OPMs, we rename as (similarly, we rename as ). Therefore, and can be computed using the following equations:
The pseudo-code for the new RNA secondary structure alignment algorithm that implements both speedup techniques is summarized in Figure 3.
Figure 3. Pseudo-code for the implementation of the speedup techniques.
We implemented the proposed general edit-distance RNA structural alignment algorithm into a program called ERA (Efficient RNA Alignment) using GNU C++. In this section, we will show that (1) ERA has the expected O(zl2) time complexity; (2) ERA is as accurate as the other state-of-the-art RNA alignment tools; and (3) ERA runs much faster than the other RNA alignment tools. In addition to these goals, we have also benchmarked ERA to demonstrate its O(l2) space complexity. For details regarding the space complexity issues please refer to the Additional file 1: Section S1 (also see Figure S1, Figure S2, and Table S1).
Additional file 1. Supplementary information. This file contains four sections. In Section S1, we briefly discuss the space issue of ERA and provide related experimental results. In Section S2, we document the randomly selected RNA structures used for experiments mentioned in the main article. In Section S3, we evaluate the impact of the online OPM pruning technique in speeding up ERA. In Section S4, we give examples of the pruned base-pairing probability matrix for executing LocARNA.
Format: PDF Size: 684KB Download file
This file can be viewed with: Adobe Acrobat Reader
We benchmarked the ERA with two other state-of-the-art RNA alignment tools: LocARNA as a representative of the general edit-distance RNA structure alignment algorithms and RNAforester as a representative of the tree-based RNA structure alignment algorithms. Note that although LocARNA is developed to compare RNA structure ensembles, its flexible parameter setup makes it easy to prune its input RNA ensembles (see Section ‘Running LocARNA’ for more details). However, the readers should note that LocARNA is used in a restricted case for fair comparison with ERA, and more potential applications of LocARNA should be recognized. We do not compare ERA with its predecessor RNAscf, because RNAscf is implemented to find consensus helical configurations that do not include individual base pairs . Both LocARNA and RNAforester were invoked using their default parameters.
Note that LocARNA was originally developed to compare two RNA structure ensembles . Due to the recent technical advances in experimental RNA structure probing, we anticipate that RNA structures can be predicted with much higher accuracy. Therefore, we develop ERA to compare two fixed RNA structures. In this case, we need to prune the original inputs of LocARNA, so as to ensure that they only represent the fixed structures rather than any additional information.
The input RNA ensembles for LocARNA are represented using the base-pairing probability matrices, which can be computed using the McCaskill’s algorithm [40,41]. In a base-pairing probability matrix, each base pair (possibly crossing) is assigned with a probability to indicate its thermodynamic stability. Our goal is to prune such a base-pair probability matrix, such that it only contains information regarding the fixed RNA structure (in our experiment, we take the Rfam  annotation or the BraliBase II  annotation as the fixed structure for an RNA sequence). For each base pair in the matrix, if it is not presented in the annotated structure, its corresponding probability is reset to 0. On the other hand, if it is included in the annotated structure, its probability is reset to 1. In this case, the pruned base-pairing probability matrix contains only the information regarding the fixed RNA structure. We show an original and a pruned base-pairing probability matrix in Additional file 1: Figure S3 as an example. All LocARNA inputs for experiments mentioned in this article are preprocessed using this strategy.
In this section, we expect to show that the proposed sparsification is successfully implemented, and ERA has the expected O(zl2) time complexity. To show the O(zl2) time complexity, we chose a number of RNA families from Rfam that have a wide range of sequence lengths. We then randomly selected two individual RNA structures from each family (see Additional file 1: Table S2) to run ERA alignment. The running time for their alignments, versus n3 (note that n<l for annotated structures and O(n)=O(l)), is plotted in Figure 4 (a). We can clearly observe the expected O(zl2) time complexity from the figure. In addition, we are also able to show that the speedup ratio, when comparing to the O(l4+n2l2)LocARNA algorithm, is strongly correlated with the efficiency of pair matching reduction due to the sparse DP technique (the ratio n2/z, see Figure 4 (b)). The relatively large deviations are observed for biocoid_3UTR and snR86 RNA structures. This is because they contain a large number of base pairs and have a high base pair to sequence length ratio. In this case, the overhead for maintaining the OPM list becomes apparent and makes the speedup less significant. In summary, we have shown that the sparse DP technique is successfully implemented, ERA has an expected time complexity of O(zl2).
Figure 4. Time complexity and OPM reduction of ERA. (a) Running time versus n3, where n is the average number of base pairs in the RNA structures. (b) OPM reduction ratio versus running time speedup ratio. The OPM reduction ratio is computed by n2/z, where z is the number of OPMs.
In addition to time complexity improvement, we also expect to show that ERA is as accurate as the other state-of-the-art general edit-distance RNA structure alignment tools. We used BraliBase II  as the reference data set, and used its corresponding structure annotations as the fixed input structures. We adopted two measures to indicate the alignment quality, i.e., the Sum-of-Pair Score (SPS)  and the Structure Conservation Index (SCI) . The benchmark results are shown in Figure 5. The alignment qualities of ERA and LocARNA are nearly identical, since incorporating the sparse DP technique will not compromise global optimality. The benchmark results also show that ERA and LocARNA can produce more accurate alignments when compared to RNAforester. This is because ERA and the restricted version of LocARNA are both general edit-distance RNA alignment algorithms that are capable of flexibly handling incorrectly predicted base-pairs, while RNAforester is more sensitive to such errors, since it implements tree alignment.
Figure 5. Alignment quality comparison of ERA, LocARNA and RNAforester. The comparison of (a) Sum-of-Pair Score and (b) Structure Conservation Index between ERA, LocARNA and RNAforester on BraliBase II data set with fixed input structures. The sequence identity range is between 0.37 to 0.99. The curves are generated using LOWESS smoothing with a smoothing factor of 0.3.
Running time speedup
Finally, after benchmarking the time complexity and alignment accuracy of ERA, we also expect to show that ERA is an efficient implementation and can run faster than other state-of-the-art RNA alignment tools. We compared the real running time of ERA, LocARNA, and RNAforester on the selected RNA structures from Rfam. The benchmark results are summarized in Table 1. We can observe that ERA is capable of speeding up LocARNA by a minimum of 5.2 fold and a maximum of 91.5 fold. ERA can also speedup RNAforester by a minimum of 2.8 fold and a maximum of 242.6 fold, with only one exception in which RNAforester is 9.6 times faster than ERA. This is because the RNA structures being aligned (snR86) contain only one stem-loop structure; and in such a special case, the time complexity of RNAforester becomes O(l2) .
Table 1. Comparison on running time of ERA, LocARNA, and RNAforester
To further investigate the real running time speedup of ERA on randomly selected RNA structures, we compiled a much larger data set that contains 1,000 pairs of randomly selected RNA structures from Rfam. The benchmark results on this large data set are summarized in Figure 6. In Figure 6, we can see that ERA (blue triangle) runs much faster than LocARNA (red cross) and RNAforester (green star). In addition, we can also observe that the running time of ERA grows slower than those of LocARNA and RNAforester, which further confirms our previous time complexity analysis (see Figure 4 (a)). This speedup is significant, and renders ERA with the power of aligning long ncRNAs that are revealed by recent research advances. In summary, ERA is an efficient and accurate RNA structure alignment tool as compared to its state-of-the-art counterparts LocARNA and RNAforester.
Figure 6. Computational efficiency comparison between ERA, LocARNA and RNAforester on aligning randomly selected RNA structures from Rfam. The running time for ERA (blue triangles), LocARNA (red crosses) and RNAforester (green stars) on aligning 1,000 pairs of randomly selected RNA structures from the Rfam database. The x-axis corresponds to the average sizes of the RNA structures being aligned, which is computed as the product of their average length (l) and their average number of base pairs (n). The y-axis corresponds to the actual running time in the unit of second. We can see that ERA is significantly faster than the other two tools.
Discussion and conclusions
In this article, we have presented a novel algorithm for efficient alignment of RNA secondary structures by incorporating the sparse DP technique. The major theoretical contribution of this work lies in two parts. First, to our knowledge, this is the first application of the sparse DP technique to RNA structure-structure alignment. Second, the novel online OPM pruning technique can provide insights for future algorithm designs that need to maintain a candidate list. The implementation of this novel algorithm is a tool called ERA, which can run in O(zl2) time and O(l2). Such time and space complexity make ERA one of the most efficient RNA structure alignment tools that are currently available.
The online OPM pruning technique is newly developed from this work, which aims at deleting obsolete candidates as the DP proceeds. Although this technique cannot improve the computational complexity, it is efficient in reducing the real running time. In Additional file 1: Table S3, we summarized the running time of ERA in aligning individual RNA structures, with or without the online OPM pruning technique. We observed that by incorporating this technique, the running time of ERA was reduced by an average of 2.3 fold. Meanwhile, the speedup ratio is highly uniform (with 1.7 fold as the lowest and 3.1 fold as the highest) across RNA structures with different sizes, meaning that it reduces running time by a constant factor. The online OPM pruning technique can also be modified and incorporated into other related algorithms that implement the candidate list, such as the sparse DP algorithms for RNA folding , RNA consensus folding [36,37], and RNA-RNA interaction .
The speedup of ERA is most significant when the number of base pairs in the RNA structures is small. This is because the algorithm is indexed by base pairs and has a time complexity of O(z(n2+l2)). As n increases, the term O(zn2) will dominate the overall time complexity. In this case, an ideal application of ERA is to align fixed RNA structures, because it guarantees that n<l. Note that as a sparsified version of the SAF algorithm RNAscf, the new algorithm developed here is also capable of handling RNA structure ensemble alignments. However, we do not implement this feature into ERA, because one cannot guarantee n<l for RNA ensemble alignments. This would make the speedup of ERA less significant. Besides, there are other alternative tools [36,37] available for such a purpose.
With the completion of the ENCODE  and modENCODE  projects, more and more RNA transcripts will be experimentally revealed. At the same time, with the advance of high-throughput RNA structure probing techniques [13-15], the secondary structures of these RNA transcripts will also be predicted with a much higher accuracy. In this case, ERA, which can compare fixed RNA structure efficiently and accurately, becomes an ideal computational tool to evaluate the structural similarities of these RNA transcripts. ERA can be used to perform all-against-all alignments on these RNA transcripts, which will then be subsequently summarized as the distance matrix for clustering purposes. Various clustering algorithms [8,39] can then be applied to identify ncRNA families with similar secondary structures and infer their amazing cellular and molecular functionalities.
The authors declare that they have no competing interest.
SZ contributed with the conception of the research. SZ and CZ designed the algorithm. CZ implemented the algorithm and performed benchmark analysis. SZ and CZ wrote the manuscript. Both authors read and approved the final manuscript.
This work is supported by the National Institute of General Medical Sciences of the National Institutes of Health, USA (R01GM102515).
Tripathi V, Ellis JD, Shen Z, Song DY, Pan Q, Watt AT, Freier SM, Bennett CF, Sharma A, Bubulya PA, Blencowe BJ, Prasanth SG, Prasanth KV: The nuclear-retained noncoding RNA MALAT1 regulates alternative splicing by modulating SR splicing factor phosphorylation.
Zhong C, Andrews J, Zhang S: Discovering non-coding RNA elements in Drosophila 3’ un-translated regions. In Proceedings of the 2nd IEEE International Conference of Computational Advances in Bio and Medical Sciences.. IEEE; 2012:1-6.
Celniker SE, Dillon LA, Gerstein MB, Gunsalus KC, Henikoff S, Karpen GH, Kellis M, Lai EC, Lieb JD, MacAlpine DM, Micklem G, Piano F, Snyder M, Stein L, White KP, Waterston RH: Unlocking the secrets of the genome.
Lucks JB, Mortimer SA, Trapnell C, Luo S, Aviran S, Schroth GP, Pachter L, Doudna JA, Arkin AP: Multiplexed RNA structure characterization with selective 2’-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq).
J ACM 1979, 26:422-433. Publisher Full Text
SIAM J Comput 1989, 18:1245-1262. Publisher Full Text
Höchsmann M, Töller T, Giegerich R, Kurtz S: Local similarity in RNA secondary structures. In Proceedings of the 2nd IEEE Computer Society Bioinformatics Conference. Washington DC: IEEE Computer Society; 2003:159-168.
Bafna V, Muthukrishnan S, Ravi R: Computing similarity between RNA strings. In Proceedings of the 6th Annual Symposium on Combinatorial Pattern Matching. Berlin Heidelberg: Springer-Verlag; 1995:1-16.
Theor Comput Sci 2005, 337:217-239. Publisher Full Text
Comput Appl Biosci 1988, 4:11-17. PubMed Abstract
Ziv-Ukelson M, Gat-Viks I, Wexler Y, Shamir R: A faster algorithm for RNA co-folding. In Proceedings of the 8th Workshop on Algorithms in Bioinformatics. Berlin Heidelberg: Springer-Verlag; 2008:174-185.
J of Discrete Algorithms 2011, 9:12-31. Publisher Full Text
Salari R, Mȯhl M, Will S, Sahinalp SC, Backofen R: Time and space efficient RNA-RNA interaction prediction via sparse folding. In Proceedings of the 14th International Conference on Research in Computational Molecular Biology. Berlin Heidelberg: Springer-Verlag; 2010:473-490.
Monatsh Chem 1994, 125:167-188. Publisher Full Text