Abstract
Background
It is well known that the search for homologous RNAs is more effective if both sequence and structure information is incorporated into the search. However, current tools for searching with RNA sequencestructure patterns cannot fully handle mutations occurring on both these levels or are simply not fast enough for searching large sequence databases because of the high computational costs of the underlying sequencestructure alignment problem.
Results
We present new fast indexbased and online algorithms for approximate matching of RNA sequencestructure patterns supporting a full set of edit operations on single bases and base pairs. Our methods efficiently compute semiglobal alignments of structural RNA patterns and substrings of the target sequence whose costs satisfy a userdefined sequencestructure edit distance threshold. For this purpose, we introduce a new computing scheme to optimally reuse the entries of the required dynamic programming matrices for all substrings and combine it with a technique for avoiding the alignment computation of nonmatching substrings. Our new indexbased methods exploit suffix arrays preprocessed from the target database and achieve running times that are sublinear in the size of the searched sequences. To support the description of RNA molecules that fold into complex secondary structures with multiple ordered sequencestructure patterns, we use fast algorithms for the local or global chaining of approximate sequencestructure pattern matches. The chaining step removes spurious matches from the set of intermediate results, in particular of patterns with little specificity. In benchmark experiments on the Rfam database, our improved online algorithm is faster than the best previous method by up to factor 45. Our best new indexbased algorithm achieves a speedup of factor 560.
Conclusions
The presented methods achieve considerable speedups compared to the best previous method. This, together with the expected sublinear running time of the presented indexbased algorithms, allows for the first time approximate matching of RNA sequencestructure patterns in large sequence databases. Beyond the algorithmic contributions, we provide with RaligNAtor a robust and well documented opensource software package implementing the algorithms presented in this manuscript. The RaligNAtor software is available at http://www.zbh.unihamburg.de/ralignator webcite.
Background
Due to their participation in several important molecularbiological processes, ranging from passive carriers of genetic information (tRNAs) over regulatory functions (microRNAs) to proteinlike catalytic activities (Riboswitsches), noncoding RNAs (ncRNAs) are of central research interest in molceular biology [1]. NcRNAs, although synthesized as singlestranded molecules, present surprising complexity by being able to base pair with themselves and fold into numerous different structures. It is to a large extent the structure that allows them to interact with other molecules and hence to carry out various biological functions. This can also be observed in families of functionally related ncRNAs like the ones compiled in the Rfam database [2]. Here members of a family often share only few sequence features, but share by far more specific structural and functional properties. Consequently, methods for effective RNA homology search (i.e. finding new members of an RNA family) cannot rely on sequence similarity alone, but also have to take structural similarity into account.
In this paper, we address the problem of searching nucleotide databases for occurrences of RNA family members. Since for this task it is not sufficient to rely on pure sequence alignment, we briefly review search methods that employ sequence and structure information.
There exist various general sequencestructure alignment tools which determine structural similarities that are too diverse to be alignable at the sequence level. Such tools can roughly be divided into two classes. The first class consists of tools that align RNAs with given structures or determine a common structure during the alignment process. Tools like MARNA[3] and RNAforester[4] require an a priori known secondary structure for both input RNAs. However, they suffer from the low quality of secondary structure prediction. Addressing this problem, other tools implement variations of the Sankoff algorithm [5], which provides a general but computationally demanding solution to the problem of simultaneously computing an alignment and the common secondary structure of the two aligned sequences. Unfortunately, even tools with improved running times using variations of this algorithm (LocARNA[6], Foldalign[7,8], Dynalign[9,10]) or heuristics [11] are simply not fast enough for rapid searches in large nucleotide databases. Hence, in a second class we identify more specialized tools for searching RNA families in nucleotide databases. These tools use a model or motif descriptors (i.e. patterns) defining consensus sequence and secondary structure properties of the families to be searched for. For example, Infernal[12] and RSEARCH[13] infer a covariance model from a given multiple sequence alignment annotated with structure information. This model can then be used to search sequence databases for new family members. Another tool, ERPIN[14] is also based on automatically generated statistical secondary profiles. Although being very sensitive in RNA homology search, in particular Infernal and RSEARCH suffer from high computational demands. An alternative are tools like RNAMotif[15], RNAMOT[16], RNABOB[17], RNAMST[18], PatScan[19], PatSearch[20], or Palingol[21]. These methods use userdefined motif descriptors created from a priori knowledge about the secondary structure of the described RNA family. Another tool, Locomotif[22], generates a thermodynamic matcher program from a pattern drawn interactively by the user via a graphical interface. Although these tools based on motif descriptors are faster than the previously mentioned tools, they have a running time that scales at least linearly with the size of the target sequence database. This makes their application to large databases challenging. Previously, we addressed this problem by presenting Structator[23], an ultra fast indexbased bidirectional matching tool that achieves sublinear running time by exploiting base pair complementarity constraints for search space reduction.
Apart from running time constraints, another major disadvantage of all current tools that search for sequencestructure patterns is their limited capacity to find approximate matches to the patterns. Although variability in length of pattern elements is often allowed, this is constrained to certain pattern positions that must be specified by the user. This limitation also holds for our Structator tool. Also, variations (insertions, deletions, or replacements) in the sequence that lead to small structural changes, such as the breaking of a base pair, are not supported. This often hampers the creation of patterns that are specific but generalized enough to match all family members. An algorithm presented in [24] only partially alleviates this problem by finding approximate matches of a helix in a genome allowing edit operations on single bases, but not on the structure.
To overcome these issues, we present new fast indexbased and online algorithms for approximate matching of sequencestructure patterns, all implemented in an easytouse software package. Given one or more patterns describing any (branching, noncrossing) RNA secondary structure, our algorithms compute alignments of the complete patterns to substrings of the target sequence, i.e. semiglobal alignments, taking sequence and structure into account. For this, they apply a full set of edit operations on single bases and base pairs. Matches are reported for alignments whose sequencestructure edit cost and number of insertions and deletions do not exceed userdefined thresholds. Our most basic algorithm is a scanning variant of the dynamic programming algorithm for global pairwise sequencestructure alignment of Jiang et al.[25], for which no implementation was available. Because its running time is too large for database searches on a large scale, we present accelerated online and indexbased algorithms. All our new algorithms profit from a new computing scheme to optimally reuse the required dynamic programming matrices and a technique to save computation time by determining as early as possible whether a substring of the target sequence can contain a match. In addition, our indexbased algorithms employ the suffix array data structure compiled from the search space. This further reduces the running time.
As in [23], we also support the description of an RNA molecule by multiple ordered sequencestructure patterns. In this way, the molecule’s secondary structure is decomposed into a sequence of substructures described by independent sequencestructure patterns. These patterns are efficiently aligned to the target sequences using one of our new algorithms and the results are combined with fast global and local chaining algorithms [23,26]. This allows a better balancing of running time, sensitivity, and specificity compared to searching with a single long pattern describing the complete sequence and secondary structure.
Before we describe our algorithms, we formalize the approximate search problem with the involved sequencestructure edit operations. Then we present, step by step, two efficient online and two indexbased matching algorithms. We proceed with a short review of the approach for computing chains of matches. Finally, we present several benchmark experiments.
Methods
Preliminaries
An RNA sequenceS of length n = S over the set of bases
For a sequence S = S[1] S[2] … S[n] and 1 ≤ i ≤ j ≤ n, S[i..j] denotes the substring S[i] S[i + 1] … S[j] of S. For S = uv, u and
The secondary structure of an RNA molecule is formed by WatsonCrick pairing of complementary
bases and also by the slightly weaker wobble pairs. We say that two bases
Let Φ = {R, Y, M, K, W, S, B, D, H, V, N} be a set of characters. According to the
IUPAC definition, each character in Φ denotes a specific character class
Approximate matching of RNA sequencestructure patterns
To find in a long RNA sequence S approximate matches of an RSSP
We define the alignment of
• An arc breaking, with cost ω_{b}, occurs if (k_{1},l_{1}) ∈ A_{match} and (k_{2},l_{2}) ∈ A_{match} but bases S[l_{1}] and S[l_{2}] are not complementary. An additional base mismatch cost ω_{m} is caused if S[l_{1}] ∉ φ(P[k_{1}]) and another if S[l_{2}] ∉ φ(P[k_{2}]). To give an example, consider the semiglobal alignment in Figure
1. RSSP
• An arc altering, with cost ω_{a}, occurs if either (1) (k_{1},l_{1}) ∈ A_{match} and (k_{2},−) ∈ A_{gap} or (2) (k_{2},l_{2}) ∈ A_{match} and (k_{1},−) ∈ A_{gap}. Each case induces an additional base mismatch cost ω_{m} if S[l_{1}] ∉ φ(P[k_{1}]) or S[l_{2}] ∉ φ(P[k_{2}]). As an example, observe in the alignment shown in Figure
1 that there exist a base pair
• An arc removing, with cost ω_{r}, occurs if (k_{1},−) ∈ A_{gap} and (k_{2},−) ∈ A_{gap}. As an example, observe in the alignment in Figure
1 that there exist a base pair
Figure 1. Example of a semiglobal alignment of a sequencestructure pattern
With this set of edit operations on the sequence and structure we can now define the
cost of the alignment of
where
An alignment A of minimum cost between
In practice, one is often interested in finding substrings of an RNA sequence S having a certain degree of similarity to a given RSSP
• the cost
• the number of indels in the alignment should be at most d.
Thus, the approximate search problem for finding occurrences of an RSSP
We call every substring S[p..q] satisfying Equation (3) a match of
Online approximate RNA database search for RSSPs: ScanAlign
A straightforward algorithm to search for approximate matches of an RSSP
We begin the description of our algorithm by defining three functions required by the dynamic programming recurrences. Let T = S[p..q].
1. For computing base match and mismatch costs for positions i and j of the RSSP
2. To determine whether an arc breaking operation can occur, we must also be able
to check for base complementarity at positions i and j of T. Therefore, we define a function
3. For determining the correct row (of the dynamic programming matrices introduced
below) where certain operation costs must be stored we introduce a function
Intuitively, function row satisfies the following: (1) given the right index i of a base pair (i^{′},i), it returns the left index i^{′} if (i^{′},i) is preceded or followed by other structures; (2) given the left index i of a base pair (i,i^{′}), it returns 0 if the base at position i + 1 of
Using these three functions, our algorithm determines the sequencestructure edit
distance
1. If i = 0 or R[i] =. (unpaired base), then
2. If R[i] ≠. (paired base), then
(a) If R[i] =) where i forms base pair
(b) If (a) holds and either R[i^{′}−1] =. or R[i^{′}−1] =), compute in addition to Equation (8)
A natural way to compute these DP matrices is top down, checking whether case 1, 2(a), or 2(b) applies, in this order. Due to the matrix dependencies in cases 2(a) and (b), the matrices need to be computed simultaneously.
Note that for all j, 1 ≤ j ≤ m^{′}, clearly
Lemma 1
When sliding a window along S to compute
Proof
Let T[1..m^{′}] = S[p..q]. The computation of
Due to Lemma 1, our algorithm computes only the last column of the DP matrices for every shifted window substring (see the example in Figure 2) and just for the first window S[1..m^{′}] it computes every column. We call this algorithm ScanAlign. We note that during the reviewing process of this manuscript, Will et al.[27] submitted and published an algorithm for semiglobal sequencestructure alignment of RNAs. As our method, this algorithm saves computation time by reusing entries of dynamic programming tables while scanning the target sequence.
Figure 2. DP tables for the sequencestructure alignment computation of RSSP
Our ScanAlign algorithm has the following time complexity: computing DP_{k}(i,j) in cases 1 and 2(a) takes O(1) time and in case 2(b) it takes O(m^{′}) time. Now consider the two situations:
• For the first computed window substring S[1..m^{′}], cases 1 and 2(a) require O(mm^{′2}) time in total and case 2(b) requires O(mm^{′3}) time in total. This leads to an overall time of O(mm^{′3}).
• For one window shift, cases 1 and 2(a) require O(mm^{′}) time in total and case 2(b) requires O(mm^{′2}) time in total, leading to an overall time of O(mm^{′2}).
Since there are n − m^{′}−1 window shifts, the computation for all shifted windows takes O(mm^{′2}(n−m^{′})) = O(mm^{′2}n) time. We observe that the time needed by ScanAlign to compute all window shifts reduces to O(mm^{′}n) if recurrence case 2(b) is not required. This is the case if the structure of
Faster online alignment with earlystop computation: LScanAlign
Often, before completing the computation of the alignment between an RSSP
• We decompose the RSSP
• We compute the alignment of a given initial RSSP region and a substring of the current window S[p..q], progressively extending the alignment to other regions.
• If the cost of aligning an RSSP region to a substring of the window exceeds cost
threshold
Formally, a valid RSSP region
1.
2. Position y is unpaired and there is at least one base pair
3.
4. y forms a base pair
Figure 3. Regions of RSSP
Note that regions can be embedded in other regions but cannot partially overlap another.
Our progressive alignment computation of an RSSP
1. All rows in the interval [x..y] are computed by Equation (7).
2. One scans the structure of region
3. Row y is computed by recurrence (a) of Equation (8).
4. Row row(y) is computed by recurrence (b) of Equation (8).
The sequential computation of the rows belonging to each region naturally leads to
the computation of the entire alignment of
Our improvement of the ScanAlign algorithm is based on the following two observations.
• The standard dynamic programming algorithm for aligning two plain text sequences of lengths m and n requires an (m + 1) × (n + 1) matrix. Let i and j be indices of each of the matrix dimensions and a diagonal v be those entries defined by i and j such that j − i = v. Given that the cost of each edit operation is a positive value, the cost of the entries along a diagonal of the matrix are always nondecreasing [28].
• Moreover, one indel operation implies that an optimal alignment path including an
entry on diagonal v also includes at least one entry on diagonal v + 1 or v − 1. Now let v be the diagonal ending at the entry on the lowerright corner of the matrix and d be the number of allowed indels. One can stop the alignment computation as soon as
all the entries of one row in the matrix and along diagonals v + d^{′}, −d ≤ d^{′} ≤ d, exceed
For our improvement of algorithm ScanAlign, based on the following Lemma, we define a diagonal for each RSSP region instead of only one for the entire matrices.
Lemma 2
Assume an RSSP
Proof
If the RSSP region
• T[x..y], T[x..y − 1], or T[x..y + 1], requiring for all three alignments the computation of
• T[x − 1..y − 1], requiring the computation of
• T[x + 1..y + 1], requiring the computation of
The alignments with T[x..y], T[x..y + 1], and T[x..y−1] end in matrix DP_{x}. The alignments with T[x−1..y−1] end in matrix DP_{x−1}, and the alignments with T[x + 1..y + 1] end in matrix DP_{x+1}. Every minimizing path obtained for the entire alignment of
Let
When scanning the searched RNA sequence, a window can be shifted before all DP matrices entries are computed. Hence, a direct application of Lemma 1 is no longer
possible. To overcome this, we define an array Z in the range 1 to z, where z is the number of RSSP regions, and associate each region with an index r, 1 ≤ r ≤ z. Let p be the starting position of the window substring S[p..q] in the RNA sequence. We set Z[r] = p whenever all DP matrices rows and columns belonging to region r are computed. This occurs when the cost of aligning this region does not exceed cost
threshold
Our improved algorithm, hereinafter called LScanAlign, in the worst case needs to process every RSSP region for every window shift. Hence, it has the same time complexity as algorithm ScanAlign. However, as in many cases only a few RSSP regions are evaluated, it is much faster in practice as will be shown later. ScanAlign and LScanAlign are the basis for further improvements presented in the subsequent sections.
Indexbased search: LESAAlign
Suffix trees and enhanced suffix arrays are powerful data structures for exact string
matching and for solving other string processing problems
[29,30]. In the following we show how the use of enhanced suffix arrays leads to even faster
algorithms for searching for matches of an RSSP
The enhanced suffix array of a sequence S is composed of the suffix array suf and the longest common prefix array lcp. Let $, called terminator symbol, be a symbol not in
Figure 4. Enhanced suffix array of sequence S$ = CCACCCCCCACCCACCACCCUCUU$ consisting of the suffix array suf, longest common prefix array lcp, and inverse suffix array suf^{−1}. For the definition of suf^{−1}, see the section describing algorithm LGSlinkAlign.
Consider an RSSP
Our algorithm for the suffix array traversal and
1. If p_{i} ≤ lcp and
2. If p_{i} ≤ lcp and
Figure 5. DP tables for the sequencestructure alignment computation of RSSP
These situations allow LESAAlign to benefit from the enhanced suffix array in a second important way. That is, it
skips all suffixes S_{suf}_{[i]}, S_{suf}_{[i+1]},..., S_{suf}_{[j]} sharing a common prefix of at least length lcp with S_{suf}_{[i−1]}. To find the index j of the last suffix S_{suf}_{[j]} to be skipped, it suffices to look for the largest j such that min{lcp[i],lcp[i+1],...,lcp[j]} ≥ lcp. If the first situation above holds, there are matches of
We also incorporate in our indexbased algorithm the earlystop alignment computation
scheme of algorithm LScanAlign. This allows to skip suffixes S_{suf}[i] as soon as it becomes clear that the sequencestructure edit distance of RSSP
• Previously computed pattern regions are all regions whose index is strictly smaller than r. The alignment computation of these regions profits from the common prefix between S_{suf}_{[i]}[1..p_{i}] and S_{suf}_{[ j]}[1..p_{j}] by avoiding the recomputation of DP matrices columns as described above.
• Noncomputed pattern regions are all regions whose index is larger than or equal to r. In this case, all DP matrices columns of the respective pattern region need to be computed, even if S_{suf}_{[i]}[1..p_{i}] and S_{suf}_{[ j]}[1..p_{j}] share a common prefix.
We observe that longer ranges of suffixes not containing matches to
Additional file 1. Supplemental material. Additional file 1 contains additional experiments, figures, and tables.
Format: PDF Size: 308KB Download file
This file can be viewed with: Adobe Acrobat Reader
Enhanced indexbased search: LGSlinkAlign
Given an RSSP
• avoid recomputation of DP matrices columns due to a common prefix between suffixes of S; and
• skip long ranges of suffixes of the suffix array suf whose common prefix up to a required reading depth are known to match or not match
Therefore, LESAAlign exploits repetitions of substrings of S, i.e. substrings shared by different suffixes, and information of the lcp table to save computation time. However, the use of information of the lcp table alone does not necessarily lead to large speedups. Consider e.g. the DP matrices for the computation of the alignment of
Our next goal is to develop an algorithm traversing the enhanced suffix array of S that:
1. can skip more suffixes; and
2. improves the use of already computed DP matrices entries, reusing computed entries for as many suffixes as possible.
To address the first goal, we motivate our method by recalling the alignment computation
example in Figure
2. In this example, one of the regions of
Let S be an arbitrary RNA sequence and T[x..y] = S_{suf}_{[i]}[x..y] contain all substrings whose alignment cost to a region of an RSSP
Then, by letting j = link(i,x^{′}),
We remark that the described method for skipping suffixes can profit from a resorting
according to the order by which RSSP regions are aligned. In the alignment computation
example in Figure
2, determining
We now address the second goal, namely reusing computed DP matrices entries for as many suffixes as possible. Recall that computing the sequencestructure
edit distance
• were already computed once because T^{′} is lexicographically smaller than T, but were discarded to allow the processing of other suffixes until T was traversed; or
• are computed for the first time otherwise, but will not be reused to also allow the processing of other suffixes until T^{′} occurs in table suf as a prefix of a suffix itself.
In both cases, redundant computations occur. To avoid this, we optimize the use of
computed DP matrices by processing T^{′} directly after processing T for fixed k = 2, recalling that T = S_{suf}_{[i]}[1..p_{i}] and T^{′} = S_{suf}_{[j]}[1..p_{j}]. This value of k implies that S_{suf}_{[j]} does not contain the first character of S_{suf}_{[i]} and that we can locate S_{suf}_{[j]} in table suf by computing the suffix link j = link(i,1). Also, k = 2 implies that T^{′} only differs by its last character from T, aside from not beginning with character T[1]. Therefore, to determine
We call our algorithm incorporating the presented improvements LGSlinkAlign. For its pseudocode, see Section 1 of Additional file 1. LGSlinkAlign inherits all the improvements of the above presented algorithms. In summary, its improvements are as follows.
• LGSlinkAlign traverses the enhanced suffix array of the searched sequence S, i.e. the suffix array suf enhanced with tables lcp and suf^{−1}. During this traversal, it benefits from common prefixes shared among suffixes to
(1) avoid the computation of DP matrix columns and to (2) skip ranges of suffixes known to match or not match RSSP
• The suffix array traversal is predominantly top down, but noncontiguous suffixes are processed to optimize the use of computed DP matrices.
• LGSlinkAlign stops the alignment computation as early as the alignment cost of a region of RSSP
• Due to the earlystop computation scheme, suffixes sharing common prefixes shorter than m + d can be skipped, leading to larger ranges of skipped suffixes. The earlystop computation scheme also helps to identify and skip noncontiguous suffixes sharing a common substring which is not their prefix.
Example: searching for an RSSP with algorithm LGSlinkAlign
We elucidate the ideas of algorithm LGSlinkAlign with the following example. Consider the RSSP
RNA secondary structure descriptors based on multiple ordered RSSPs
RNAs with complex branching structures often cannot be adequately described by a single
RSSP due to difficulties in balancing sensitivity, specificity, and reasonable running
time of the used search algorithm. Although their description by a single short RSSP
specifying an unbranched fragment of the molecule might be very sensitive, it is often
too unspecific and likely to generate many spurious matches when searching for structural
homologs in large sequence databases or complete genomes. In contrast, using a single
long RSSP often requires a higher cost threshold
We solve this problem by applying the powerful concept of RNA secondary structure
descriptors (SSDs for short) recently introduced in
[23]. The underlying concept of SSDs is similar to the idea of PSSM family models
[38], which are successfully used for fast and sensitive protein homology search. SSDs
use the information of multiple ordered RSSPs derived from the decomposition of an
RNA’s secondary structure into stemloop like structural elements. In a first step,
approximate matches to the single RSSPs the SSD consists of are obtained using one
of the algorithms presented above. From these matches, either local or global highscoring
chains are computed with the efficient chaining algorithms described in
[23]. These algorithms take the chain’s score, i.e. the weights of the fragments in the
chain, into account (see
[23] for details). For chaining of approximate RSSP matches, we use the fragment weight
Results and discussion
Implementation and computational results
We implemented (1) the fast indexbased algorithms LESAAlign and LGSlinkAlign, (2) the online algorithms LScanAlign, ScanAlign, both operating on the plain sequence, and (3) the efficient global and local chaining algorithms described in [23]. In our experiments we use ScanAlign, which is the scanning version of the method proposed in [25], for reference benchmarking. All algorithms are included in the program RaligNAtor. The algorithms for index construction were implemented in the program sufconstruct, which makes use of routines from the libdivsufsort2 library (see http://code.google.com/p/libdivsufsort/ webcite) for computing the suf table in O(n log n) time. For the construction of table lcp we employ our own implementation of the linear time algorithm of [35]. All programs were written in C and compiled with the GNU C compiler (version 4.5.0, optimization option O3). All measurements are performed on a Quad Core Xeon E5620 CPU running at 2.4 GHz, with 64 GB main memory (using only one CPU core). To minimize the influence of disk subsystem performance, the reported running times are user times averaged over 10 runs. Allowed base pairs are canonical WatsonCrick and wobble, unless stated otherwise. The used sequencestructure operation costs are ω_{d} = ω_{m} = ω_{b} = ω_{a} = 1 and ω_{r} = 2.
Comparison of running times
In a first benchmark experiment we measure the running times needed by the four algorithms
to search with a single RSSP under different cost thresholds
Figure 6. Consensus secondary structure of the tRNA family (Acc.: RF00005) as drawn byVARNA[39] (top) and respective sequencestructure pattern tRNApat(bottom).
Figure 7. Running times (in minutes and log_{10} scale) needed by the different algorithms to search with an RSSP describing the tRNA
in RFAM10.1. In (1) the cost threshold
Scaling behavior of the online and indexbased algorithms
In a second experiment we investigate how the search time of algorithms ScanAlign, LScanAlign, LESAAlign, and LGSlinkAlign scales on random subsets of RFAM10.1 of increasing size. The searched RSSPs flg1, flg2, and flg3 were derived from the three stemloop substructures the members of family flgRhizobiales
RNA motif (Acc.: RF01736)
[40] fold into. These patterns differ in length, cost threshold
Figure 8. Consensus secondary structure of family flgRhizobiales RNA motif (Acc.: RF01736) showing its three stemloop substructures hp1,hp2, andhp3as drawn by VARNA[39]. The secondary structure descriptor (SSD) for this family, on the righthand side, consists of three RSSPs flg1, flg2, and flg3 derived from the stemloop substructures.
Figure 9. Scaling behavior of algorithms LGSlinkAlign , LESAAlign , LScanAlign , and ScanAlign when searching with RSSPs flg1,flg2, andflg3, in subsets of RFAM10.1, of different length. For details, see main text.
Influence of stem and loop lengths on the search time
When searching a database for matches of a given pattern, our algorithms compute the
required DP matrices using recurrences according to two main cases: either a row corresponds
to an unpaired or to a paired base of the pattern. To analyze the influence of the
used recurrence on the search time of each algorithm, we search RFAM10.1 for artificial stemloop patterns. Therefore we vary the number of bases in the loop
of pattern
Figure 10. Search times for different number of bases in the loop (lefthand side) and base pairs in the stem (righthand side) for given RSSPs.
RNA family classification by global chaining of RSSP matches
In the next experiment we show the effectiveness of global chaining when searching
with two SSDs built for Rfam families Cripavirus internal ribosome entry site (Acc.:
RF00458) and flgRhizobiales RNA motif (Acc.: RF01736)
[40]. These two families present only 53% and 69% sequence identity, respectively, much
below the average of ∼80% of the Rfam 10.1 families. This illustrates the importance of using both sequence
and structure information encoded in the SSDs of this experiment. The SSD of family
RF01736 comprises three RSSPs, denoted by flg1, flg2, and flg3 in Figure
8, derived from the three stemloop substructures the members of this family fold into.
The SSD of family RF00458 comprises five RSSPs, denoted by ires1, ires2, ires3, ires4, and ires5 in Figure S5 of Additional file
1, where the last four RSSPs describe the stemloop substructures the members of this
family fold into. ires1 describes a moderately conserved strand occurring in these members. Observe also
in Figures
8 and S5 the cost threshold
Searching with the SSD of family RF00458 in RFAM10.1 delivers 16,033,351 matches for ires1, 8,950,417 for ires2, 1,052 for ires3, 112 for ires4, and 1,222,639 for ires5. From these matches, RaligNAtor computes highscoring chains of matches, eliminating spurious matches and resulting in exactly 17 chains. Each chain occurs in one of the 16 sequence members of the family in the full alignment except in sequence AF014388, where two chains with equal score occur. The highest (lowest) chain score is 171 (162). Using ScanAlign, LScanAlign, LESAAlign, and LGSlinkAlign, the search for all five RSSPs requires 688.32, 585.59, 186.88, and 92.25 minutes, respectively, whereas chaining requires 13.66 seconds. See Table S8 of Additional file 1 for the time required to match each pattern using the different algorithms.
The same search is performed using the SSD of family RF01736. It results in 4,145 matches for flg1, 68,024 for flg2, and 67 for flg3. Chaining the matches leads to 15 chains occurring each in one of the 15 sequence members of the family in the full alignment. The highest (lowest) chain score is 163 (156). Using ScanAlign, LScanAlign, LESAAlign, and LGSlinkAlign, the search for all three RSSPs requires 755.48, 336.69, 133.58, and 52.86 minutes, respectively, whereas chaining requires 0.03 seconds. The time required to match each pattern using each algorithm is reported in Table S9 of Additional file 1.
We also show that the lack of the sequencestructure edit operations supported by RaligNAtor deteriorates sensitivity and specificity in the search for sequence members of families RF00458 and RF01736. For this, we report in Section 4 and Table S10 of Additional file 1 results obtained with the Structator tool [23]. Structator is much faster but, in contrast to RaligNAtor, does not support all sequencestructure edit operations.
Importance of structural constraints for RNA family classification
To assess the potential of using RSSPs for reliable RNA homology search on a broader
scale and to investigate the effect of using base pairing information, we evaluated
RaligNAtor on 35 RNA families taken from Rfam 10.1 with different degrees of sequence identity
and of different sizes. See Table
1 for more information about the selected families. In our experiment, we compared
(1) RaligNAtor results obtained by using RSSPs derived from Rfam seed alignments with (2) results
obtained for the same RSSPs ignoring base pairing information and (3) results obtained
by blastn[41] searches with the families’ consensus sequence. For each selected family, we automatically
compiled an RSSP
Table 1. Results of RaligNAtor and blastn database searches for members of RNA families of different degrees of sequence identity inRFAM10.1
Figure 11. Results of ROC analyses using RaligNAtor with and without base pairing information and blastn for the 35 selected Rfam families shown in Table1. ROC curves showing RaligNAtor’s classification performance using (ignoring) base pairing information are shown in green (blue). Blast performance results are shown in red. Subfigure (1) shows the performance results averaged over all selected families. (2) and (3) show each the ROC analysis for the family with the lowest and highest level of sequence identity.
In addition, we show in Figures 11(2) and (3) the results of the ROC analysis for the families with the lowest and highest degree of sequence identity. For the ROC curve of each selected family, see Figures S7 and S8 of Additional file 1. Clearly, by using base pairing information, RaligNAtor achieves a higher sensitivity with a reduced false positive rate compared to searches ignoring base pairing (compare columns “RaligNAtor” and “RaligNAtor (sequence only)” in Table 1). This is in particular evident when searching for families with a low degree of sequence identity. This can be explained by the small amount of information left in the RSSP for such a family, once the structural information is removed. Due to the high variability of bases in the columns of the multiple alignment of the family, the pattern contains a large number of wildcards. These symbols alone, without the constraints imposed by the base pairs, lead to unspecific patterns and therefore to a large number of false positives. We observe that, for families with sequence identity of up to 59%, the area under the curve (AUC) is considerably larger when base pairing information is taken into account. This difference decreases with increasing sequence identity (compare Figures 11(2) and (3)). Overall, the average AUC value over all families is, with a value of 0.93, still notably higher when base pairing information is considered compared to 0.89 if base pairing information is ignored (see Table 1). In this experiment, blastn only finds all members of those families whose sequence identity is at least 85%. This is due to the fact that blastn cannot appropriately handle IUPAC wildcard characters. Hence, by taking the most frequent symbol in an alignment column as consensus symbol, the heterogeneity of less conserved positions in the alignment cannot be adequately modeled. For the blastn searches, the average AUC value over all families is only 0.72.
RaligNAtor software package
RaligNAtor is an opensource software package for fast approximate matching of RNA sequencestructure patterns (RSSPs). It allows the user to search target RNA or DNA sequences choosing one of the new online or further accelerated indexbased algorithms presented in this work. The index of the sequence to be searched can be easily constructed with program sufconstruct distributed with RaligNAtor.
Searched RSSPs can describe any (branching, noncrossing) RNA secondary structure;
see examples in Figures
1,
6,
8, and S5 of Additional file
1. Bases composing the sequence information of RSSPs can be ambiguous IUPAC characters.
As part of the search parameters for RSSPs, the user can specify the cost of each
sequencestructure edit operation defined above, the cost threshold of possible matches,
and the number of allowed indels. The RSSPs, along with costs and thresholds per RSSP,
are specified in a simple text file using a syntax that is expressive but easy to
understand as shown in the mentioned figures. Another possibility is to provide the
same costs and thresholds for all searched patterns as parameters in the command line
call to RaligNAtor. To ensure maximal flexibility, the user can also define the base pairing rules from
an arbitrary subset of
For describing a complex RNA with our concept of secondary structure descriptor (SSD), i.e. with multiple RSSPs, the user specifies all RSSPs in one text file. The order of the RSSPs in the file will then specify the order of the RSSP matches used to build highscoring chains. The chain score directly depends on the score of each match occurring in the chain. This is inversely proportional to the sequencestructure edit distance of the RSSP and its matching substring in the target sequence. Hence, higher scores indicate sequences with a higher conservation which are probably more closely related to the sought RNA family.
Chaining of matches discards spurious matches not occurring in any chain. An additional filtering option eliminates matches overlapping another with a higher score for the same RSSP. This is particularly useful when indels lead to almost identical matches that are only shifted by a few positions in the target sequence.
The output of RaligNAtor includes not only matching positions to single RSSPs and chains, but their sequencestructure alignment to the matched substrings as well. In the RaligNAtor software package, all programs for searching patterns support multithreading to take advantage of computer systems with multiple CPU cores. There are two modes of parallelism. At first, different patterns are searched using multiple threads. Additionally, the search space (i.e. the sequence for the online algorithms and the index structure for the indexbased methods) is partitioned, processing each part using a different thread. Lastly, we remark that our software also provides an implementation of the original algorithm of Jiang et al. for global sequencestructure alignment [25], easily applicable by the user.
Conclusions
We have presented new indexbased and online algorithms for fast approximate matching of RNA sequencestructure patterns. Our algorithms, all implemented in the RaligNAtor software, stand out from previous search tools based on motif descriptors by supporting a full set of edit operations on single bases and base pairs. See Table 2 for an overview of the algorithms. In each algorithm, the application of a new computing scheme to optimally reuse the entries of the required dynamic programming matrices and an earlystop technique to avoid the alignment computation of nonmatching substrings led to considerable speedups compared to the basic scanning algorithm ScanAlign. Our experiments show superior performance of the indexbased algorithms LGSlinkAlign and LESAAlign, which employ the suffix array data structure and achieve running time sublinear in the length of the target database. When searching for approximate matches of biologically relevant patterns on the Rfam database, LGSlinkAlign (LESAAlign) was faster than ScanAlign and LScanAlign by a factor of up to 560 (1,323) and 17 (29), respectively (see Figure 7). Comparing the two indexbased algorithms, LESAAlign was faster than LGSlinkAlign when searching with tight cost threshold (i.e. sequencestructure edit distance) and no allowed indels, but became considerably slower when the number of allowed indels was increased. In this scenario, LGSlinkAlign was faster than LESAAlign by up to 4 times. In regard to the two online algorithms, LScanAlign was faster than ScanAlign by up to factor 45. In summary, LGSlinkAlign is the best performing algorithm when searching with diverse thresholds, whereas LScanAlign is a very fast and spaceefficient alternative. RaligNAtor also allows to use the powerful concept of RNA secondary descriptors [23], i.e. searching for multiple ordered sequencestructure patterns each describing a substructure of a larger RNA molecule. For this, RaligNAtor integrates fast global and local chaining algorithms. We further performed experiments using RaligNAtor to search for members of RNA families based on information from the consensus secondary structure. In these experiments, RaligNAtor showed a high degree of sensitivity and specificity. Compared to searching with primary sequence only, the use of secondary structure information considerably improved the search sensitivity and specificity, in particular for families with a characteristic secondary structure but low degree of sequence conservation. We remark that, up to now, RaligNAtor uses a relatively simple scoring scheme. By incorporating more fine grained scoring schemes like RIBOSUM [13] or energy based scoring [42], we believe that the performance of RaligNAtor for RNA homology search can be further improved. Beyond the algorithmic contributions, we provide with the RaligNAtor software distribution, a robust, welldocumented, and easytouse software package implementing the ideas and algorithms presented in this manuscript.
Table 2. Overview of the presented algorithms
Availability
The RaligNAtor software package including documentation is available in binary format for different operating systems and architectures and as source code under the GNU General Public License Version 3. See http://www.zbh.unihamburg.de/ralignator webcite for details.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
FM and MB developed the algorithms. FM implemented the algorithms. SK implemented the chaining algorithms. MB initiated the project and provided supervision and guidance. All three authors contributed to the manuscript. All authors read and approved the final manuscript.
Acknowledgements
This work was supported by basic funding of the University of Hamburg. We thank Steffen Dettmann for interesting discussions and algorithmic ideas that contributed to this work.
References

Mattick J: RNA regulation: a new genetics?
Nat Rev Genet 2004, 5(4):316323. PubMed Abstract  Publisher Full Text

Burge SW, Daub J, Eberhardt R, Tate J, Barquist L, Nawrocki EP, Eddy SR, Gardner PP, Bateman A: Rfam 11.0: 10 years of RNA families.

Siebert S, Backofen R: MARNA: multiple alignment and consensus structure prediction of RNAs based on sequence structure comparisons.
Bioinformatics 2005, 21(16):33523359. PubMed Abstract  Publisher Full Text

Höchsmann M, Voss B, Giegerich R: Pure multiple RNA secondary structure alignments: a progressive profile approach.
IEEE/ACM Trans Comput Bio Bioinformatics 2004, 1:5362. Publisher Full Text

Sankoff D: Simultaneous solution of the RNA folding, alignment and protosequence problem.
SIAM J Appl Mathe 1985, 45:810825. Publisher Full Text

Will S, Reiche K, Hofacker IL, Stadler PF, Backofen R: Inferring noncoding RNA families and classes by means of genomescale structurebased clustering.
PLoS Comput Biol 2007, 3(4):e65+. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Havgaard JH, Torarinsson E, Gorodkin J: Fast pairwise structural RNA alignments by pruning of the dynamical programming matrix.

Torarinsson E, Havgaard JH, Gorodkin J: Multiple structural alignment and clustering of RNA sequences.
Bioinformatics 2007, 23(8):926932. PubMed Abstract  Publisher Full Text

Mathews DH, Turner DH: Dynalign: an algorithm for finding the secondary structure common to two RNA sequences.
J Mol Biol 2002, 317(2):191203. PubMed Abstract  Publisher Full Text

Mathews DH: Predicting a set of minimal free energy RNA secondary structures common to two sequences.
Bioinformatics 2005, 21(10):22462253. PubMed Abstract  Publisher Full Text

Dalli D, Wilm A, Mainz I, Steger G: STRAL: progressive alignment of noncoding RNA using base pairing probability vectors in quadratic time.
Bioinformatics 2006, 22(13):15931599. PubMed Abstract  Publisher Full Text

Nawrocki EP, Kolbe DL, Eddy SR: Infernal 1.0: inference of RNA alignments.
Bioinformatics 2009, 25(10):13351337. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Klein R, Eddy S: RSEARCH: finding homologs of single structured RNA sequences.
BMC Bioinformatics 2003, 4:44. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Gautheret D, Lambert A: Direct RNA motif definition and identification from multiple sequence alignments using secondary structure profiles.
J Mol Biol 2001, 313:100311. PubMed Abstract  Publisher Full Text

Macke T, Ecker D, Gutell R, Gautheret D, Case D, Sampath R: RNAMotif – A new RNA secondary structure definition and discovery algorithm.
Nucleic Acids Res 2001, 29(22):47244735. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gautheret D, Major F, Cedergren R: Pattern searching/alignment with RNA primary and secondary structures: an effective descriptor for tRNA.
Comput Appl Biosci 1990, 6(4):325331. PubMed Abstract

RNABOB: a program to search for RNA secondary structure motifs in sequence databases [ http://selab.janelia.org/software.html webcite]

Chang T, Huang H, Chuang T, Shien D, Horng J: RNAMST: efficient and flexible approach for identifying RNA structural homologs.
Nucleic Acids Res 2006, 34:W423W428. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dsouza M, Larsen N, Overbeek R: Searching for patterns in genomic data.
Trends Genet 1997, 13(12):497498. PubMed Abstract  Publisher Full Text

Grillo G, Licciulli F, Liuni S, SbisÃă E, Pesole G: PatSearch: A program for the detection of patterns and structural motifs in nucleotide sequences.
Nucleic Acids Res 2003, 31(13):36083612. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Billoud B, Kontic M, Viari A: Palingol: a declarative programming language to describe nucleic acids’ secondary structures and to scan sequence database.
Nucleic Acids Res 1996, 24(8):13951403. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Reeder J, Giegerich R: A graphical programming system for molecular motif search. In Proceedings of the 5th international Conference on Generative Programming and Component Engineering. New York: ACM Press; 2006:131140.

Meyer F, Kurtz S, Backofen R, Will S, Beckstette M: Structator: fast indexbased search for RNA sequencestructure patterns.
BMC Bioinformatics 2011, 12:214. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

ElMabrouk N, Raffinot M, Duchesne JE, Lajoie M, Luc N: Approximate matching of structured motifs in DNA sequences.
J Bioinform Comput Biol 2005, 3(2):317342. PubMed Abstract  Publisher Full Text

Jiang T, Lin G, Ma B, Zhang K: A general edit distance between RNA structures.
J Comput Biol 2002, 9(2):371388. PubMed Abstract  Publisher Full Text

Abouelhoda M, Ohlebusch E: Chaining algorithms for multiple genome comparison.

Will S, Siebauer M, Heyne S, Engelhardt J, Stadler P, Reiche K, Backofen R: LocARNAscan: incorporating thermodynamic stability in sequence and structurebased RNA homology search.
Algo Mol Biol 2013, 8:14. BioMed Central Full Text

Manber U, Myers E: Suffix arrays: a new method for online string searches.
SIAM J Comput 1993, 22(5):935948. Publisher Full Text

Abouelhoda M, Kurtz S, Ohlebusch E: Replacing suffix trees with enhanced suffix arrays.
J Discrete Algo 2004, 2:5386. Publisher Full Text

Kärkkäinen J, Sanders P: Simple linear work suffix array construction. In Proceedings of the 13th International Conference on Automata, Languages and Programming. Berlin  Heidelberg: Springer; 2003.

Puglisi SJ, Smyth W, Turpin A: The performance of linear time suffix sorting algorithms. In DCC ’05: Proceedings of the Data Compression Conference. Washington: IEEE Computer Society; 2005:358367. PubMed Abstract  Publisher Full Text

Manzini G, Ferragina P: Engineering a lightweight suffix array construction algorithm.
Algorithmica 2004, 40:3350. Publisher Full Text

Kasai T, Lee G, Arimura H, Arikawa S, Park K: Lineartime longestcommonprefix computation in suffix arrays and its applications. In Proceedings of the 18th Annual Symposium on Combinatorial Pattern Matching. Berlin  Heidelberg: Springer; 2001:181192.

Beckstette M, Homann R, Giegerich R, Kurtz S: Fast index based algorithms and software for matching position specific scoring matrices.
BMC Bioinformatics 2006, 7:389. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Ukkonen E: Online construction of suffix trees.
Algorithmica 1995, 14(3):249260. Publisher Full Text

Beckstette M, Homann R, Giegerich R, Kurtz S: Significant speedup of database searches with HMMs by search space reduction with PSSM family models.
Bioinformatics 2009, 25(24):32513258. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Darty K, Denise A, Ponty Y: VARNA: Interactive drawing and editing of the RNA secondary structure.
Bioinformatics 2009, 25(15):19741975. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Weinberg Z, Wang J, Bogue J, Yang J, Corbino K, Moy R, Breaker R: Comparative genomics reveals 104 candidate structured RNAs from bacteria, archaea, and their metagenomes.
Genome Biol 2010, 11(3):R31. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSIBLAST: a new generation of protein database search programs.
Nucleic Acids Res 1997, 25(17):33893402. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Mathews DH, Turner DH: Prediction of RNA secondary structure by free energy minimization.
Curr Opin Struct Biol 2006, 16(3):270278. PubMed Abstract  Publisher Full Text