Abstract
Background
Searching for members of characterized ncRNA families containing pseudoknots is an important component of genomescale ncRNA annotation. However, the stateoftheart known ncRNA search is based on contextfree grammar (CFG), which cannot effectively model pseudoknots. Thus, existing CFGbased ncRNA identification tools usually ignore pseudoknots during search. As a result, dozens of sequences that do not contain the native pseudoknots are reported by these tools. When pseudoknot structures are vital to the functions of the ncRNAs, these sequences may not be true members.
Results
In this work, we design a pseudoknot search tool using multiple simple substructures, which are derived from knotfree and bifurcationfree structural motifs in the underlying family. We test our tool on a contiguous 22Mb region of the Maize Genome. The experimental results show that our work competes favorably with other pseudoknot search methods.
Conclusions
Our substructure based tool can conduct genomescale pseudoknotcontaining ncRNA search effectively and efficiently. It provides a complementary pseudoknot search tool to Infernal. The source codes are available at http://www.cse.msu.edu/~chengy/knotsearch webcite.
Background
Noncoding RNAs (ncRNAs), which function directly as RNAs without translating into proteins, play diverse and important biological functions [1]. Many types of ncRNAs function through both their sequences and secondary structures, which are defined by interactions between WatsonCrick and wobble base pairs. Pseudoknot is a functionally important structural motif in ncRNA secondary structures. In pseudoknots, bases in loop regions can form base pairs with bases outside the stem loop. In a graphical representation where arcs connect base pairs, pseudoknotfree secondary structures only contain parallel or nested base pairs while pseudoknot structures allow "crossing" base pairs, shown by an example in Figure 1.A.
Figure 1. Consensus secondary structure of tmRNA and the secondary structure described by SCFG (pseudoknots missing). A. Consensus secondary structure of RF00023 (tmRNA) in Rfam. Stacking base pairs in 1 are parallel to base pairs in 2 and 3. 1, 2, and 3 are nested in 4. 2 and 3 form a pseudoknot. B. Secondary structure described by SCFG (pseudoknots missing).
It is already known that pseudoknots play important functions in telomerase RNA, tmRNA, rRNA, some riboswitches, some proteinbiding RNAs, Viral ribosomal frameshifting signals, etc [2]. Different research groups [3,4] have shown that the pseudoknot structure in the telomerase RNA is essential for telomerase activity. Gilley and Blackburn [3] experimentally demonstrated that disruptions of the pseudoknot base pairing within the telomerase RNA from Tetrahymena thermophila prevent the stable assembly in vivo of an active telomerase. They further concluded that the pseudoknot topology rather than sequence is critical for an active telomerase. Similarly, biologists reported that the pseudoknots in tmRNA are highly important for protein biding, tmRNA maturation, and proper folding of the tRNAlike domain [5]. Currently, 26,704 sequences in 71 ncRNA seed families of Rfam 10.0 [6] contain pseudoknots. With the advances of sequencing technologies and structure predictions, more pseudoknot structures are expected to be revealed.
Because the functions of ncRNAs are determined by both the sequence and structure, successful ncRNA homology search tools must consider both sequence and structural conservations. Existing ncRNA search tools can be divided into two categories. One is commonly referred to "known ncRNA search", which aims to detecting homologs of ncRNAs with annotated secondary structures. The second category includes tools for identifying novel ncRNA genes. This work belongs to the first category and focuses on ncRNAs containing pseudoknots.
For pseudoknot free ncRNAs, the stateoftheart search method is based on stochastic contextfree grammars (SCFGs), which can accurately model the evolutionary changes of both the sequences and structures of a group of homologous ncRNAs. Commonly used general and specialized known ncRNA search tools such as Infernal [7], RSEARCH [8], and tRNAScanSE [9] are all based on SCFG. In conjunction with the ncRNA family database Rfam, Infernal has been successfully applied to classify query sequences into different types of ncRNA. However, SCFGs are not able to model pseudoknot. Thus, the implementations of SCFG by Infernal neglect pseudoknots in the structures. For example, although RF00023 (tmRNA) has four pseudoknots, its SCFG only models the knotfree structures, shown in Figure 1.B. As a result, Infernal could misclassify sequences as members of families containing pseudoknots. In addition, Infernal has high computational cost, limiting its usage in largescale data sets, such as those generated by the nextgeneration sequencing technologies.
More complicated grammars such as contextsensitive Grammars (CSGs) [10] exist to faithfully model pseudoknots. However, the computational cost of the parsing algorithms of a CSG is even higher than using a CFG. Besides CSGs, other grammars such as parallel communicating grammar systems [11], RNA pseudoknot grammars [12], tree adjoining grammars (TAGs) [13,14], and multiple contextfree grammars [15] have been proposed to model pseudoknot structures. These work described the grammars and associated parsing algorithms. However, they have not been widely used in pseudoknot search in largescale databases. First, although the parsing algorithms are polynomial, their cubic or even higher time or memory complexity [15] limits their largescale applications. Second, these methods were designed for and tested on secondary structure derivation rather than homology search. In order to conduct largescale homology search, local parsing algorithms are needed. As there are no source codes or executable implementations of these grammars, it is not clear whether they can be automatically applied to known ncRNA search including pseudoknots.
In this work, we design an efficient pseudoknot search algorithm for all types of pseudoknots. Our method is based on a set of carefully chosen simple substructures (or substructures for short), which do not contain pseudoknots or bifurcations. The time complexity of the parsing and probability computation algorithms for an SCFG including the CYK, the inside, and the outside algorithm will be significantly reduced when the secondary structure does not contain any bifurcation [10,16]. Thus, these simple substructures can be searched efficiently using existing implementations of SCFGs. For multiple substructures extracted from one ncRNA family, we choose a set of substructures according to their sizes and false positive (FP) rates in order to maximize the search performance. These chosen substructures will be used in a progressive search. Our experimental results show that our tool competes favorably with other pseudoknot search methods.
Related work
Brown and Wilson [17] proposed an RNA pseudoknot search method using intersections of SCFGs. Both Brown's method and our approach try to decompose pseudoknot into knotfree structures for SCFG modeling. There are two major differences. First, our substructures are not only knotfree, but also bifurcation free, which enables faster search. Second, while Brown and Wilson's method focused on the model construction and parsing algorithm, we focus on choosing an optimal set of substructures to optimize the search performance. The model construction and the parsing algorithms can be conveniently implemented using Infernal, which has gone through extensive testing.
Structural motifs similar to substructures have been used as filters to speed up Infernal. FastR [18] relies on stemloops ((k, w)stack) that do not contain bulge or interior loops to search for ncRNAs. Weinberg et al. [19] use more flexible structural motifs based on subCMs and profile HMMs for ncRNA classification. Smith [16] used a decision tree to organize partial SCFG models for fast ncRNA search. Currently, these filters are only designed and tested for speeding up SCFG search.
Available pseudoknot search tools include RNAv [20] and RNATOPS [21]. RNATOPS designs a graph model for RNA pseudoknots and solves the structure sequence alignment by graph optimization. RNAv is a profile based RNA secondary structure variation search program that detects distant ncRNA structural homologs, which might be missed by RNATOPS.
The chain filter designed by Zhang et al. [22] consists of a collection of short conserved words in an ncRNA family. In our work, we use a collection of simple substructures for pseudoknot search. Similar to Zhang et al.'s work, we find that using a collection of simple structures can achieve a good tradeoff between sensitivity and false positive rate during search.
Approach
There are two components in the method. The first component is the design of a set of substructures to represent an ncRNA family. The second component is a progressive search strategy using the designed substructures. Different regions of an ncRNA sequence have different sequence and structural conservations. Wellconserved structural and sequence motifs tend to yield better search performance than poorly conserved motifs. Our approach sorts substructures extracted from different regions according to their lengths and predicted FP rates in order to choose a set of substructures with the optimal search performance.
For a chosen set of substructures, we conduct a progressive search according to a predetermined order. During the progressive search, one substructure is only applied to regions containing matches to all previous substructures. A sequence is classified into the pseudoknot family if and only if 1) it passes the score thresholds of all the chosen substructures; 2) the position relationship between matched substrings is consistent with the relationship between the substructures. Thus the false positive rate of the chosen set of substructures is bounded by the product of the false positive rates of all component substructures. The pipeline of the approach is illustrated in Figure 2.
Figure 2. The pipeline of the SCFG construction and the progressive search.
Substructure derivation
In order to use SCFGbased models for pseudoknot search, we decompose a pseudoknot structure into simple substructures. Each substructure contains at least one stem, which includes a set of stacking base pairs allowing short bulge and interior loops. A full secondary structure of an ncRNA family can be decomposed into multiple stems. Combinations of stems define different substructures. Figure 3 shows all five simple substructures derived from the given pseudoknot.
Figure 3. Five candidate substructures can be constructed from three stems in a pseudoknot structure. Each arc represents a stem containing nested base pairs and possible internal/bulge loops. Singlestranded regions are represented using solid lines.
We describe a method to systematically extract all simple substructures from a pseudoknot. In the first step, all stems are extracted and sorted in increasing order of their starting positions (i.e. 5' end of the outmost base pair in the stem). Second, we build a bit table R of size N by N for N stems extracted from the first step. For each cell R[i, j], if stem i and stem j are nested, R[i, j] = 1; otherwise, R[i, j] = 0. Table R provides us information about whether given stems can form one substructure. Given the stem set and their relationship table R, we use pseudocode in Algorithm 1 to extract all simple substructures. In the pseudocode, H^{x }is the set of substructures containing x stems. Thus, the union of H^{x }for x = 1 to N consists of all simple substructures for a given secondary structure. The number of substructures depends on the number of nested stems. Suppose the average number of nested stems inside a stem is n. The total number of substructures is O(N + N2^{n}).
Algorithm 1 ExtractSubstructures Input: a secondary structure containing pseudoknots Output: all simple substructures
1: for each stem i = 1 to N do
2: /* h: a substructure containing a set of stems */
3: h = {i}
4: H^{1 }= H^{1 }∪{h}
5: end for
6: for L = 2 to N do
7: H^{L }= Ø
8: for each substructure h ∈ H^{L1 }do
9: for each stem i ∉ h do
10: /* h[i] is the ith stem in a substructure h */
11: if R[h[1], i] and R[h[2], i] ... and R[h[L1], i] then
12: /* construct a new substructure h' */
13: h' = h∪{i}
14: H^{L }= H^{L }∪{h'}
15: end if
16: end for
17: end for
18: end for
19: output all substructures H = H^{1 }∪H^{2 }∪ ... ∪H^{N}
Algorithm 1 only outputs the combination of stems. For each stem (or stem set) in a substructure, we add loop and flanking regions using the following three rules. Let the 5' and 3' ends of the outmost base pair in a substructure be I_{5 }and I_{3}, respectively. Thus, I_{5 }<I_{3}.
• Rule 1: Add all singlestranded regions including bulge and internal loops between I_{5 }and I_{3}.
• Rule 2: Except the base pairs inside the chosen stems in a substructure, all other base pairs will be treated as singlestranded regions.
• Rule 3: Extend the flanking singlestranded regions to the left of I_{5 }and to the right of I_{3 }until the first base pair in other substructures.
Search performance of different substructures
Each substructure can be conveniently modeled by an SCFG. As different substructures are derived from regions with different sequence and structural conservations, their corresponding SCFGs have different performance in database search. In this section, we use an example to illustrate this. We built SCFGs for eight substructures derived from RF00373 (Ribonuclease P) and evaluated the sensitivity, FP rates, and running time of the eight SCFGs when applying them to a to a 22.5 M Maize genome (data is described in "Experimental results"). The sensitivity and FP rates of different substructures from the same family can be compared using true positive (TP) hits and FP hits respectively, because the condition positive and condition negative sets are the same for all substructures derived from the same family. For any SCFG , let the set of matched sequences be . Let the set of true pseudoknot sequences be S, which are the sequences in seed families containing pseudoknots in Rfam. The number of true positive and FP matches of a subSCFG is and , respectively. We summarized the TP hits and FP matches of eight SCFGs under different score thresholds in Figure 4. In addition, the search times are included for the score thresholds corresponding to the highest sensitivity. It is clear that different SCFGs have highly search performance. During a progressive search using a series of substructures, the number of matches of the preceding substructure determines the search space of the current substructure. Thus, the total search time depends on both the FP hits and the model running time, which is heavily affected by the model length. In order to maximize the search efficiency, it is important to sort all candidate substructures according to their FP rates. When the FP rates of two or more substructures are similar (same order), we prefer shorter models because they incur less search times.
Figure 4. Number of TP hits and FP matches of each substructure under different score thresholds. For each substructure, the length and the search time corresponding to the highest sensitivity is listed. Time format is hr:min:sec. Due to highly different number of FP hits, two substructures are plotted in the embedded figure.
Sort substructures according to their Evalues
There are two methods to calculate the FP rates of substructures. Theoretically, by assuming a background model for random sequences and applying the CYK algorithm [10], we can directly calculate the probability that a random sequence matches an SCFG model. Empirically, we can apply the SCFGs to a large annotated sequence database and record the number of FP matches. However, as it is more important to compare the FP rates of different substructures than knowing their exact values, it is not necessary to directly calculate FP rates. By assuming that the SCFG alignment scores for random sequences follow an exponential distribution, as implemented by Infernal, we can use Evalues of the designed score cutoffs to sort all substructures.
For an alignment score and a database size, an Evalue indicates how many random hits a user can expect to see with the same or better score in a random sequence database of similar size. Thus, Evalue indicates FP hits when it can be computed accurately. Currently, we are using the Evalue calculation method provided by Infernal. Although the assumed score distribution is not accurate, we found that the estimated Evalues allow us to compare FP rates of different substructures with high accuracy. In order to estimate Evalue, Infernal generates a set of N random sequences whose GC content depends on the covariance model. These N random sequences then are aligned against the model. In this process, all searching result with score > 0 will be considered as hits. Scores of the top X hits are assumed to follow an exponential distribution with two parameters, μ and λ. The maximum likelihood approach is then taken to fit scores of hits into an exponential distribution.
where db is adjusted database size and is defined as
In the Evalue computation, μ and λ are parameters trained in Infernal. sc is the score for which one needs to calculate Evalue. db_{target }is the size of target database. db_{random }is the number of random sequences generated for curve fitting. At last, randhit is the number of random sequences found by the covariance model. We can directly obtain μ and λ from each calibrated covariance model, which is built for a substructure. With these two parameters available, we can use the above equation to compute Evalues for given scores.
Our experiments show that although the change of Evalues does not scale with the change of the FP rates, the order of Evalues is highly consistent to the order of FP rates for all 71 families we tested. Only for SCFGs with similarly small FP rates, their Evalues cannot accurately reflect their order. Table 1 presents an example. It is worth noting that we also considered to use the average entropy to sort the substructures. However, our experiments show that there is no systematic relationship between entropybased measurements and the FP rates of substructures.
Table 1. The order of Evalues is highly consistent to the order of number of the FP hits.
Choose substructures for progressive search
During a progressive search based on multiple substructures, the final sensitivity is bounded by the lowest sensitivity of all substructures. The final search time and FP rates heavily depend on the order of applying these substructures. Let the final array of substructures be , where will be applied before if i <j. Let the size of the original database be L. For a substructure , let t_{i }and fp_{i }be its search time per hit and FP rate, respectively. The final FP rate is bounded by . The final search time is roughly , where is roughly the search space for the substructure . Minimizing T requires the accurate computation of t_{i }or quantification of the relationship between t_{i }and fp_{i}, which is not known as a priori. Although Infernal provides estimated running time, it can be quite different from the true running time. According to the equations, it is clear that we should apply short substructures with small FP rates before long substructures with high FP rates. Thus we develop a greedy algorithm to generate a set of substructures for progressive search based on our empirical observations.
We split substructures into short group and long group, which contain short and long substructures respectively. For each group of substructures, we sort the substructures according to their Evalues and apply a greedy algorithm to choose a set of substructures for search. The main steps of the greedy algorithm are outlined below, starting from the short group:
1. In each iteration, choose the substructure with the smallest Evalue. Remove it and append it to the final substructure list .
2. Remove any remaining substructure in both groups that only contains stems in this substructure.
3. Repeat the first step until all stems are covered by one chosen substructure or the Evalues of all remaining substructures are bigger than a predetermined cutoff (default is 1).
If has not included all stems, we apply the same process to the long group and append the chosen substructures to . We require all stems covered by the chosen substructures in order to ensure the representation of the annotated pseudoknot structure. It is possible that this constraint will exclude homologous ncRNAs that lack annotated stem loop structures. Currently, we use size 150 as the threshold to divide substructures into the short and the long group.
Implementation
For each substructure, we train an SCFGbased model based on the corresponding alignment in the training data using Infernal. Let the SCFGs trained from n substructures of an ncRNA family be , where represents a single SCFG. A sequence can be classified into the corresponding family if the following conditions are satisfied. First, the sequence contains matches to all designed SCFGs in Π. SCFG match will be defined in the following text. Second, for every pair of strings that match two SCFGs, their position relationship must be consistent with the annotated relationship between two SCFGs in the underlying ncRNA family. There are three types of position relationship between two substructures: parallel, nested, and crossover. Crossover indicates existence of pseudoknots.
We determine SCFG match using score thresholds. For all sequences in the training set, its alignment score with a given SCFG is computed. The minimum score of all the seed sequences is used as the score threshold. This score cutoff is similar to the NC (trusted cutoff) bit score thresholds used in HMMER [23] or Infernal. When the training data contains a good representation of the family member sequences, the computed score threshold can ensure a high sensitivity during homology search. If the training set only contains close homologs of this ncRNA family, the designed cutoff may be too high for remotely related homologs.
Experimental results
In order to test the performance of our tool for pseudoknot search in sequence databases, we conducted two experiments. First, we examined the automatically classified pseudoknot sequences in Rfam. Second, we applied it to part of the Maize genome. On the same data set, we compared our tool with RNAv, RNATOPS, and Infernal.
Pseudoknot sequences in Rfam
Because CFG cannot model pseudoknots, the implementations of Stochastic CFG (SCFG), covariance models (CMs) in Rfam neglect pseudoknots in the structures. As a result, tools that use SCFG for ncRNA search such as Infernal could misclassify sequences as members of pseudoknot families. Each Rfam family contains a seed sequence set and a full sequence set. While the seed sequence set contains manually validated homologous sequences, the full sets are automatically produced using SCFGbased search against RFAMSEQ database [6]. Thus, some of the sequences in the full set may not contain pseudoknot structures that are annotated in the seed sequences. We examined the full member set of the 71 ncRNA families containing pseudoknots in Rfam using our tool. Many families contain dozens of sequences that lack the annotated pseudoknot structures. For all those sequences that cannot be matched by our tool, we also utilized the Infernal alignments and a RNA stem finding tool RNAmotif [24] to double check whether the base pairs in pseudoknot structures are missing. The SCFG alignments output by Infernal contains annotations of all base pairs that do not form pseudoknots. By comparing the annotated base pairs and the consensus secondary structure of the seed alignments, we can extract the regions that should form pseudoknots. Then, we applied RNAmotif to output all stems of size at least two in the chosen regions. Failing to output any stems validated our findings that these sequences do not have the annotated pseudoknots. The results are summarized in Table 2. Although homologous ncRNAs may not share the same set of stems, simply ignoring pseudoknots without knowing their impacts on the function can introduce a large number of false members. In particular, it was already experimentally shown that pseudoknot structures are vital to the functions of some types of ncRNAs [35]. For these wellstudied pseudoknot structures, it is important to include them during homology search.
Table 2. Sequences that do not contain annotated pseudoknots and thus may not be real members.
Data set preparation
We created a simulated data set based on a contiguous 22Mb region of the Maize Genome [25]. The annotation of the 22Mb region does not contain any hit to the 71 pseudoknot families in Rfam. In order to evaluate the sensitivity of pseudoknot search tools, we randomly chose 1,586 out of 26,704 seed sequences from 71 pseudoknot families and inserted them in the 22Mb region. The remaining seed sequences are used as the training data. In order to examine the FP rate of SCFGbased tools, we also created 1,586 sequences without pseudoknots. Specifically, for each of the 1,586 seed sequences, we altered the bases to disrupt the base pairs that can form pseudoknots. Similarly RNAmotif is applied again to ensure these sequences lose the annotated pseudoknot structure. These modified 1,586 sequences and the original 22Mb region of the Maize Genome constitute the negative training data. Any hit to them is an FP hit. Note that by changing the bases, the modified sequences might share lower sequence similarity to the trained model and thus pose an easier case for all tools. Even so, our experimental results still show that different tools exhibit highly difference performance on this data set. Thus, we feel this data set is a reasonable test set.
There are two major advantages of using this simulated data set for testing pseudoknot search tools. First, as the 22Mb region of the Maize genome does not harbor any reported ncRNA that contains pseudoknots, we can measure the empirical FP rates of pseudoknot search tools with higher reliability than using simulated sequences, which are usually generated using a simple i.i.d. model or loworder Markov model. In particular, the Maize genome contains a high percentage of repeats and lowcomplexity regions, which could not be simply simulated and can pose a challenge for ncRNA search as warned by the Rfam website (http://rfam.sanger.ac.uk/ webcite). Second, using thousands of seed members of the pseudoknot families provides us adequate test data for evaluating the sensitivity.
Besides using the seed sequences of Rfam, we also considered another pseudoknot sequence database Pseudobase [26]. This database contains 304 RNA sequences with pseudoknot structures. A majority of them are substrings of Rfam seed sequences. Thus, we choose to use Rfam seed sequences as the true label.
Results and comparisons
In order to separate the training set and the test set, we removed the sequences that were inserted in the Maize genome from the seed alignments. For the alignments composed of the remaining sequences, we trained the full covariance model and the models for the substructures. We used the designed substructure sets for pseudoknot search. We evaluated the performance of pseudoknot search tools using three metrics: the sensitivity, FP hits, and running time. For each ncRNA family represented by an SCFG , let be the set of output sequences by a search tool. Let S be the set of true pseudoknot sequences, which, in this data set, includes seed sequences of each pseudoknot family. The sensitivity is thus defined as:
Any output that does not overlap with true pseudoknot sequences is a false positive hit. The number of FP hits of a search tool on one family is computed as:
We report the FP hits instead of the FP rates for two reasons. First, the condition negative set is family specific and thus is the same for all search tools for a given family. Second, the size of the condition negative set is mainly determined by the size of the genome minus the size of all true pseudoknot sequences. For a large genomic sequence, the FP rate becomes very small and cannot reflect the difference between different tools.
On the same dataset, we run RNAv, RNATOPS, and Infernal 1.0.2. Of the three, RNAv and RNATOPS are designed for pseudoknot search. For Infernal and substructure, no hidden Markov modelbased filtration was used in order to maximize the sensitivity. Other parameters were set as default for Infernal. We used the default parameters to run RNAv and RNATOPS. All experiments were run on the main cluster of the High Performance Computing Center on campus (http://www.icer.msu.edu/?q=hpcc webcite). Each experiment was allocated four CPU days at most. There are 65 families and 31 families that failed RNAv and RNATOPS, respectively. The search jobs for those families were killed by the cluster after four CPU days. No results were produced. Thus we could not report the results for those families. RNAPTOPS output results for 22 families by the end of the allocated time.
The performance of these four tools is recorded in Table 3. The results show that our tool is significantly faster than RNATOPS and RNAv. For a majority of families, the running time is smaller than half an hour. A closer examination reveals that 99% of the running time is attributed to the first substructure, which is expected. Of the six families for which RNAv successfully generated outputs, they all have the sensitivity of 1.0, equal to the sensitivity of substructure based search. Of the 40 families for which RNATOPS reported results, 14 of them have equal sensitivity to ours. 1 family yields slightly better sensitivity than ours while other 24 families have significantly worse sensitivity. Thus, overall, our search achieves higher sensitivity than RNAv and RNATOPS. In addition, substructure based search tool incurs lower FP rate than RNATOPS and RNAv. Table 3 shows that RNATOPS yields low FP hits. Of the 40 families, RNATOPS has the same number of FP hits as ours for only one family and significantly more FP hits for the rest. In particular, RNATOPS outputs over 1,000 hits for 9 families.
Table 3. Sensitivity, FP hits, and running time comparison between RNAv, RNATOPS, Infernal, and substructure.
We compared the sensitivity, FP hits, and running time of Infernal and our tool in Figure 5, Figure 6, and Figure 7 using XY scatter plots. As Infernal and our tool generate the same sensitivity or other metrics for some families, we use the bubble plot to visualize the number of the same data points. As expected, Infernal is highly sensitive. However, it reported dozens of hits on the pseudoknotfree sequences which we inserted as false positive sequences. For all families, Infernal reported equal or more FP hits than our tool. In addition, it is generally slower than substructurebased tool. Out of 71 RNA families, substructurebased tool has shorter running time on 66 families. For 14 families, it yields 10x speed up over Infernal.
Figure 5. Sensitivity comparison on 71 families.
Figure 6. Comparison of false positive hits on 71 families.
Figure 7. Running Time comparison. There are 4 families on which Infernal run much longer than on other families. To keep an appropriate scale, there running times are not displayed on the figure.
There is no significant difference in the sensitivity between Infernal and substructurebased tool when the average sequence length in a family is not too long. Infernal has better sensitivity on longer and more complicated RNA families including RF00010, RF00011, RF00023, and RF00030. The major reason behind our worse sensitivity on the long families is that we use substructure that cover every stem. Thus, we only classify sequences that have all characterized stems from the underlying structure. However, some remote homologs may lose base pairs in stems during evolution. Thus while we guarantee to find sequences that have the same structures as the annotated pseudoknots, we can miss some homologs, leading to lower sensitivity for some families.
Conclusion
Although Infernal is highly sensitive in known ncRNA search, caution must be taken when applying Infernal to ncRNA families containing pseudoknots. In this work, we designed a pseudoknot search method based on a set of carefully chosen substructures. These substructures do not contain pseudoknots or bifurcations. SCFGs can be conveniently built on them and searched with high efficiency. In order to minimize the overall FP rate and the running time, we sorted substructures according to their lengths and their Evalues for designed trusted cutoff (NC) bit score thresholds. We designed a greedy algorithm to choose a set of substructures and applied the progressive search to minimize search time. Our experimental results showed that our tool competes favorably with RNAv and RNATOPs, both of which have been used for pseudoknot search in large databases. This work provides a complementary pseudoknot search tool to existing SCFGbased knotfree ncRNA search methods.
Currently our tool only reports homologous ncRNAs with the same number of characterized stems as the training data. As a result, some true homologs that have lost one or multiple stems will be ignored. As part of the future work, we plan to incorporate available RNAseq data for remote homology search.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
YS proposed the original idea and algorithms. YS and CY both contibuted to experiment design. CY conducted the experiments and implemented the algorithms. Both Authors read and approved the final manuscript.
Declarations
The publication costs for this article were funded by NSF DBI0953738 and IOS1126998.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 2, 2013: Selected articles from the Eleventh Asia Pacific Bioinformatics Conference (APBC 2013): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S2 webcite.
Acknowledgements
This work was partially supported by the NSF grants DBI0953738 and IOS1126998.
References

GriffithsJones S: Annotating Noncoding RNA Genes.
Annual Review of Genomics and Human Genetics 2007, 8:279298. PubMed Abstract  Publisher Full Text

Staple DW, Butcher SE: Pseudoknots: RNA Structures with Diverse Functions.
PLoS Biology 2005, 3(6):e213. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gilley D, Blackburn EH: The telomerase RNA pseudoknot is critical for the stable assembly of a catalytically active ribonucleoprotein.
PNAS 1999, 96(12):66216625. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chen JL, Greider CW: Functional analysis of the pseudoknot structure in human telomerase RNA.
PNAS 2005, 102(23):80808085. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wower IK, Zwieb C, Wower J: Contributions of Pseudoknots and Protein SmpB to the Structure and Function of tmRNA in transTranslation.
the Journal of Biological Chemistry 2004, 279(52):5420254209. PubMed Abstract  Publisher Full Text

Gardner P, Daub J, Tate J, Nawrocki E, Kolbe D, Lindgreen S, Wilkinson A, Finn R, GriffithsJones S, Eddy S, Bateman A: Rfam: updates to the RNA families database.
Nucleic Acids Research 2008, 37(Database):D136D140. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Nawrocki EP, Kolbe DL, Eddy SR: Infernal 1.0: Inference of RNA alignments.
Bioinformatics 2009, 25:13351337. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Klein RJ, Eddy SR: RSEARCH: finding homologs of single structured RNA sequences.
BMC Bioinformatics 2003, 4:44. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Lowe T, Eddy SR: TRNAscanSE: a program for improved detection of transfer RNA genes in genomic sequence.
Nucleic Acids Res 1997, 25:95564. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Durbin R, Eddy SR, Krogh A, Mitchison G: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. UK: Cambridge University Press; 1998.

Cai L, Malmberg RL, Wu Y: Stochastic modeling of RNA pseudoknotted structures: a grammatical approach.
Bioinformatics 2003, 19(Suppl. 1):i66i73. PubMed Abstract  Publisher Full Text

Rivas E, Eddy SR: The language of RNA: a formal grammar that includes pseudoknots.
Bioinformatics 2000, 16(4):334340. PubMed Abstract  Publisher Full Text

Uemura Y, Hasegawa A, Kobayashi S, Yokomori T: Tree adjoining grammars for RNA structure prediction.
Theoretical Computer Science 1999, 210(2):277303. Publisher Full Text

Matsui H, Sato K, Sakakibara Y: Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures.
Bioinformatics 2005, 21(11):26112617. PubMed Abstract  Publisher Full Text

Kato Y, Seki H, Kasami T: RNA Pseudoknotted Structure Prediction Using Stochastic Multiple ContextFree Grammar.

Smith JA: RNA Search with Decision Trees and Partial Covariance Models.
IEEE/ACM Trans Comput Biol Bioinform 2009, 6(3):517527. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Brown M, Wilson C: RNA pseudoknot modeling using intersections of stochastic context free grammars with applications to database search.
Pac Symp Biocomput 1996, 109125. PubMed Abstract

Zhang S, Haas B, Eskin E, Bafna V: Searching Genomes for Noncoding RNA Using FastR.
IEEE/ACM Trans Comput Biol Bioinform 2005, 2:36679. PubMed Abstract  Publisher Full Text

Weinberg Z, Ruzzo W: Exploiting conserved structure for faster annotation of noncoding RNAs without loss of accuracy.
Bioinformatics 2004, 20(suppl. 1):i33440. PubMed Abstract  Publisher Full Text

Huang Z, Malmberg R, Mohebbi M, Cai L: RNAv: Noncoding RNA secondary structure variation search via graph homomorphism. In CSB Conference Proceedings. CA, USA; 2010:5669.

Huang Z, Wu Y, Robertson J, Feng L, Malmberg RL, Cai L: Fast and accurate search for noncoding RNA pseudoknot structures in genomes.
Bioinformatics 2008, 24(20):22812287. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhang S, Borovok I, Aharonowitz Y, Sharan R, Bafna V: A sequencebased filtering method for ncRNA identification and its application to searching for riboswitch elements.
Bioinformatics 2006, 22:e55765. PubMed Abstract  Publisher Full Text

Eddy S: HMMER  biosequence analysis using profile hidden Markov models. [Http://hmmer.janelia.org/] webcite
2007.

Macke T, Ecker D, Gutell R, Gautheret D, Case D, Sampath R: RNAMotif  A new RNA secondary structure definition and discovery algorithm.
Nucleic Acids Research 2001, 29:47244735. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wei F, Stein JC, Liang C, et al.: Detailed Analysis of a Contiguous 22Mb Region of the Maize Genome.
PLoS Genet 2009, 5(11):e1000728. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

van Batenburg FHD, Gultyaev AP, Pleij CWA, Ng J, Oliehoek J: PseudoBase: a database with RNA pseudoknots.
Nucleic Acids Research 2000, 28:201204. PubMed Abstract  Publisher Full Text  PubMed Central Full Text