Department of Computer Science, The University of Hong Kong, Hong Kong

Abstract

Background

Predicting new non-coding RNAs (ncRNAs) of a family can be done by aligning the potential candidate with a member of the family with known sequence and secondary structure. Existing tools either only consider the sequence similarity or cannot handle local alignment with gaps.

Results

In this paper, we consider the problem of finding the optimal local structural alignment between a query RNA sequence (with known secondary structure) and a target sequence (with unknown secondary structure) with the affine gap penalty model. We provide the algorithm to solve the problem.

Conclusions

Based on an experiment, we show that there are ncRNA families in which considering local structural alignment with gap penalty model can identify real hits more effectively than using global alignment or local alignment without gap penalty model.

Background

A non-coding RNA (ncRNA) is a RNA molecule that does not translate into proteins. It has been shown to be involved in many biological processes

Most of the computational approaches are based on the observation that if two different ncRNA molecules are in the same family (with similar biological functions), they usually exhibit similar sequences as well as secondary structures. One common approach

Instead of using one member of a family, some other approaches

The core idea behind all comparative approaches is to compute the similarity between the query (known member(s)) and the target (each possible region in the genomic sequence to be investigated). Some only consider sequence similarity which may not work well for families in which members do not have high sequence similarity (e.g. members of RF00017 in Rfam 9.1

Long gap may exist in conserved local region.

**Long gap may exist in conserved local region.** Multiple sequence alignment of some seed members of the family RF01051 from Rfam 9.1 database. The red and blue highlighted are the base-pair regions. All sequences are aligned according to their structures. If the two circled sequences are selected as query and target, the circled region is the conserved local region between them, in which there exists long gap inside.

We consider the following problem. Given a query sequence together with its secondary structure, we try to identify the substring in the given target sequence (with unknown secondary structure) that can align to a substring in the query sequence with the highest structural similarity score based on the affine gap model (see next section for formal definitions). We assume that the secondary structures of the ncRNAs are regular, that is, they do not have pseudoknots (no two base pairs crossing each other). This type of ncRNAs is found to be the most abundant in existing databases. We consider all possible substrings of the query sequence, even for those substrings that cover only one of the end points of some base pairs in the structure.

Our result

We propose a local structural alignment algorithm with affine gap model which assumes the secondary structure of the query is known while that of the target sequence is unknown. The time complexity of our algorithm is ^{3}) which is the same as the best algorithm for global alignment for this problem where

Preliminaries

An ncRNA molecule can be regarded as a sequence of four characters {

Formally speaking, let _{1}_{2} … _{m}_{i}_{i}_{j}_{i}_{j}_{i}_{j}_{x,y}_{x}s_{x}_{+1}…_{y}_{x,y}_{x,y}_{1}, _{1}), (_{2}, _{2}) ∈ _{1} ≠ _{2}, _{2} ≠ _{1}, and _{1} = _{2} if and only if _{1} = _{2}.

A regular structure is the structure in which there does not exist any two base pairs crossing each other. The formal definition is as follows:

**Definition 1**_{x,y} is a regular structure if there does not exist two base pairs_{x,y} such that i

Note that an empty set is also considered as a regular structure.

Problem definition

Structural alignment with affine gap model

Let

where _{1},_{2}) and _{1}, _{2}, _{1}, _{2}) where _{1}, _{2}, _{1}, _{2} ∈ {

**Definition 2** An optimal global structural alignment

Let _{x,y}

**Definition 3** An optimal local structural alignment

Given

Results and discussion

The details of the algorithm for solving the problem will be given in Method Section. In this section, we evaluate the resulting algorithm and show that considering local structural alignment with affine gap model can improve the effectiveness of locating ncRNAs for the families in which members may have variable size of hairpins, loops or stems when compared to using global alignment

To test the algorithm, we selected around twenty ncRNA families in which the members have variable sizes of hairpins, loops or stems. We construct our testing cases based on real ncRNAs as follows. For each family, we first select a seed member (i.e. In Rfam database, there is a set of reliable members which are regarded as seed members) as the query sequence

Let

The details of the ncRNA families used in the experiments.

Family

Query Sequence ID

Length

Number of members embedded

RF00014

CP000468.1/2032552-2032638

87

96

RF00021

CP000851.1/113395-113522

128

100

RF00022

AAND01000021.1/495-707

213

100

RF00027

AAPE01289140.1/8905-8994

90

100

RF00032

S49118.1/1081-1106

26

100

RF00033

Y15844.1/450-543

94

100

RF00034

BX571867.1/288515-288628

114

100

RF00038

AJ132964.1/66-198

133

100

RF00039

AF370716.1/3603-3656

54

100

RF00042

X55895.1/474-565

92

100

RF00043

Z47410.1/1220-1294

75

21

RF00044

M11813.1/4883-5126

244

8

RF00046

AY013245.2/62208-62303

96

76

RF00048

AF504534.1/666-726

61

100

RF00386

AF363455.1/1-122

122

100

RF00643

AASG02000279.1/67999-67862

138

100

RF00661

AC154049.1/4734-4855

122

100

RF01051

AE014299.1/1112481-1112574

94

100

We compare our algorithm with the global structural alignment

Summary of comparison on results between global alignment, local alignment without gap penalty and local alignment with affine gap penalty when using the smallest threshold such that there is no false positive.

Family

Number of members

Number of misses

Gotohscan

%

Global

%

Local

%

Local with affine gap

%

RF00014

96

2

2.1%

0

0%

0

0%

0

0%

RF00021

100

10

10%

5

5%

5

5%

2

2%

RF00022

100

59

59%

20

20%

19

19%

4

4%

RF00027

100

100

100%

15

15%

9

9%

2

2%

RF00032

100

59

59%

4

4%

1

1%

0

0%

RF00033

100

29

29%

27

27%

27

27%

25

25%

RF00034

100

71

71%

11

11%

22

22%

7

7%

RF00038

100

88

88%

0

0%

0

0%

0

0%

RF00039

100

100

100%

1

1%

1

1%

1

1%

RF00042

100

10

10%

0

0%

0

0%

0

0%

RF00043

21

3

14.3%

0

0%

0

0%

0

0%

RF00044

8

1

12.5%

0

0%

0

0%

0

0%

RF00046

76

9

11.8%

2

2.6%

1

1.3%

0

0%

RF00048

100

17

17%

0

0%

0

0%

0

0%

RF00386

100

88

88%

63

63%

62

62%

6

6%

RF00643

100

98

98%

4

4%

13

13%

0

0%

RF00661

100

100

100%

87

87%

77

77%

30

30%

RF01051

100

100

100%

91

91%

85

85%

52

52%

**average**

**53.9%**

**18.4%**

**17.9%**

**7.2%**

Summary of comparison on results between global alignment, local alignment without gap penalty and local alignment with affine gap penalty when setting the threshold which allows 5% or 10% of false positives.

Family

Number of members

Number of misses

False positive rate=5%

False positive rate=10%

Gotohscan

Global

Local

Local with affine gap

Gotohscan

Global

Local

Local with affine gap

RF00014

96

2

0

0

0

2

0

0

0

RF00021

100

10

1

1

1

10

1

1

1

RF00022

100

51

9

5

2

35

4

4

2

RF00027

100

100

3

5

0

100

2

2

0

RF00032

100

59

0

0

0

37

0

0

0

RF00033

100

27

1

25

24

26

1

1

24

RF00034

100

71

1

0

0

71

1

0

0

RF00038

100

88

0

0

0

88

0

0

0

RF00039

100

100

0

0

0

100

0

0

0

RF00042

100

10

0

0

0

10

0

0

0

RF00043

21

3

0

0

0

3

0

0

0

RF00044

8

1

0

0

0

1

0

0

0

RF00046

76

9

0

0

0

9

0

0

0

RF00048

100

11

0

0

0

11

0

0

0

RF00386

100

88

58

56

1

88

48

38

1

RF00643

100

98

1

4

0

98

0

2

0

RF00661

100

100

87

66

23

100

81

52

14

RF01051

100

100

79

85

47

100

79

81

39

Summary of the area (normalized) under ROC curve for false positive rate ≤ 10%

Family

Area (normalized) under ROC curve

Gotohscan

Global

Local

Local with affine gap

RF00014

0.98

1.0

1.0

1.0

RF00021

0.9

0.99

0.99

0.99

RF00022

0.53

0.92

0.93

0.98

RF00027

0.0

0.96

0.96

1.0

RF00032

0.61

0.99

1.0

1.0

RF00033

0.73

0.93

0.79

0.76

RF00034

0.29

0.98

0.99

0.99

RF00038

0.12

1.0

1.0

1.0

RF00039

0.0

1.0

1.0

1.0

RF00042

0.9

1.0

1.0

1.0

RF00043

0.86

1.0

1.0

1.0

RF00044

0.88

1.0

1.0

1.0

RF00046

0.88

1.0

1.0

1.0

RF00048

0.89

1.0

1.0

1.0

RF00386

0.12

0.42

0.49

0.98

RF00643

0.02

0.99

0.96

1.0

RF00661

0.0

0.14

0.36

0.79

RF01051

0.0

0.18

0.17

0.56

We also use RF00661 as an example and show the score distribution between the real hits and the false hits when using different algorithms in Figure

Score distribution between the real hits and the false hits when using different algorithms for the family RF00661.

**Score distribution between the real hits and the false hits when using different algorithms for the family RF00661.** The figure shows the comparison on score distribution of real hits (i.e. real members) and false hits for the family RF00661 between different algorithms. It shows that the local structural alignment algorithm with affine gap penalty can increase the difference between the scores of real hits and the scores of false hits compared with the other methods, and so it has a higher distinguishing power to identify the real ncRNA members along the long genome sequence.

Our program take around 15 seconds for performing local structural alignment with affine gap model between query and target of around 150 bases long, and around 30 seconds for 200 bases long. We tested the program on a machine with 2.4GHz dual-core CPU and 8G memory.

Conclusions

In the paper, we provided an algorithm to handle local structural alignment with affine gap model of RNA with regular structure that compute the optimal alignment. Our experiments show that the solution is effective for some ncRNA families in which members may have varying sizes on hairpins, loops or stems (contributing to large gaps) when compared to using only global alignment or local alignment without gap model. And also we have not yet studied different types of gap penalty model and the effect of setting different gap penalty parameters. Other interesting directions include speeding up the algorithm and considering other more complicated structures (e.g. the structures with pseudoknots). In the mean time, we have completed the algorithm of computing local structural alignment for simple pseudoknots structure, and we are now in the progress of performing experiments.

Methods

We develop a dynamic programming algorithm to solve the problem. Before we describe the method, we would like to define some variations of alignments which will be used in our algorithm. Let

**Definition 4** Optimal prefix-global structural alignment

**Definition 5** Optimal suffix-global structural alignment

**Definition 6** Optimal semi-global structural alignment

Let the affine gap model be _{e}_{≤}_{f}_{+1}

When considering any substring

Define _{1}(_{2}(_{3}(_{4}(

The value of _{1}(_{2}(_{3}(_{4}(

**Lemma 1**

The following subsections describe how to compute _{1},_{2},_{3},_{4}.

Calculation of _{1}

When considering the optimal global structural alignment (with affine gap model) between _{1}.

Define _{1}_{x}

The value of _{1}(

**Lemma 2**

We will describe the calculation of _{12}. Similar skill can be applied for the others (i.e. _{11}, _{13}, … , _{19}).

Calculation of A_{12}

_{12}(

**Lemma 3**

Calculation of _{2}

When considering the optimal prefix-global structural alignment (with affine gap model) between

Define _{2x}(_{2}(

**Lemma 4**

_{2}(_{21}[ _{22}[_{23}[

We will describe the calculation of _{22}. Similar skill can be applied to calculate _{21} and _{23}.

Calculation of A_{22}

The following lemma lists out the computation of _{22}.

**Lemma 5**

_{22}(_{12}, there are also the same three situations. Situation I: when (

The calculations for _{3} and _{4} are similar. In the following subsection, we will describe the time complexity of the algorithm.

Time complexity

To fill the dynamic programming table, not all entries for all possible subrange of

Case 1: if (_{p,q}_{1}, _{2}, _{3}, _{4}, _{11}, …, etc.) can be computed from the entries for

Case 2: if ∃_{p,q}

Case 3: if ∄_{p,q}

Therefore, we define a function

We only need to fill in the entries for all the tables provided (^{2}) values of different (_{p,q}_{e≤f+1}{^{2}) time. Therefore the total time complexity = ^{3}) + ^{2}) = ^{3}).

**Theorem 1**^{3}).

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

TW and SY conceived the study. All authors refined the study. TW and BC came up the algorithm and BC implemented it. All authors contributed to the analysis. TW, SY and TL participated in drafting the manuscript. All authors read and approved the final version.

Acknowledgements

The project is partially supported by the Seed Funding Programme for Basic Research (Project number: 200911159065) of the University of Hong Kong.

This article has been published as part of