College of Computer Science and Technology, Symbol Computation and Knowledge Engineering Lab of Ministry of Education, Jilin University, Changchun, China

Key Laboratory of Zoonoses of Ministry of Education, Jilin University, Changchun, China

Abstract

Background

In recent years, the important functional roles of RNAs in biological processes have been repeatedly demonstrated. Computing the similarity between two RNAs contributes to better understanding the functional relationship between them. But due to the long-range correlations of RNA, many efficient methods of detecting protein similarity do not work well. In order to comprehensively understand the RNA’s function, the better similarity measure among RNAs should be designed to consider their structure features (base pairs). Current methods for RNA comparison could be generally classified into alignment-based and alignment-free.

Results

In this paper, we propose a novel wavelet-based method based on RNA triple vector curve representation, named multi-scale RNA comparison. Firstly, we designed a novel numerical representation of RNA secondary structure termed as RNA triple vectors curve (TV-Curve). Secondly, we constructed a new similarity metric based on the wavelet decomposition of the TV-Curve of RNA. Finally we also applied our algorithm to the classification of non-coding RNA and RNA mutation analysis. Furthermore, we compared the results to the two well-known RNA comparison tools: RNAdistance and RNApdist. The results in this paper show the potentials of our method in RNA classification and RNA mutation analysis.

Conclusion

We provide a better visualization and analysis tool named TV-Curve of RNA, especially for long RNA, which can characterize both sequence and structure features. Additionally, based on TV-Curve representation of RNAs, a multi-scale similarity measure for RNA comparison is proposed, which can capture the local and global difference between the information of sequence and structure of RNAs. Compared with the well-known RNA comparison approaches, the proposed method is validated to be outstanding and effective in terms of non-coding RNA classification and RNA mutation analysis. From the numerical experiments, our proposed method can capture more efficient and subtle relationship of RNAs.

Background

RNA once is considered as the fundamental information medium in central dogma of molecular biology. A number of studies have indicated that RNAs play a more active role and carry diverse functionalities in nature, including mediating the synthesis of proteins, regulating cellular activities, and exhibiting enzyme-like catalysis and post-transcriptional activities. Furthermore, many recent discoveries have shown that the number and biological significance of functional RNAs has been underestimated. In living cells, RNAs do not remain in a linear form, which folds its secondary structure through base pairs including canonical bonds of A-U and G-C and wobble pair of G-U. For understanding RNA's functionality, the alignment and similarity of RNA should consider not only the primary structure (sequence) but also the secondary structure (base pairs).

Numerous approaches were proposed to measure the similarity between RNA secondary structures, which can be broadly categorized into two classes: alignment based string or tree representation of RNA secondary structure, and comparison based some numerical representation without alignment.

Most studies usually adopt dynamic programming algorithms and tree models. Some are usually based on the alignment of a string representation of the secondary structures such as the dot-bracket representation, in which a score function or a distance function to represent insertion, deletion and substitution of letters in the compared structures

Others are almost based on alignment of a tree representation of the RNA secondary structure elements or the base pairing probability matrices

Each tree model offers a more or less detailed views of an RNA structure. Given the tree representations of two RNA secondary structure, one comparison way is based on the computation of the edit distance between the trees while the other focus on the alignment of the trees using the score of the alignment as a measure of the distance between the trees. Popular tools for optimal alignment of RNA secondary structures include RNAdistance

Because the above methods rely on dynamic programming algorithms, they are computation-intensive even if the pseudoknots are ignored. For example, the Sankoff's algorithm ^{4}) in memory and ^{6}) in time for two RNA sequences of lengthn. So these algorithms are still impractical for long RNA sequences. Recently some comparison algorithms without aligning them are proposed. Kin

The graphical representations of biosequences (protein, DNA and RNA) could be out of the mainstream but a new research view and tool to understand and analyze such biosequences. M.Randic

In

In this paper, motivated by DV curve representation of DNA sequences

Results and discussion

Similarities/dissimilarities among non-coding RNA from different families

We performed the experiments on 100 RNA sequences to test the ability to distinguish non-coding RNA families. We randomly chose 25 sequences from each of the four RNA classes (5S rRNA, miRNA, RNaseP arch and tRNA) in RFAM database.

Firstly, the secondary structures of the 100 RNA sequences are predicted by the Vienna RNA folding prediction package. Secondly, their characteristic representations are constructed according to the primary sequence and the predicted secondary structure. Thirdly, the TV-Curves can be obtained based on their characteristic representations. Then we computed the similarity between any two RNA among these 100 RNA sequences by the proposed multi-scale similarity measure algorithm based on TV-Curve. Furthermore, all the similarity values are arranged into a similarity matrix. For validation of our algorithm, we computed the distance matrixes using RNApdist and RNAdistance tools respectively.

For the comparison of our multi-scale similarity measure with the popular RNA comparison tools, the validation index used here is Hubert’s statistic

If

The Hubert's statistic represents the correlation between the matrices

where

The Hubert's statistic for different similarity matrixes are shown in Table

**Method**

**RNApdist**

**RNAdistance**

**Multi-scale similarity based on TV-Curve**

Hubert statistic

0.4095

0.1156

0.7205

However, the Hubert's statistic of our proposed multi-scale similarity measure based on our algorithm is 0.7205. Obviously, our similarity measure is more closer to the real data compared with RNApdist and RNAdistance.

In addition, to further compare the performance of our method with the RNApdist and RNAdistance, we reconstructed three phylogenetic trees (see Figure

The Phylogenetic tree by RNApdist using Unweighted Pair Group Method with Arithmetic Mean (UPGMA) for the four RNA classes (5S rRNA, miRNA, RNaseP arch and tRNA). **Figure S2:** The Phylogenetic tree by RNAdistance using Unweighted Pair Group Method with Arithmetic Mean (UPGMA) using Unweighted Pair Group Method with Arithmetic Mean (UPGMA) for the four RNA classes (5S rRNA, miRNA, RNaseP arch and tRNA). **Figure S3:** Largest structure mutation for 21 RNA Ribosomal sequences using RNAmscTV-Curve, RNAdistance and RNApdist.

Click here for file

The Phylogenetic tree by Multi-Scale RNA comparison based on RNA triple vector curve representation using Unweighted Pair Group Method with Arithmetic Mean (UPGMA) for the four RNA classes (5S rRNA, miRNA, RNaseP arch and tRNA)

**The Phylogenetic tree by multi-scale RNA comparison based on RNA triple vector curve representation using Unweighted Pair Group Method with Arithmetic Mean (UPGMA) for the four RNA classes (5S rRNA, miRNA, RNaseP arch and tRNA).**

Obviously, compared Additional file

Similarities/dissimilarities among the RNA secondary structures of nine virus

To further illustrate the utility of our approach for the subtle structure comparison, we examine similarities /dissimilarities of a set of relatively similar RNA secondary structures at the 3’-terminus of nine different viruses. The nine virus include alfalfa mosaic virus (ALMV), citrus leaf rugose virus (CiLRV), tobacco streak virus (TSV), citrus variegation virus (CVV), apple mosaic virus (APMV), prune dwarf ilarvirus (PDV), lilac ring mottle virus (LRMV), elm mottle virus (EMV) and asparagus virus II (AVII). The predicted corresponding secondary structures and corresponding TV-Curves are given in Figure

The secondary structures at the 3'-terminus of RNA 3 of nine viruses: Alfalfa Mosaic Virus (ALMV ), Citrus Leaf Rugose Virus (CiLRV ), Tobacco Streak Virus (TSV), Citrus Variegation Virus (CVV ), Apple Mosaic Virus (APMV), Prune Dwarf Ilarvirus (PDV), Lilac Ring Mottle Virus (LRMV), Elm Mottle Virus (EMV ) and asparagus virus II (AVII )

**The secondary structures at the 3'-terminus of RNA 3 of nine viruses: Alfalfa Mosaic Virus (ALMV), Citrus Leaf Rugose Virus (CiLRV), Tobacco Streak Virus (TSV), Citrus Variegation Virus (CVV), Apple Mosaic Virus (APMV), Prune Dwarf Ilarvirus (PDV), Lilac Ring Mottle Virus (LRMV), Elm Mottle Virus (EMV) and asparagus virus II (AVII).**

The TV-Curves at the 3'-terminus of RNA 3 of nine viruses: Alfalfa Mosaic Virus (ALMV ), Citrus Leaf Rugose Virus (CiLRV ), Tobacco Streak Virus (TSV), Citrus Variegation Virus (CVV ), Apple Mosaic Virus (APMV), Prune Dwarf Ilarvirus (PDV), lilac Ring Mottle Virus (LRMV), Elm Mottle Virus (EMV ) and Asparagus Virus II (AVII )

**The TV-Curves at the 3'-terminus of RNA 3 of nine viruses: Alfalfa Mosaic Virus (ALMV), Citrus Leaf Rugose Virus (CiLRV), Tobacco Streak Virus (TSV), Citrus Variegation Virus (CVV), Apple Mosaic Virus (APMV), Prune Dwarf Ilarvirus (PDV), Lilac Ring Mottle Virus (LRMV), Elm Mottle Virus (EMV) and Asparagus Virus II (AVII).**

**Species**

**ALMV**

**CiLRV**

**TSV**

**CVV**

**APMV**

**LRMV**

**PDV**

**EMV**

**AVII**

The maximal similarity is 1.0000.

ALMV

1.0000

0.2596

0.2300

0.1281

0.1638

0.2606

0.4545

0.2688

0.3770

CiLRV

1.0000

0.3259

0.4983

0.2678

0.5929

0.2007

0.4241

0.4337

TSV

1.0000

0.3828

0.2869

0.2888

0.3054

0.1652

0.1443

CVV

1.0000

0.3947

0.6029

0.2755

0.3217

0.5566

APMV

1.0000

0.1912

0.7407

0.3245

0.1886

LRMV

1.0000

0.1734

0.4963

0.6387

PDV

1.0000

0.3033

0.1187

EMV

1.0000

0.4248

AVII

1.0000

To further present our result, we constructed a phylogenetic tree with UPGA algorithm for the nine virus using the multi-scale similarity measure based on TV-Curves shown in Figure

The phylogenetic tree for nine virus by multi-scale RNA comparison based on RNA triple vector curve representation using Unweighted Pair Group Method with Arithmetic Mean (UPGMA)

**The phylogenetic tree for nine virus by multi-scale RNA comparison based on RNA triple vector curve representation using Unweighted Pair Group Method with Arithmetic Mean (UPGMA).**

Observing Table

RNA mutation analysis

Mutations in RNA structure may lead to impair functions resulting in diseases, but RNA structure mutations could be beneficial in some situation. Consequently, it is very important to search the most significant point mutation. Our proposed method is very efficient to find the significant point mutation compared with the popular RNA mutation analysis tool: RDMAS

Largest structure mutation for microRNA miR-21 precursor sequences using multi-scale RNA comparison based on RNA triple vector curve representation (RNAmscTV-Curve), RNAdistance and RNApdist

**Largest structure mutation for microRNA miR-21 precursor sequences using multi-scale RNA comparison based on RNA triple vector curve representation (RNAmscTV-Curve), RNAdistance and RNApdist.**

Structural deleteriousness profiles analysis (A) Comparison of Structural deleteriousness profiles for microRNA miR-21 precursor sequences between RNAmscTV-Curve, RNAdistance and RNApdist; (B) Histograms of Structural deleteriousness profile for microRNA miR-21 precursor sequences based on RNAmsctriv, RNAdistance and RNApdist

**Structural deleteriousness profiles analysis (A) Comparison of Structural deleteriousness profiles for microRNA miR-21 precursor sequences between RNAmscTV-Curve, RNAdistance and RNApdist; (B) Histograms of Structural deleteriousness profile for microRNA miR-21 precursor sequences based on RNAmsctriv, RNAdistance and RNApdist.**

Additionally, in order to further validate the efficiency of our method, we test the 21 rRNA fragments of the thermus thermophilus from Ribosomal data-set in

21 ribosomal RNA fragments of thermus thermophilus HB8. **Table S2:** The mutations with the largest difference from the wild types of 21 ribosomal RNA fragments using RNAmscTV-Curve, RNAdistance and RNApdist.

Click here for file

Conclusion

In this paper, we provide a better visualization and analysis tool TV-Curve for RNA to indicate the information of sequence and secondary structure especially for long RNA. Additionally, based on TV-Curves representation of RNA, a multi-scale similarity measure for RNA comparison is proposed, which can capture the local and global difference between the information of sequence and structure of RNA. Compared with the popular RNA comparison approaches, the proposed method is evaluated to be outstanding and effective. But as we know, the native secondary structure of a RNA is often a suboptimal structure not the predicted structure with minimum free energy (MFE) due to limitations of thermodynamic models. The structural similarity measurement using multiple predicted suboptimal structures is still a challenge. In the further research, we will focus on how to measure the structural similarity to integrate multiple structures with different energy levels.

Method

The TV-Curve representation of RNA secondary structure

In this section, we describe the construction of TV-Curve of the secondary structure of RNA. Firstly, we give the characteristic representation of RNA based on the primary and secondary structure of RNA.

The characteristic representation of RNA secondary structure

In

Combining the information of the sequence and secondary structure, we give the corresponding characteristic sequence of the secondary structure of tRNA **(****
U48228.1/7-166
**

Construction of TV-Curve

In this subsection, the construction of TV-Curve is given. As shown in Figure

TV-Curve Representation: The numerical representations of four unpaired nucleotides (A, T, C and G ) and four paired nucleotides (A', U', G' and C') of TV-Curve

**TV-Curve Representation: The numerical representations of four unpaired nucleotides (A, T, C and G) and four paired nucleotides (A', U', G' and C') of TV-Curve.**

TV-Curve can be obtained by connecting all the vectors one by one. We give two corresponding mathematical models of TV-Curve. Denote a characteristic sequence of RNA as S = _{1}
_{2} ⋯ _{
n
} where _{
i
} ∈ {A, T, C, G, A ', T ', C ', G '} and n is the length of this characteristic sequence. Define the corresponding TV-Curve as

which can be obtained by the following formulas:

For a given TV-Curve of RNA, we can retrieve its characteristic representation from equation (2).

For example, we give the secondary structures and the corresponding TV-Curves of tRNA (**
U48228.1/7-166)
** and 5S_rRNA

TV-Curve representation: (A) The secondary structure of tRNA

**TV-Curve representation: (A) The secondary structure of tRNA **
**
(U48228.1/7-166);
**

The TV-Curve is a good visualization method to represent the information of the primary and secondary structure of a RNA molecular especially for long RNA sequence. In addition, the TV-Curve is a numerical representation of RNA, which provides another view to understand RNA. From the above construction, some properties of TV-Curve can be easily obtained:

(1). TV-Curve extends 3 units along X-axis to represent each unpaired nucleotide (A, T, C G) and paired nucleotide (A', T', C' G').

(2). From TV-Curve, one can immediately grasp the information about RNA sequence and structure information. From a given TV-Curve, we can obtain its unique sequence and secondary structure representation. Moreover, for a given RNA sequence and structure, there is a unique TV-Curve representation. The correspondence between TV-Curves and the RNA information of sequences and secondary structures is one to one and no loss of information. If one wants to know whether the i-th nucleotide in RNA sequence is paired, only need to examine the difference between the values at (3i-2) and (3i-3) of TV-Curve. If _{3i − 2} − _{3i − 3} = 1, the i-th nucleotide is paired. If _{3i − 2} − _{3i − 3} = − 1, the i-th nucleotide is unpaired.

(3). The X-axis end point _{
end
} of the TV-Curve indicates the length of RNA sequence n, i.e. _{
end
}/3.

Multi-scale similarity measure based on TV-Curves

In this section, based on TV-Curves of RNA, we propose a multi-scale similarity measure for RNA comparison in terms of the multi-scale property of wavelet transform.

We estimate RNA similarity using the weighted correlations in the wavelet domains at the different scales. The main characteristics of wavelet transforms are time-frequency localization and multi-resolution property. Wavelet can capture the global and local property of a signal synchronously and can focus on the any detail of a signal. In this sense wavelets are referred to as a mathematical microscope. In the following, we briefly introduce the discrete wavelet transform

The wavelet transform relies on the wavelet function

satisfies the following two-scale relation:

Where {_{
n
}} is a low-pass filter (scaling filter).

The associated wavelet function constructed using scaling function satisfies the following equation:

Where {_{
n
}} is a high-pass filter (wavelet filter)

Given a signal s with length N, the wavelet transform consists of _{
2
}
_{
1
}, and detail coefficients _{1}. _{1} is obtained by convolving s with the low-pass filter and then is downsampled (keep the even index elements) for approximation, and _{
1
} is also obtained by the high-pass filter and then is downsampled for detail.

The wavelet decomposition at level two analyzed the approximation coefficients _{
1
} in two sets using the same scheme, replacing s by _{
1
}, and producing the approximation coefficients _{
2
} and detail coefficients _{
2
}. The wavelet decomposition of the signal s analyzed at level j has the approximation coefficients _{
j
} and detailed coefficients _{
1
} at different level. In Figure

The Flow Chart of Wavelet Decomposition

**The flow chart of wavelet decomposition.**

For any signal s denote _{0} = {_{
k
}
^{0}} = _{
j
} = {_{
k
}
^{
j
}} and detail coefficient _{
j
} = {_{
k
}
^{
j
}} can be fast computed by Mallat algorithm

And if {_{
l
}}_{
l
} and {_{
l
}}_{
l
} are orthogonal, there is _{
l
} = (−1)^{
l
}
_{1 − l
}. While in the biorthogonal condition there are four filters (two group filters): decomposition filters {_{
l
}}_{
l
}, {_{
l
}}_{
l
}, reconstruction filters

We applied the wavelet decomposition to the TV-Curves of tRNA (**
U05019.1/544-658
**) and 5S_rRNA (

The wavelet decomposition of the TV-Curves: (A) the wavelet Decomposition of tRNA (

**The wavelet decomposition of the TV-Curves: (A) the wavelet Decomposition of tRNA (**
**
U05019.1/544-658
**

Based on the wavelet decomposition of TV-Curves of RNA sequences, we design a novel similarity measure for RNA comparison by the combination of Pearson correlation coefficient and multi-resolution feature of wavelet, which can capture the local and global similarity at the same time. For two given RNA TV-Curves _{
1
} and _{
2
}, it is easy to extend they have the same length N using period extend or zero extend. The Pearson correlation between _{
1
} and _{
2
} is defined as:

We firstly decompose the two TV-Curves _{
1
} and _{
2
} with L level wavelet transform. Here L=4. After the four level transform, we obtained the detail coefficients _{
1
} and _{
2
} using each level's resolution proportion as weight as follows:

Abbreviations

TV-curve: Triple vector curve; RNAmscTV-curve: RNA multi-scale comparison based on Triple vector curve.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

LY formulated the mathematical model and drafted the original manuscript. DM revised the manuscript and consulted on the experiments. LYC conceived the study and revised the manuscript. All authors contributed to the design and writing of the manuscript. All authors read and approved the final manuscript.

Acknowledgements

This research was supported by the National Natural Science Foundation of China (11001106, 61073075 and 61272207), and the Science-Technology Development Project from Jilin Province of China (20120730). The authors would like to thank the editor and two anonymous reviewers for their numerous helpful suggestions and comments for this manuscript.