Department of Computer Science, University of Georgia, Athens, Georgia 30602, USA

Department of Plant Biology, University of Georgia, Athens, Georgia 30602, USA

Institute of Bioinformatics, University of Georgia, Athens, Georgia 30602, USA

Center for Simulational Physics, University of Georgia, Athens, Georgia 30602, USA

Abstract

Background

The computational identification of RNAs in genomic sequences requires the identification of signals of RNA sequences. Shannon base pairing entropy is an indicator for RNA secondary structure fold certainty in detection of structural, non-coding RNAs (ncRNAs). Under the Boltzmann ensemble of secondary structures, the probability of a base pair is estimated from its frequency across all the alternative equilibrium structures. However, such an entropy has yet to deliver the desired performance for distinguishing ncRNAs from random sequences. Developing novel methods to improve the entropy measure performance may result in more effective ncRNA gene finding based on structure detection.

Results

This paper shows that the measuring performance of base pairing entropy can be significantly improved with a constrained secondary structure ensemble in which only canonical base pairs are assumed to occur in energetically stable stems in a fold. This constraint actually reduces the space of the secondary structure and may lower the probabilities of base pairs unfavorable to the native fold. Indeed, base pairing entropies computed with this constrained model demonstrate substantially narrowed gaps of Z-scores between ncRNAs, as well as drastic increases in the Z-score for all 13 tested ncRNA sets, compared to shuffled sequences.

Conclusions

These results suggest the viability of developing effective structure-based ncRNA gene finding methods by investigating secondary structure ensembles of ncRNAs.

Background

Statistical signals in primary sequences for non-coding RNA (ncRNA) genes have been evasive

A predicted secondary structure can be characterized for its fold certainty, using the Shannon base pairing entropy _{i,j }_{i,j }_{i,j }

The diverse results of the entropy measuring on different ncRNAs suggest that the canonical RNA secondary structure ensemble has yet to capture all ncRNAs structural characteristics. For example, a Boltzmann ensemble enhanced with weighted equilibrium alternative structures has also resulted in higher accuracy in secondary structure prediction

In this paper, we present work that computes Shannon base pairing entropies based on a constrained secondary structure model. The results show substantial improvements in the Z-score of base pairing Shannon entropies on 13 ncRNA datasets

Results

We implemented the algorithm for Shannon base pairing entropy calculation into a program named TRIPLE. We tested it on ncRNA datasets and compared its performance on these ncRNAs with the performance achieved by the software NUPACK

Data preparation

We downloaded the 13 ncRNA datasets previously investigated in Table 1 of

The results from using these datasets were analyzed with 6 different types of measures, including Z-score and

For our tests, we also generated random sequences as control data. For every ncRNA sequence, we randomly shuffled it to produce two sets of 100 random sequences each; one set was based upon single nucleotide shuffling, the other was based upon di-nucleotide shuffling. In addition, all ncRNA sequences containing nucleotides other than A, C, G, T, and U were removed for the reason that NUPACK

Shannon entropy distribution of random sequences

Two energy model based softwares, NUPACK (with the pseudoknot function turned off) and RNAfold, and our program TRIPLE computed base pairing probabilities on ncRNA sequences and on random sequences. In particular, for every ncRNA sequence **x **and its associated randomly shuffled sequence set ** _{x}**, the Shannon entropies of these sequences were computed.

A Kolmogorov-Smirnov test (KS test)

Z-score scores and comparisons

For each ncRNA, the average and standard deviation of Shannon entropies of the randomly shuffled sequences were estimated. The Z-score of the Shannon entropy **x**) of ncRNA sequence **x **is defined as follows:

where ** _{x}**)) and

Comparisons of averaged Z-score of Shannon base pairing entropies

**Comparisons of averaged Z-score of Shannon base pairing entropies**. Comparisons of averaged Z-score of Shannon base pairing entropies computed by NUPACK, RNAfold, and TRIPLE for each of the 13 ncRNA datasets downloaded from

To examine how the Z-scores might have been improved by TRIPLE, we designated four thresholds for Z-scores, which are 2, 1.5, 1, and 0.5. The percentages of sequences of each dataset with Z-score greater than or equal to the thresholds were computed.

Table

Comparisons of TRIPLE and NUPACK by the percentages of sequences falling in each category of a

**ncRNA**

**Method**

**Z ≥ 2.0**

**Z ≥ 1.5**

**Z ≥ 1.0**

**Z ≥ 0.5**

Hh1

TRIPLE

26.67

40.00

53.33

73.33

NUPACK

0.00

0.00

20.00

53.33

sno_guide

TRIPLE

14.43

24.45

38.39

58.19

NUPACK

0.73

8.80

27.63

45.23

sn_splice

TRIPLE

40.51

50.63

60.76

65.82

NUPACK

3.80

18.99

48.10

70.89

SRP

TRIPLE

35.06

44.16

59.74

67.53

NUPACK

3.90

36.36

72.73

85.71

tRNA

TRIPLE

29.56

51.33

70.97

86.02

NUPACK

0.00

2.30

12.04

32.21

intron

TRIPLE

60.75

69.16

78.50

85.98

NUPACK

1.87

19.63

61.68

85.05

riboswitch

TRIPLE

34.64

48.37

60.13

78.43

NUPACK

1.96

18.95

45.75

69.28

miRNA

TRIPLE

81.48

88.89

94.07

97.04

NUPACK

0.00

12.59

68.15

97.78

telomerase

TRIPLE

29.41

35.29

41.18

58.82

NUPACK

11.76

17.65

35.29

47.06

RNase

TRIPLE

50.70

70.42

81.69

92.25

NUPACK

5.63

23.94

48.59

72.54

regulatory

TRIPLE

22.41

24.14

32.76

56.90

NUPACK

1.72

3.45

18.97

51.72

tmRNA

TRIPLE

18.64

32.20

45.76

55.93

NUPACK

1.69

8.47

27.12

37.29

rRNA

TRIPLE

36.16

50.62

70.87

83.06

NUPACK

4.75

21.07

42.56

61.16

Random sequences were obtained with di-nucleotide shuffling of the real ncRNA sequences.

Comparisons of TRIPLE and NUPACK by the percentages of sequences falling in each category of a

**ncRNA**

**Method**

**Z ≥ 2**

**Z ≥ 1.5**

**Z ≥1**

**Z ≥ 0.5**

Hh1

TRIPLE

6.67

33.33

53.33

73.33

NUPACK

0.00

0.00

20.00

60.00

sno_guide

TRIPLE

14.91

25.43

41.10

57.95

NUPACK

0.98

9.05

28.85

45.72

sn_splice

TRIPLE

31.65

43.04

56.96

65.82

NUPACK

5.06

26.58

51.90

69.62

SRP

TRIPLE

32.47

45.45

55.84

68.83

NUPACK

3.90

37.66

72.73

87.01

tRNA

TRIPLE

24.07

45.31

64.25

79.47

NUPACK

0.00

2.12

14.69

33.45

intron

TRIPLE

59.81

68.22

74.77

84.11

NUPACK

1.87

22.43

66.36

85.98

riboswitch

TRIPLE

32.03

44.44

56.86

71.90

NUPACK

1.96

21.57

46.41

69.28

miRNA

TRIPLE

75.56

81.48

90.37

93.33

NUPACK

0.00

9.63

70.37

98.52

telomerase

TRIPLE

23.53

29.41

41.18

58.82

NUPACK

5.88

29.41

29.41

52.94

RNase

TRIPLE

38.03

56.34

72.54

87.32

NUPACK

10.56

26.06

52.11

76.06

regulatory

TRIPLE

18.97

25.86

31.03

51.72

NUPACK

0.00

1.72

24.14

50.00

tmRNA

TRIPLE

15.25

27.12

38.98

57.63

NUPACK

3.39

6.78

27.12

42.37

rRNA

TRIPLE

34.09

47.31

64.88

79.96

NUPACK

6.40

21.69

43.19

60.74

Random sequences were obtained with single nucleotide shuffling of the real ncRNA sequences.

The results of RNAfold using the default setting are given in Table

Comparisons of TRIPLE and RNAfold by the percentages of sequences falling in each category of a

**Dataset**

**Method**

**≥2 (%)**

**≥1.5 (%)**

**≥1(%)**

**≥0.5 (%)**

Hh1

TRIPLE

26.67

40.00

53.33

73.33

RNAfold

0.00

0.00

20.00

53.33

sno_guide

TRIPLE

14.43

24.45

38.39

58.19

RNAfold

1.71

7.82

23.96

43.03

sn_splice

TRIPLE

40.51

50.63

60.76

65.82

RNAfold

6.33

21.52

54.43

69.62

SRP

TRIPLE

35.06

44.16

59.74

67.53

RNAfold

5.19

24.68

58.44

71.43

tRNA

TRIPLE

29.56

51.33

70.97

86.02

RNAfold

0.18

4.25

24.78

47.96

intron

TRIPLE

60.75

69.16

78.50

85.98

RNAfold

2.80

17.76

60.75

84.11

riboswitch

TRIPLE

34.64

48.37

60.13

78.43

RNAfold

0.65

17.65

47.06

70.59

miRNA

TRIPLE

81.48

88.89

94.07

97.04

RNAfold

0.00

7.41

65.93

97.78

telomerase

TRIPLE

29.41

35.29

41.18

58.82

RNAfold

0.00

23.53

41.18

58.82

RNase

TRIPLE

50.70

70.42

81.69

92.25

RNAfold

1.41

12.68

34.51

59.15

regulatory

TRIPLE

22.41

24.14

32.76

56.90

RNAfold

0.00

6.90

27.59

63.79

tmRNA

TRIPLE

18.64

32.20

45.76

55.93

RNAfold

1.69

10.17

33.90

50.85

rRNA

TRIPLE

36.16

50.62

70.87

83.06

RNAfold

1.45

15.70

35.33

56.82

Random sequences were obtained with di-nucleotide shuffling of the real ncRNA sequences.

Comparisons of TRIPLE and RNAfold by the percentages of sequences falling in each category of a

**Dataset**

**Method**

**≥2 (%)**

**≥1.5 (%)**

**≥1 (%)**

**≥0.5 (%)**

Hh1

TRIPLE

6.67

33.33

53.33

73.33

RNAfold

0.00

0.00

20.00

53.33

sno_guide

TRIPLE

14.91

25.43

41.10

57.95

RNAfold

1.47

7.33

24.21

44.01

sn_splice

TRIPLE

31.65

43.04

56.96

65.82

RNAfold

6.33

24.05

53.16

68.35

SRP

TRIPLE

32.47

45.45

55.84

68.83

RNAfold

5.19

29.87

59.74

77.92

tRNA

TRIPLE

24.07

45.31

64.25

79.47

RNAfold

0.00

6.19

26.19

48.85

intron

TRIPLE

59.81

68.22

74.77

84.11

RNAfold

1.87

16.82

58.88

85.98

riboswitch

TRIPLE

32.03

44.44

56.86

71.90

RNAfold

1.31

20.92

49.67

71.24

miRNA

TRIPLE

75.56

81.48

90.37

93.33

RNAfold

0.74

10.37

69.63

97.78

telomerase

TRIPLE

23.53

29.41

41.18

58.82

RNAfold

5.88

17.65

35.29

58.82

RNase

TRIPLE

38.03

56.34

72.54

87.32

RNAfold

1.41

15.49

35.92

61.27

regulatory

TRIPLE

18.97

25.86

31.03

51.72

RNAfold

0.00

5.17

32.76

67.24

tmRNA

TRIPLE

15.25

27.12

38.98

57.63

RNAfold

0.00

11.86

35.59

45.76

rRNA

TRIPLE

34.09

47.31

64.88

79.96

RNAfold

1.86

17.98

37.60

57.64

Random sequences were obtained with single nucleotide shuffling of the real ncRNA sequences.

When we specify "noLP" and "noCloseGU" on RNAfold, TRIPLE beats RNAfold in 13, 13, 12, and 11 di-nucleotide shuffling datasets, and 13, 13, 13, and 11 single nucleotide shuffling datasets with threshold 2, 1.5, 1, and 0.5, respectively. If we specify "noLP" and "noGU" on RNAfold, our method performs better on all di-nucleotide shuffling and single nucleotide shuffling datasets with all four thresholds.

We also compared TRIPLE, NUPACK, and RNAfold on some real genome background tests. Several genome sequences from bacteria, archaea, and eukaryotes were retrieved from the NCBI database. Using these genome sequences, we created genome backgrounds for the 13 ncRNA data sets. In particular, for each RNA sequence from 13 ncRNA data sets, 100 sequence segments of the same length were sampled from each genome sequence and used to test against the RNA sequence to calculate base pairing entropies and Z-score. With such genome backgrounds, the overall performance of TRIPLE on the 13 ncRNA data sets is mixed and is close to that of NUPACK and RNAfold (data not shown). This performance of TRIPLE on real genomes indicates that there is still a gap between the ability of our method and successful ncRNA gene finding. Nevertheless, the test results reveal that the constrained "triple base pairs" model is necessary but still not sufficient enough. This suggests incorporating further structural constraints will improve the effectiveness for ncRNA search on real genomes.

To roughly evaluate the speed of the three tools, the running time for 101 sequences, including 1 real miRNA sequence and its 100 single nucleotide shuffled sequences, was measured on a Linux machine with an Intel dual-core CPU (E7500 2.93 GHz). Each sequence has 100 nucleotides. TRIPLE, NUPACK, and RNAfold spent 20.7 seconds, 36.2 seconds and 3.4 seconds, respectively. We point out that TRIPLE has the potential to be optimized for each specific grammar to improves its efficiency.

Discussion

This work introduced a modified ensemble of ncRNA secondary structures with the constraint of requiring only canonical base pairs to only occur and that stems must be energetically stable in all the alternative structures. The comparisons of performances between our program TRIPLE and energy model based software (NUPACK and RNAfold) implemented based on the canonical structure ensemble have demonstrated a significant improvement in the entropy measure for ncRNA fold certainty by our model. In particular, an improvement of the entropy Z-scores was shown across almost all 13 tested ncRNAs datasets previously used to test various ncRNA measures

We note that there is only one exceptional case observed from Table

To ensure that the performance difference between TRIPLE and energy model based software (NUPACK and RNAfold) was

Since the entropy Z-score improvement by our method was not uniform across the 13 ncRNAs, one may want to look into additional other factors that might have contributed to the under-performance of certain ncRNAs. For example, the averaged GC contents are different in these 13 datasets, with SRP RNAs having 58% GC and standard deviation of 10.4%. A sequence with a high GC content is more likely to produce more spurious, alternative structures, possibly resulting in a higher base pairing entropy. However, since randomly shuffled sequences would also have the same GC content, it becomes very difficult to determine if the entropies of these sequences have been considerably affected by the GC bias. Indeed, previous investigations

Technically the TRIPLE program was implemented with an SCFG that assumes stems to have at least three consecutive canonical base pairs. Yet, as we pointed out earlier, the performance results should hold for a constrained Boltzmann ensemble in which stems are required to be energetically stable. This constraint of stable stems was intended to capture the energetic stability of helical structures in the native tertiary fold

Conclusions

We present work developing structure measures that can effectively distinguish ncRNAs from random sequences. We compute Shannon base pairing entropies based on a constrained secondary structure model that favors tertiary folding. Experimental results indicate that our approach significantly improves the Z-score of base pairing Shannon entropies on 13 ncRNA datasets

Method and model

Our method to distinguish ncRNAs from random sequences is based on measuring of the base pairing Shannon entropy

Energetically stable stems

A stem is the atomic, structural unit of the new secondary structure space. To identify the energy levels of stems suitable to be included in this model, we conducted a survey on the 51 sets of ncRNA seed alignments, representatives of the ncRNAs in Rfam

Percentages of free-energy of stems

**Percentages of free-energy of stems**. Percentages of free-energy of stems from 51 Rfam datasets (percentages of stems with free-energy less than -12 are not given in this figure).

Cumulative percentages of free-energy of stems

**Cumulative percentages of free-energy of stems**. Cumulative percentages of free-energy of stems from 51 Rfam datasets (cumulative percentages of stems with free-energy less than -12 are not given in this figure). Note the step at -3.4.

The peaks (with relatively high percentages) on the percentage curve of Figure

Based on this survey, we were able to identify two energy thresholds: -3.4 and -4.6 kcal/mol for

The RNA secondary structure model

In the present study, a secondary structure model is defined with a Stochastic Context Free Grammar (SCFG)

(1)

(4)

(7)

where capital letters are non-terminal symbols that define substructures and low case letters are terminals, each being one of the four nucleotides A, C, G, and U.

The starting non-terminal,

Probability parameter calculation

There are two sets of probability parameters associated with the induced SCFG. First, we used a simple scheme of probability settings for the unpaired bases and base pairs, with a uniform 0.25 probability for every base. The probability distribution of {0.25, 0.25, 0.17, 0.17, 0.08, 0.08} is given to the six canonical base pairs G-C, C-G, A-U, U-A, G-U, and U-G; a probability of zero is given to all non-canonical base pairs. Alternatively, probabilities for unpaired bases and base pairs may be estimated from available RNA datasets with known secondary structures

Second, we computed the probabilities for the production rules of the model as follows. To allow our method to be applicable to all structural ncRNAs, we did not estimate the probabilities based on a training data set. In fact, we believe that the probability parameter setting of an SCFG for the fold certainty measure should be different from that for fold stability measure (i.e., folding). Based on the principle of maximum entropy, we developed the following approach to calculate the probabilities for the rules in our SCFG model.

Let _{i }

Let

be the geometric average of the six base pair probabilities. According to the principle of maximum entropy, given we have no prior knowledge of a probability distribution, the assumption of a distribution with the maximum entropy is the best choice, since it will take the smallest risk

From above equations, it follows that

Computing base pairing Shannon entropy

Based on the new RNA secondary structure model, we can compute the fold certainty of any given RNA sequence, which is defined as the Shannon entropy measured on base pairings formed by the sequence over the specified secondary structure space Ω. Specifically, let the sequence be _{1}_{2 }... _{n }_{i,j }_{i }_{j }

where _{i}_{j}_{i,j}

To compute the expected frequency of the base pairing, _{i,j}

i.e., the total probability for the sequence segment _{i}x_{i}_{+1 }... _{j }_{0 }to be the initial nonterminal symbol for the SCFG model. Then _{0}, 1,

The

i.e., the total probability for the whole sequence _{1 }... _{n }

Illustration of the application of the generic production rule

**Illustration of the application of the generic production rule**. Illustration of the application of the generic production rule _{0 }derives _{1}_{2 }... _{i}_{-1}_{k}_{+1 }... _{n}

_{i,j}_{i}_{j}

where

in which variables

The efficiency to compute _{i,j}^{3}) for a model of

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

YW contributed to grammar design, algorithm development, program implementation, data acquisition, tests, result analysis, and manuscript drafting. AM contributed to algorithm design and program implementation. PS and TIS contributed to data acquisition and tests. YWL participated in model discussion. RLM contributed to the supervision, data acquisition, results analyses, biological insights, and manuscript drafting. LC conceived the overall model and algorithm and drafted the manuscript. All authors read and approved the manuscript.

Acknowledgements

This research project was supported in part by NSF MRI 0821263, NIH BISTI R01GM072080-01A1 grant, NIH ARRA Administrative Supplement to NIH BISTI R01GM072080-01A1, and NSF IIS grant of award No: 0916250.

This article has been published as part of