Department of Biology, Boston College, Chestnut Hill, MA 02467, USA

Laboratoire de Recherche en Informatique (LRI), Université Paris-Sud XI, 91405 Orsay cedex, France

Laboratoire d'Informatique (LIX), Ecole Polytechnique, 91128 Palaiseau, France

Department of Mathematics and Computer Science, Denison University, Granville, OH 43023-0810, USA

Abstract

Background

Since RNA molecules regulate genes and control alternative splicing by

Results

Given an arbitrary RNA secondary structure _{0 }for an RNA nucleotide sequence _{1},..., _{n}_{0}, if the base pair distance between _{0 }and _{0 }can be approximated with accuracy _{0 }and for all values 0 ≤ _{0}. Computation time is ^{3 }· ^{2}), and memory requirements are ^{2 }·

Conclusions

The approximation of

Background

RNA secondary structure conformational switches play an essential role in a number of biological processes, such as regulation of viral replication

_{D }

Since current riboswitch detection algorithms do not attempt to predict the location of the expression platform, we have developed tools,

In previous work **s**, the secondary structure **s **is said to be a _{0}, for all values 0 ≤ _{0}. (Note that _{0}, and where the ^{-1 }K^{-1 }is the universal gas constant, and _{0 }is the minimum free energy structure, the existence of one or more 'peaks', or values _{k }_{0 }- i.e., a potentially distinct meta-stable structure, as shown in Figure

Output of _{0}, the minimum free energy (MFE) structure....... ((((((((....)))))))) with free energy -10.3 kcal/mol

**Output of ****on the 27 nt bistable switch with nucleotide sequence CUUAUGAGGG UACUCAUAAG AGUAUCC and initial structure S _{0}, the minimum free energy (MFE) structure....... ((((((((....)))))))) with free energy -10.3 kcal/mol**. The 16-neighbor of

In

The sum in the numerator is taken over all secondary structures of the given RNA sequence, that contain base pair (

Our motivation in developing both _{0}, a complex portion of the code in

In this paper, we begin by showing rigorously how to approximate the output of _{0}; i.e., that secondary structure which has maximum expected accuracy over all structures that differ from _{0 }by exactly

Since the detection of computational switches remains an open problem, despite the success of some tools such as

Results and discussion

In this paper, we describe the following new results, discussed in the 'Methods' section in greater detail with attendant definitions of unexplained concepts.

1. We describe a Python script

2. We prove that for any desired accuracy 0

structures are sampled, then

for all 0 ≤ _{k}_{k}

3. We develop an algorithm, ^{3 }· ^{2}) and space ^{2 }· _{0}; i.e., for each 0 ≤ _{k }_{0 }by exactly

Moreover,

We now describe the 13 figures and 4 tables, corresponding to computational experiments performed with

Figure _{k }_{k}_{k }_{k}_{0}. A secondary structure

Boltzmann density plot for

**Boltzmann density plot for ****, along with approximating relative frequency plots for ****and****for the 101 nt RNA sequence UACUUAUCAA GAGAGGUGGA GGGACUGGCC CGCUGAAACC UCAGCAACAG AACGCAUCUG UCUGUGCUAA AUCCUGCAAG CAAUAGCUUG AAAGAUAAGU U for the SAM riboswitch aptamter with GenBank accession code **_{0 }is the initial structure (taken as the minimum free energy here). The script _{k }_{k }_{0}. Finally, we compute relative frequency of

** GENE ON (left) and GENE OFF (right) secondary structures for the 148 nt**. XPT guanine riboswitch from

Given riboswitch sequence X83878/168-267 and initial structure _{0}, the minimum free energy structure, a structure output by

**Given riboswitch sequence X83878/168-267 and initial structure S _{0}, the minimum free energy structure, a structure output by **

For each RNA sequence in the seed alignment from Rfam family RF00167 of purine riboswitch

**For each RNA sequence in the seed alignment from Rfam family RF00167 of purine riboswitch aptamers, we retrieved downstream flanking residues from the appropriate EMBL files, in order to ensure likelihood that the expression platform was included**. Then the following six programs were run:

For each RNA sequence in the seed alignment from Rfam family RF00167 of purine riboswitch

_{0}. Subsequently, we applied the program _{1}, such the _{1}) structure for that RNA has the greatest structural similarity with the XPT

Figure _{1}, _{n }_{0}, computes the ^{3 }· ^{2}) and space ^{2 }· ^{2}) algorithm to sample structures from the ensemble of structures having high MEA scores - a maximum expected accuracy analogue of the sampling algorithm

Figure depicting the increasing divergence between

**Figure depicting the increasing divergence between ****and **_{0}. We computed the base pair distance between the

Sample outputs from

**Sample outputs from ****on a 143 nt TPP-riboswitch, AF269819/1811-1669 with sequence CUACUAGGGG AGCCAAAAGG CUGAGAUGAA UGUAUUCAGA CCCUUAUAAC CUGAUUUGGU UAAUACCAAC GUAGGAAAGU AGUUAUUAAC UAUUCGUCAU UGAGAUGUCU UGGUCUAACU ACUUUCUUCG CUGGGAAGUA GUU**. We took the TPP riboswitch aptamer from the Rfam database _{0 }= _{0 }=

** (Left) Free energy for all MEA(k) structural neighbors, 0 ≤ k ≤ 99, of the TPP-riboswitch, AF269819/1811-1669, described in the previous figure**. Clearly,

Initial portion of pseudocode for

**Initial portion of pseudocode for ****algorithm, which continues in Figure 11**. Given RNA sequence **s **= _{1}, _{n }_{0 }of **s**, _{0}, which maximizes the value ^{5}) time with ^{3}) space.

Pseudocode for

**Pseudocode for ****algorithm**. Given RNA sequence **s **= _{1}, _{n }_{0 }of **s**, _{0}, which maximizes the value ^{3}) time with ^{3}) space.

Pseudocode for the ^{2}) traceback computed by our

**Pseudocode for the O(n**

_{0 }is ((((((((((...(((.................))).))).)))))))

** (Left) Pseudo-Boltzmann and uniform probabilities of structural neighbors MEA(k) for the 49 nt SECIS sequence fdhA, with nucleotide sequence CGCCACCCUG CGAACCCAAU AAUAAAAUAU ACAAGGGAGC AAGGUGGCG and where S**. Here, the (formal) parameter

We now briefly describe Tables _{k}

Number of samples needed for high-confidence approximation of Boltzmann probabilities

**
P
**

**
K
**

**
ε
**

**
z
**

**
N
**

0.05

1

0.01

1.45

9506

0.05

100

0.01

3.48

30276

0.05

1000000

0.01

5.45

74256

0.001

100

0.01

3.89

37830

0.000001

100

0.01

5.73

82082

0.05

1

0.001

1.45

950600

0.05

100

0.001

3.48

3027600

The number _{k }_{k}| < ε _{0}, and _{k }

Comparison of

**index**

**EMBL**

**UNAFold**

0

AL591981/205922-205823

-9.0

5.0

-9.0

-8.5

-9.0

-9.0

1

CP000764/271074-271175

-43.5

5.0

-37.5

-44.5

-23.0

-53.0

2

CP000764/308099-308200

-27.0

-18.0

-24.5

-31.5

-25.5

-22.0

3

BA000028/760473-760574

-25.5

-0.5

-36.0

-38.5

-24.5

-31.0

4

CP000557/252200-252301

-9.5

8.5

-9.5

8.5

-10.0

-12.0

5

X83878/168-267

60.0

87.5

57.0

66.0

64.0

59.0

6

BA000004/1593074-1592973

35.0

16.5

-13.5

-21.5

-19.0

-13.5

7

AAOX01000023/19446-19345

-15.0

-2.0

-13.0

-18.5

-13.5

-15.5

8

CP000416/1798040-1798138

5.5

1.5

1.5

12.0

4.5

-4.5

9

CP000721/398929-399026

26.0

24.5

16.5

-20.0

21.5

-32.0

10

BA000028/1103943-1104044

1.0

1.5

2.0

-0.5

0.5

0.5

11

ABDQ01000002/251055-251152

-16.0

-2.5

-16.5

-21.5

-17.5

-22.5

12

AAXV01000026/31334-31233

11.5

6.0

-1.5

-8.5

22.0

-3.0

13

AE016877/298774-298875

-18.5

14.0

-17.5

-34.0

-12.0

-26.5

14

BA000004/676475-676576

-28.5

-31.0

-28.0

-69.0

-21.0

-29.5

15

AE017333/692981-693082

-1.5

2.5

-11.5

-9.5

-5.5

-53.0

16

AM180355/256217-256318

-17.0

-45.0

-45.5

-49.0

-48.0

-49.0

17

AM406671/1321062-1320965

-25.5

-15.0

-22.0

-28.5

-23.5

-23.5

18

CP000612/2598111-2598012

-42.0

-39.5

-42.0

-47.5

-39.0

-38.5

19

CP000002/697032-697134

-8.0

-11.0

-10.5

-10.0

-4.5

-7.5

20

CP000002/2295936-2295837

23.5

47.0

31.5

21.0

30.0

22.5

21

AL596170/223345-223246

-0.5

7.0

0.5

-8.5

-10.0

-10.0

22

ABDQ01000005/131908-131807

-33.0

-15.5

-31.5

-31.5

-19.0

-50.0

23

AAOX01000052/9069-8968

-13.5

1.5

-14.0

-21.0

-15.5

-14.5

24

AE017333/4024324-4024425

-29.5

-26.5

-33.5

-24.0

-23.5

-36.0

25

AP006627/1554717-1554818

-31.5

-1.5

-37.0

-44.5

-28.5

-43.5

26

CP000024/1182948-1183043

-0.5

-18.5

-9.0

4.0

2.0

-19.0

27

BA000028/786767-786867

-18.0

-41.5

-48.0

-46.5

-49.0

-44.5

28

ABDP01000002/29688-29587

-34.5

-42.5

-34.5

-37.0

-35.0

-50.0

29

BA000043/272473-272574

-9.5

4.0

-9.5

-10.0

-3.0

-12.5

30

CP000724/944285-944386

-30.5

-21.5

-30.5

-28.5

-26.5

-31.5

31

CP000764/1409725-1409826

14.0

-3.0

-18.0

-24.0

-11.5

-20.0

32

AAEK01000017/86437-86538

-44.5

-44.0

-41.5

-52.0

-35.0

-49.0

33

CP000764/357645-357544

11.0

-13.5

-33.0

-26.0

-18.5

-36.0

Comparison of

**Index**

**EMBL**

**UNAFold**

0

AL591981/205922-205823

27.5

28.5

28.5

25.5

25.5

25.5

1

CP000764/271074-271175

13.0

12.5

11.0

6.5

12.0

5.5

2

CP000764/308099-308200

24.0

26.0

26.5

23.0

24.5

26.5

3

BA000028/760473-760574

18.5

22.0

13.0

20.5

23.5

23.0

4

CP000557/252200-252301

7.0

8.0

7.0

10.0

6.5

4.5

5

X83878/168-267

143.0

143.5

143.0

141.0

143.0

141.0

6

BA000004/1593074-1592973

41.0

39.0

41.0

36.0

38.0

41.0

7

AAOX01000023/19446-19345

47.5

45.5

46.0

42.5

34.0

43.5

8

CP000416/1798040-1798138

17.5

12.5

12.5

13.0

11.5

12.5

9

CP000721/398929-399026

36.5

20.5

23.0

-38.5

34.5

-52.5

10

BA000028/1103943-1104044

32.0

29.5

32.0

27.5

30.5

30.0

11

ABDQ01000002/251055-251152

27.0

26.0

26.5

24.0

25.5

7.5

12

AAXV01000026/31334-31233

37.5

38.5

38.0

32.5

35.0

36.0

13

AE016877/298774-298875

24.0

25.5

23.0

19.0

23.0

22.5

14

BA000004/676475-676576

9.0

4.5

6.5

-35.5

5.0

9.0

15

AE017333/692981-693082

-30.0

-9.5

-23.5

-25.5

-17.0

-70.5

16

AM180355/256217-256318

-23.5

-24.0

-25.0

-27.0

-23.5

-27.0

17

AM406671/1321062-1320965

-0.5

3.5

1.0

-10.0

1.0

0.5

18

CP000612/2598111-2598012

-12.0

-9.0

-8.0

-8.5

-9.5

-9.0

19

CP000002/697032-697134

16.5

7.0

12.0

14.0

16.5

7.5

20

CP000002/2295936-2295837

75.0

73.0

75.5

71.0

72.0

69.5

21

AL596170/223345-223246

30.5

31.5

30.5

28.5

29.5

29.5

22

ABDQ01000005/131908-131807

12.5

3.0

13.0

10.5

13.5

4.5

23

AAOX01000052/9069-8968

12.5

14.5

13.5

11.0

12.0

12.0

24

AE017333/4024324-4024425

-3.5

2.5

3.5

6.0

-2.5

-1.5

25

AP006627/1554717-1554818

22.5

18.0

22.5

14.5

25.5

12.5

26

CP000024/1182948-1183043

6.0

7.0

6.5

6.0

5.0

6.0

27

BA000028/786767-786867

-23.5

-19.5

-23.0

-24.5

-21.0

-24.0

28

ABDP01000002/29688-29587

3.0

1.0

2.5

1.0

4.5

0.5

29

BA000043/272473-272574

17.5

12.5

12.5

13.5

12.5

11.5

30

CP000724/944285-944386

10.0

11.0

10.5

7.0

12.0

9.5

31

CP000764/1409725-1409826

32.5

36.0

32.0

26.5

35.0

30.5

32

AAEK01000017/86437-86538

11.5

11.5

13.0

8.0

13.0

11.0

33

CP000764/357645-357544

23.5

22.0

24.5

24.0

22.0

22.5

Number of times that the most similar structure was produced

**Method**

**greatest similarity to gene on**

**greatest similarity to gene off**

18

11

7

11

2

8

3

2

5

8

1

3

Number of times that the most similar structure to the

The figures and tables show, in summary, that

Conclusions

We have applied the notion of _{1}, _{n }_{0}. Our software _{BP}(_{0},

and

Here, the expected accuracy score

where first sum is taken over all base pairs (_{i,j }[resp. _{i}

Our preliminary investigations have not indicated a clear application of the partition function analogue, though it may be construed to provide a representation of the temperature-dependent

Indeed, in 18 [resp. 11] out of 34 instances,

Methods

Preliminaries

Recall the definition of RNA secondary structure.

**Definition 1 **_{1}, _{n }is defined to be a set of ordered pairs

_{i}_{j}

The preceding definition provides for an inductive construction of the set of all secondary structures for a given RNA sequence _{1}, _{n}_{i}_{i+d }is defined as follows. If 0 ≤ _{i}_{i+d }is the empty structure containing no base pairs (due to the requirement that all hairpins contain at least

_{i}_{i+d-1 }is a secondary structure for _{i}_{i+d}, in which _{i+d }is unpaired.

_{i}_{j }_{i+1}, _{i+d-1}, the structure _{i}_{i+d}.

_{r}_{j }_{i}_{r-1 }and any secondary structure _{r+1}, _{j-1}, the structure _{i}_{i+d}.

Given two secondary structures _{BP }_{BP }

RNAbor-Sample

In this section, we describe how sampling from the Boltzmann ensemble, using

Let _{1}, _{n}_{k }_{k }_{0 }is

Let _{1}, _{n }_{0}. Following _{k }_{0}; i.e.,

As usual, let _{1}, _{n}

and let

Given a desired approximation accuracy of _{k }_{k }

Consider the value _{k}_{k}_{k }_{k }_{k}_{k}_{k }

Before starting, we mention that it will suffice for our intended application of _{k }_{k }

Temporarily, we fix _{k}_{0}. Provided that we sample a number _{k }

Let _{α/2 }is defined by

If _{α/2}) and right tail (_{α/2}, +∞) for the normal approximation of the binomial distribution, then by a well-known argument (e.g. equation (24.35) on p. 529 of

It follows that

provides a sufficient lower bound on number of samples necessary to guarantee margin of error

We have just shown that for

The following is now a key step. If we have _{0}, i.e., those structures _{0 }is _{k }_{k }_{k}| > ε_{k }_{0}, after sampling

then

Putting everything together, we have shown that for given

we have

We have completed a more rigorous argument using Chernoff bounds, but prefer the exposition given here for simplicity. Note that the same argument, given

We can make some basic conclusions from the above analysis. The number of samples sufficient to ensure that _{k }_{k}| < ε

In the case of one bin, it is important to remember that the value ^{-6}, then at least on the order of one million samples are needed, just for a reasonable probability of sampling the bin once.

Algorithm description

Given an RNA sequence _{1}, _{n}_{0 }of

where the first sum is taken over all base pairs (_{i,j }_{i}_{i,j}

The dynamic programming computation of _{i}_{j}_{0}, _{1 }of numbers, such that the following _{0}[_{i}_{j }_{j }_{0 }= _{1 }= 0. _{0}[_{i}_{j }_{0 }= 0 and _{1 }= _{0}[_{i}_{j }_{0 }neighbor of _{1 }neighbor of _{0}, _{1 }will be used in computing the traceback, where the maximum expected accurate structure that is a _{j }_{i}_{j }_{0}-neighbor of _{1}-neighbor of _{0 }+ _{1 }= _{r}_{j }

Pseudocode for the algorithm _{0}[_{0}[_{0}[_{0}, _{1}, such that: _{0}[_{0}-neighbor of _{0}[_{0 }= _{0}, then leaving _{0}[_{1}-neighbor of _{0}[_{1 }= _{1}, then adding the enclosing base pair (_{0}[_{0}-neighbor of _{0}[_{1}-neighbor of _{0}[_{0}, _{1 }such that _{0 }+ _{1 }=

In a manner similar to the pseudocode of Figures

We have graphed the Boltzmann probabilities _{1,n }is the total number of secondary structures. When

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

Acknowledgements

Funding for the research of P. Clote and F. Lou was provided by the Digiteo Foundation for the project

This article has been published as part of