Saint Joseph College, 1678 Asylum Avenue, West Hartford, CT 06117, USA

SABiosciences Corporation, 6951 Executive Way, Frederick, MD 21703, USA

Abstract

Background

RNA interference (RNAi) is a cellular mechanism in which a short/small double stranded RNA induces the degradation of its sequence specific target mRNA, leading to specific gene silencing. Since its discovery, RNAi has become a powerful biological technique for gene function studies and drug discovery. The very first requirement of applying RNAi is to design functional small interfering RNA (siRNA) that can uniquely induce the degradation of the targeted mRNA. It has been shown that many functional synthetic siRNAs share some common characteristics, such as GC content limitation and free energy preferences at both terminals, etc.

Results

Our three-phase algorithm was developed to design siRNA on a whole-genome scale based on those identified characteristics of functional siRNA. When this algorithm was applied to design short hairpin RNA (shRNA), the validated success rate of shRNAs was over 70%, which was almost double the rate reported for TRC library. This indicates that the designs of siRNA and shRNA may share the same concerns. Further analysis of the shRNA dataset of 444 designs reveals that the high free energy states of the two terminals have the largest positive impact on the shRNA efficacy. Enforcing these energy characteristics of both terminals can further improve the shRNA design success rate to 83.1%. We also found that functional shRNAs have less probability for their 3' terminals to be involved in mRNA secondary structure formation.

Conclusion

Functional shRNAs prefer high free energy states at both terminals. High free energy states of the two terminals were found to be the largest positive impact factor on shRNA efficacy. In addition, the accessibility of the 3' terminal is another key factor to shRNA efficacy.

Background

RNA interference (RNAi) is a cellular mechanism in which a short/small double stranded RNA induces the degradation of its sequence specific target mRNA, leading to specific gene silencing. Since its discovery, RNAi has become a powerful biological technique for gene function studies and drug discovery

Another challenge is that most existing datasets are usually about chemically synthesized siRNA sequences. Currently there are two approaches to induce siRNA sequences into cells. One is to transfect chemically synthesized siRNA sequences into cells. Though this approach is more frequently used, the drawback is that it can not offer long-term gene suppression and some mammalian cell types are resistant to the transfection methods

As suggested by Matveeva et al

This article further extends our analysis on the available shRNA dataset generated by SABiosciences. Although this dataset is biased as it is specifically generated by our three-phase algorithm, the analysis reveals useful information that may help confirm the effectiveness of the rules used in the algorithm, modify the existing rules or rearrange them for better prediction and identify new rules.

Results and discussion

The two sets of biological experiments completely tested 444 shRNAs targeting 125 human genes. Of the 444 shRNAs, 316 are found to be functional (71.2%). Considering the fact that variations exist in the experimentally measured suppression efficacy, we decided to remove some shRNAs whose efficacy are in the range near 70% in hope of ensuring the validity of the dataset. The two ranges for removal are 60–75% and 55–80%. This means we exclude shRNAs whose efficacy is between 60–75% and 55–80% respectively. With the 60–75% removal range, there are 351 shRNAs for analysis in which 268 are considered functional (efficacy >= 75%). However, with the 55–80% range, there are only 289 shRNA sequences for analysis in which 221 shRNAs are considered functional (efficacy >= 80%).

By default, the three-phase algorithm sets a selection cutoff such that only shRNAs which score at least 7 points will be selected for biological experiments. However, former experiments showed that there were a few genes for which our algorithm failed to design enough shRNAs

To obtain 7 points for a shRNA, it must pass multiple phase II filters. What filters/rules contribute to the statistical bias of the 7-points-cutoff? After analyzing all the design rules, we found that the shRNAs scoring no less than 7 are perfectly correlated with the f-dga filter (R = 1.0). The f-dga filter in phase II is defined such that any shRNA whose free energy (ΔG) of the 5-mer at 3' terminal is no less than -3.2 would gain 1 point. The free energy of the 5-mer at 3' terminal, ΔG_{as-5 }(Subscript _{as }in ΔG_{as-5 }represents the 3' terminal while _{ss }represents the 5' terminal, _{-5 }means the first 5 nucleotides), shows some bias between functional and nonfunctional shRNAs. Functional shRNAs prefer the ΔG_{as-5 }to be no less than -3.2 while disfavoring ΔG_{as-5 }being less than -3.2 (p = 0.0046). This shows that the statistical bias behind the 7-points-cutoff is due to the ΔG_{as-5 }values of shRNAs, and suggests that having ΔG_{as-5 }larger than -3.2 increases the probability obtaining functional shRNAs at the expense of missing some functional ones (shown in Table

Higher ΔG_{as-5 }increases the probability that shRNAs are functional.

shRNA ΔG_{as-5}

Number

Functional Rate

>= -3.2

Functional

155

81.6% (155/190)

Not functional

35

<-3.2

Functional

66

66.7% (66/99)

Not functional

33

Table

The free energy of 3' terminal is computed with the first 5 terminal base pairs as suggested by Ui-Tei and Levenkova _{as-7 }is a better representation of the free energy than ΔG_{as-5}. Also it is found that ΔG_{ss-6 }and ΔG_{ss-7 }show significant correlation with shRNA efficacy (p values are 0.00447 and 0.00546 respectively). Making judgment from the p values, it is not difficult to observe that the high energy state at the 3' terminal (5' end of antisense strand) can better differentiate functional and nonfunctional shRNAs than the high energy state at the 5' terminal (5' end of sense strand) can. This confirms the strand bias discovered before _{as-7}, ΔG_{ss-6 }or both to predict the efficacy of shRNAs. The above results are summarized in Table

Using either ΔG_{as-7}, ΔG_{ss-6 }or both can better predict the efficacy of shRNAs.

Energy profile

ΔG_{as-5}

ΔG_{as-7}

ΔG_{ss-6}

ΔG_{as-7 }or ΔG_{ss-6}or both

Energy criteria

>= -3.2

>= -6.6

>= -7.0

P value of Chi-test

0.0046

0.0005

0.0045

0.0000001

True positive

155

155

115

192

False positive

35

32

22

39

True negative

33

36

46

29

False negative

66

66

106

29

ROC specificity

0.49

0.53

0.68

0.43

ROC sensitivity

0.70

0.70

0.52

0.87

ROC: Receiver Opearting Characteristic. True positives (TP) are those shRNAs which are experimentally tested to be functional and they meet the energy criteria. False positives (FP) are those shRNAs which meet the energy criteria but are experimentally tested to be nonfunctional. True negatives (TN) are those which are nonfunctional in experiments and fail the energy criteria. False negatives (FN) are those which fail the energy criteria but are functional experimentally. ROC specificity = TN/(TN+FP); ROC sensitivity = TP/(TP+FN).

Without any removal range, all the shRNAs whose ΔG_{as-7 }>= -6.6 or ΔG_{ss-6 }>= -7.0 or both have averaged suppression efficacy of 76.98%, while those shRNAs with no high energy state at either end have averaged suppression efficacy of 65.83%. A two-sample t-test reveals that the difference between the two averaged efficacy is very significant (p = 0.0000047). This confirms that the use of either ΔG_{as-7}, ΔG_{ss-6}or both could significantly improve the selection of functional shRNAs.

Recent research has shown that the siRNA sequence characteristics could be helpful in predicting siRNA efficacy

• The free energy of the first 7 base pairs of the antisense string (if >= -6.6, then value 1; otherwise, value -1).

• The free energy of the first 6 base pairs of the sense string (if >= -7.0, value equals 1; otherwise value equals -1).

• At each of the 21 positions, the nucleotide could be either A, C, G, T. For the presence of each nucleotide, there are 4 values generated. For example, for A, the 4 parameters are 1 0 0 0 while for G they are 0 0 1 0. So there are 84 parameters for the 21 nucleotides.

• For each pair of nearest neighbors, there are 4 × 4 = 16 parameters. For example, for AC, they are 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0. So there are 16 × 20 = 320 parameters.

For this analysis, the removal range of 60–75% was selected since it provides a larger sample size than the removal range 55–80%. For each round of linear square analysis, three-fourth of shRNAs from the shRNA dataset were randomly selected as the experimental group to identify the best set of coefficients for the 406 parameters, while the remaining one-fourth of the samples were used to evaluate how well the parameters and the found coefficients can predict the shRNA efficacy. The experiment was repeated 8 times. The averaged prediction accuracy is about 68.55%, which cannot match the prediction by only using the energy profiles. Nevertheless, the linear analysis revealed that for the 8 repetitions of the experiments, some parameters always showed significant positive impact on the shRNA efficacy while others always had significant negative influence. These parameters are listed in Table

Sequence characteristics that consistently show either positive or negative impacts on shRNA efficacy in Least Square Analysis.

Positive Parameters

Parameter

Position

Coefficient

G

3

8.01

G

5

9.11

A

11

7.59

T

14

6.16

CA

6–7

11.48

CC

8–9

11.63

TG

10–11

19.58

GA

13–14

31.32

TA

17–18

6.36

TA

18–19

24.11

Negative Parameters

Parameter

Position

Coefficient

TA

4–5

-35.04

TA

6–7

-23.35

GG

8–9

-21.00

GC

9–10

-24.66

GA

17–18

-15.08

Different parameter sets are also compiled in search for the best parameter set. For example, some parameter sets exclude energy profiles and some include GC ratio, etc. It is worthy to note that with different parameter sets, the coefficients change dramatically while the predication accuracy does not improve. This makes us wonder whether the findings by linear least square analysis are nuisances. However, we did notice that on average, the nucleotide pairs have larger impact than single nucleotides. For example, in Table

As there are reports that local RNA target structure influences siRNA efficacy

Initially we considered all possible palindromes and repeated sequences of length 7 or more that could involve any part of the 21 shRNA nucleotides. No significant results were found. We then only considered the possible palindromes and repeated sequences of length 8 or more that could only involve any part of the 6 base pairs from 5' terminal or of the 7 base pairs from 3' terminal. It was found that the number of possible palindromes of length 8 or more involving 3' terminal shows statistical bias between functional and nonfunctional shRNAs. Nonfunctional shRNAs tend to have more possible palindromes. This bias is most significant with palindromes of length 9 or more involving any part of the 7 base pairs from the 3' terminal. By Chi-Square test, the statistical significance is p = 0.011 with removal range of 55–80% and p = 0.0001 with removal range of 60–75%. Please notice that here we assumed that more possible palindromes implies higher probabilities for the terminal to be involved in secondary structure formation. If the assumption is valid, then the above result implies that secondary structure involving the 3' terminal could negatively impact the shRNA efficacy.

The above experiment targets the 7 base pairs at the 3' terminal. It is reasonable again to ask if other lengths of nucleotide sequences at 3' terminal will show similar statistical bias. Unsurprisingly, we found that all 3' terminal sequences of lengths 1 to 7 show similar statistical bias, i.e. functional shRNAs tend to have less possibility for all the terminal sequences of lengths 1 – 7 to be involved in palindrome formation. The most statistical bias is found with terminal sequence of length 6 (p < 0.000004 with removal range of 60–75%, p = 0.0006 with removal range 55–80%). This discovery motivates us to combine it with energy profile in order to further improve the efficacy predication. Our investigation has yet to show that this statistical bias could further improve the predication accuracy. This is not surprising since the possible palindrome structure could affect the terminal energy state. The two variables, the terminal energy state and the possible secondary structure are interfering with each other. A multivariate statistical analysis or recursive partition approach might help bring more lights into our future investigation.

Conclusion

The default setting of the three-phase algorithm is relatively stringent. Under this default setting, the algorithm cannot design shRNA sequences for approximately 8% of genes from human Refseq database

It has been confirmed by several studies that the free energy profile at the two terminals is the most critical factor relating to siRNA efficacy

Internal palindrome is one of several causes that help RNA molecules form secondary structures. Our analysis found that shRNAs with more possible palindromes involving the 3' terminal tend to affect shRNA efficacy negatively, especially those possible palindromes that are of length 9 or more and involve the 7 terminal nucleotides. RNA secondary structure involving the terminal could limit the accessibility of the terminal, which might explain why the secondary structure could negatively impact shRNA efficacy. However, our result is very primitive since it is obtained with possible palindromes only and is only a statistical analysis result. Our future work will make use of software mfold to more precisely elucidate the relationship between shRNA efficacy and RNA secondary structure. If more positive relationships are found, confirmation by biological experiments will follow.

Methods

Cell culture and shRNA delivery

293H cells (Invitrogen) were cultured in D-MEM supplemented with 10% FBS and 1× non-essential amino acids (Invitrogen) for no more than 15 passages. Gene specific shRNA sequences were designed using the three-phase algorithm

Real-time RT-PCR

cDNA was synthesized from total RNA using the ReactionReady™ First Strand cDNA Synthesis Kit. Real-time PCR was performed using RT2 SYBR Green qPCR Master Mixes on the Bio-Rad iCycler real-time PCR system or the Stratagene Mx3000 realtime PCR system.

Gene knockdown efficiency calculation

Detailed description of knockdown success rate and its calculation is given in

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

HZ carried out the statistical analysis of the data and drafted the manuscript. XZ participated in the biological data preparation and revised this manuscript critically. All authors read and approved the final manuscript.

Acknowledgements

This article has been published as part of