SIEE, China University of Mining and Technology, Xuzhou, China

Department of Electrical and Computer Engineering, University of Texas at San Antonio

Department of Pediatrics, University of Texas Health Science Center at San Antonio

Greehey Children's Cancer Research Institute, University of Texas Health Science Center at San Antonio

Department of Epidemiology and Biostatistics, University of Texas Health Science Center at San Antonio

Abstract

Background

MicroRNAs (miRNAs) are single-stranded non-coding RNAs shown to plays important regulatory roles in a wide range of biological processes and diseases. The functions and regulatory mechanisms of most of miRNAs are still poorly understood in part because of the difficulty in identifying the miRNA regulatory targets. To this end, computational methods have evolved as important tools for genome-wide target screening. Although considerable work in the past few years has produced many target prediction algorithms, most of them are solely based on sequence, and the accuracy is still poor. In contrast, gene expression profiling from miRNA transfection experiments can provide additional information about miRNA targets. However, most of existing research assumes down-regulated mRNAs as targets. Given the fact that the primary function of miRNA is protein inhibition, this assumption is neither sufficient nor necessary.

Results

A novel Bayesian approach is proposed in this paper that integrates sequence level prediction with expression profiling of miRNA transfection. This approach does not restrict the target to be down-expressed and thus improve the performance of existing target prediction algorithm. The proposed algorithm was tested on simulated data, proteomics data, and IP pull-down data and shown to achieve better performance than existing approaches for target prediction. All the related materials including source code are available at

Conclusions

The proposed Bayesian algorithm integrates properly the sequence paring data and mRNA expression profiles for miRNA target prediction. This algorithm is shown to have better prediction performance than existing algorithms.

Background

MicroRNAs (miRNAs) are single-stranded non-coding RNAs with about 19 to 25 nucleotides in length. MiRNA is known to inhibit target translation or cleave target mRNA by binding to the complementary sites in the 3’ untranslated region (UTR) of targets. The importance of miRNA regulation lies in the fact that a miRNA is estimated to regulate hundreds of targets

Despite these effort, the existing algorithms using sequence data alone are still of poor prediction specificity and sensitivity

Microarray profiling of differential gene expression after miRNA transfection is a widely adopted approach to investigate the impact of the miRNA regulation. Such gene expression profiles have been used in a variety of studies for predicting miRNA targets. However, the majority of existing research relies on the assumption that miRNA targets are down-expressed in microarray and thus search within the intersection of sequence level prediction and down-regulated genes in microarray for potential targets

To address the problem with the current practice in combining sequence prediction with microarray data, we present a novel Bayesian algorithm with the scheme shown in Figure

Algorithm Block Diagram

**Algorithm Block Diagram** The proposed algorithm consists of a sequence-based prediction module and a expression profile inference module. A Naïve Bayes model integrates the outputs of these two modules to generate final prediction score.

Methods

Problem statement

For convenience of composition, the mathematical definition of the problem is first given. For a given mRNA

where the second equality is arrived based on the assumption that

Mapping of sequence level prediction scores to

There exist several target prediction algorithms using sequence data. We adopt our own SVMicrO algorithm in the work since it has been shown to outperform other popular algorithms. Like most of target prediction algorithm, SVMicrO produces a score

where _{0} and _{1} are the parameters to be trained.

Training

The training data used for training SVMicrO were adopted here to train

Curve of

Curve of

Microarray Data Source of Negative Samples

miRNA

GEO accecsion

miRNA

GEO accecsion

hsa-let-7c

GSM156557

hsa-miR-128

GSM210902

hsa-miR-15a

GSM156545

hsa-miR-132

GSM210904

hsa-miR-16

GSM156546

hsa-miR-133a

GSM210906

hsa-miR-17

GSM156553

hsa-miR-142-3p

GSM210908

hsa-miR-192

GSM156547

hsa-miR-148b

GSM210910

hsa-miR-20a

GSM156554

hsa-miR-7

GSM210896

hsa-miR-215

GSM156548

hsa-miR-9

GSM210898

hsa-miR-192

GSM328290

hsa-miR-34a

GSM187633

GSM187631

hsa-miR-215

GSM328291

hsa-miR-34b

GSM190765

hsa-miR-122

GSM210900

hsa-miR-34c-5p

GSM190758

Gaussian mixture models of expression profile

The gene expression profile of miRNA transfection experiment contains both the expressions of the positive as well as negative targets, both of which needs to be properly modeled. To this end, the empirical distributions of expression was first examined. To obtain the expression of verified targets, the verified targets of human miRNAs recorded in miRecords

Histograms of Gene Expression Profiles

**Histograms of Gene Expression Profiles**

where ^{2} are the mean and variance of the respective Gaussian mixtures, the subscripts + and — denote the positive (_{+} + ** θ** represents the collection of the model parameters. Given model (3), the goal is to uncover mixture components from the expression data, which is equivalent to estimate the parameters from the expression data. Note that since the number of positive targets is only in hundreds,

Bayesian estimation of the gaussian mixture

Under the Bayesian framework, the goal of estimating model parameters ** θ** is to obtain the posterior distribution

** θ**|

where ** θ**|e

where **e**p_{}**e**_{+}

the informative can be shown to be

where

_{p}^{2} are the sample mean and variance of **e _{p}**, and all other parameters with subscript 0 are the same as those in (5), which define the noninformative prior. Next, for the noninformative priors in (5) and (6), the parameters are chosen as:

_{0} = 0, _{0} = 0.2, _{0} = 0.2, _{0} = 0.2.

Lastly, the parameters of the Dirichlet prior are chosen as _{+},0 = 200 and

Since the likelihood assumes the mixture model in (3), the posterior distribution cannot be obtained analytically. A Variational Bayes Expectation Maximization (VBEM) algorithm is applied to estimate the desired distributions.

Variational bayes expectation maximization algorithm

Since the expression level of each gene is assumed to be i.i.d. and follows the Gaussian mixture (3), the parameters should be estimated from the gene expression profile of all genes ** e** = {

where as above the inequality is due to the Jensen's inequality,

**VBE Step:**

**VBM Step:**

where ** π**) and

Weighted Distributions of Positive and Negative Components with Parameters Estimated from Data

**Weighted Distributions of Positive and Negative Components with Parameters Estimated from Data** The parameters of both positive and negative are estimated by the VBEM algorithm.

Calculation of

With the estimated parameters,

where ^ represents the estimate of the corresponding parameter. Based on the parameters estimated by VBEM algorithm,

Curve of

**Curve of α(e) Obtained from the Gaussian Mixture Model from Real Data** The curve of

Results and Discussion

Validation Based on Simulated Data

We first tested the proposed algorithm based on the simulated data set. Particularly, we generated the sequence level prediction scores of both positive and negative data from two Gaussian distribution, whose means and variances were chosen based on the prediction scores of SVMicrO on the real positive and negative targets. The expression fold change data were produced from the Gaussian Mixture Model; the parameters of mixture model were chosen also based on those fitted to the expression fold changes of real positive and negative targets. To also reflect the imbalance between the positive and negative targets, 200 positive data and 19800 negative data were generated with distributions shown in Table

Distributions and parameters used to generate test data

sequence score

fold change

mixture coefficient

Positive

1%

Negative

99%

Fitting of function

GMM parameters estimated by VBEM

fold change

mixture coefficient

Positive

1.8%

Negative

98.2%

Next, precision recall curve was plotted to compare the performance of combined method with algorithms only relying on either sequence level score or expression fold change. Precision represents the odds of a predicted target to be the true target, while recall denotes the chance of having predicted the entire true targets. High precision often concerns biologists more because it is highly desirable and efficient to allocate the limited resource to test a set of predictions with high chance to be the true targets. However, recall is also important to assure that all the true targets can be uncovered. Overall, the larger the area under the PR curve an algorithm has, the better it is. As can be seen from Figure

Precision Recall Curve Comparison Based on Simulated Data

**Precision Recall Curve Comparison Based on Simulated Data** This figure indicated that the performance of proposed algorithm is better than those using either sequence information or expression data alone.

Evaluation on real data

The proposed algorithm was applied to predict the targets of hsa-miR-1 and hsa-miR-124. The result was validated by the mass spectrometry data in

Sequence Score and Differential Expression Data Retrieval

3'UTR sequences of human genome were downloaded from UCSC Genome Browser mySQL database. Prediction of genome-wide targets of hsa-miR-124 and hsa-miR-1 based on the sequence pairing data were carried out by SVMicrO. The prediction scores were recorded for each mRNA, which were then mapped to the APPs of being targets using the logistic function

Evaluation using Mass Spectrometry Data

To evaluate the performance, we first consulted the proteomics data of

Cumulative Sum of Protein Fold Change of Top 50 Predictions of hsa-miR-124

**Cumulative Sum of Protein Fold Change of Top 50 Predictions of hsa-miR-124** This figure shows the result for the top 50 predictions, which indirectly reflects the prediction precision. Particularly, the approach "Expression" uses simply mRNA expression as a score and ranks the larger down-expressed gene higher in the list. We note from Figure

Cumulative Sum of Protein Fold Change for Different Number of Top Ranked Predictions of hsa-miR-124

**Cumulative Sum of Protein Fold Change for Different Number of Top Ranked Predictions of hsa-miR-124** The cumulative sum of different numbers of top predictions for several algorithms are depicted. This figure shows that, after top 50, the proposed algorithm has the largest down fold, which also suggests higher sensitive for the proposed algorithm.

Cumulative Sum of Protein Fold Change of Top 150 Predictions of hsa-miR-1

**Cumulative Sum of Protein Fold Change of Top 150 Predictions of hsa-miR-1** We note similar superior performance of the proposed approach as in Figure

Cumulative sum of protein fold change for different number of top ranked predictions of hsa-miR-1

**Cumulative sum of protein fold change for different number of top ranked predictions of hsa-miR-1** The cumulative sum of different numbers of top predictions for several algorithms are depicted. This figure shows that, after top 100, the proposed algorithm has the largest down fold, which also suggests higher sensitive for the proposed algorithm.

0.0.1 Precision-Recall (PR) Performance using IP pull-down data

Since the utility of the evaluation on proteomic data is limited by the coverage of the SILAC technology and the potential noise in protein quantification, we further validated the prediction of hsa-miR-1 and hsa-miR-124 using the Immunoprecipitation (IP) pull-down data (Hendrickson, et al., 2008), which measures the potential targets recruited by the ARG-2, an important component of the miRNA effector protein complexes. In this experiment, 59 and 388 genes were determined as high confidence targets of hsa-miR-1 and hsa-miR-124, respectively, at a stringent FDR level of 0.01. We then treated these genes as the true targets and investigated the PR performance of different algorithms. The Precision-Recall curve of the proposed algorithm as well as SVMicrO, expression fold change, PicTar, miRanda, MirTarget, PITA and Target Scan were plotted as Figure

Precision Recall Curves for the Predictions Tested on IP Pull-downs of hsa-miR-124

**Precision Recall Curves for the Predictions Tested on IP Pull-downs of hsa-miR-124** This figure shows a clear enhancement in both precision and recall compared to SVMicrO, the approach using expression data, and other sequence-based prediction algorithms. Besides, the overlapping method (black dot) only improves the precision slightly compared to SVMicro but is much worse our compared with the proposed algorithm.

Precision Recall Curves for the Predictions Tested on IP Pull-downs of hsa-miR-1

**Precision Recall Curves for the Predictions Tested on IP Pull-downs of hsa-miR-1** This figure shows again the similar performance improvement as Figure

Comparison with the Overlap Method

As we mentioned before, most literature considers overlapping between sequence level prediction and down-regulated mRNA for target prediction. The performance of such overlapping scheme was also evaluated. In Figure

Conclusions

In this paper, we presented a novel algorithm for miRNA target prediction by integrating sequence level prediction results with microarray expression profiling of miRNA transfection. A Gaussian mixture model was designed to model the gene expression profiles of the positive and negative targets and a Bayesian algorithm is devised to integrate the data. The validation results on both proteomics and IP pull-down data demonstrated the superior performance of proposed algorithm.

Limitations and Future Work

Since our algorithm is proposed for integrating sequence data with microarray measurement of miRNA transfection, target prediction can be carried out only for the miRNAs, for which both types of data are available. Since microarray measurements of genome-wide miRNA transfection are not yet available, it is still infeasible to conduct genome-wide prediction using this algorithm. However, as miRNA transfection becomes increasingly popular and indispensible for miRNA target identification, the need for integrating the two data types is highly desirable. In an effort to provide prediction results, we retrieved around 20 miRNA over-express microarray data From GEO database. The prediction result can be found in

The subsequence work of this paper will focus in two aspects, which are, firstly, continue the predictions for more miRNAs once the two types of data are accessible and secondly improve the mathematical model to further increase the performance.

Competing interests

The authors declare that they have no competing interests.

Authors contributions

HL, SJG, and YH conceived the idea. HL, YC, YH worked out the detailed derivations. HL, DY, and LZ, implemented the algorithm and performed the prediction. HL, DY, YH wrote the paper. brodersen2009revisiting

Acknowledgements

Hui Liu and Lin Zhang are supported by the talent introduction project of China University of Mining Technology, the set-sail project of China University of Mining Technology and Fok Ying-Tung Education Foundation for Young Teachers (121066). Yidong Chen is supported by NCI Cancer Center grant P30 CA054174-17 and NIH CTSA 1UL1RR025767-01. Shou-Jiang Gao is supported by NIH grants CA096512 and CA124332. Yufei Huang is supported by an NSF Grant CCF-0546345 and an NIH grants CA096512. Publication of this supplement was made possible with support from the International Society of Intelligent Biological Medicine (ISIBM).

This article has been published as part of