Cranfield Health, Cranfield University, Vincent Building, Cranfield, UK

Computational Biology, GlaxoSmithKline Medicine Research Centre, Gunnels Wood Road, Stevenage, UK

Abstract

Background

MicroRNA (miRNA) directed gene repression is an important mechanism of posttranscriptional regulation. Comprehensive analyses of how microRNA influence biological processes requires paired miRNA-mRNA expression datasets. However, a review of both GEO and ArrayExpress repositories revealed few such datasets, which was in stark contrast to the large number of messenger RNA (mRNA) only datasets. It is of interest that numerous primary miRNAs (precursors of microRNA) are known to be co-expressed with coding genes (host genes).

Results

We developed a miRNA-mRNA interaction analyses pipeline. The proposed solution is based on two miRNA expression prediction methods – a scaling function and a linear model. Additionally, miRNA-mRNA anti-correlation analyses are used to determine the most probable miRNA gene targets (

Conclusions

The MMpred pipeline requires only mRNA expression data as input and is independent of third party miRNA target prediction methods. The method passed extensive numerical validation based on the binding energy between the mature miRNA and 3’ UTR region of the target gene. We report that MMpred is capable of generating results similar to that obtained using paired datasets. For the reported test cases we generated consistent output and predicted biological relationships that will help formulate further testable hypotheses.

Background

MicroRNAs are short non-coding RNAs that utilise the cellular RNA-induced silencing complex (RISC) to influence gene expression

The most significant changes of miRNA repression activity are observed during differentiation process

Combined these observations imply that a miRNA expression profile is positively correlated with it’s host gene mRNA expression profile and anti-correlated with it’s target genes expression profiles. This simple functional model can be further extended to identify functional clusters of miRNA host genes. An intriguing application of this model is that we can use mRNA expression data to predict both miRNA expression and their putative targets (Figure

Simple overview of the Predictive Model assumptions

**Simple overview of the Predictive Model assumptions.**

Performing functional analyses of miRNA-mRNA interactions using standard methodology would require measuring global expression of mRNA and miRNA using two different arrays or RNA-sequencing experiments. Such approach requires a large quantity of purified RNA, increased processing and handling overhead, as well as the additional costs of supporting two different array platforms. Such impediments are reflected in the relatively small number of paired miRNA-mRNA datasets available in public repositories - (

**Summary of all miRNA datasets performed on popular platforms in GEO (represented by at least 25 arrays, data from July 2011).** Paired datasets are marked in green with mRNA array platform and number of samples stated). Description and remarks: All the data have been derived from NCBI Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/). The newest and most advanced, 3rd version of Agilent Human miRNA is represented only by 49 samples, gathered in 5 datasets. The older Agilent array (version 2.0, capable of measuring 723 microRNAs), listed currently as most popular global miRNA expression test in GEO is represented only by 17 datasets containing 539 samples. Among 10 major miRNA microarray platforms available in GEO (those platforms are represented by more than 25 arrays) only 23 experiments have been identified as paired miRNA-mRNA. The vast majority of those assays concern large cancer tissue expression studies, so the chance of finding a dataset on different biological subject is relatively low.

Click here for file

In completing this investigation we have focused on paired Affymetrix Human Exon ST 1.0 – Agilent Human miRNA Microarray 2.0 datasets to build a prediction model, and data derived from the Affymetrix Human Genome U133 Plus 2.0 - Agilent Human miRNA Microarray 2.0 as validation sets.

The initial step of this process involved mapping all of the miRBase human miRNAs to Affymetrix probes. Then, the paired datasets were used to construct two independent, general predictors. A consensus method was then developed to consolidate the predictors’ output and to correlate this with experimental mRNA expression data. This was used to identify putative miRNA interactions with coding genes (targets). Finally overrepresentation of the predicted target genes in different ontologies was estimated using a hypergeometric test to determine functionally annotated clusters of miRNA-genes interactions. The model has been implemented in the R statistical environment and is accessible as a modular, user-friendly analysis pipeline for the prediction of microRNA regulatory mechanisms using HG-U133Plus2 microarray data as input.

Results

User input and pre-processing

Raw microarray intensity values are pre-processed using the Robust Microarray Average (RMA) method

The flowchart presenting the general structure of the pipeline

**The flowchart presenting the general structure of the pipeline.**

The mapping of microRNAs to protein coding “host genes”

The mapping between microRNAs and its host genes was completed using a simple method that utilizes genomic coordinates retrieved from miRBase

The resulting network comprised 690 mature miRNAs and 544 coding genes connected by 3992 edges. The large number of connections between the nodes supports current opinion of a many-to-many relationship between miRNAs and host genes. 92% of the overlaps (3653) involve intronic sites, while 208 (5%) involve the exons of coding genes. In addition, 97 and 34 (2% and >1%) involve the 5’UTR and 3’UTRs respectively. Sorting the overlaps by DNA strand indicated that 3320 (83%) of the predicted interactions involve the coding strand and 672 (17%) the anti-sense strand.

The microarray platform specific mappings between Affymetrix genes/exons IDs and mature mRNA identifiers represented on the chosen platforms were retrieved and directly incorporated into the pipeline. In the case of Affymetrix Human Genome 133 Plus 2.0 mapping to Agilent Human miRNA Microarray 2.0, 996 probesets corresponding to 483 host genes (1,600 Ensembl transcripts), were identified. A total of 4,857 edges connect the transcripts to 544 pre-microRNAs. This can be further processed to 646 mature miRNAs as represented on the Human miRNA microarray. The second mapping features the same miRNA array platform and Affymetrix Human Exon 1.0 ST array. In this instance 996 probesets representing 14,191 exons (encoding 544 genes), have been identified as in close proximity of pri-miRNA sequences. An estimated 16,851 edges associate these transcripts to 578 pre-microRNAs, (representative of 646 mature miRNAs). Due to the increased genomic coverage and robust expression measurements the HuEx-1.0ST mapping were used to calculate the predictors’ parameters and validate the model. However, because of much larger numbers of HG-U133Plus2 experiments in GEO, this array was selected as the primary input platform for the pipeline.

The mapping is utilised as a binary file when the pipeline is executed. Obviously the mappings can be re-calculated, with new releases of the source databases. A representative section of the mapping table is illustrated in Table

**The complete mapping table in CSV format.** Filtered for sense intronic transcripts only.

Click here for file

**Mirbase_id**

**s.**

**Overlap**

**Evidence**

**Ensembl_gene_id**

**Ensembl_transcript_id**

**Affy_hg_u133_plus_2**

**Affy_hu x _1_0_st_v2**

**Chromosome**

**Start_ position**

**End_ position**

**miR**

**miR***

This table is used by the mapping function, essential for both prediction methods.

hsa-let-7a-3

+

exon

HGNC_automatic_transcript

ENSG00000197182

ENST00000360737

**232480_at**

3948921

22

46449741

46509808

**hsa-let-7a**

hsa-let-7a*

hsa-let-7a-3

+

exon

Vega_transcript

ENSG00000197182

ENST00000360737

**232480_at**

3948949

22

46449741

46509808

**hsa-let-7a**

hsa-let-7a*

hsa-let-7b

+

exon

HGNC_automatic_transcript

ENSG00000197182

ENST00000360737

**232480_at**

3948921

22

46449741

46509808

**hsa-let-7b**

hsa-let-7b*

hsa-let-7b

+

exon

Vega_transcript

ENSG00000197182

ENST00000360737

**232480_at**

3948949

22

46449741

46509808

**hsa-let-7b**

hsa-let-7b*

hsa-let-7c

+

intron

HGNC_curated_transcript

ENSG00000215386

ENST00000308787

**1559901_s_at**

3915214

21

17442842

17982094

**hsa-let-7c**

hsa-let-7c*

hsa-let-7c

+

intron

HGNC_curated_transcript

ENSG00000215386

ENST00000308787

**1559901_s_at**

3915194

21

17442842

17982094

**hsa-let-7c**

hsa-let-7c*

hsa-let-7c

+

intron

HGNC_curated_transcript

ENSG00000215386

ENST00000308787

**1559901_s_at**

3915317

21

17442842

17982094

**hsa-let-7c**

hsa-let-7c*

hsa-let-7c

+

intron

HGNC_curated_transcript

ENSG00000215386

ENST00000308787

**1559901_s_at**

3915201

21

17442842

17982094

**hsa-let-7c**

hsa-let-7c*

hsa-let-7c

+

intron

HGNC_curated_transcript

ENSG00000215386

ENST00000308787

**1559901_s_at**

3915291

21

17442842

17982094

**hsa-let-7c**

hsa-let-7c*

hsa-let-7c

+

intron

HGNC_curated_transcript

ENSG00000215386

ENST00000308787

**1559901_s_at**

3915202

21

17442842

17982094

**hsa-let-7c**

hsa-let-7c*

hsa-let-7c

+

intron

HGNC_curated_transcript

ENSG00000215386

ENST00000308787

**1559901_s_at**

3915257

21

17442842

17982094

**hsa-let-7c**

hsa-let-7c*

hsa-let-7c

+

intron

HGNC_curated_transcript

ENSG00000215386

ENST00000308787

**1559901_s_at**

3915275

21

17442842

17982094

**hsa-let-7c**

hsa-let-7c*

hsa-let-7c

+

intron

HGNC_automatic_transcript

ENSG00000215386

ENST00000308787

**1559901_s_at**

3915192

21

17442842

17982094

**hsa-let-7c**

hsa-let-7c*

hsa-let-7c

+

intron

Vega_transcript

ENSG00000215386

ENST00000308787

**1559901_s_at**

3915318

21

17442842

17982094

**hsa-let-7c**

hsa-let-7c*

hsa-let-7c

+

intron

Vega_transcript

ENSG00000215386

ENST00000308787

**1559901_s_at**

3915193

21

17442842

17982094

**hsa-let-7c**

hsa-let-7c*

hsa-let-7c

+

intron

Vega_transcript

ENSG00000215386

ENST00000400178

**1559901_s_at**

3915214

21

17442842

17982094

**hsa-let-7c**

hsa-let-7c*

Predictor I: Scaling function

Paired microRNA-mRNA dataset “

In contrast, when only those miRNAs that had been mapped to the host genes transcripts were used, the correlation coefficient values attained were 0.23 for Pearson’s, 0.22 for Spearman’s and 0.16 for Kendall’s method. This is a significant improvement over non-mapped interactions. The relatively higher value of the Pearson product–moment correlation suggests that the observed correlation in the dataset may be linear in nature. To determine if the mapped genes represent a random sampling of the population of all genes, the Shapiro-Wilk test was performed. The null hypothesis that the sample is derived from a normally distributed population, was rejected with a 99% confidence interval p-value of < 0.0001 (α level 0.05).

Consequently a scaling function was introduced to estimate the miRNA expression values from the corresponding host genes’ expression (Figure

Predictor I – The Scaling Function

**Predictor I – The scaling function.**

Validation of the model indicated that the mean correlation of overlapping miRNA with their host genes is only marginally improved by performing scaling. However, values on the right tail of the probability distribution plot, representing strongly correlated expressions (

Finally the predictor uses calculated expression values to build a pseudo-expression matrix. This matrix has exactly the same construction as expression sets obtained from real microarray experiments, but the values are generated in silico, using the linear predictor, rather than experimentally determined expression data.

Predictor II: Linear model

Despite the satisfactory performance of scaling function predictor several tests indicated that implementing a general linear model might further enhance the predictive power of the model. When applying this approach the coefficients are fitted using least squares method derived from the paired data rather than being arbitrarily chosen. Furthermore, it is also feasible to introduce individual coefficient values for each miRNA to more accurately reflect biological dependencies.

To fit a linear model that correctly optimizes the linear function parameters for each microRNA, an appropriate training dataset was required. The

To pair miRBase IDs with their corresponding Affymetrix Human Exon Array host transcripts IDs, the previously used mapping array was extended using HuEx-1.0ST transcript IDs. Since the Human Exon chip is backward compatible with Affymetrix genome chips this operation proved feasible

In order to optimize the predictor power and avoid over-fitting expression values were split into a training set (2/3 of the data) and a test set (1/3 of data). To minimise any potential bias the composition of both sets was randomized after pairing miRNA expression indexes with their respective mRNA expression values (Figure

Predictor II – Linear model

**Predictor II – linear model.**

After maximizing the prediction power the utility of generalizing predictions on different array experiments and platforms were assessed. On this occasion, the linear models were trained on all available data from the

Correlation analyses

Correlation between messenger RNA and microRNA is the corner stone of the pipeline. A positive correlation indicates a host gene relationship while a negative value suggests a target gene relationship. The pipeline utilizes both dependences to extract genes predicted to be influenced by miRNA (

The flow-chart presenting the idea of filtering putative miRNA target genes

**The flow-chart presenting the idea of filtering putative miRNA target genes.** The predictor output based correlation matrix is filtered by a negative correlation cut-off in order to find putative miRNA-target interactions. These interactions are subsequently used for GO, KEGG, DOlight over-representation testing and creating user-readable HTML output.

Final analyses – GO, KEGG, DOLight and user defined terms overrepresentation testing

Filtering the most anti-correlated expression values generates a list of microRNA – target gene interactions. Depending on the parameters defined by the user and the quality of the input data the length of this list may vary significantly. The pipeline generates three summary lists: (1) influenced genes, sorted by miRNA identified as inducer of coding transcript quantity change, (2) miRNAs sorted by genes they are influencing and (3) all interactions with significance score (

The Affymetrix probe IDs are transformed into user-friendly Entrez IDs, HGNC symbols and gene names, which are also easily integrated into third party tools. Each of the lists is available to the user in either CSV format, or displayed in an HTML report.

The final step of the pipeline performs analyses of gene ontology terms, KEGG pathways, DOLite disease ontology and user defined Entrez terms. In each case a hypergeometric test is applied to those genes predicted to be influenced by miRNA differential expression to evaluate enrichment of each category. Subsequently, the corresponding table of terms with test statistics, pie chart, bar chart, and concept network of interaction and heatmap of most overrepresented genes featured in each of the ontology categories is generated. These tables and plots are incorporated into a final HTML report. The motivation for incorporating such analyses into the pipeline was to facilitate biological interpretation of the output. The lists of miRNAs and differentially repressed mRNAs may by very long; enrichment categories offers the user a consistent, compact output and simplifies assessment of the biological significance of the predicted mRNA – miRNA interactions and direct further validation studies.

Examples of the pipeline results and sample HTML reports (

**Sample pipeline outputs in HTML format (compressed file).**

Click here for file

**Overview of the pipeline outputs (raw MMpred output for both case studies).**

Click here for file

The validation of expression based target prediction and pipeline’s general performance

We experimentally validated the predictive models by correlating the predicted miRNA expressions with the ones obtained from microarrays. To validate if strongly anti-correlated interactions between the predicted miRNA and measured mRNA expressions can identify putative target genes we implemented systematic, numerical method based on the binding energy between the mature miRNA and 3’ UTR region of the gene. The general pipeline performance was assayed by comparing the analyses presented in the GSE19350 validation dataset author’s publication (Wang

**The short description of analysed case studies.**

Click here for file

**Detailed report on case study I: Toll-like 4 receptor activated by Lipopolysaccharide (LPS).**

Click here for file

**Detailed report on case study II: Comparison of miRNA regulation in human severe blunt trauma and severe burn injury.**

Click here for file

The miRNA-target binding energy base validation

The method we propose is modified “energy walk” procedure described in the paper by Ritchie

The distribution of calculated free energies in validated targets set driven from miRecords

**The distribution of calculated free energies in validated targets set driven from miRecords.** The calculation has been obtained by sampling 3240 sequences of 3’UTR human target genes for optimal miRNA binding free energy. The randomized sampling contains the same number of free energy calculations.

The study of lowest binding energy distributions revealed that using fixed free energy cut-off (−20 Kcal, Ritchie

To further assay the significance between actual and randomized energy calculation the Welch Two Sample t-test has been performed. The null hypothesis (true difference in means between actual and randomized data is equal to 0) has been rejected with p-value < 2.2e-16 for both randomizations. It should be noted that the randomized samples have the same mean with p-value = 0.9776.

Further, we validated experimentally measured miRNA-mRNA expression anti-correlation as target identification method using the paired microarray dataset “

Finally, to assess both predictive power of miRNA expression predictor and targets predictive capabilities, the full MMpred pipeline has been run on GSE21687 mRNA expression data only, repeating the same free energy calculation procedure (distribution shown on Figure

**Systematic validation of target prediction by the similarity of binding free energy distribution with miRecords.**

Click here for file

The distribution of calculated free energies in expression anti-correlation based target predictions

**The distribution of calculated free energies in expression anti-correlation based target predictions.** The calculations have been obtained by sampling sequences of 697 3’UTR human target genes candidates for experimental miRNA expression dataset and 26 for predictor drive miRNA expression (full MMpred pipeline). The randomized sampling contains 697 free energy calculations.

P-values obtained from Welch Two Sample t-test

**P-values obtained from Welch Two Sample t-test.** The cases where the null hypothesis has been rejected are marked in red (p-value cut-off equals 0.01), otherwise marked in green.

General performance and usability

We compared the analyses presented in Wang

Discussion

The primary objective of the reported model is to facilitate miRNA focussed analyses of the large body of mRNA expression data available in public repositories. Extensive, long term usage of microarray gene expression assays in clinical studies has produced a vast repository of extremely valuable, well-designed datasets. This is in contrast to the very limited miRNA expression datasets available in the public domain. Our model enables inexpensive hypothesis generation regarding miRNA regulatory events, from this vast repository of mRNA expression datasets. The primary assumption implemented in the pipeline is that analyses of correlation between regulatory host genes and miRNAs can be used to predict miRNA regulatory networks. Since the majority of human microRNAs are co-expressed with host genes we propose that expression of these miRNAs is positively correlated to their host transcripts. That is, over-expression of host genes indicates a positive fold change of miRNA copy number and visa-versa. A further assumption is that such microRNAs are expressed in the same quantity and at the same time as their respective host genes (

In contrast, miRNAs promote target gene degradation, which is in turn detected as a lower expression signal on mRNA microarrays. These two dependences were used to create a general mathematical model of miRNA expression prediction and to predict regulatory miRNA networks. The model was initially validated using numerical coherence between predicted and experimental data achieving a significant degree of correlation. Subsequent functional hypothesis generation using model predictions was evaluated by completing case studies with three previously reported mRNA expression datasets (GSE11327

**Examples of MMpred predictions supported by experimental data and mapping against current databases.**

Click here for file

Possible applications of the pipeline include, miRNA target prediction, constructing putative miRNA regulatory clusters and a cost efficient means of generating a large number of predicted differential miRNA expression profiles from the vast repository of human mRNA data in the public domain.

Methodology similar to MMpred was previously reported. For example, several tools utilises miRNA-targets anti-correlation to rank the computational target predictions (usually sequence matching or homology based) and identify ones, which are most probable to be a true biological hits. The validation is usually performed by experimental assays or measuring the enrichment in overlap between top ranked predictions and validated miRNA targets. A noteworthy example is the HOCTAR method

However, before applying the model one must be aware of it’s limitations. In particular, the predictor does not determine if genes connected within the functional category are suppressed by miRNA, or that the suppression normally existing in the control group has been alleviated. The pipeline does identify if the expression of differentially regulated genes is significantly anti-correlated with the expression of one more predicted miRNA. The direction of regulation (

The functional analyses (

Although the gene ID method was chosen as the default pipeline’s mapping generator, other tested methods (

The validation of predictors indicated that for many intronic miRNAs the linear model predictor performed better, though in a few cases the scaling functions performed best. For that reason we decided to implement both predictors in the pipeline. The number of miRNA predicted to be significantly misregulated after performing auto-generated cut-off may differ considerably for each of the predictors. In certain extreme cases there may be no miRNA found significantly over- or under-expressed by one or both predictors. If only one predictor returns significant miRNAs the pipeline will continue to execute. If both predictors return no significant result further analyses is impossible and the process will terminate. In such scenarios the user would either adjust the cut-off parameter or re-evaluate the experiment design. A union of the predictions is used to report a consensus result. When using the linear model approach fold change values are generally smaller and possibly more likely to reflect experimental fold change. This is due to the specificity of this predictor – that is the linear model uses coefficients fitted using the experimental data, hence making its predictions more accurate. In contrast, the coefficients of the scaling functions are chosen manually and the final coefficient is a product of the multiplication. This approach may overestimate the fold change value of genes/miRNAs with high expression index. Beside linear predictors some higher order predicting methods (

The interactions derived from correlation analyses support the biological rational of the predictors. Our first investigation is an assessment of the top 500 mRNA intronic transcripts expression (

Correlation box

**Correlation box. (A)** Top500 mRNA transcripts ranked by p-value (X-axis) plotted against corresponding miRNA expressions (Y-axis). The group presenting good linear correlation is featured with green regression line, while the group with no expression indexes dependence is featured with red one. **(B)** Messenger RNA transcripts expression index (X-axis) plotted against corresponding exotic miRNA transcripts’ expressions index (Y-axis). No expression indexes dependence is visible on this plot.

Conclusions

We present details of MMpred, a novel and generally applicable mathematical model of miRNA-mRNA interactions predicted from mRNA expression data. The method enables cost and time efficient hypotheses building of both miRNA differential expression and miRNA-mRNA interactions using retrospective analyses of publicly available mRNA microarray datasets. The notable advantage of the model is the creation of case specific predictions of miRNA-mRNA signalling networks from mRNA datasets. Contrary to the approach applied by other miRNA target prediction tools, that aim to find all possible miRNA-target repression interactions, our minimalistic, case specific approach reduces the burden of numerous false discovery rates. Additionally, the fewer number of significant targets returned by the prediction pipeline simplifies associated functional analyses of the predicted networks.

The MMpred pipeline reports the functional enrichment categories of the most likely miRNA-mRNA relationships given the experimentally determined differential gene expression profile. The data are presented in a succinct manner to facilitate testable hypothesis generation of the predicted miRNA-mRNA interaction networks. For example, the comparative burn and blunt injuries case study indicates that miRNAs repressing immune system cells’ metabolic genes are down-regulated in order to relief the metabolic lock of inflammatory response, thus protecting the organism against infections and promoting the regeneration process (see Additional file

The MMpred model is implemented as an R package and is suitable for further community validation (details in the Supplementary materials). Our validation showed significant prediction power and ability to partially reproduce results obtained by analysing paired expression datasets. The reported case studies indicate that the method predicts biologically coherent miRNA-mRNA networks and that the approach will add value to current miRNA regulatory network analysis efforts. Consequently, we believe MMpred is a useful tool for mining the vast mRNA expression data resources and screening for potential miRNA targets and miRNA-mRNA functional modules.

Methods

The mathematical bases of the predictors and correlation analyses

The scaling function predictor can be summarized as set of vector equations and implemented as required in the model:

**
E
**

**
E
**

**
FC
**

**
k
**

**
k
**

**
k
**

**Equation 1** - General formula for the scaling function predictor

**Equation 2** - Mapping function

**Equation 3** - Weight vector

**Equation 4** - Scaling coefficients determining weight vector elements

**Equation**
**1** represents the general form of the predictor, which calculates the estimated microRNA expression index by averaging elements of the experimentally observed mRNA expression vector multiplied by a weight vector. The expression vector is created by a mapping function, which selects expression values corresponding to host genes from the messenger RNA expression matrix (**Equation**
**2**). Simultaneously a weights vector of the same length is created (**Equation**
**3**). Each value in this vector is calculated by multiplying the absolute fold change (FC) and reverse scaled p-value (1-_{sense}, _{
overlap
} and _{
evidence
}) that combined describe the nature of predicted edge between miRNA and mRNA (**Equation**
**4**). Values for these coefficients have been arbitrarily assigned, using biological knowledge and computational tests performed prior to building the function. For example the intronic regions are extracted from coding sequences during splicing, which theoretically makes them available to the Drosha enzyme. However, both the 3’UTR, and 5’UTR are incorporated into mature mRNA, so they can only be processed to miRNA if the maturation process and transportation of mRNA out of the nucleus is interrupted. These scenarios dictate that the model preferentially promotes intronic sequences.

For linear model predictor the principal mathematical problem encountered while constructing the optimal regression formula was the variable number of the independent values describing each dependent value. The mapping function assigned every miRNA from 1 to 32 mRNA expression indexes. Parameters such as the p-value, fold change and genomic context of transcripts that were used successfully in the previous predictor were again incorporated into the linear model. In addition, the regression model includes additional ordinal (categorical) and continuous descriptive parameters:

**
e
**

**
FC
**

**
overlay
** Categorical parameter of levels:

**
strand
** Categorical parameter of levels:

**
evidence
** Categorical parameter e.g.

The following equations describe how starting with the simplest scenario (

**Equation 5** - The regression formula when predicting the miRNA expression of 1 microRNA when dependent on 1 mRNA transcript

**Equation 6** - The regression formula when predicting the expression of a miRNA when dependent on 2 mRNA transcripts

**Equation 7** - A general regression formula for predicting the expression of a miRNA expression value when dependent on n transcripts

Implementing the iterative formula into the linear model is mathematically impossible. Instead the model predicts miRNA expression with each transcript separately and then calculates a median value as the final prediction for each miRNA. However, using the model described by **Equation**
**5** with this method resulted in poor prediction power – the Pearson’s correlation coefficient between the measured values and our predictions was 0.324. As solution the factor containing the names of miRNAs was introduced into the model. This allowed the fitting function to select different linear equation coefficients for unique miRNAs (**Equation **
**8**).

**Equation 8** - The regression formula for predicting the expression value of a microRNA after introducing miRNAs’ name factor

This model achieved a high performance, with an estimated correlation value of 0.945 between the experimental values and our predicted values. Additional analyses indicated that miRNAs located on antisense strand, exonic, 3’UTR and 5’UTR are weakly correlated and may introduce noise rather than add to the signal in the model. Pre-filtering these transcripts marginally increased the correlation to 0.949. The ambiguous nature of the evidence (**Equation**
**9**) further increased prediction power to 0.956. This simplification of the model (**Equation**
**10**), based only on mRNA expression values and miRNA ID factor resulted in a correlation coefficient of 0.955. Despite the larger computational complexity the best performing regression formula described by **Equation**
**9** was implemented in the pipeline (Figure

**Equation 9** - Final regression formula characterised by the highest prediction power and moderate resource consumption.

**Equation 10** - Simplified regression formula for the linear model predictor.

Finally the miRNA-mRNA correlation analyses can be simplified to the following formula:

**Equation 11** - The mathematical bases of miRNA-mRNA correlation analyses:

R/ Bioconductor implementation

Despite the complexity of the model, the R implementation (referred further as the pipeline) has been designed to be simple and user friendly. The pipeline takes as input raw Affimetrix CEL files and experiment design vector (or matrix in case of more complicated ANOVA statistics), which distinguish the biological replicates, time series etc. (e. g. sample versus control in the simplest case). The output is HTML formatted report. This includes output of predictors in tabular form, as well as quality assessment plots on statistical pre-processing and performance of the predictors. Functional analyses presented as hypergeometric test result tables are supported by pie charts, bar plots, interaction concept networks and annotated heatmaps (provided by R/Bioconductor GeneAnswers library). The primary pipeline interface is in the form of a command-line R console, however users with different requirements may use a convenient graphical user interface (GUI) build with GTK+. Most advanced users may benefit on the modular structure of the pipeline, which facilitate applying changes to the components and utilising single modules in third party projects.

An explicit documentation explaining the interfaces, system requirements and implementation structure is available as Additional file

**The detailed description of software implementation in R language.**

Click here for file

**The R implementation of the presented method: MMpred.**

Click here for file

Pre-processing of raw array data in R

The expression matrices for both array types were obtained by performing standard Robust Multi-chip Average procedure

Correlation matrix

The idea of creating correlation matrices has been inspired by mathematical procedures present in regression analyses. The independent variables are being correlated against each other to assess their independence. The important differences are that regression analyses method operates on vectors, creates square matrices and aims to minimize the absolute value of correlation: correlation close to 0 indicates that independent variables are not biased to describe each other. The method that we have developed operates on arrays – though can be treated as reducing the dimensionality of the data. The basic assumption is that the expression matrices calculated using every paired dataset have the same number of columns – the same quantity of arrays must be used to assay miRNA and mRNA, and different number of rows – there is much more coding genes than miRNAs. Every row of the miRNA array is correlated against each row of the mRNA array and the correlation coefficient is captured – this way two matrices are collapsed into one, which shares the number of rows with miRNA’s expression matrix. The number of columns is equal to the number of rows present in mRNA array.

The most correlation comprehensive investigation has been made on the

The design of microarrays used in our studies

Affymetrix HG-U133 Plus 2.0 and Human Exon 1.0 ST measures messenger RNA expression by in situ oligonucleotide hybridization. The important difference between those platforms is that HuEx-1.0ST measures gene expression at the exon level – each probeset corresponds to a single exon rather than gene. The older platforms, including U133 arrays used probes complementary to the 3’UTR regions only. The new approach requires using the most current, high-density arrays, but should ensure higher precision of expression measurements and allows performing alternative splicing analyses. The manufacturer guarantees that on the genomic level HuEx-1.0ST arrays are fully backward compatible with the U133 family. Since gene mapping between those platforms is possible numerous comparative studies have been performed. The high concordance between HuEx-1.0ST and HG-U133Plus2 platforms is confirmed by many independent research groups

**Affymetrix human genome U133 Plus 2.0**

**Affymetrix human Exon1.0 ST**

**Agilent human miRNA Microarray 2.0**

**Total features per array**

~ 1 million

> 5.5 million

~ 15,000

**Probe sets**

>54,000

1.4 million

821

**Exon clusters / Transcripts / miRNAs**

~47,400

>1 million

723 human + 76 viral

**Oligonucleotide probe length**

25-mer

25-mer

~ 40–60 nucleotides

**Resolution**

11 pairs/transcript, 16.1 /gene

5.8 /exon, 44.8 /gene

20–40 /sequence

**Feature size**

11 μm

5 μm

65 μm

**Agilent Human miRNA** microarrays utilize similar technology to Affymetrix GeneChips, but measure the abundance of mature microRNA transcripts (both dominant and minor transcripts). This platform contains probes complementary to 723 human microRNAs and 76 human viral microRNAs. The probesets design is based on the miRBase version 10.1. The raw data are extracted as a text (.TXT) file, which can be further processed by Agilent's feature extraction software to a GeneView file or directly analysed by the BioConductor AgiMicroRna library

Paired datasets

The paired datasets required for building and testing the model are publically available and have been obtained from Gene Expression Omnibus repository.

The miRNA-target binding energy base validation

The procedure utilizes the Vienna RNA Package version 1.8.5 to calculate minimum free energy of miRNA binding. The 3’UTR sequences are scanned using sliding window of 25bp and 5bp step. Since RNAfold algorithm allows only the calculation of free energy for single stranded RNA molecule, the scanned 25bp fragments of 3’UTR mRNA have been joined with mature miRNA sequence using 8bp artificial inter-linker sequence containing 'X' bases that cannot be paired (as described by Enright

The validation has been implemented in R language. Mature miRNA sequences have been obtained from miRBase version 17.0 using miRbase.db R library. 3’UTR sequences have been downloaded from Ensembl via biomaRt R interface. For genes with multiple 5’UTR transcripts the longest isoform was selected to ensure the sampling of all possible binding locations. Genes with 3’UTRs shorter that 100bp were discarded from analysis. The free energy calculations have been executed using GeneRfold R interface for Vienna RNA library. The miRecords (version 3,

Abbreviations

**
pri-miRNA
**:

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

PS, MC and PW conceived and planned the project. PS developed, implemented and tested the algorithm, plus performed the analyses. PS, MC and PW co-authored the paper. All authors read and approved the final manuscript.

Acknowledgements

This work was funded by the GlaxoSmithKline Medicine Research Centre. We would like to thank GlaxoSmithKline for providing HPC resources vital to complete the project, as well as Cranfield University for substantial support. We also thank authors of utilized microarray datasets for providing publically available, high quality experimental data.