A correction for this article has been published in BMC Bioinformatics 2007, 8:248Inference of miRNA targets using evolutionary conservation and pathway analysis1 Biozentrum, University of Basel, Basel, Switzerland 2 Swiss Institute of Bioinformatics, Basel, Switzerland
BMC Bioinformatics 2007, 8:69doi:10.1186/1471-2105-8-69 The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/8/69
©
2007 Gaidatzis et al; licensee BioMed Central Ltd. AbstractBackgroundMicroRNAs have emerged as important regulatory genes in a variety of cellular processes and, in recent years, hundreds of such genes have been discovered in animals. In contrast, functional annotations are available only for a very small fraction of these miRNAs, and even in these cases only partially. ResultsWe developed a general Bayesian method for the inference of miRNA target sites, in which, for each miRNA, we explicitly model the evolution of orthologous target sites in a set of related species. Using this method we predict target sites for all known miRNAs in flies, worms, fish, and mammals. By comparing our predictions in fly with a reference set of experimentally tested miRNA-mRNA interactions we show that our general method performs at least as well as the most accurate methods available to date, including ones specifically tailored for target prediction in fly. An important novel feature of our model is that it explicitly infers the phylogenetic distribution of functional target sites, independently for each miRNA. This allows us to infer species-specific and clade-specific miRNA targeting. We also show that, in long human 3' UTRs, miRNA target sites occur preferentially near the start and near the end of the 3' UTR. To characterize miRNA function beyond the predicted lists of targets we further present a method to infer significant associations between the sets of targets predicted for individual miRNAs and specific biochemical pathways, in particular those of the KEGG pathway database. We show that this approach retrieves several known functional miRNA-mRNA associations, and predicts novel functions for known miRNAs in cell growth and in development. ConclusionWe have presented a Bayesian target prediction algorithm without any tunable parameters, that can be applied to sequences from any clade of species. The algorithm automatically infers the phylogenetic distribution of functional sites for each miRNA, and assigns a posterior probability to each putative target site. The results presented here indicate that our general method achieves very good performance in predicting miRNA target sites, providing at the same time insights into the evolution of target sites for individual miRNAs. Moreover, by combining our predictions with pathway analysis, we propose functions of specific miRNAs in nervous system development, inter-cellular communication and cell growth. The complete target site predictions as well as the miRNA/pathway associations are accessible on the ElMMo web server. BackgroundSince the initial discovery of the lin-4 miRNA [1], and then of the let-7 miRNA which is highly conserved in evolution [2], combined experimental and computational approaches have resulted in the identification of hundreds of miRNAs in animal genomes, some of the large-scale studies being [3-19]. In contrast, high-throughput approaches for experimental identification of miRNA targets are only in their infancy [20,21], and global properties of miRNA-dependent regulatory networks have mostly been inferred from computationally-predicted target sites [22-26]. Perhaps surprisingly, relatively little is known about the constraints on a functional miRNA target site. Mutational studies [27,28] confirmed initial observations of Lai [29] and Lewis et al. [30] that perfect base pairing between the 5' end of the miRNA and its target is essential. As a consequence, some of the computational methods for miRNA target prediction require [22,31] or can enforce the constraint [32] that 6–8 nucleotides at the 5' end of the miRNA, the so-called miRNA "seed", are perfectly base paired with its mRNA target, or give a higher weight to the base pairs formed in this region [25,33]. Since every 6mer occurs on average once every 4,096 nucleotides in random sequence, the number of target sites for each miRNA would be very large if matching of the seed were the only requirement for functional target sites. Although there are indications that miRNAs do have a large number of targets [20-22,28,31,34], experimental studies typically do not confirm that every seed match constitutes a functional target site. It seems therefore that additional factors contribute to the functionality of target sites. To improve the specificity of prediction of functional target sites, most computational studies make use of evolutionarily conservation [22,25,31,35] or at least flag conserved putative targets [32,33]. However, currently available methods generally use conservation statistics in an ad hoc manner. In particular, existing methods do not explicitly take the phylogenetic relationships into account when weighing the evidence of conservation between related species. In addition, current methods treat all miRNAs identically and ignore that the selection pressures for conserving functional target sites between related species may differ significantly between miRNAs. That is, functional target sites for one miRNA may be preferentially conserved in one subset of species, whereas the functional sites for another miRNA may be preferentially conserved in another subset of species. Incorporating conservation statistics in a general, rigorous and miRNA-dependent manner are the main features of the miRNA target prediction method that we present here. From the very early stages of miRNA target prediction it became clear that regulatory proteins such as transcription factors are preferentially subjected to miRNA-dependent regulation. Yet, beyond a few well-characterized miRNA-target interactions, there is still very little known about the place of individual miRNAs in the regulatory networks of cells and organs. Several groups [23,30,36] have used Gene Ontology categories in an attempt to characterize the biological roles of different miRNAs. Here we present a new analysis based on the association of targets for individual miRNAs with molecular pathways annotated in the KEGG database. This approach recovers some of the known miRNA-mRNA associations, and makes new predictions, in particular it predicts the for the involvement of specific miRNAs in nervous system development, inter-cellular communication, and cell growth. Results and DiscussionmiRNA-target interactions: the importance of different 'seed types'Several lines of evidence [22,27-29,37] suggest that complementarity of the target site to the first 8 bases at the 5' end of the miRNA are of crucial importance for target site recognition. Lewis et al. [22] have investigated the importance of the miRNA "seed", defined as the positions 2–7 of the miRNA, by comparing conservation statistics of mRNA segments that are complementary to miRNA seeds with those of randomized control sets. They concluded that conserved 3' UTR regions predicted to hybridize perfectly with positions 2–8 of the miRNA or with positions 2–7 of the miRNA, but having an A nucleotide flanking the seed match at the 3' end are likely to be miRNA target sites. Inspired by these methods we decided to re-investigate the conservation statistics of different "seed types" across different clades of organisms. In all cases, we only focused on the first 8 positions of the miRNA, and we analyzed the following 9 "seed types" (see Figure 1):
1. Perfect complementarity with Watson-Crick interactions between positions 1–8 of the miRNA and the mRNA target site. 2. Perfect complementarity at positions 1–7 but not at position 8. 3. Perfect complementarity at positions 2–8 but not at position 1. 4. Perfect complementarity at 2–7 but not at 1 and 8. 5. Complementarity at positions 1–8 with a single G-U pair occurring with the U in the miRNA (GUM). 6. Complementarity at positions 1–8 with a single G-U pair occurring with the U in the target (GUT). 7. Complementarity with a single bulged nucleotide on the miRNA side (BM). 8. Complementarity with a single bulged nucleotide on the target side (BT). 9. Complementarity with a single internal loop involving one nucleotide in both miRNA and target (LP). For each of these 9 seed types t, and each of the four clades (mammals plus chicken, fishes, flies, and worms), we determined the fraction ft of putative target sites that are perfectly conserved in all species of the clade (see Methods for details). We only considered miRNAs that were themselves conserved in all species of the clade. We also determined the "background" conservation fraction, of randomly chosen 3' UTR sequence segments of the same length as the respective seed types that are conserved in all species of the clade. Figure 1 shows the ratio of these two fractions, which we called "conservation fold enrichment". As expected, octameric sites show most evidence of functionality. The conservation fold enrichment decreases dramatically as the extent of complementarity that we require between miRNA and putative target site decreases. In particular, sites in which only the nucleotides 2–7 of the miRNA are predicted to form base pairs with the mRNA, as well as sites predicted to form G-U base pairs or to contain internal loops show relative little evidence of conservation enrichment. This is not to say that such sites are never functional. Indeed, functional target sites of this type are known, in particular in worms [38], for which, interestingly, we observe the strongest evidence of selection on these seed types. However, for our target prediction method we decided to focus on the three seed types that show strong evidence of conservation enrichment across all clades: those with perfect Watson-Crick complementarity with positions 1–7, 2–8, or 1–8 of the miRNA. As a result, we predict the same set of target sites for different miRNAs with the same first 8 nucleotides. In reality, even though the seed is probably most important for targeting, the 3' ends may also contribute to the target selection and this could differentiate the target sets for different miRNAs with the same seed. This possibility, which has been studied experimentally by Brennecke et al. [28], and is explicitly incorporated in other target prediction models [25], is not captured by our model. Note, however, that because the miRNA-mediated targeting depends on the expression of both miRNA and targets, distinct miRNAs that have the same seed sequence may still have different target sets simply due to differences in their expression profile, even though in principle they recognize the same set of target sites. Note also that we do not use a model in which the mRNA position corresponding to the first nucleotide in miRNA is an adenosine, because we did not find this constraint to consistently improve the conservation fold enrichment across all clades (not shown). Bayesian phylogenetic model for miRNA target sitesWe have developed a Bayesian probabilistic model for assigning, to each putative "site" in a 3' UTR that is complementary to a miRNA seed, a posterior probability that the site is a functional target site for the miRNA, meaning that the site has been selected in evolution for its ability to bind the miRNA. The details of the model are described in the Methods section, but the main ingredients are the following. For each miRNA and each seed type t we collect all putative sites in the 3' UTRs of the reference species of the clade in question, i.e. the 3' UTR sequence segments that are complementary to the given miRNA seed. For each of these putative sites we then determine the conservation pattern Generally, if conservation patterns with many ci = 1 are much more abundant among putative miRNA sites than among background sites, then we infer that a fraction of the putative target sites must be functional and that selection has maintained these sites in some of the species. However, conserved target sites need not be functional in all species in which they occur. The conservation pattern of a given site is typically the result of selection maintaining the site in some of the species in combination with chance conservation of the site in other species, in particular those that are evolutionarily close. Our model flexibly and explicitly takes this into account. The model considers all possible "selection patterns" For each miRNA we then determine the frequencies p( Using the estimated frequencies p( Phylogenetic distribution of functional target sites across miRNAsNote that the estimated distribution over selection patterns p( Additional File 1. Phylogenetic distribution of functional target sites. Inferred selection pattern distributions p( Format: PDF Size: 13KB Download file This file can be viewed with: Adobe Acrobat Reader
Because we infer different distributions p( One of the issues that has been extensively discussed in the miRNA literature is the question of the typical number of functional targets per miRNA, and the related question of what fraction of seed matches in 3' UTRs corresponds to functional target sites. Previous work has indicated that the number of targets per miRNA varies across miRNAs [23]. We believe that the ability of our method to infer species-specific miRNA targeting for each miRNA, allows for a more sensitive and accurate estimation of the total number of functional target sites of each miRNA. There are two independent contributions to the total number of functional target sites for a given miRNA. First, the total number of miRNA seed matches varies from miRNA to miRNA and second, the fraction of seed matches that correspond to functional sites may vary from miRNA to miRNA. The latter can be estimated from the conservation evidence. In particular, the inferred parameter ρ of the distribution p(
The number of predicted sites under selection does not appear to correlate with the breadth of miRNA expression, as among the miRNAs with the largest number of predicted target sites we find some that are highly tissue specific (miR-9 and miR-124 that are expressed in the nervous system [5], and miR-155 that is specific to lymphoid cells [5]) as well some that have broad expression (e.g. the families of miR-29 [9,10,41,42] and miR-30 [41,42]). miR-16, which is ubiquitously expressed [5,41] has an intermediate number of targets (Additional file 2). Additional File 2. Number of miRNA targets predicted to be under selection pressure for each miRNA. Format: XLS Size: 63KB Download file This file can be viewed with: Microsoft Excel Viewer Performance comparison with other methodsTo assess the quality of our predictions relative to other methods that have been published to date, we built on the results recently published by Stark et al. [25], who have performed a detailed comparison of the performance of most of the prediction methods that are currently in use on a relatively large set of experimentally tested miRNA-mRNA interactions. This experimental data set has been mostly obtained by the Cohen lab, with a small number of interactions having been tested by other groups. The issues concerning the biases involved with the assembly of this data set have already been discussed by Stark et al. [25], and we will not belabor them here. We will only caution the reader that the accuracy of various different methods on this data set should not be taken as an indication of their accuracy on a random set of miRNA-mRNA interactions. Unfortunately, this unbiased experiment has not been done. Since our method assigns a posterior probability to each predicted site, sets of predictions at different levels of confidence can be obtained by including only sites over a given posterior probability. We created such sets at different thresholds in posterior probability and computed the sensitivity (
The figure indicates that our method performs as well as the most accurate prediction methods available to date, while maintaining a very high specificity even for high sensitivities. We observe a large overlap between our predicted targets and those predicted by Stark et al. [25] and Grün et al. [23] although there are also substantial numbers of predicted sites that are either specific to our method or specific to one of the other methods. The significant overlap is most likely a reflection of the similarity in the definition of target sites: 7/8-nucleotide seed matches that are conserved across at least some of the other flies account for a large fraction of the predicted sites in all three methods. However, these other methods also consider putative sites with fewer matches in the seed region if they are sufficiently conserved [23] or compensated by matches to the 3' end of the miRNA [25]. The very good accuracy of our predictions indicates that appropriately weighing the evolutionary information enables us to achieve a good performance even with more restrictive definition of putative target sites compared to thee other methods. In particular, we note that from a total of 12,155 high confidence predicted sites (posterior probability p ≥ 0.5), a substantial proportion, namely 1,953 (16%), are not perfectly conserved in Drosophila pseudoobscura, but are conserved in many of the other flies. Such sites will be missed by methods that only consider strict conservation in D. pseudoobscura. In Additional file 3 we show a detailed comparison of our predicted target sets for fly miRNAs and those reported by Stark et al. [43] and Grün et al. [23]. We defined a UTR to be a predicted target of a specific miRNA if it had at least 0.5 probability of containing a functional site for the miRNA (see Methods). Because the UTR data sets used by different groups differs to some extent, we have used a conservative scheme of computing the overlap: we have assumed that whenever another method predicted a site in a splice variant of a given gene, all the variants would share the site. Thus, the numbers below represent upper bounds on the extent of overlap between the different methods. Note additionally that the total number of predictions made by other methods may not be the number of predictions reported in the respective studies, but include all the splice variants known to date. The overlap between our predictions and those of Stark et al. [43] and Grün et al. [23] varies significantly between miRNAs. For example, for the bantam miRNAs, which has shown to be involved in the regulation of cell growth [47,48], the overlap is quite large. We predict 140 targets of which 106 (76%) and 121 (86%) occur in the predictions of Stark et al. [25] and Grün et al. [23], respectively. The discrepancy is higher for miR-1, a miRNA required for muscle development [49]. We predict 362 targets of which 252 (70%) and 271 (75%) occur among the predictions of Stark et al. [25] and Grün et al. [23], respectively. Finally, for another microRNA, miR-281, we make only a total of 34 predictions of which only 13 (38%) and 17 (50%) occur among the predictions of Stark et al. [25] and Grün et al. [23], respectively. That is, at least half of our predicted targets are not predicted by the other two algorithms. Unfortunately, the data set of experimentally tested miRNA-mRNA interactions is too small to meaningfully compare the predictions of the different methods for individual miRNAs. Additional File 3. Detailed comparison of the overlap between the predictions provided by our method and the methods of Stark et al. [25] and Grün et al. [23]. Format: XLS Size: 34KB Download file This file can be viewed with: Microsoft Excel Viewer Location bias of predicted miRNA target sites in UTRsWe next turned to the high-confidence (posterior probability ≥ 0.5) subset of our predicted miRNA target sites and we asked whether we could identify a bias in the location of evolutionarily selected miRNA target sites in the 3' UTRs. Figure 5 shows a heat map representation of the location of these sites along the 3' UTRs in the different clades. In this plot, each predicted miRNA target site is represented as a dot with its x-coordinate being the total length of the 3'UTR in which it resides and its y-coordinate being the relative, normalized position of the site in the UTR. We infer that in all clades, the high-confidence sites tend to avoid the regions immediately after the stop codon as well as the end of the transcript. At the 3' end, this effect could be due to the presence of polyA tails in some of the Refseq transcripts. In human, where the UTRs are much longer than in the other species considered (3,300 of the 22,459 of the human UTRs in our data set were longer than 2 kb), conserved miRNA target sites also tend to avoid the regions in the middle of long UTRs. This pattern is mirrored in the conservation profile across long UTRs, i.e. long UTRs tend to be less conserved in the middle than toward their ends (data not shown). The pattern is also observed at the level of predicted target sites for individual miRNAs (Figure 6), i.e it is not caused by one or two miRNAs with an aberrant target site distribution.
A conceivable explanation for the observed pattern of enrichment of miRNA sites toward the start and end of long UTRs is that a non-negligible proportion of long Refseq UTRs erroneously contains introns. To test this hypothesis we obtained all EST sequences that overlap Refseq UTRs and calculated, for each UTR base, the fraction of all overlapping ESTs in which the base is intronic. As shown in Additional file 4, there is almost no difference between the intron-inclusion profiles for long and short 3' UTRs. That is, the observed enrichment of miRNA sites toward the ends of long 3' UTRs cannot be explained by intron inclusion. Additional File 4. Profile of the exon coverage of short (left panel) and long (right panel) 3' UTRs. We used the mappings of spliced ESTs from the UCSC database to determine, for each nucleotide in a 3' UTR in our data set, the fraction of times the nucleotide has been observed in an exon, as opposed to an intron. We only used ESTs that mapped uniquely with at least 95% identity to the genome. Genome gaps longer than 30 nucleotides were considered to be introns. The profiles of the computed exon coverage along relatively short (less than 2 kb, left panel) and relatively long (longer than 4 kb, right panel) 3' UTRs are shown in the plots with a continuous line. Also shown are the histograms of the relative positions of predicted sites (with posterior probability ≥ 0.5) in the same 3' UTRs. Format: PDF Size: 7KB Download file This file can be viewed with: Adobe Acrobat Reader The observed pattern is interesting because it has been argued [25,26] that miRNAs are a major factor driving the evolution of UTR lengths: ubiquitously-expressed genes have short UTRs, while genes whose expression is more restricted and regulated by miRNAs have longer UTRs. Our result suggests a more complicated scenario, in which more strongly conserved miRNA target sites, which have most likely emerged early, are located towards the boundaries of the 3' UTR, the stop codon and the polyadenylation site. This particular location of target sites may influence the likelihood of interaction between the miRNA-containing ribonucleoproteins and other complexes involved in RNA processing and regulation. Inference of miRNA function using pathway analysisTo analyze the role that individual miRNAs play in the regulatory networks in human, we have used the KEGG database in which a large fraction of the human genes are assigned to pathways. KEGG provides a mapping between genes and pathways, as well as a reference to the identifier of each of the genes in the Gene database of NCBI. Based on this mapping, as well as on the assignment of Refseq identifiers to Gene identifiers which we obtained from NCBI, we have constructed an assignment between putative miRNA targets and pathways. The resulting dataset consisted of 4, 011 human Refseq transcripts. Using putative target sites with posterior probability of ≥ 0.5, we have determined which pathways are significantly associated with each individual miRNA (see Methods). In particular, for each miRNA/pathway combination we calculated the log-likelihood ratio, given the observed data, of two models: one that assumes that pathway membership and being a predicted target of the miRNA are independent, and one that assumes that these are generally dependent properties. Figure 7 shows the results of this analysis for the subset of miRNAs that had at least one significant association (Additional file 5 shows the entire miRNA set). The color scale is centered around a log-likelihood ratio of 0 (white), and the intensity of the color is proportional to the posterior probability of the dependent model. Enrichment of targets in a pathway is shown in red, and depletion of targets in a pathway is shown in blue. Additional File 5. Pathway analysis for all miRNAs and all KEGG pathways. Representation of individual pathways among the predicted targets of a given miRNA. Each column corresponds to a KEGG pathway and each row to a group of miRNAs with the same seed sequence. Red indicates overrepresentation of the targets of a specific miRNA among the genes in the corresponding pathway, whereas blue indicates depletion. The intensity of the color indicates the posterior probability of the dependent model. Pathways have been grouped in larger functional categories according to the KEGG annotation. Format: PDF Size: 37KB Download file This file can be viewed with: Adobe Acrobat Reader
The first thing to note is that, as reported previously [25], genes that are ubiquitous and are involved in basic metabolic functions, tend to be depleted in miRNA target sites. Also noted before is that miRNAs tend to target genes involved in transcription regulation, intercellular communication, cell growth and death and development [22,23,25,30,33,36]. For example, we find that the targets of 19 of the 119 unique miRNA seeds are significantly enriched in the axon guidance pathway. This does not necessarily imply, however, that all these miRNAs are specifically involved in axon guidance. Many of the molecules involved in axon guidance are also involved in delivering spatial cues during the development of other systems, such as for example the cardiovascular system. So it is plausible that, whereas many miRNAs are associated with the axon guidance pathway, different miRNAs may act on different subsets of the mRNAs from this pathway in different tissues. Below we describe some of the most notable associations that we found between specific miRNAs and pathways. Our method yields the expected associations for miRNAs which are specifically expressed in certain tissues (and presumably regulate processes that are specific to these tissues), or for miRNAs for which targets are already known. miR-124a, whose expression is highly specific to the nervous system, is one of the miRNAs most significantly associated with the axon guidance pathway. Its corresponding targets in this pathway include players with known involvement in nervous system development such as the ephrins B1, B2, and B3, ephrin receptors A2, A3, and B4, semaphorins 5A, 6A, 6C, and 6D, and plexins A3 and B2. As miR-124 is highly expressed in mature neurons, it is possible that its function is to maintain previously established neuronal circuits. Our results also suggest an involvement of the miR-181 family of miRNAs in nervous system. These miRNAs, whose expression in zebrafish appears to be restricted to the nervous system, thymus and gills [13], have so far been shown to play a role in lymphocyte [50] and muscle [51] development. In our data, they have a set of high confidence targets in the long term potentiation pathway, among which glutamate receptors, calcium/calmodulin-dependent protein kinase II, adenylate cyclate 1, and calmodulin. In fact, calcium/calmodulin kinase II 2 appears to play a role in both memory performance [52] and in activation-induced T cell differentiation [53]. These results may explain the up-to-now puzzling expression pattern of these miRNAs. The let-7 miRNA, which was recently shown to regulate the let-60 gene in C. elegans and is presumed to regulate the human homologs of let-60, i.e. the Ras genes [35], is most significantly associated with the MAPK pathway, with the NRAS gene and the Ras guanyl releasing protein 1 RasGRP1 being predicted as high confidence targets. Additionally, let-7 is predicted to target several kinases and phosphatases in this pathway, and, importantly for the postulated involvement of let-7 in malignancy, the Fas ligand, TGFβ receptor I, nerve growth factor and fibroblast growth factor 11. miR-9 has been described as a brain-specific miRNA [5], and recent evidence suggests that its expression is highest in fetal brain and oligodendrogliomas [54]. The top pathway associated with this miRNA is that of glutamate metabolism, in which miR-9 appears to target glutamate decarboxylase, glutamate dehydrogenase, glutamase, glutamate-cysteine ligase, glutamic-oxaloacetic transaminase 1, as well as glucosamine-phosphate N-acetyltransferase 1, 4-aminobutyrate aminotransferase, and phosphoribosyl pyrophosphate amidotransferase. The second most significant association for miR-9 is with with the focal adhesion pathway, in which many more genes appear to be targeted, among which collagen V α1, collagen IV α2, integrin 6, tenascin C, talin, trombospondin 2, and vinculin. These targets suggest that miR-9 may be involved in regulating the intercellular communication in the brain and the function of neural circuits. Another group of miRNAs for which we suggest a role a development, in particular in the nervous system, is that of the embryonic miRNAs exemplified by miR-372, initially identified in a study of human embryonic stem cells [10]. These miRNAs appear to be primate-specific. However, the nucleotides at position 2–7, AAGUGC, are shared by the 5' ends of several other miRNAs that are embryonically-expressed and of restricted phylogenetic distribution such as the rodent miR-290 (AAAGUGCC 1–8), miR-291 (AAAGUGCU 1–8), miR-292 (AAAGUGCC 1–8), and miR-294 (AAAGUGCU 1–8), zebrafish miR-430's (U/AAAGUGCU at 1–8), as well as the human miR-302 group (UAAGUGCU at 1–8), miR-373 (GAAGUGCU at 1–8), and most miRNAs of the miR-520 group (AAAGUGCU at 1–8). Of these miRNAs, the study of [55] implicated miR-430 in the nervous system development in zebrafish, although in a subsequent study the authors showed that miR-430 plays a role in the clearance of maternal RNAs [56]. Our results speak to the first proposed role of this class of miRNAs, namely in nervous system development. The miR-372-related miRNAs (AAAGUGCU at 1–8) have a strong predicted association with the axon guidance pathway, where it is predicted to target, among others, the ephrin B2, ephrin receptors A4, A5 and A7, semaphorin 4B, LIM kinase 1, and p21-activated kinase 7. Moreover, at least some of the Smad genes that are part of the top pathway predicted to be targeted by these miRNAs, the TGFβ pathway, have been implicated in the growth of neurites [57]. Interestingly, the difference A vs. U or G at the first position between the miR-372 and other families mentioned above leads to quite different predictions of targeted pathways. For none of these other miRNAs have we found a pathway that appears to be significantly targeted. Finally, we were very interested in understanding the function of miR-16 (which shares its seed with the miR-15 group of miRNAs), a miRNA that appears to be ubiquitously expressed at least in mouse [5], and has been implicated in regulation of apoptosis [58] and of mRNA stability [59]. We find that the most significant association of miR-16 is with the mTOR signaling pathway [60], which integrates nutrient-derived signals and controls cell growth. miR-16 appears to target the rapamycin-insensitive companion of mTOR, several ribosomal protein kinases, components of the eukaryotic translation initiation factor 4 (B and E), insulin-like growth factor 1 and others. The second most significant association of this miRNA is with the Wnt pathway, in which it targets several Wnt (Wnt2B, Wnt3A, Wnt5B, Wnt7A), a Wnt inhibitor (WIF1) and cyclin (D1, D2, D3) proteins, and the third most significant association is focal adhesion, where miR-16 appears to target a large number of transcripts that have fundamental role in cell division and cell-cell communication. Some examples are again the cyclins D1, D2, and D3, cell division cycle 42, p21-activated kinase 7 (PAK7), v-akt murine thymoma viral oncogene homolog 3 (AKT3), v-crk sarcoma virus CT10 oncogene homolog (avian)-like (CRKL), mitogen-activated protein kinase kinase 1 (MAP2K1), laminin gamma 1, B-cell CLL/lymphoma 2 (BCL2), and others. These suggest a fundamental role of miR-16 in controlling cell growth and maintaining cell-cell interactions. These functions may explain the observed association between miR-16/miR-15a deletions and chronic lymphocytic leukemia [61], and the slower progression of CLL in mice treated with rapamycin [62]. ConclusionAs the number of miRNA genes has been growing steadily, especially through high-throughput cloning techniques, the number of experimentally validated targets has been lagging markedly behind. Recently, studies that take advantage of the fact that miRNAs appear to also induce partial degradation of their mRNA targets have used microarray methodology to identify genes whose expression changes upon over-expression or knock-down of individual miRNAs. Typically hundreds of putative targets are identified in such studies but there is only partial overlap between these sets of putative targets and those that are computationally predicted using comparative genomics methods. Computational modeling of miRNA-mRNA interaction and accurate prediction of miRNA target sites therefore remains an important and challenging problem in bioinformatics. In particular, it is still poorly understood what constraints beyond matching of the miRNA seed determine functionality of putative target sites. In this study, we developed a general method for miRNA target prediction that extends the already available methods in several ways. First, we treat the phylogenetic relationships between species in a rigorous and general way, without any tunable parameters. That is, the Bayesian procedure uniquely determines the posterior probabilities for each conservation pattern and seed type in terms of the observed conservation patterns of target sites for each miRNA. Thus, in contrast to many other target predictions methods which are specifically tailored to operate on a particular clade of species, our method can be applied to any clade of species, and the phylogenetic relations between the species will be automatically taken into account when assessing the significance of the site conservation patterns. This will, for example, enable us to easily update our predictions as more genomes become available, without the need of adapting the method. Note also that our Bayesian procedure for incorporating information from conservation statistics is generally independent from the "site" definition that we employ and can easily be applied to other target site definitions (see Methods for details). Thus, if a better definition of target sites is developed in the future, for example through a better understanding of the requirements on functional miRNA target sites, then we can easily adapt the method to incorporate conservation statistics in essentially the same way. Most generally put, given a binary function that distinguishes "sites" from "non-sites" in RNA sequences, and given a set of "background frequencies" p( Second, we estimate the evolution of selection pressures on target sites in a miRNA-specific manner. This enables us to correctly treat miRNAs that appeared at different stages in evolution, and whose targets may have undergone different selection pressures in different lineages. In particular, we show that different miRNAs show markedly different distributions of functional target sites across the phylogenetic tree and provide the first comprehensive picture of species-specific and clade-specific miRNA targeting. We have additionally shown that, especially in long 3' UTRs that occur in vertebrates, miRNA target sites show a significant bias toward occurrence near the start and end of the 3' UTR. This suggests the possibility that the choice of a distal polyadenylation site may reduce the activity of a miRNA target cassette in the center of the 3' UTR, while introducing other miRNA target sites close to the new polyA tail. With respect to the performance of our algorithm, we have shown that in fly, where extensive comparisons of the performance of target prediction algorithms have been done, our method performs at least as well as the most accurate methods available today, with a high specificity over a relatively large range of sensitivities. Finally, to more robustly infer the function of individual miRNAs, each of whom may target hundreds of transcripts, we developed a method for identifying biochemical pathways that are significantly enriched or depleted in targets of a specific miRNA. We showed that, for well-studied miRNAs, this approach recovers the known functional associations. In addition, this analysis predicts novel pathway associations for a significant number of miRNAs. MethodsConservation fold enrichment of different seed typesFor the data shown in Figure 1 we focused, for each clade, on all miRNAs that occur in all species of the clade. Given that the seed sequence is so important for our inference, we used small RNA cloning data in human to determine the most abundant form of each mature miRNA (Pfeffer et al. [63,64] and M.Zavolan & T.Tuschl, unpublished data), and we used this form in our prediction (Additional file 6). To determine which miRNAs are conserved in the clade we started with miRNA genes annotated in miRBase and searched the genomes of the other species for matches to the mature miRNA. Whenever the mature miRNA mapped with at most one mismatch we consider the mature miRNA conserved in that species. Since our inferences only uses the first 8 nucleotides of a miRNA, we then consider a miRNA seed to be conserved in a species if there exists at least one mature miRNA in that species with the corresponding seed. Additional File 6. MiRNA seed families. All miRNAs that have the same seed (positions 1–8) were clustered together. The table shows the representative miRNA as well as the members of each cluster. Format: XLS Size: 73KB Download file This file can be viewed with: Microsoft Excel Viewer For each seed type t we located all sites in the 3' UTRs of the reference species that are complementary to a seed of type t for any of the conserved miRNAs and then computed the fraction ct of these sites that are conserved in all other species of the clade. We also determined the "background" conservation frequencies bt for each seed type by scanning all 3' UTRs of the reference species and computing the fraction of all sequence segments of the same length as the seed that are conserved in all other species of the clade. Note that all seeds of the same length have the same background frequency bt. This is because we found that this frequency is largely independent of the number of occurrences of a particular sequence segment in the reference species. Finally, the conservation fold enrichment ft of seed type t is defined as the ratio of observed and background conservation rates: ft = ct/bt. Bayesian phylogenetic miRNA target identification algorithmFor each miRNA and each of the three seed types we identify putative target sites separately and assign a posterior probability to each target site as follows. First we find all "sites" that are complementary to the seed in the 3' UTRs of the reference species. Using pairwise alignments between the reference species and the other species we determine, for each putative site, which other species have the site conserved. An individual site was considered conserved if all the base pairs predicted to form between the miRNA and this site in the reference species could also be formed with the corresponding sites, extracted from the genome alignments, in the other species. This defines a "conservation pattern" for each site, which is a binary vector The fact that a putative target site is conserved does not necessarily imply that the site is functional. Especially for closely-related species short sequence segments, such as the 7-mers and 8-mers of miRNA seeds, can easily be conserved by chance. This evolutionary dependency between orthologous sites can be taken into account in a number of different ways. For example, in RNAhybrid [32] the p-values for orthologous target sites are combined by fitting an "effective" number of orthologous sequences to the observed p-value distribution for randomly generated miRNAs. Here we aim to incorporate the conservation statistics in a Bayesian framework that takes the phylogeny of the species explicitly into account and recognizes that a conserved site may be under selection in any of the subsets of species in which the site is conserved. That is, to infer how likely it is that a given putative site with conservation pattern To this end we first define a "background model" that gives the probabilities p( We next calculated how likely it is to observe different conservation patterns Thus, in general we consider all possible "selection patterns" for the site across the different species. Like the conservation pattern, a selection pattern Note that p( Finally, we need to quantify how likely it is a priori that a target site in the reference species will be under selection in a particular subset of the other species. That is, we need a prior probability distribution p( For each miRNA we need to estimate the prior probabilities p( The most general approach to estimating p( where S is the set of all selection patterns that are consistent with the miRNA gene conservation pattern, p( Given sufficient data, i.e. n(
Note that by using only conservation information, we cannot possibly distinguish sites that are only functional in the reference species from sites that are not functional at all. That is, we do not know what part of the fraction (1 - ρ) corresponds to sites that are functional, reference-specific sites and what fraction is nonfunctional. The inferred fraction ρ therefore provides a lower bound on the fraction of functional sites. For simplicity, we will make the conservative assumption that only the fraction ρ of sites is functional, and refer to these sites as the fraction of "functional" sites. For each miRNA we estimate the parameters ρ, and ρ11(k), ρ10(k) and ρ01(k) for each node k, by maximizing the likelihood of the distribution given the observed data, i.e. equation (3). Let ω denote one of the possible selection patterns for the two descending branches, i.e. ω ∈ {01, 10, 11}, and define the indicator function δ( Using this it is easy to show that L can be maximized with respect to the parameters ρω(k) by an expectation maximization (EM) procedure. If we define then the EM update equations are given by By iterating these equations we can determine the optimal ρω(k). Since, as can also be shown by taking second derivatives, the likelihood L is a concave function of the parameters ρω(k), the EM procedure is guaranteed to converge to the unique global optimum of the likelihood. Once all ρω(k) have been determined for a given miRNA we can calculate posterior probabilities of functionality for each putative target site as follows. As mentioned above, we consider a target site functional if it is under selection in the reference and at least one other species. The nonfunctional sites are then by definition those sites that are not under selection in any of the other species. We will denote this no-selection pattern as Note that the prior probability of no selection is simply (1 - ρ), i.e. p( We can thus also write the posterior as The parameter ρ thus corresponds to the estimated fraction of all putative target sites in the reference that are functional, i.e. under selection in at least on other species. Finally, note that the sums in equations (1) and (2) involve a number of terms that grows exponentially with the number of species in the clade. We believe that this will not cause any computational problems in clades with less than 20 or so species. For much larger sets of species these sums can become computationally prohibitive. In those circumstances one could reduce the number of species by choosing, for each set of closely-related species, only a single representative. For species that are so closely-related that most putative target sites are conserved between them, choosing a single representative per group would hardly affect the predictions. Sequence dataWe carried out miRNA target predictions for all available human, fly, fish and worm RefSeq transcripts present in the 17th release of the Refseq database. We mapped all transcripts to the corresponding genomes using the Spa cDNA-to-genome alignment program [65], and the genome assemblies hg17 (human), dm2 (fly), ceWB05 (worm) and danRer3 (fish) provided by the Genome Bioinformatics group at the University of California, Santa Cruz [66]. From the same source we also downloaded pairwise alignments of several genomes with the genome of reference species, as follows: for human we downloaded hg17-to-panTro1, hg17-to-rheMac2, hg17-to-canFam2, hg17-to-bosTau2, hg17-to-mm7, hg17-to-rn3, hg17-to-monDom1 and hg17-to-galGal2; for fly we used dm2-to-droSim1, dm2-to-droYak1, dm2-to-droAna1, dm2-to-dp3, dm2-to-droMoj1, dm2-to-droVir1; for fish we used danRer3-to-fr1 and danRer3-to-tetNig1. Finally, for worm we used the software Threaded Blockset Aligner (TBA) [67] to align C. briggsae and C. remanei to C. elegans. Pathway enrichment analysisWe used the KEGG database to infer pathways preferentially targeted by individual miRNAs. The KEGG database (ftp.genome.jp) contains mappings from NCBI Gene identifiers to pathway IDs (data files: [org]_ncbi-geneid.list, with [org] being the species code)), while the Gene database of NCBI (ftp://ftp.ncbi.nih.gov/gene/ webcite) provides mappings from Gene IDs to Refseq IDs (gene2refseq). By intersecting these data sets we obtained the mappings from Refseq IDs to pathways. We then used a Bayesian method to determine the significance of the overlap between the targets of each seed-equivalent set of miRNAs and each specific pathway. For a given pathway and miRNA let n01, n10, n00 and n11 denote respectively the number of predicted targets of the miRNA that are not part of the pathway, the number of genes in the pathway that are not targeted by the miRNA, the number of genes that are neither targets of the miRNA nor members of the pathway, and the number of genes in the pathway that are predicted to be targeted by the miRNA. While pathway membership is a simple boolean variable (a gene is either a member of a given pathway or it is not), we can only assign probabilities for a given gene to be a miRNA target. Assume that a given gene has n putative target sites for a given miRNA and let pi denote the posterior probability of the ith site. The probability that at least one of the sites is functional is then given by where Γ(x) is the gamma function, a dot indicates summation over the variable in question, i.e. n1. = n10 + n11 and n is the total number of genes. For the dependent model the likelihood is given by where the integral is over the simplex p00 + p10 + p01 + p11 = 1. The ratio of likelihoods Ldep/Lindep quantifies the amount of evidence for association between the miRNA targets and the pathway. This association can either be positive (miRNA targets are enriched in the pathway) or negative (miRNA targets are depleted in the pathway). In Figure 7 we plotted the quantity sign(n11n.. - n1.n.1)pdep, where Authors' contributionsDG, EvN, MZ contributed to all the stages of this project. JH implemented the web server. All authors read and approved the final manuscript. Additional FilesSupplementary Data – Lists of predicted miRNA target sites. Lists of predicted miRNA target sites in the sets of species that we studied, as well as miRNA/pathway associations are available from [68]. AcknowledgementsDG and JH were supported by SNF grants 205321-105945 and 3100A0-114001 to MZ, and by the Swiss Institute of Bioinformatics. References
Have something to say? Post a comment on this article! |




on Google Scholar







author email
corresponding author email
Figure 1.

Figure 2.
Figure 3.

Figure 4.
Figure 5.
Figure 6.
Figure 7.


Figure 8.









