Abstract
Background
Advances in biotechnology offer a fast growing variety of highthroughput data for screening molecular activities of genomic, transcriptional, posttranscriptional and translational observations. However, to date, most computational and algorithmic efforts have been directed at mining data from each of these molecular levels (genomic, transcriptional, etc.) separately. In view of the rapid advances in technology (new generation sequencing, highthroughput proteomics) it is important to address the problem of analyzing these data as a whole, i.e. preserving the emergent properties that appear in the cellular system when all molecular levels are interacting. We analyzed one of the (currently) few datasets that provide both transcriptional and posttranscriptional data of the same samples to investigate the possibility to extract more information, using a joint analysis approach.
Results
We use Factor Analysis coupled with preestablished knowledge as a theoretical base to achieve this goal. Our intention is to identify structures that contain information from both mRNAs and miRNAs, and that can explain the complexity of the data. Despite the small sample available, we can show that this approach permits identification of meaningful structures, in particular two polycistronic miRNA genes related to transcriptional activity and likely to be relevant in the discrimination between gliosarcomas and other brain tumors.
Conclusions
This suggests the need to develop methodologies to simultaneously mine information from different levels of biological organization, rather than linking separate analyses performed in parallel.
Background
Currently, it is possible to observe the activity (over, under expression, presence or absence of mutations) of almost all molecules of a given type (mRNA, miRNA, DNA) in a single screen using highdensity chips [1], or sequencing related techniques [2,3]. Lately, the number of studies using microarray platforms for analysis of mRNA are quickly being followed by similar analyses related to miRNAs [4,5]. Only recently both types of variables were analyzed simultaneously [68], while, typically, both types of data are analyzed in search for (i) molecules sharing similarity, using simply the expression available at the time (unsupervised approaches, [9]) e.g. clustering [10,11] and association networks [1214] or (ii) similarity with or dependency from other types of traits, providing for example clinical classes or other nonmolecular information on the samples (supervised approaches, [9]) i.e. Significant Analysis of Microarray (SAM [15]), Gene Set Enrichment Analysis (GSEA [16]). However, this approach implies to analyze separately different aspects of a system (e.g., transcriptional and/or posttranscriptional mechanisms) and the results may not be concordant with analyses of the system as a whole. For example, interactions among miRNAs and mRNAs may be underestimated or completely overlooked. This lack of information can be expressed as missing the emergent properties of the system. While the concept of emergent properties is well known in Systems Theory, it has only recently become an important concept in the area of life sciences, thanks to the relatively new approach of Systems Biology [1720]. Emergent properties arise from hierarchical integration of the individual components and organizational levels of complex systems, and, biologically, they are only manifest when the organism is considered in its entirety. Analogous to emergent properties in systems biology is the concept of latent variables in multivariate statistics. Latent variables are socalled hidden variables generated in certain types of multivariate analysis (e.g. factor analysis, see below) which are not evident in original observed data. Rather, these latent variables emerge from consideration of the covariance patterns when a large number of relevant variables are analyzed simultaneously. These latent variables may reflect a summarization of causal indicators underlying observed biological variability. Given the parallelism between biological systems' emergent properties and latent variables, we sought quite naturally to investigate the ability of latent variables to describe emergent properties, by applying multivariate analysis simultaneously to different parts of a biological system, and notably to transcriptional and posttranscriptional data. Previously, successful parallel multiplatform analyses were performed integrating genomic and transcriptional level, by using CGH arrays or SNPs and cDNA arrays [21,22]. This approach portend to explain variations observed at the transcriptional level, based on information at the genomic level. These approaches can annotate and map different types of probe IDs onto genomic coordinates [23], or add analyses at the translational level [24]. However, to date, simultaneous analysis of miRNA and mRNA from the same tissue have used only profile correlations [6]. Herein, we expand analyses of molecular covariation beyond correlation of expression profiles by using the multivariate statistical procedure of multiple or common Factor Analysis (FA, [25]). This procedure is widely used to reduce the dimensionality of multivariate data and to do so in a manner that elucidates the underlying or latent structure of the observed variation. Succinctly speaking, for a given set of molecular data, factor analysis partitions the observed pairwise correlations between variables into that molecular covariation that is common between the variables from that which is unique to the individual variables. Application of FA directly on biological data without any a priori hypothesis about latent variables is ideal for data reduction. With this approach FA was used extensively to cluster microarray data [2628]. The use of the a priori knowledge on how each sample maps on tumor classes to constrain the relation between the latent variables under study and the factors obtained permits further data interpretation. In other words we perform a FA that is driven by data (hypothesis) preestablished to find latent variables that could be investigated to obtain biological information [29]. To constrain the factor model we used Linear Discriminant Analysis (LDA, [25]), a technique used to classify a set of observations into categories (a dichotomy in our case). In particular, in the following we will describe the methodology and the results obtained from applying FA to mRNA and miRNA data simultaneously, with the goal to identify information that is not obvious when the analysis is performed on the 2 datasets separately, or when using other approaches. In particular, the identification of a set of colocalized miRNAs (cluster) with possible relevance for the molecular description of gliosarcomas, appears to emerge from this analysis only, showing the potential FA in the identification of emergent properties. Besides LDA, other classifiers (Support Vector Machine, Naive Bayes, Neural Network and kNearestNeighbours) were also tested and performances are listed in Table S9 of the Additional file 1. We only briefly mention here that most of the performances are identical for all the classifiers, and only for the Glioblastomas discrimination LDA shows slightly more accuracy. These results indicate that the classification analysis is robust and gives stable results independently from the choice of the classification algorithm. Factor analysis proceeds from a matrix of pairwise correlations to extract a small number of factors that describe the major patterns of common covariation. More formally, the common factor model is based on the equation D = LF + E, where D are the observed variables, L are the common factors, F are the coefficients or scores of the factors and E are the unique factors, under the assumptions that the unique factors are uncorrelated whith each other and that F and E are independent. Since only common variation is analyzed, these individual factors describe the latent structure underlying the major patterns of molecular covariation. The sign and magnitude of the factors coefficients reflect the extent and direction of the correlation between each variable and individual factor and describe the relative contribution of each variable to a particular pattern of multivariate changes. FA derives a set of factor scores that gives the relative location of each item in the reduced latent variable subspace. The resultant factors, coefficients and scores are interpreted in light of biological knowledge about the specific data under study. FA can define a biological model about the underlying nature of molecular covariation (e.g. number of patterns of covarying elements and their relative importance). These models are evaluated both biologically and statistically and subsequently used to explain the structure and dynamics of complex biological systems. FA and Principal Component Analysis (PCA, [30]) involve several of the same statistical components and are both useful for data reduction. Therefore few words on the rationale for choosing FA instead of PCA are necessary. PCA is an exact mathematical method that returns a single solution where each component is orthogonal and represents an element of variance in the samples (both biological and nonbiological). Therefore, although it is possible to force PCA in order to relax constraints like orthogonality we chose to apply FA since it is more a natural choice to analyzes the common or shared molecular variations and thus, describe the patterns of biological variation. Besides, the method commonly used to estimate the common or shared molecular variations are based on multiple regression and therefore, for most of the applications of FA, this standard approach is stable. There exist several approaches to perform data reduction and classification (see for example Bayesian classifiers [3133], Support Vector Machine [34], Knearestneighbor [35]), however, FA has already been used successfully in various applications related to molecular biology, like the identification of multidimensional patterns of molecular covariation able to describe proteins' structures [36]. More classical approaches have been designed for effective clustering in the analysis of cDNA microarrays and Expressed Sequences Tag (ESTs) [37], as well as in specific applications to identify genes and pathways related to biological categories that could be associated to relevant phenotypes in both yeast and humans [38] or to test and validate hypotheses on the association of gene expression to cisplatin resistance in ovarian cancer cell lines [39]. One of the advantages of this approach over hierarchical clustering is the possibility to include genes in more than one category. More recently, FA was used to filter informative and noninformative data from microarray for gene expression [40]. Variations of classical FA (Bayesian factor analysis) have been used to identify the latent structure that describes the relationship between transcription factors and genes, using microarray data [41]. Previously, this approach was used to perform gene network reconstruction in E. Coli taking advantage of literature information, DNA sequences and expression arrays [42]. We now propose to apply FA to the composite analysis of multilevel molecular data.
Additional file 1. Supplementary information.
Format: PDF Size: 144KB Download file
This file can be viewed with: Adobe Acrobat Reader
Results and Discussion
Because miRNAs and mRNAs are processed together, from now on, Factors will always be likely to include both mRNAs and miRNAs in their composition. To avoid confusion on the meaning of the word gene, we use the term coding genes to refer to mRNAs and the generic term genes to refer both to mRNAs and miRNAs. The interpretation of factors based on associating them to mRNAs/miRNAs (separately considering positive/negative scores) is a novelty of the presented approach, and will be discussed in details in the coming sections. In particular, in the following we will describe: how we identified the latent factors and we will give their interpretation, both using mRNA and miRNA (indirect) functionalities. Then, we will describe the biological structure emerging from this analyis, and we will speculate on its clinical meaning. Finally, we offer a comparison with the results of an analysis done in parallel, although more comparisons are provided in the Additional file 1.
Identification of Multilevel Latent Structures
We performed several Factor Analyses and obtained Models characterized by 1 to 5 factors (named here Model n, n = 1,..., 5). We have used Kaiser criterion [43] to identify the number of factors that show a large variance (common variance in each factor greater than a given threshold, t) and therefore carry a large amount of the information hidden in the data. Given t = 1 the number of informationrich factors appears to be 4. Therefore, FA was performed with a growing number of such factors, from the one with higher variance, up to 5, to test the appropriateness of the variance threshold. We then confirmed the validity of a subset of the Models using LDA to identify which factor (or linear combination of factors) was able to best classify tumor grade and histopathology, based on the statistical significance of Fisher exact test [44]. This test, suited for contingency tables where one or more expected frequencies are below 5, evaluates the null hypothesis associated with LDA that there are no statistically significant differences between the a priori clinically defined groups. The models for which the null hypothesis was rejected were retained (see Table 1 and Methods for details). Therefore, we performed 4 LDA, namely between a class and its complement: i.e. high/low grade, anaplastic/nonanaplastic, glioblastoma/nonglioblastoma and gliosarcoma/nongliosarcoma, following the original classification in [6]. We did not consider oligodendroglioma relevant, because of a single sample available. Model 3 appears to be the most suitable, since it is able to discriminate between anaplastic and nonanaplastic tumors with 100% accuracy (based only on Factor 2) and the other two types of tumors with ≃ 92% accuracy. Since anaplastic tumors are low grade tumors, Factor 2 is relevant in the identification of low grade tumors in general with ≃ 92% accuracy, since the only oligodendroglioma appears to be elusive. It is worth noting that Model 4 shows the same performance scores, but with a greater number of factors and Factor 4 does not appear to be involved in class identification.
Table 1. Model Selection  Discriminant Analysis
Interpretation of Multilevel Latent Structures mRNA
Functional Analysis
Working solely on Model 3, the mRNAs in each factor were processed to detect enriched Gene Ontology (GO, [45]) terms or UniProt (SP, [46]) keywords. The magnitude and sign of the factor scores (not the factor coefficients from the eigenvectors) give their relative relationship with the expression of miRNA and mRNA. Consequently, each row in the 3 factors score matrix (F1, F2 and F3) was split into positive and negative portions (F1^{+ }and F1^{−}; F2^{+ }and F2^{−}; F3^{+ }and F3^{−}) and analyzed separately. F1^{+ }is associated with GO terms related to response to stress and external stimuli. Terms from SP keywords like secreted and glycoprotein were also found in this subset. Thus this factor appears then to be related with cell functions that process signal from the external environment to the cell with membrane receptors involved to the signal transduction. F2^{− }is also involved in the signaling, including categories related to cell adhesion, it appears then to be related to functions like chemotaxis that are involved in inflammation processes. Finally, F3^{+ }contains coding genes that are related to the biological process that goes under the general term gene expression. Gene expression includes all the mechanisms such as transcription, translation, RNA maturation, proteins transport and ubiquitination by which information coded in the DNA is converted to a functional product. All results are summarized in Table 2.
Table 2. Functional Analysis
miRNA Indirect Functional Analysis
Since miRNAs are not included in any ontology database, we performed an indirect functional analysis by screening the functional terms associated with the experimentally validated target coding genes of the miRNAs, extracted from TarBase [47]. Once the target coding genes were identified, they were manually annotated via GO terms or SP keywords, as above (see Table 3).
Table 3. Indirect Functional Analysis
mRNa/miRNA Complex Functional Annotation
We then checked the functional classification's coherence between the indirect and direct functional analysis, within each significantly annotated factor (i.e. F2^{−}, F3^{+}, since no miRNA appeared in F1^{+}). Thus, globally speaking, F1^{+ }annotation is unchanged and related to functions that are responsible for signal transduction. In F2^{−}, 3 out of 7 target coding genes (CXCL12, TM6SF1 and AGTR1) are annotated with terms that can be associated to the categories significantly varied in the mRNA functional analysis: F2^{− }is then confirmed to be a factor involved in functions related with adhesion and/or chemotaxis. For the miRNAs in F3^{+}, 5 out of 8 target coding genes (ARID4B, MYLIP, HIPK3, E2F1 and NCOA3) are functionally related with the gene expression term found in the mRNA functional analysis. Interestingly, most of the terms (4/5) are related with mechanisms of transcription regulation and only one with protein ubiquitination. After direct and indirect annotation, 2 miRNAs and 31 human coding genes in F3^{+ }were selected as belonging to the same category (see Additional file 1, Table S5). Not surprisingly, most of the coding genes in this list are not predicted to be targets of the 2 miRNAs that appear in the factor. In fact, the biological meaning of the result is a set of genetic elements that share covariability in the expression pattern and we know that, e.g. in animals, most of the control on gene expression is performed by tuning translation. Therefore, the levels of miRNAs and the mRNAs of direct targets are not directly correlated [48]. As it is also suggested in [6] we can imagine that our list of coding genes contains the possible subset of indirect targets (functionally related with the regulation of the transcription) of two miRNAs: miR175p, and miR20b. Globally, F3^{+ }is confirmed to be associated with gene expression, with transcription regulation being the most common mechanism of expression.
Emergent Properties
Since the transcription regulation term (F3^{+}) appears to give the clearest biological information, coherent in mRNAs and miRNA, we focused our efforts on this part of the analysis. The total sets of mRNAs and miRNAs returned from this analysis are listed in Table S6 and S7 of the Additional file 1. Latent Structure Chromosomal Localization: Most of the miRNAs in F3^{+ }belong to two polycistronic miRNA genes where miRNAs are lying in close proximity on the chromosome. (The named clusters are given in italics throughout the paper to improve readability and avoid confusion with clusters emerging from supervised or functional analyses). These polycistronic miRNA genes are involved in cell proliferation, apoptosis suppression, tumor angiogenesis [49] and T cell leukemia [50]. The first polycistronic gene (miR1792) is composed by 7 miRNAs and maps on Chromosome 13 whereas the second one (miR106363) maps on Chromosome × and contains 6 miRNAs, details are shown in Figure 1. The two clusters are closely related, in fact, each miRNA on one cluster has at least one homologous in the other cluster except for miR173p and miR363 that do not share homology with the other miRNAs (shown in Figure 1). As further corroborating test, we observed that, when searching the target coding genes of homologous miRNAs (miR20a, miR175p and miR106a) the list of predicted targets (Targetscan, [51]) is identical for all miRNAs. Moreover, we notice that only two homologous groups of miRNAs in the cluster (miR18 and miR92) are not part of F3^{+}. If we look at their sequence in detail we observe that they are very similar to miR20a with only two mismatches: one in the loop (miR18a and 18b) and one after the supplementary pairing region (miR18b). This can represent a partial functional redundancy since all the known key regions in target recognition are identical. Conversely, miR92 does not share any significant homology with the other members of the cluster (except for the seed region with miR363). Taking into consideration all the redundancies in the clusters, most of the transcript targets in F3^{+ }are probably under the regulation effect of the expressed miRNAs. It is worth noting that a crosshybridization effect in miRNAs could be considered the mechanism responsible for these association in clusters. But, as reported by the authors of the dataset [6], each primer and probe contained zipcoded sequences specifically assigned to each miRNA to increase the specificity of each reaction so that even small differences in miRNA were amplified and detected. So, this artifact can be discarded as explanation for the emerging of clusters of miRNA. Statistical Relevance: Interestingly, in F3^{+}, only 2 miRNAs (hsamir9 and hsamir130b) out of 7 do not belong to any of these two clusters. Their role was shown respectively to be related to the molecular pathogenesis of ovarian cancer [52] as well as to schizophrenia and Human Tcell leukemia Virus1 (HTLV1) transformation [53,54]. Six more miRNAs (miR106a, miR18a and miR18b, miR20a, miR19b1 and miR19b2) that belong to these two clusters could not be part of our analysis, as they were not part of Liu's original dataset. Given the high density of miRNAs in these clusters, we used the hypergeometric distribution to compute the probability associated with the hypothesis that a random sampling would give the same result in terms of number of cluster members in cluster miR1792 (3 members out of 4 total), in cluster miR106363 (2 members out of 3 total) and in both (5 members out of 7 total). The reference group for computing the probability consists of the total number of detected miRNAs (93). The resultant probabilities were Bonferroni corrected and were equal to 3.6 × 10^{−3}, 0.045 and 2.3 × 10^{−7 }respectively. All three are statistically significant.
Figure 1. Organization of miRNA. clusters miR1792 and miR106363. Structure of the two polycistronic miRNA gene and the relations between miRNAs.
Speculations on MolecularClinical Implications
Ultimately, we speculated on how the two clusters that emerge in F3^{+ }can, along with the molecular analysis performed on F1, discriminate between gliosarcomas and nongliosarcomas. This choice is due to the fact that our analysis has shown that the combination of factors that carry the more coherent functional information (both from miRNAs and mRNAs signals) was the combination able to discriminate glioscarcomas from other tumors. Believing that such a coherence could hide strong biological meanings we focused on gliosarcomas the efforts to detect emergent properties. This complex task, that cannot be fully explained with the data and results in hand, can take advantage of intriguing observations emerging from the analysis. We notice, in fact, that the presence of the sarcomatous element, that derives from an endothelial hyperplasic lesion [55], is a characteristic of these kinds of tumor. The hyperplasic lesion is a proliferation of vesselwall components that contains endothelial cells, myofibroblast, smooth muscle cells and other components of the vascular endothelium [56]. In [49] it is also shown that cluster miR1792 is related to solid tumors angiogenesis. The finding of this cluster, and the homologous miR106363, in the factor that contributes to discriminate gliosarcomas, could then indicate an involvement in the development of the sarcomatous element.
Identification and Interpretation of Simple Latent Structures
In this Section we present results obtained from analyzing with FA and LDA the two datasets (mRNA and miRNA) separately. Our original hypothesis dealt with the ability of the complex analysis to identify emergent properties. To evaluate this hypothesis we produced a 3 factor model with factor analysis on the two expression matrices separately. Next, we analyzed the two series of factor scores using separate LDA. In this Section we identify with F_{mi}i Factor i obtained from the miRNA dataset and with F_{m}j Factor j from the mRNA dataset (Fk continues to identify Factor k from the joint dataset. Regarding the identification of the latent structures, as expected and given the larger size of the mRNA matrix, the results in terms of discrimination power among tumor classes and the functional analysis are unchanged. However, the situation is different for the miRNA data. As shown in Table 4 only high/low grade tumors and anaplastic/non analplastic categories are predicted with the same accuracy (and on the same factor, F_{mi}2). The accuracy is lower, 0.83 (p = 0.08) versus 0.92 (p = 0.015) for the glioblastoma/nonglioblastoma category. This occurs because one of the glioblastomas is predicted as a nonglioblastoma. Furthermore, the discrimination appears to be based on a linear model composed only by F_{mi}1 and not on a combination (see F1 and F2 in the complex analysis). The discrimination between gliosarcomas and its dual class is the worst, as accuracy drops to 0.75 (p = 0.23) and F_{mi}3 is not used in discrimination. For what concerns the interpretation of the latent structures, out of the 18 miRNAs selected, 9 are in common with the joint analysis and 9 represent a new set of miRNAs. Five of the miRNAs in the new set are associated with biological terms, and only one (hsamiR126) is shared by more than one factor (F_{mi}1 and F_{mi}2). F_{mi}1 contains 5 terms, F_{mi}2 2 terms (a subset of F_{mi}1) and F_{mi}3 2 terms (for details see Additional file 1, Table S8). These are related with the regulation of the transcription (in F_{mi}1 and F_{mi}3) and they show some overlap with the mRNAs Factors annotation. Namely, biological terms in F_{mi}1 overlap with all the three F_{m }whereas terms in F_{mi}2 overlap only with F_{m}2. Terms in F_{mi}3 are found both in F_{m}2 and F_{m}3. With respect to the comparison to the complex analysis, since these miRNAs are mostly clustered in homologous factors it is possible to associate F_{mi}3 with F1, F_{mi}2 with F2 and F_{mi}3 with F1). The miRNAs shared with the complex analysis and that return an annotation are in F_{mi}2 (both miR155 and miR23a) and F_{mi}3 (miR155). However, without the joint analysis there is no obvious rationale to associate miRNA factors with mRNA factors. This is because, crucially, the 18 miRNAs obtained are distribuited over factors that are decoupled from the factors returned from the simple mRNA data analysis. Therefore this approach does not suggest any obvious association between the two sets of factors. As a consequence, the interpretation of this latter (simple) analysis is limited to the indirect functional annotation of this small set of miRNA (Additional file 1, Table S8). Therefore, the activation of the polycistronic clusters miR1792 and miR106363 does not emerge when miRNAs are analysed separately. In summary, combining the two datasets and applying FA and LDA, provides an obvious way to associate the translational and posttranslational information. In particular, although the mRNA latent structure is the same in the simple and complex analysis, and consequently the functional annotation is the same, hidden signals present in the smaller dataset (miRNA set) appear to be amplified by the signals present in the larger dataset (mRNA set) thanks to their association in a common latent structure.
Table 4. Performances of Model 3 using only miRNA data.
Conclusions
The capability to discriminate between a priori defined classes can be achieved in a variety of ways (a comparison with supervised and unsupervised algorithms is provided in the Additional file 1). However, the capacity to generate factors explaining the complexity of the molecular interactions requires the ability to construct multilevel clusters. With the data at hand we showed that this cannot be achieved in parallel analysis (versus simultaneous or joint) of the two datasets (mRNA and miRNA) or with other approaches we evaluated. The interpretation of factors based on associating them to mRNA/miRNAs represents the major contribution of this work. Certainly, the study of [6] shows sample size limitations (12 patients enrolled) therefore our analyses must be considered as an exemplar of the factor analysis approach. Globally, based on this analysis, since the miRNAs in F3^{+ }belong to two redundant clusters of miRNA, we can speculate that: 1) one of the biological functions in which these clusters could be involved is the regulation of the transcription and 2) in some way, in brain tumors these two clusters are active whereas, in normal cells, only miR1792 appears to be constitutively expressed. Probably both clusters act on the same set of coding genes, but the two loci are regulated separately in normal cells [50]. Nevertheless, despite this strong relationship between the 2 clusters it is difficult to understand how this redundancy works effectively in cells. However, the finding of a possible activation of the polycistronic genes miR1792 and miR106363 represents an encouraging evidence that the factorization of the miRNA and mRNA data can reveal latent structure in the configuration of the expression levels in tumor samples. Despite obvious limitations, we believe our results clearly show that this approach is a very powerful one for the study of multilevel omic data, which in turn can bring more insight into understanding the complex mechanisms of the transmission of information in the cell as a whole.
Methods
In this work, we applied FA to the dataset from [6]. These data consist of 12 microarray samples (for mRNA genomewide expression, around 14,500 coding genes) and 12 realtime PCR (for the profile of 93 miRNAs), performed on the same 12 human primary brain tumor biopsies (details in Additional file 1, Table S1). On this test case dataset, we first identified the best FA model (i.e. the appropriate number of factors) based on the models' ability to explain the relevant clinical and histopathological information. Next, we characterized the factors based on 3 properties: 1) their ability to discriminate among tumor types this was done using Linear Discriminant Analysis (LDA, [25]), a supervised classifier able to find the linear combination of factors which best separates two predefined classes; 2) their functional biological characterization with the help of literature and databases; 3) their complex biological characterization, by searching novel properties emerging from the joint analysis of miRNA and mRNAs. The procedure is summarized in Figure 2.
Figure 2. Schematic view of the complex analysis performed jointly on mRNA and miRNAs from the same 12 tumor samples.
Data Preprocessing
Data from [6] were transformed by computing log_{2 }of the intensity value of mRNA expression (miRNA data come already in log_{2 }from realtime PCR). Quality selection filtering was performed removing every row (mRNA or miRNA expression across 12 experiments) with maximum fold change below 2.5; this reduced the dataset from 7182 IDs to 4966 IDs. The filtering was decided to select genetic elements with strong signal of variation. This criterion was selected as natural consequence of the filtering performed by the authors of the dataset [6] that used the same conditions to reduce the number of the IDs. Data were also normalized in different ways according to:
• , were M_{i }and m_{i }are the maximum and minimum values in the ith row, and x_{ij }is the expression of gene i on sample j.
• , where μ_{i}, is the average expression level in the ith row, and x_{ij }is the expression of gene i on sample j.
The two methods map the expression level in an interval comprised between 0 and 1 the first and μ_{i }and μ_{i }+ 1 the second (in order to introduce in the model also the difference in expression beween genes). The two normalizations give identical results in the Factor Analysis step as expected. In fact, expression signals obtained from qPCR are different from signals obtained from microarrays due to the extended dynamic range of the former. It is common [57,58], in order to validate a set of coding genes obtained by microarray, to express the mRNA level in each sample as a fraction of the expression level in the sample in which that mRNA is most abundant. So, from this point on, miRNA and mRNA expression data were analyzed together, as a single expression table with normalization .
Factor Analysis
The Factor Analysis model can be defined in matrix notation as: D = LF + ε, where D(m × n) represents the data matrix, L(m × l) is the factors loadings matrix, F(l × n) is the factors scores matrix and ε(m × n) is the unique factors matrix. Furthermore, m are the number of samples, n the number of genetic elements and l the number of factors. Our model assumes that F and ε are indipendent, E(F) = 0, and Cov(F) = I. Under these conditions Cov(D) = LL^{T }+ Cov(ε), for the sake of clarity LL^{T }is named communality and Cov(ε) uniqueness. Variability in a human tumor expression dataset arises from several sources besides tumor type, including human variability (sex, age, race) and experimental variability (systematic and stochastic errors). Available information is about tumor types, therefore, our model explicitly involves tumor types variability, and groups other causes within the ε term, showing the power of the FA method. In our work, we were interested in discovering the hidden or latent structure within tumor types, therefore FA is applied using the model D = X^{T}. The Rpackage HDMD developed by Lisa McFerrin at North Carolina State University was used to take advantage of the principal axes algorithm. Communalities were estimated by iteratively updating the diagonal of the correlation matrix and solving the eigenvector decomposition. Axes were rotated to simple structure using the Promax algorithm to improve their interpretability. The simple structure obtained after rotation meets the requirements proposed by Thurstone [59,60] to ensure the stability of FA results. The factor score matrix was analyzed for each of the 5 models (from 1 to 5 embedded factors). The scores associated to the genes within each factor were ranked in descending order. All 3 factors presented a similar scores distribution with average μ ≃ 0 and standard deviation σ ≃ 0.75. Selection has been performed by looking at the value distribution of each row of matrix F and then considering as genes associated with a factor only those whose corresponding score is outside the 2σ interval. In this way, only genes with a strong relation in the same factor were selected.
Discriminant Analysis
The factor loadings coefficients matrix of each model was used to perform LDA. Four dichotomous categories (given by a class and its negate, e.g. glioblastoma/nonglioblastoma etc.) were defined (Table 1). LDA was also performed to assess the most likely class of sample T18 which had an ambiguous classification (glioblastoma/gliosarcoma), see Additional file 1, Table S2. Rpackage MASS [61], function lda() configured to perform a classical crossvalidation classification (jackknife method, also known as leaveoneout validation) was used. In particular we used a stepwise greedy strategy, i.e. checking performances with one factor, and adding another factor, iteratively. All possible equivalent combination of factors were tested, and the most performant with the smallest number of factors involved was chosen.
Model Selection
To evaluate the performances of each factor model on the four tumor classes, we evaluated the contingency table obtained from the discriminant analysis by Fisher's exact test. The null hypothesis assuming that the discrimination between two tumor classes is due to chance was rejected for p < 0.05. For models with similar prediction scores we kept the one with fewer factors.
Functional Classification
On both FA and clustering (used as alternative method to our approach, see Additional file 1) functional analysis was performed using the online tool DAVID [62,63] using GO terms, Kegg pathways terms, SP keywords and features and InterPro terms. The whole list of 4876 probe ID was used as background population. In order to reduce the number of non significant associations, a resulting functional cluster was further analyzed if and only if it contained at least one category with Benjamin score < 0.05. The indirect functional analysis performed to describe miRNAs relevance was performed by searching manually in TarBase [47] all the known coding genes that are target of the miRNAs identified by the FA and clustering. Then for each gene a list with all the associated GO terms was compiled. Due to the small number of targets obtained no pvalue could be associated to any GO term.
Authors' contributions
RF analyzed the data with the help of MT and provided the biological interpretation. WRA provided strong theoretical support for the study, CN ideated the study and wrote the paper with the contribution of WRA and RF. All authors have read and approved the final manuscript.
Acknowledgements
The authors would like to thank prof Casadio for organizing and actively taking part in the exchange between the University of Bologna and PICB, for her enthusiasm and knowledge. The authors would like to thank prof Cavalcanti for actively and ethusiastically contributing to the exchange as well, and they would particularly like to acknowledge him, since, to the regret of all who knew him, has unexpectedly passed away since the exchange was set up. This work is funded by the SinoSwiss Science and Technology Cooperation Project (Grant no.:GJHZ0911). WRA's participation was supported by a CAS Distinguished International Professorship. RF and MT are Fellows of the Official Exchange Agreement between the University of Bologna and MPGCAS PICB.
References

Guiducci C, Nardini C: High Parallelism, Portability and Broad Accessibility: Technologies for Genomics.

Holt RA, Jones SJ: The new paradigm of flow cell sequencing.
Genome Res 2008, 18(6):839846. PubMed Abstract  Publisher Full Text

Wang Z, Gerstein M, Snyder M: RNASeq: a revolutionary tool for transcriptomics.
Nat Rev Genet 2009, 10:5763. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Volinia S, Calin GA, Liu CG, Ambs S, Cimmino A, Petrocca F, Visone R, Iorio M, Roldo C, Ferracin M, Prueitt RL, Yanaihara N, Lanza G, Scarpa A, Vecchione A, Negrini M, Harris CC, Croce CM: A microRNA expression signature of human solid tumors defines cancer gene targets.
Proc Natl Acad Sci USA 2006, 103(7):22572261. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Yang N, Kaur S, Volinia S, Greshock J, Lassus H, Hasegawa K, Liang S, Leminen A, Deng S, Smith L, Johnstone CN, Chen XM, Liu CG, Huang Q, Katsaros D, Calin GA, Weber BL, Bützow R, Croce CM, Coukos G, Zhang L: MicroRNA microarray identifies Let7i as a novel biomarker and therapeutic target in human epithelial ovarian cancer.
Cancer Res 2008, 68(24):1030710314. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Liu T, Papagiannakopoulos T, Puskar K, Qi S, Santiago F, Clay W, Lao K, Lee Y, Nelson SF, Kornblum HI, Doyle F, Petzold L, Shraiman B, Kosik KS: Detection of a microRNA signal in an in vivo expression set of mRNAs.
PLoS One 2007, 2(8):e804. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lanza G, Ferracin M, Gaf`a R, Veronese A, Spizzo R, Pichiorri F, gong Liu C, Calin GA, Croce CM, Negrini M: mRNA/microRNA gene expression profile in microsatellite unstable colorectal cancer.
Mol Cancer 2007, 6:54. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Panguluri SK, Bhatnagar S, Kumar A, McCarthy JJ, Srivastava AK, Cooper NG, Lundy RF, Kumar A: Genomic profiling of messenger RNAs and microRNAs reveals potential mechanisms of TWEAKinduced skeletal muscle wasting in mice.
PLoS One 2010., 5 Publisher Full Text

Butte A: The use and analysis of microarray data.
Nature Reviews Drug Discovery 2002, 1:951960. PubMed Abstract  Publisher Full Text

Quackenbush J: Computationa Analysis of Micorarray Data.
Nat Rev Genet 2001, 2(6):418427. PubMed Abstract  Publisher Full Text

Madeira SC, Oliveira AL: Biclustering Algorithms for Biological Data Analysis: A Survey.
IEEE/ACM Transactions on Computational Biology and Bioinformatics 2004, 1:2445. Publisher Full Text

Margolin AA, Nemenman I, Basso K, Klein U, Wiggins C, Stolovitzky G, Favera RD, Califano A: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context.

Meyer PE, Lafitte F, Bontempi G: minet: A R/Bioconductor package for inferring large transcriptional networks using mutual information.
BMC Bioinformatics 2008, 9:461461. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Neretti N, Remondini D, Tatar M, Sedivy JM, Pierini M, Mazzatti D, Powell J, Franceschi C, Castellani GC: Correlation analysis reveals the emergence of coherence in the gene expression dynamics following system perturbation.
BMC Bioinformatics 2007., 8(Suppl 1) PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response.
Proc Natl Acad Sci 2001, 98(9):51165121. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstråle M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC: PGC1αresponsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes.
Nat Genet 2003, 34(3):267273. PubMed Abstract  Publisher Full Text

Kitano H: Systems Biology: A Brief Overview.
Science 2002, 295(5560):16621664. PubMed Abstract  Publisher Full Text

Hocquette JF: Where are we in genomics?
Journal of Physiology and Pharmacology 2005, 56(3):3770. PubMed Abstract  Publisher Full Text

Ahn AC, Tewari M, Poon CS, Phillips RS: The Limits of Reductionism in Medicine: Could Systems Biology Offer an Alternative?
PLoS Medicine 2006, 3(6):e208. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ahn AC, Tewari M, Poon CS, Phillips RS: The Clinical Applications of a Systems Approach.
PLoS Medicine 2006, 3(7):e209. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Yao J, Weremowicz S, Feng B, Gentleman RC, Marks JR, Gelman R, Brennan C, Polyak K: Combined cDNA array comparative genomic hybridization and serial analysis of gene expression analysis of breast tumor progression.
Cancer Res 2006, 66(8):40654078. PubMed Abstract  Publisher Full Text

LindbladToh K, Tanenbaum DM, Daly MJ, Winchester E, Lui WO, Villapakkam A, Stanton SE, Larsson C, Hudson TJ, Johnson BE, Lander ES, Meyerson M: Lossofheterozygosity analysis of smallcell lung carcinomas using singlenucleotide polymorphism arrays.
Nat Biotechnol 2000, 18(9):10011005. PubMed Abstract  Publisher Full Text

Yang TP, Chang TY, Lin CH, Hsu MT, Wang HW: ArrayFusion: a web application for multidimensional analysis of CGH, SNP and microarray data.
Bioinformatics 2006, 22(21):26972698. PubMed Abstract  Publisher Full Text

Mijalski T, Harder A, Halder T, Kersten M, Horsch M, Strom TM, Liebscher HV, Lottspeich F, de Angelis MH, Beckers J: Identification of coexpressed gene clusters in a comparative analysis of transcriptome and proteome in mouse tissues.
Proc Natl Acad Sci USA 2005, 102(24):86218626. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Johnson RA, Wichern DW: Applied Multivariate Statistical Analysis. Upper Saddle River, NJ: Prentice Hall; 2002.

Sherlock G: Analysis of largescale gene expression data.
Brief Bioinform 2001, 2(4):35062. PubMed Abstract  Publisher Full Text

Peterson LE: Factor analysis of clusterspecific gene expression levels from cDNA microarrays.
Comput Methods Programs Biomed 2002, 69(3):17988. PubMed Abstract  Publisher Full Text

Lozano JJ, Soler M, Bermudo R, Abia D, Fernandez PL, Thomson TM, Ortiz AR: Dual activation of pathways regulated by steroid receptors and peptide growth factors in primary prostate cancer revealed by Factor Analysis of microarray data.
BMC Genomics 2005, 6:109. PubMed Abstract  BioMed Central Full Text

Crijns APG, Gerbens F, Plantinga AED, Meersma GJ, de Jong S, Hofstra RMW, de Vries EGE, van der Zee AGJ, de Bock GH, te Meerman GJ: A biological question and a balanced (orthogonal) design: the ingredients to eciently analyze twocolor microarrays with Confirmatory Factor Analysis.
BMC Genomics 2006, 7:232. PubMed Abstract  BioMed Central Full Text

Jollie T: Principal Component Analysis. New York: SpringerVerlag New York Inc; 1986.

Langley P, Iba W, Thompson K: An analysis of Bayesian classifiers.

Friedman N: The bayesian structural em algorithm.
Proceedings of the Conference on Uncertainty in Artificial Intelligence 1998, 98:129138.

Persson O, Krogh M, Saal LH, Englund E, Liu J, Parsons R, Mandahl N, Borg A, Widegren B, Salford LG: Microarray analysis of gliomas reveals chromosomal positionassociated gene expression patterns and identifies potential immunotherapy targets.
J Neurooncol 2007, 85:1124. PubMed Abstract  Publisher Full Text

Furey TS, Cristianini N, Duy N, Bednarski DW, Schummer M, Haussler D: Support vector machine classification and validation of cancer tissue samples using microarray expression data.
Bioinformatics 2000, 16(10):90614. PubMed Abstract  Publisher Full Text

Theilhaber J, Connolly T, RomanRoman S, Bushnell S, Jackson A, Call K, Garcia T, Baron R: Finding genes in the C2C12 osteogenic pathway by knearestneighbor classification of expression data.
Genome Res 2002, 12:16576. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Atchley WR, Zhao J, Fernandes AD, Drüke T: Solving the protein sequence metric problem.
Proc Natl Acad Sci USA 2005, 102(18):63956400. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Peterson LE: Factor analysis of clusterspecific gene expression levels from cDNA microarrays.
Comput Methods Programs Biomed 2002, 69(3):179188. PubMed Abstract  Publisher Full Text

Lozano JJ, Soler M, Bermudo R, Abia D, Fernandez PL, Thomson TM, Ortiz AR: Dual activation of pathways regulated by steroid receptors and peptide growth factors in primary prostate cancer revealed by Factor Analysis of microarray data.
BMC Genomics 2005, 6:109109. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Crijns AP, Gerbens F, Plantinga AE, Meersma GJ, de Jong S, Hofstra RM, de Vries EG, van der Zee AG, de Bock GH, te Meerman GJ:
BMC Genomics. 2006, 7:232232. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Kasim A, Lin D, Van Sanden S, Clevert DA, Bijnens L, Göhlmann H, Amaratunga D, Hochreiter S, Shkedy Z, Talloen W: Informative or Noninformative Calls for Gene Expression: A Latent Variable Approach.
Statistical Applications in Genetics and Molecular Biology 2010, 9:Article 4. PubMed Abstract  Publisher Full Text

Pournara I, Wernisch L: Factor analysis for gene regulatory networks and transcription factor activity profiles.
BMC Bioinformatics 2007, 8:6161. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Sabatti C, James GM: Bayesian sparse hidden components analysis for transcription regulation networks.
Bioinformatics 2006, 22(6):739746. PubMed Abstract  Publisher Full Text

von Zerssen D: Psychiatric syndromes from a clinical and a biostatistical point of view.
Psychopathology 1985, 18(23):8897. PubMed Abstract  Publisher Full Text

Consortium TGO: Creating the Gene Ontology Resource: Design and Implementation.
Genome Res 2001, 11(8):14251433. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LSL: The Universal Protein Resource (UniProt).
Nucleic Acids Res 2005, (33 Database):D1549. PubMed Abstract  PubMed Central Full Text

Papadopoulos GL, Reczko M, Simossis VA, Sethupathy P, Hatzigeorgiou AG: The database of experimentally supported targets: a functional update of TarBase.
Nucleic Acids Res 2009, (37 Database):D1558. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Williams AE: Functional aspects of animal microRNAs.
Cell Mol Life Sci 2008, 65(4):54562. PubMed Abstract  Publisher Full Text

Mendell JT: miRiad roles for the miR1792 cluster in development and disease.
Cell 2008, 133(2):21722. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Landais S, Landry S, Legault P, Rassart E: Oncogenic potential of the miR106363 cluster and its implication in human Tcell leukemia.
Cancer Res 2007, 67(12):5699707. PubMed Abstract  Publisher Full Text

Lewis BP, Burge CB, Bartel DP: Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets.
Cell 2005, 120:1520. PubMed Abstract  Publisher Full Text

Laios A, O'Toole S, Flavin R, Martin C, Kelly L, Ring M, Finn SP, Barrett C, Loda M, Gleeson N, D'Arcy T, McGuinness E, Sheils O, Sheppard B, O' Leary J: Potential role of miR9 and miR223 in recurrent ovarian cancer.
Mol Cancer 2008, 7:35. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Burmistrova OA, Goltsov AY, Abramova LI, Kaleda VG, Orlova VA, Rogaev EI: MicroRNA in schizophrenia: genetic and expression analysis of miR130b (22q11).
Biochemistry (Mosc) 2007, 72(5):57882. PubMed Abstract  Publisher Full Text

Yeung ML, ichirou Yasunaga J, Bennasser Y, Dusetti N, Harris D, Ahmad N, Matsuoka M, Jeang KT: Roles for microRNAs, miR93 and miR130b, and tumor protein 53induced nuclear protein 1 tumor suppressor in cell growth dysregulation by human Tcell lymphotrophic virus 1.
Cancer Res 2008, 68(21):897685. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Feigin IH, Gross SW: Sarcoma arising in glioblastoma of the brain.
Am J Pathol 1955, 31(4):63353. PubMed Abstract  PubMed Central Full Text

Kishikawa M, Tsuda N, Fujii H, Nishimori I, Yokoyama H, Kihara M: Glioblastoma with sarcomatous component associated with myxoid change. A histochemical, immunohistochemical and electron microscopic study.
Acta Neuropathol 1986, 70:4452. PubMed Abstract  Publisher Full Text

Wu H, Neilson JR, Kumar P, Manocha M, Shankar P, Sharp PA, Manjunath N: miRNA profiling of naive, effector and memory CD8 T cells.
PLoS One 2007, 2(10):e1020. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wang H, Ach RA, Curry B: Direct and sensitive miRNA profiling from lowinput total RNA.
RNA 2007, 13:1519. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Thurstone LL: A single plane method of rotation.
Psychometrika 1946, 11:719. PubMed Abstract  Publisher Full Text

Thurstone LL: Factorial analysis of body measurements.
Am J Phys Anthropol 1947, 5:1528. PubMed Abstract  Publisher Full Text

Venables WN, Ripley BD: [http://www.stats.ox.ac.uk/pub/MASS4] webcite
Modern Applied Statistics with S. fourth edition. New York: Springer; 2002.
ISBN 0387954570

Dennis GJ, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA: DAVID: Database for Annotation, Visualization, and Integrated Discovery.
Genome Biol 2003, 4(5):P3. PubMed Abstract  BioMed Central Full Text

Huang DW, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources.
Nat Protoc 2009, 4:4457. PubMed Abstract  Publisher Full Text