Computer Architecture Department, Facultad de Ciencias Físicas, Universidad Complutense de Madrid, 28040, Spain

BioComputing Unit, National Center of Biotechnology, Campus Universidad Autónoma de Madrid, 28049, Spain

The KEY Institute for Brain-Mind Research, University Hospital of Psychiatry. Lenggstr. 31, CH-8029 Zurich, Switzerland

Abstract

Background

In the Bioinformatics field, a great deal of interest has been given to Non-negative matrix factorization technique (NMF), due to its capability of providing new insights and relevant information about the complex latent relationships in experimental data sets. This method, and some of its variants, has been successfully applied to gene expression, sequence analysis, functional characterization of genes and text mining. Even if the interest on this technique by the bioinformatics community has been increased during the last few years, there are not many available simple standalone tools to specifically perform these types of data analysis in an integrated environment.

Results

In this work we propose a versatile and user-friendly tool that implements the NMF methodology in different analysis contexts to support some of the most important reported applications of this new methodology. This includes clustering and biclustering gene expression data, protein sequence analysis, text mining of biomedical literature and sample classification using gene expression. The tool, which is named bioNMF, also contains a user-friendly graphical interface to explore results in an interactive manner and facilitate in this way the exploratory data analysis process.

Conclusion

bioNMF is a standalone versatile application which does not require any special installation or libraries. It can be used for most of the multiple applications proposed in the bioinformatics field or to support new research using this method. This tool is publicly available at

Background

The development of "omics" technologies has represented a revolution in biomedical research allowing the study of biological systems from a global perspective. These high-throughput techniques generate vast amounts of data which have required the development and application of sophisticated statistical and machine learning methodologies in order to analyze and extract biological knowledge.

Matrix factorization techniques have become well established methods for the analysis of such datasets. These methods can be applied to the analysis of multidimensional datasets in order to reduce the dimensionality, discover patterns and aid in the interpretation of the data. Among the most popular, Principal Component Analysis (PCA), Singular Value Decomposition (SVD) or Independent Component Analysis (ICA) have been successfully used in a broad range of contexts such as transcriptomics

In 1999 Lee and Seung developed a novel matrix factorization technique named Non-Negative Matrix Factorization (NMF)

Despite the increasing use of NMF in Bioinformatics, most of its implementations are only available as MATLAB (Mathworks, Natick, MA) toolboxes, command line programs

Implementation

bioNMF has been implemented as a single standalone application for Microsoft Windows platform. The application has been written in Borland Delphi version 7 and it does not require any special installation or libraries and thus bioNMF is self-contained in a single application file. Analysis using bioNMF can be executed in three steps:

1)

2)

3)

Non negative matrix factorization model

NMF is a matrix factorization algorithm originally introduced in **V ≈ WH **where **V **∈ ℝ^{m×n }is a positive data matrix with **W **∈ ℝ^{m×k }are the reduced **H **∈ ℝ^{k×n }contains the coefficients of the linear combinations of the basis vectors needed to reconstruct the original data (also known as encoding vectors). The main difference between NMF and other classical factorization models relies in the non-negativity constraints imposed on both the basis **W **and encoding vectors **H**. In this way, only additive combinations are possible. The number of factors (

In the case of gene expression analysis, for example, the expression data matrix **V **can be represented as a gene-experiment matrix, where **W**, therefore, will have the dimension of a single array (**H **are known as encoding vectors and are in one-to-one correspondence with a single experiment of the gene expression data matrix. Consequently, each row of **H **has the dimension of a single gene (

Results and discussion

The main window of bioNMF application is divided into three groups: data input, data transformation and analysis (see Figure

Main bioNMF window

**Main bioNMF window**. The tool is divided into three main functional modules: data input, data transformation and analysis modules (standard NMF, bicluster analysis and sample classification). Standard NMF implements the classical Lee and Seung NMF algorithm. Bicluster analysis uses a sparse variant of the NMF model while Sample classification implements an unsupervised classification method that uses NMF to classify experimental samples.

•

•

•

•

The following section describe in details the three main analysis modules implemented in bioNMF. More information and step by step examples are included in the project web site.

Standard NMF

This module performs the classical NMF factorization using the algorithm proposed by Lee and Seung

Kim and Tidor applied NMF for gene expression data analysis in yeast **V **as an

Another NMF application proposed in the context of data analysis in biology is text mining **V **can be modeled as a gene-document collection represented in a vector space model, where **V **is an

Given a factorization rank (**W**, **H**). Each factor (column) in the matrix **W **corresponds to a semantic feature (described as weighted sum of terms) while each column in **H **corresponds to the new representation for a gene as a linear combination of semantic factors (gene semantic profile). To provide both a more comprehensive representation of the genes, and a more robust clustering, Chagoyen **H **matrices from different random runs. E.g. clustering of genes according to similar semantic profiles.

NMF analysis has also being used for the identification of sequence patterns conserved in subgroups of proteins in diverse superfamilies **V **corresponds to a generalized sequence space (**W **are the basis vectors of the reduced space, and **H **is the encoding in the new basis. The coefficients of the attributes in a basis vector (column in **W**) reflect the frequency of particular residues in the corresponding protein set. In **W **matrix obtained at different ranks (namely

Finally, NMF has been described to perform functional categorization of genes **V **of size **W **describes the loadings of the genes on the **W**, providing in this way insights about the most prominent functional categories for each gene.

Due to the non-deterministic nature of NMF results might differ from one run to the other. To minimize this effect and in order to select the best factorization results, it is crucial to repeat the process using different random initialization for matrices **W **and **H**. Standard NMF module provides this functionality using two methods: 1) repeat the process a predetermined number of times and select the best possible solution (the ones that maximizes the explained variance) 2) Combine different random runs in a single output file, as proposed in

Standard NMF module

**Standard NMF module**. This functional module implements the classical NMF algorithm. Different random runs can be executed and results can be either combined in a single output file or saved independently. The application selects the best run based on the minimum error of the model.

Standard NMF is therefore a wide-ranging analysis module that is not specifically focused to any particular analysis but more generally oriented to any potential application that might use this factorization method for analysis.

Gene expression bicluster analysis

One of the main goals in the analysis of large and heterogeneous gene expression datasets is to identify groups of genes that are co-expressed in subsets of experimental conditions. The identification of these local structures plays a key role in understanding the biological events associated to different physiological states as well as to identify gene expression signatures. Classical one-way clustering techniques, especially hierarchical clustering, have been commonly applied to cluster genes and samples separately in order to identify these types of local patterns. In the last few years, many authors have proposed the use of two-way clustering methods (also known as biclustering algorithms) to identify gene-experiment relationships. For a review see

bioNMF estimates biclusters using a novel method based on a modified variant of the Non-negative Matrix Factorization algorithm which produces a suitable decomposition as product of three matrices that are constrained to have non-negative elements. The new methodology, denoted as Non-smooth Non-negative Matrix Factorization (

Similarly to NMF, the non-smooth non negative matrix factorization model is used to approximately reproduce a gene expression matrix **V **with **W**, **H **and **S **(**V = WSH**), with dimensions **W **have the dimension of a single array (**H **has the dimension of a single gene (**S**, on the other hand, is denoted as smoothing matrix and its task is to demand sparseness in both **W **and **H**.

For details of the algorithm see **S**, each factor obtained by **H **are used to determine the set of experimental conditions highly associated to these modules. In other words, the set of genes and experimental conditions that show high values in the same basis experiment (i^{th }column of **W**) and its corresponding basis gene (i^{th }row of **H**), respectively, are highly related in only a sub-portion of the data and constitute a gene expression bicluster.

Once the factorization has been completed, results can be explored using a graphical user interface (see Figure

Graphical User Interface for biclustering application

**Graphical User Interface for biclustering application**. Each factor is used to sort the original data matrix to emphasize the clustering structure of the data. Biclusters can be browsed in textual and graphical format. Thresholds to select the biclusters of interest can be interactively selected

Similarly to the standard NMF module, the biclustering process allows the multiple execution of the **W **and **H**. The solution that best reproduces the original data matrix is then selected for the analysis.

Regarding processing time, this algorithm takes one minute and twenty seconds in a 2.1 GHz Pentium M processor to process 1000 iterations with a data set containing 4585 genes with 46 experimental conditions.

Sample classification

This module implements the approach proposed in

To determine the most suitable number of meaningful clusters for a given dataset a model selection, that exploits the stochastic nature of the NMF algorithm, was also implemented in bioNMF as proposed in

This model selection method is based on the idea of consensus clustering **H**, contain a maximum value in the same factor (same row). The entries of the consensus matrix then range from 0 to 1 and reflect the probability that samples

The samples (rows and columns) of the consensus matrix are then reordered using the average linkage method to provide visual insights of the clustering stability (Figure

Graphical User Interface for the sample classification module

**Graphical User Interface for the sample classification module**. This panel shows the reordered consensus matrix and cophenetic correlation coefficient computed for each rank (

bioNMF fully implements this methodology using the divergence-based update equations

It is important to mention that the cophenetic correlation coefficient has also been used to estimate the optimum factorization rank (value of

This methodology has an important computational efficiency drawback due to the fact that a large number of runs per factorization rank (

Conclusion

Non-negative matrix factorization method has gained high popularity in the Bioinformatics field due to its potential in providing new insights about the complex relationships in large data sets. Although this algorithm is conceptually simple, its use by the scientific community still demands a certain level of programming skills to fully exploit it. The bioNMF application aimed at filling this gap by providing the research community with a tool containing the functionality needed to run either a simple exploratory analysis or to answer more complex analysis questions in an easy-to-use environment.

Current implementation of bioNMF includes a basic functionality for running the original NMF algorithm, which can be easily used with any data set. To demonstrate the usefulness of this method we described different types of analysis that have been proposed in different experimental contexts. This includes applications for finding functional gene modules

More concrete applications of NMF have also been included in bioNMF tool. For example gene expression biclustering, which has been incorporated in this application using a new sparse variant of NMF

There are still open problems; however, that requires a more detailed study. That is the case of the available methods for making the data non-negative, in particular, for the gene expression applications described in this work. In this application we have implemented four methods to cope with this problem. Nevertheless, we believe that there is no best method for all applications and results are very much dependent of the data and problem. A full comparison of methods to transform gene expression data into positive data sets is more than welcome and it represents an interesting topic of research.

bioNMF will also be systematically updated to support new functionalities and applications that might potentially help in the analysis of biological information using this methodology or some of its variants. In this way we expect that this tool helps researches in this field in using a method that it is conceptually simple and powerful for the process of data analysis.

Availability and requirements

**Project home page**:

**Source code availability**:

**Operating system**: Microsoft Windows (98, Me, 2000, or XP)

**Programming language**: Delphi Pascal v.7

**Other requirements**: 1024 × 768 resolution

**License**: GPL

**Any restrictions to use by non-academics**: none

Abbreviations

**NMF **– Non-negative Matrix Factorization

**PCA **– Principal Component Analysis

**SVD **– Singular Value Decomposition

**ICA **– Independent Component Analysis

** nsNMF **– Non-Smooth Non-negative Matrix Factorization

**AML **– Acute Myelogenous Leukemia

**ALL **– Acute Lymphoblastic Leukemia

**GUI **– Graphical User Interface

**GO **– Gene Ontology

Authors' contributions

APM, RDPM and PCS conceived the study. APM and RDPM designed and developed the software. PCS, JMC and MC developed the tests and documentation, FT developed the computational optimization of the method. APM, JMC and RDPM managed and coordinated the project. All authors participated in writing and revising the final manuscript.

Acknowledgements

This work has been partially funded by the Spanish grants CICYT BFU2004-00217/BMC, GEN2003-20235-c05-05, CYTED-505PI0058, TIN2005-5619, PR27/05-13964-BSCH and a collaborative grant between the Spanish CSIC and the Canadian NRC (CSIC-050402040003). PCS is recipient of a grant from CAM. APM acknowledges the support of the Spanish Ramón y Cajal program.