Computational Biology Group, Department of Clinical Laboratory Sciences, Faculty of Health Sciences, University of Cape Town, South Africa

School of Mathematics, Statistics and Applied Mathematics, National University of Ireland Galway, Ireland

Abstract

Background

Nonnegative Matrix Factorization (NMF) is an unsupervised learning technique that has been applied successfully in several fields, including signal processing, face recognition and text mining. Recent applications of NMF in bioinformatics have demonstrated its ability to extract meaningful information from high-dimensional data such as gene expression microarrays. Developments in NMF theory and applications have resulted in a variety of algorithms and methods. However, most NMF implementations have been on commercial platforms, while those that are freely available typically require programming skills. This limits their use by the wider research community.

Results

Our objective is to provide the bioinformatics community with an open-source, easy-to-use and unified interface to standard NMF algorithms, as well as with a simple framework to help implement and test new NMF methods. For that purpose, we have developed a package for the R/BioConductor platform. The package ports public code to R, and is structured to enable users to easily modify and/or add algorithms. It includes a number of published NMF algorithms and initialization methods and facilitates the combination of these to produce new NMF strategies. Commonly used benchmark data and visualization methods are provided to help in the comparison and interpretation of the results.

Conclusions

The NMF package helps realize the potential of Nonnegative Matrix Factorization, especially in bioinformatics, providing easy access to methods that have already yielded new insights in many applications. Documentation, source code and sample data are available from CRAN.

Background

Non-negative Matrix Factorization

The factorization of matrices representing complex multidimensional datasets is the basis of several commonly applied techniques for pattern recognition and unsupervised clustering. Similarly to principal components analysis (PCA) or independent component analysis (ICA), the objective of non-negative matrix factorization (NMF) is to explain the observed data using a limited number of basis components, which when combined together approximate the original data as accurately as possible. The distinguishing features of NMF are that both the matrix representing the basis components as well as the matrix of mixture coefficients are constrained to have non-negative entries, and that no orthogonality or independence constraints are imposed on the basis components. This leads to a simple and intuitive interpretation of the factors in NMF, and allows the basis components to overlap.

Applications and motivations

Since its formal definition in

The popularity of the NMF approach derives essentially from three properties that distinguish it from standard decomposition techniques.

Firstly, the matrix factors are by definition nonnegative, which allows their intuitive interpretation as real underlying components within the context defined by the original data. The basis components can be directly interpreted as parts or basis samples, present in different proportions in each observed sample. In the context of gene expression microarrays, Brunet et al.

Secondly, NMF generally produces sparse results, which means that the basis and/or mixture coefficients have only a few non-zero entries. This provides a more compact and local representation, emphasizing even more the parts-based decomposition of the data

Finally, unlike other decomposition methods such as SVD or ICA, NMF does not aim at finding components that are orthogonal or independent, but instead allows them to overlap. This unique feature is particularly interesting in the context of gene expression microarrays, where overlapping metagenes could identify genes that belong to multiple pathways or processes

Formal definition

In this section we provide a mathematical formulation of the general NMF approach. Let

where

Equation (1) states that each column of

The main approach to NMF estimates matrices

where

Existing implementations

Several algorithms to perform NMF have been published and implemented. See ^{® }files. Hoyer provided a package that implements five different algorithms ^{®}, a proprietary software, limits access to these packages within the wider bioinformatics community. Some C/C++ implementations are also available

To help realize the potential of NMF, especially in bioinformatics, we have implemented a free open-source package, that allows users to use and implement NMF algorithms.

Implementation

We implemented our package using the R/BioConductor platform

Implemented methods

Algorithms

Six published algorithms are implemented, either directly or by porting available code to R. We ported to R the standard NMF algorithms with multiplicative updates of

Seeding methods

The general NMF procedure is to run the algorithm with several random initializations for matrices

Stopping criteria

Although differing in the way the solution is updated at each iteration, some of the implemented algorithms share a common iterative schema, with common stopping criteria. The NMF package implements three standard criteria: fixed number of iterations, invariance of the consensus matrix

Flexibility and extensibility

While implementing all the possible NMF algorithms is beyond the scope of this work, one of the main objectives of our package is to provide a flexible framework for using and developing NMF algorithms in R. Our implementation is based on the

Results

To illustrate the capabilities of the NMF package, we provide an example of analysis on a real dataset. We used the Golub dataset as referenced in Brunet et al.

Running NMF algorithms

Particular care was taken to provide the user with a lean and intuitive programmatic interface. We organized our package around a

Comparing methods

A typical task in data analysis or algorithm development is to compare how different methods perform on a given data set. We provide a functionality to compare different NMF runs, based on a set of quality measures that have been proposed in the literature to evaluate NMF performance.

Standard measures for evaluating algorithms are the final error between the target matrix and its estimate, or the CPU time required to perform the factorization. Hoyer _{
ij
}are the entries of the target matrix

Pascual-Montano et al. studied the deterioration of the explained variance, as a function of sparseness for different methods, to show that their method maintained a good fit for a wide range of achieved sparseness. Note that users should be cautious about using it as the basis for comparing the performance of different methods, since it is closely related to the objective function of methods based on euclidean distance but not for Kullback-Leibler divergence, and would a priori favor the former methods. On the other hand, the results from

Kim and Park

In Table

Comparison of NMF methods

**method**

**seed**

**metric**

**rank**

**evar**

**sparseness W/H**

**purity**

**entropy**

**niter**

**CPU time (seconds)**

lee

nndsvd

euclidean

3

0.75

0.65/0.75

0.89

0.25

690

11.24

snmf/r

nndsvd

euclidean

3

0.75

0.65/0.75

0.97

0.10

130

4.31

brunet

nndsvd

KL

3

0.73

0.64/0.80

0.95

0.16

1110

23.60

nsNMF

nndsvd

KL

3

0.70

0.73/0.74

0.87

0.29

450

10.37

Comparison of different NMF algorithms applied to the Golub dataset, using the non-negative double SVD seeding method (NNDSVD). The

Estimating the factorization rank

A critical parameter in NMF is the factorization rank

The most common approach is to use the cophenetic correlation coefficient. Brunet et al.

The NMF package implements the above mentioned procedures and provides functions to generate plots for the different quality measures. To illustrate this functionality, we reproduce Brunet et al.'s estimation of the optimal factorization rank. Figure

Cophenetic correlation coefficient

**Cophenetic correlation coefficient**. Each point on the graph was obtained from 50 runs of the Brunet et al's algorithm

Heatmap of the metagene expression profiles matrix

**Heatmap of the metagene expression profiles matrix**. The metagene expression profile matrix was obtained from the factorization that achieved the lowest approximation error across 200 random runs of the Brunet et al.'s algorithm on the Golub dataset. Each column corresponds to a samples. The top colored row shows the phenotypic class to which each sample belongs. Columns were scaled to sum to one and ordered by clusters, which are highlighted on the second row by colours that map them with their associated metagene.

This approach does not always provide a clear and consistent cut-off for the choice of

Computational speed

Performing a single NMF run on large scale data requires intensive computations. Moreover, a typical NMF analysis involves performing several runs for different values of the rank (~ 30-50 runs), before running the final factorization using the estimated rank (~ 200 runs) The whole procedure is therefore highly time consuming.

Since R is able to call external compiled libraries, one possible way to speed-up the computations is to implement optimized versions of the algorithms in C/C++ or Fortran. For instance, the NMF package implements optimized C++ versions of the multiplicative updates from

As an example, we provide here the computation times achieved when running 100 factorizations of the 5000 × 38 gene expression matrix from the Golub dataset, using Brunet et al.'s algorithm with

Visualizing results

R includes a wide range of powerful plotting utilities. However, producing interpretable plots often requires tuning several function arguments, which can act as a distraction from the main analysis task. To help in interpreting and evaluating the estimated factorization, our package implements a collection of functions pre-configured to visualize the results from NMF runs. Each visualization method provides insights about specific characteristics of the result or the method used.

Sparse parts-based representation

One of the main properties of NMF is its ability to produce metagenes or metagene expression profiles that have a sparse structure. This feature is exploited in practice to simultaneously define and characterize clusters of genes and samples

Heatmap of the metagene matrix

**Heatmap of the metagene matrix**. The metagene matrix was obtained from the same factorization used in Figure 2. Each row corresponds to a gene. The most metagene-specific genes were selected using the Kim and Park's scoring and filtering method. This resulted in the selection of 635 genes. Rows were scaled to sum to one and ordered by hierarchical clustering based on the euclidean distance and average linkage.

In Figure

The metagene matrix,

- s

- s

Cluster stability

In the context of sample clustering, the consensus matrix provides information about the stability of the clusters defined by the metagene expression profiles

Consensus matrix

**Consensus matrix**. The consensus matrix was obtained from 200 random runs of the Brunet et al.'s algorithm on the Golub dataset. Values range from 0 to 1. Columns - and rows - were ordered by hierarchical clustering based on the euclidean distance with average linkage.

Convergence speed

Finally, when developing new algorithms or comparing results, the graph of the residual approximation error provides information about the convergence speed and efficiency of each method. The NMF package provides a built-in functionality to track the objective value along the iterative optimization process. Figure

Plot of the residual approximation error

**Plot of the residual approximation error**. Each curve reports the trajectory of the approximation residuals, computed with the algorithm's loss function. Each track is normalized separately over its maximum value, and stops at the number of iterations required to achieve the convergence criterion.

Conclusions

Nonnegative Matrix Factorization has several advantages over classical approaches to extracting meaningful information from high-dimensional data. Its successful application in many fields, notably in bioinformatics, has resulted in the development of several algorithms and methodologies. However, the implementations available for these algorithms often depend on commercial software or require technical skills. We implemented the NMF package to provide free and simple access to standard methods to perform Nonnegative Matrix Factorization in R/BioConductor. The package also provides a flexible framework that allows the rapid development, testing and benchmarking of novel NMF algorithms.

Availability and requirements

**Operating system: **Any

**Dependencies: **R (≥ 2.10)

**Optionally: **BioConductor (≥ 2.5)

**Programming Language: **R, C++

**License: **GPL

**Web: **

List of abbreviations used

ALL: Acute Lymphoblastic Leukemia; AML: Acute Myeloid Leukemia; ICA: Independent Component Analysis; NMF: Nonnegative Matrix Factorization; nsNMF: Non-smooth NMF; OS: Operating System; PCA: Principal Component Analysis; RSS: Residual Sum of Squares; SVD: Singular Value Decomposition.

Authors' contributions

RG designed and implemented the software and drafted the manuscript. CS instigated the study, and participated in its design and coordination. Both authors read and approved the final manuscript.

Acknowledgements

**Funding **This work was funded by the South-African National Bioinformatics Network. CS is funded through Science Foundation Ireland (Grant number 07/SK/M1211b).