Department of the Physics of Complex Systems, Eötvös Loránd University, Pázmány Péter sétány 1/A, 1117, Budapest, Hungary

Department of Mathematics, Stanford University, Stanford, CA 94305, USA

Department of Animal Hygiene, Herd Health and Veterinary Ethology, Szent István University, István utca, 1078, Budapest, Hungary

Abstract

Background

Many methods for dimensionality reduction of large data sets such as those generated in microarray studies boil down to the Singular Value Decomposition (SVD). Although singular vectors associated with the largest singular values have strong optimality properties and can often be quite useful as a tool to summarize the data, they are linear combinations of up to all of the data points, and thus it is typically quite hard to interpret those vectors in terms of the application domain from which the data are drawn. Recently, an alternative dimensionality reduction paradigm,

Results

We present an implementation to perform CUR matrix decompositions, in the form of a freely available, open source R-package called rCUR. This package will help users to perform CUR-based analysis on large-scale data, such as those obtained from different high-throughput technologies, in an interactive and exploratory manner. We show two examples that illustrate how CUR-based techniques make it possible to reduce significantly the number of probes, while at the same time maintaining major trends in data and keeping the same classification accuracy.

Conclusions

The package rCUR provides functions for the users to perform CUR-based matrix decompositions in the R environment. In gene expression studies, it gives an additional way of analysis of differential expression and discriminant gene selection based on the use of statistical leverage scores. These scores, which have been used historically in diagnostic regression analysis to identify outliers, can be used by rCUR to identify the most informative data points with respect to which to express the remaining data points.

Background

In many modern data analysis applications, the user is faced with data matrices with a huge number of columns and/or rows. Such matrices arise in disciplines ranging from astronomy through genomics and social sciences to zoology. As a specific example, let us consider gene expression microarray data. In a typical study, hundreds of thousands of probe expressions are measured for a large number of samples. This methodology has had a significant impact on gene expression research, but the publication of studies with dissimilar or contradictory results has raised concerns about the reliability of this technology, especially when all the individual values of gene expressions are requested. On the other hand, when the goal is more modest,

In such cases, it is common to employ one of several dimensionality reduction methods in order to identify low-dimensional features for use by a downstream analyst. Many popular methods,

To address these and other issues, Mahoney and Drineas

where

for all

The basic algorithm for choosing columns from a matrix—call it

1. First, compute ^{1},…,^{
k
}(the top

2. Second, keep the

3. Third, return the matrix

In some applications, this restricted CUR decomposition, _{
C
}
^{ + }
^{ + }denotes a Moore-Penrose generalized inverse of the matrix ^{a}

In other applications, one wants a CUR matrix decomposition in terms of columns and rows simultaneously. The basic algorithm for this performs the following.

1. Run

2. Run ^{
T
}with
^{
T
}) and construct the matrix

3. Define the matrix ^{ + }
^{ + }.

Thus, in contrast to PCA and the SVD, where the low-dimensional basis consists of singular vectors that are linear combinations of all the data vectors, here the matrices

In this paper, we describe the rCUR package, which is a freely available, open source R implementation of the CUR matrix decomposition method. We will summarize functionality and features of the package that allow the user to obtain the statistical leverage scores and the matrices

Finally, it should be emphasized that this CUR approach is very different the classical statistical perspective, where statistical leverage scores have been used in diagnostic regression analysis to identify outliers and errors

Implementation

The rCUR package was developed to allow users to easily perform CUR matrix decompositions. For this purpose, an easy to use primary function, called CUR, was implemented. The input of the function CUR is a two dimensional matrix with column and row names. If any of the column or row names is missing then the index of the dimension is assigned automatically. From the matrix

Several other column selection methods are also implemented in rCUR. These can be selected by the parameter "method".

**random **the original method described in

**exact.num.random **like the default method, but it is guaranteed, that exactly as many rows and columns are selected as requested. (In the random case it is only the expectation value.)

**top.scores **the rows and columns with the highest leverage scores are returned deterministically.

**ortho.top.scores **columns and rows are selected in an iteration based on a factor that combines not just the leverage score but also the orthogonality of the next vector to the already selected subspace.

**highest.ranks **rows and columns with the highest rank of leverage score for some rank parameter are selected. Every possible value is tried up to the value of

These methods are considered experimental and they provide roughly the same precision as the default method. For certain problems with highly correlated columns/rows one method (ortho.top.score) seems to be very promising. In this way the selection of multiple similar columns/rows, which does not contain new information is avoided, hence the necessary number of columns/rows can be reduced.

To extract the matrices

To improve efficiency the computation of components that are not used can be switched off. In particular, if the restricted CUR decomposition is required, the parameter

In addition, with the function plot.leverage, one can plot the statistical leverage scores themselves, highlighting the largest values and indicating the uniform level directly from CURobj.

For users who would like to test the functionalities of the package on published, real world data sets we incorporated the data used by paper

Results and discussion

We illustrate the benefits of CUR matrix decompositions and dimension reduction with the rCUR package by comparing it with two different previously-published case studies. In the first, we show that feature selection based on leverage scores can differentiate classes with a performance similar to that of the entire gene set of a microarray. In the second, we show that CUR performs well not only in the separation of classes, but in addition we can get comparable results in trend analysis with a fraction of full feature set. We provide all the code that is neccessary to reproduce these results as

**rCUR package:** The R package rCUR (version 1.1) with functions for CUR decomposition.

Click here for file

**rCUR_Case_Studies.R:** R-script file containing all the sources necessary to reproduce the results presented in the paper.cdqwqC.

Click here for file

Case study 1: soft tissue tumor discrimination

Here our goal is to check if it is possible to separate groups with genes filtered by CUR and obtain a performance similar to that with the total gene set. In this example we use a soft tissue tumor dataset, which is incorporated in the package as mentioned above (STTm, STTa). By using the rCUR package, we repeated the analysis that was performed in the paper publishing the CUR method

Feature selection using leverage score

**Feature selection using leverage score.** Normalized leverage scores (grey bars) are presented for each gene (5520) in dataset ordered by row number. The highest 27 leverage scores are dotted with black.

PCA plots are presented using the first two principal components

**PCA plots are presented using the first two principal components.** The plot at the top shows PC1 and PC2 using all genes, the one below plots the results based on the selected (27) features. Genes filtered by leverage scores give similar discriminative performance like the whole dataset.

According to biplots, one can conclude visually that using CUR as feature selection method we can discriminate the classes with many fewer variables (0.5

Case study 2: discrimination and trends

One of the major problems of microarray studies is that the individual probe values are not always well correlated with the expression of the corresponding gene. On the other hand, it has been shown

Trends from principal components

**Trends from principal components.** Based on 5,372 human microarray data Lukk et al.

To make the goodness of the classification more quantitative, we apply the following metrics to measure the separation of the classes. For all group pairs all point pair Euclidean distance was calculated. We measure the separation of two groups by the median of these distances for that group pair. For all group pairs these medians were summarized as a total separation measure. In Figure

Measure of separation against the reduced number of genes at different

**Measure of separation against the reduced number of genes at different ****values.** The sum of median distance measure as a function of genes and

Conclusions

The package rCUR provides functions for the users to perform CUR matrix decompositions in the R environment. In gene expression studies, it may give an additional way of analysis of differential expression and discriminant gene selection based on the use of statistical leverage scores. The approach proposed

Availability and requirements

**Project name:** rCUR

**Project home page:**

**Operating system(s):** Platform independent

**Programming language:** R Other requirements: package MASS, methods, Matrix

**License:** GNU GPL Any restrictions to use by non-academics: none

End notes

^{a}If

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

AB and NS implemented the functions of rCUR. NS constructed the package, examples and performed all the analysis. ICS, MM and NS wrote the manuscript. All authors read and approved the final manuscript.

Acknowledgments

We want to thank Gábor Tusnády who suggested the usage of method for our work. The function CUR in package uses the function ginv from package

This work was supported by the National Office for Research and Technology, Hungary (NKTH TECH08:3dhist08 grant) and the project TÁMOP 4.2.1.B-11/2/KMR-2011-0003.