Department of Genome Sciences, University of Washington, Seattle, Washington, USA

Department of Computer Science, University of British Columbia, Vancouver, BC, Canada

Faculty of Information Technology, Monash University, VIC, Australia

Medical Genetics, University of British Columbia, Vancouver, BC, Canada

Abstract

One challenge in applying bioinformatic tools to clinical or biological data is high number of features that might be provided to the learning algorithm without any prior knowledge on which ones should be used. In such applications, the number of features can drastically exceed the number of training instances which is often limited by the number of available samples for the study. The Lasso is one of many regularization methods that have been developed to prevent overfitting and improve prediction performance in high-dimensional settings. In this paper, we propose a novel algorithm for feature selection based on the Lasso and our hypothesis is that defining a scoring scheme that measures the "quality" of each feature can provide a more robust feature selection method. Our approach is to generate several samples from the training data by bootstrapping, determine the best relevance-ordering of the features for each sample, and finally combine these relevance-orderings to select highly relevant features. In addition to the theoretical analysis of our feature scoring scheme, we provided empirical evaluations on six real datasets from different fields to confirm the superiority of our method in exploratory data analysis and prediction performance. For example, we applied FeaLect, our feature scoring algorithm, to a lymphoma dataset, and according to a human expert, our method led to selecting more meaningful features than those commonly used in the clinics. This case study built a basis for discovering interesting new criteria for lymphoma diagnosis. Furthermore, to facilitate the use of our algorithm in other applications, the source code that implements our algorithm was released as FeaLect, a documented R package in CRAN.

Introduction

To build a robust classifier, the number of training instances is usually required to be more than the number of features. In many real life applications such as bioinformatics, natural language processing, and computer vision, a high number of features might be provided to the learning algorithm without any prior knowledge about which ones should be used. Therefore, the number of features can drastically exceed the number of training instances and the model is subject to overfit the training data. Many regularization methods have been developed to prevent overfitting and to improve the generalization error bound of the predictor in this learning situation.

Most notably, Lasso _{1}-regularization technique for linear regression which has attracted much attention in machine learning and statistics. The same approach is useful in classification because any binary classification problem can be reduced to a regression problem by treating the class labels as real numbers, and consider the sign of the model prediction as the class label. The features selected by the Lasso depends on the regularization parameter, and the set of solutions for all values of this free parameter is provided by regularization path

In this paper, we propose a novel algorithm for feature selection based on the Lasso and our hypothesis is that defining a scoring scheme that measures the "quality" of each feature can provide a more robust feature selection method. Our approach is to generate several samples from the training data by bootstrapping, determine the best relevance-ordering of the features for each sample, and finally combine these relevance-orderings to select highly relevant features. In addition to the theoretical analysis of our feature scoring scheme, we provided empirical evaluations using a real-life lymphoma dataset as well as several UCI datasets, which confirms the superiority of our method in exploratory data analysis and prediction performance.

Background and previous work

Lasso is an _{1}-regularization technique for least-square linear regression:

where the response random variable ^{d}_{XY}_{1}-regularization term shrinks many components of the solution to zero, and thus performs feature selection

A common practice is to find the best value for λ by cross-validation to maximize the prediction accuracy. Having found the best value for the regularization parameter, the features are selected based on the non-zero components of the global and unique minimizer of the training objective in equation (1). However, recent research on the

Various decaying schemes of the regularization parameter were studied

**Proposition 1**. _{XY }satisfies some mild assumptions and let _{0 }**J ****true **

_{i }s are positive constants

Now, if we send **J**, exactly the relevant features. The proposition 1 guarantees the performance of Bolasso only asymptoticly, i.e. when

Previous studies have shown that there is room for improving Bolasso

Our contributions

In this paper, we develop FeaLect algorithm that is softer than Bolasso in the following three directions:

• For each bootstrap sample, Bolasso considers only one model that minimizes the training objective

• Instead of making a binary decision of inclusion or exclusion, we compute a score value for each feature that can help the user to select the more relevant ones,

• While Bolasso-S relies on a threshold, our theoretical study of the behaviour of irrelevant features leads to an analytical criterion for feature selection without using any pre-defined parameter.

We compared the performance of Bolasso, FeaLect, and Lars algorithms for feature selection on six real datasets in a systematic manner. The source code that implements our algorithm was released as FeaLect, a documented R package in CRAN.

Feature scoring and mathematical analysis

In this section, we describe our novel algorithm that scores the features based on their performance on samples obtained by bootstapping. Afterwards, we present the mathematical analysis of our algorithm which builds the theoretical basis for its proposed automatic thresholding in feature selection.

The FeaLect algorithm

Our feature selection algorithm is outlined in Figure

Overview of bootstraping performed by FeaLect

**Overview of bootstraping performed by FeaLect**. A row and a column of the gray data matrix correspond to a feature and a case, accordingly. 1000 models are trained, each fitted to a random subset that contains

For each feature

The above randomized procedure is repeated several times for various random subsets

Total feature scores in the log-scale

**Total feature scores in the log-scale**. The middle-part of the curves is linear and represents scores of the irrelevant features (see section). The scores in (a) and (b) diagrams are computed by 1000 and 5000 samples, respectively. The low variance between diagrams indicates fast convergence and stability of score definition. Data is from lymphoma dataset.

**Algorithm 1 **Feature Scoring

**1: for **t = 1 to **do**

**2:** Sample (without replacement) a random subset

**3:** Run Lars on

**4:** Compute

**5:** **for ****do**

**6:** Update the feature scores for all feature

**7:** **end for**

**8: end for**

**9:** Fit a 3-segment spline (_{1}(.), _{2}(.), _{3}(.)) on log-scale feature score curve (see the text for more information)

**10: return **features corresponding to _{3 }as informative features

Before describing the rest of the algorithm, let us have a look at the feature scores for our lymphoma classification problem (The task and data set is described in details in the Experiment section). Figure

The final step of our feature selection algorithm is to fit a 3-segment spline model to the feature score curve: the first quadratic lower-part captures the low score features, the linear middle-part captures irrelevant features, and the last quadratic upper-part captures high-score informative features. As discussed below, the middle linear-part provides an analytic threshold for the score of relevant features: The features with score above this threshold are reported as informative features which can be used for training the final predictor and/or explanatory data analysis.

The analysis

The aim of this analysis is to provide a mathematical explanation for the linearity of the middle part of the scoring function (Figure

**Proposition 2**. **f **= _{i}_{i }by the Lasso in some stage of our feature selection method in Algorithm 1. Then, the probability distribution of the random variable **f **

Since we have not imposed any prior assumption, we put a uniform distribution on

The following definition formalizes the idea that irrelevant features depend only on a specific subset of the whole data set.

**Definition 3**. _{i}, we say that f_{i }over-fits on U if:

In words, _{i }

**Lemma 4**.

The first line of the above proof relies on the assumption that the members of the random set

The following theorem concludes our argument for the exponential behavior of total score of irrelevant features. It relates the probability of selecting a feature _{i }

**Theorem 5**. _{i }over-fits on a set of samples U with size r, then:

The last equation was proved in lemma 4, and the one before that from definition 3. □

Although we presented the above arguments for the Lasso, it also should work for any other feature selection algorithm which exhibits linearity in its feature score curve. That is, features corresponding to the linear part of the scoring curve are indeed the irrelevant features for that algorithm, and therefor, the features on non-linear upper-part should be considered as informative ones. Obviously the features on the non-linear lower-part are not interesting for the any prediction task because their scores are even less than the irrelevant features. We speculate that these features do not present a linear behavior because not only they are not relevant to the outcome, but also they are not associated with any particular set

Experiment with real data

We applied FeaLect on several datasets to test the performance of our feature selection algorithm in real life conditions.

Lymphoma

Lymphoma is a cancer that begins in the lymphatic cells of the immune system, and is presented as a solid tumor of lymphoid cells

Data preparation and feature extraction

The blood sample of each patient was divided into 7 portions, and each portion is examined in a different tube by the cytometer. Each tube gives 5 dimensional data of 20,000-70,000 blood cells. In the first analysis step, we used a spectral clustering approach to cluster the cells in each tube into cell populations. It was not possible to directly apply classical spectral clustering

SamSPECTRAL performs a specific sampling stage called

1. Set all points to be unregistered and assume the parameter

2. Pick a random unregistered point

3. Put all of these points in a set called community

4. Repeat the above two steps until no unregistered points are left.

After the above steps, the similarity between the communities is defined by summing up similarities between their members, and the resulting similarity matrix is passed to a classical spectral clustering algorithm. Because this matrix is much smaller than the original similarity matrix (3000-by-3000 instead of 20,000-by-20,000 in our experiments), its eigenvectors can be efficiently computed in reasonable time.

Each cluster computed by SamSPECTRAL was regarded as a "cell population" that could potentially have information about the lymphoma type. Without imposing any

Feature selection and classification

Since the number of features was considerably larger than the number of training samples (_{1 }-regularization technique, and it was not by its own enough to prevent overfitting. Reducing the regularization parameter did not improve the results as we observed that some of the features that were known to be biologically and clinically interesting were ignored. We also applied Bolasso

Next, we applied our feature selection algorithm. In our experiment, we set

To select the informative features, we fitted a 3-segment spline model to each curve. The features corresponding to the middle linear segment were considered as irrelevant ones, and ignored for the rest of analysis. Features with score higher than score of these irrelevant features were selected as informative features. We observed that unlike the pure Lasso, all features that were known to be biologically and clinically interesting were selected by our approach. Prediction accuracy was improved confirming the efficiency of our feature selection method. We used our selected features to build a linear classifier that had precision, recall and F-measure 98%, 94% and 96%, respectively while the best result we obtained with the pure Lasso was 93%, 82% and 87%, respectively.

For further evaluation in a data explarotary setting, we interrogated the

Additional real datasets

In addition to our lymphoma flow cytometry data, we validated the performance of FeaLect on five other datasets including the well-known colon gene expression (Table

Comparsion of area under the ROC curve between FeaLect, lars, and Bolasso on six different datasets.

**Dataset**

**Total samples**

**# of features**

**20 training samples**

**40 training samples**

**Reference**

**Bolasso**

**lars**

**FeaLect**

**Bolasso**

**lars**

**FeaLect**

Lymphoma

258

505

0.62

0.81

0.84

0.67

0.87

0.88

current

Colon

62

2000

0.50

0.57

0.65

0.47

0.64

0.75

Arcene

100

10000

0.51

0.59

0.64

0.50

0.66

0.72

SECOM

208

590

0.51

0.57

0.61

0.52

0.61

0.64

Connectionist

208

60

0.63

0.76

0.78

0.67

0.78

0.79

ISOLET

479

617

0.90

0.99

1.00

0.91

1.00

1.00

Table

Variation of area under the ROC curve when different number of features are used

**Variation of area under the ROC curve when different number of features are used**. The features are sorted by applying FeaLect on 20 random training samples. Then, the training samples and the highly scored features are considered to build linear classifiers by lars. The best AUC is reported by testing on a set of validating samples disjoint from the training set. For both lymphoma and colon datasets, the performance of the optimum classifier decreases if all features are provided to lars. This observation practically shows the advantage of using a limited number of highly scored features over pure lars.

Comparing ROC curves between FeaLect and lars

**Comparing ROC curves between FeaLect and lars**. The blue curve represents the ROC curve of the best Lasso model trained on 20 random samples using all available features, and the red curve shows the performance of the best Lasso model when only 61 and 36 top features are provided from colon and lymphoma datasets respectively. While FeaLect always performs better than pure lars, the difference is more significant for colon dataset than lymphoma dataset.

Improvements in the area under the ROC curves by increasing the number of training samples

**Improvements in the area under the ROC curves by increasing the number of training samples**. Except for Bolasso on colon dataset, the average performance increases as more training samples are provided. While FeaLect and lars converge to a common asymptotic performance on lymphoma dataset, FeaLect is consistently superior to pure lars on colon dataset because the number of training samples is very limited. Table 1 presents similar superiority for other datasets with relatively low instances.

Conclusion

We have presented FeaLect, a novel feature selection algorithm, based on Lasso (Figure

Furthermore, we provided empirical and quantitative evaluations on five other real-world datasets (from different fields) to confirm the superiority of our method, in prediction performance, compared to the baselines.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

AG and RB supervised the project and motivated the study by providing scientific insight. HZ developed the idea of scoring features and performed the experiments. HZ and GH designed the mathematical analysis. RB provided data and computing facilities. All authors read, edited and approved the final manuscript.

Declarations

The research and publication costs for this article were funded by NIH grants 1R01EB008400 and 1R01EB005034, the Michael Smith Foundation for Health Research, the National Science and Engineering Research Council and the MITACS Network of Centres of Excellence.

This article has been published as part of

Acknowledgements

The authors would like to thank Andrew Weng and Randy Gascoyne for providing data and valuable clinical insight, and Nima Aghaeepour for his scientific comments.