College of Life Science and Biotechnology, Shanghai Jiaotong University, 800 Dongchuan Road, Shanghai 200240, China

Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China

Shanghai Center for Bioinformation Technology, Shanghai 200235, China

Department of Medical Microbiology and Parasitology, Institutes of Medical Sciences, Shanghai Jiao Tong University School of Medicine, Shanghai 200240, China

State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, the First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou, Zhejiang 310003, China

Department of Cardiology, Gansu Provincial Hospital, Lanzhou 730000, China

Abstract

Background

Bacterial 16S Ribosomal RNAs profiling have been widely used in the classification of microbiota associated diseases. Dimensionality reduction is among the keys in mining high-dimensional 16S rRNAs' expression data. High levels of sparsity and redundancy are common in 16S rRNA gene microbial surveys. Traditional feature selection methods are generally restricted to measuring correlated abundances, and are limited in discrimination when so few microbes are actually shared across communities.

Results

Here we present a Feature Merging and Selection algorithm (FMS) to deal with 16S rRNAs' expression data. By integrating Linear Discriminant Analysis method, FMS can reduce the feature dimension with higher accuracy and preserve the relationship between different features as well. Two 16S rRNAs' expression datasets of pneumonia and dental decay patients were used to test the validity of the algorithm. Combined with SVM, FMS discriminated different classes of both pneumonia and dental caries better than other popular feature selection methods.

Conclusions

FMS projects data into lower dimension with preservation of enough features, and thus improve the intelligibility of the result. The results showed that FMS is a more valid and reliable methods in feature reduction.

Background

The biogeography of microbiota in the human body are linked intimately with aspects of host metabolism, physiology and susceptibility to disease

The ability to successfully distinguish between disease classes using gene expression data is an important aspect of approaches to disease classification, the discrimination methods include nearest-neighbor, linear discriminant analysis, and classification trees etc

Contrasted with feature selection, feature transformation methods create a new feature space with an optimal subset of predictive features measured in the original data. Some traditional feature transformation methods, such as principal component analysis (PCA) and linear discriminant analysis (LDA), output a combination of original features. PCA converts a set of possibly correlated variables into a set of orthogonal factors that efficiently explain the variance of the observations. LDA transforms original features to k-1 dimensions if there are k categories of training data. These traditional methods are fast and easy to compute, but there are some weakness

Previous surveys showed that taxon relative abundance vectors from 16S rRNA genes expression provide a baseline to study the role of bacterial communities in disease states

Results

Feature Merging and Selection algorithm

Two statistics methods were considered to handle the continuous and sparse data of 16S rRNAs' expression levels. Fisher statistic was used to test the classification ability of features and Pearson Correlation Coefficient was used to describe the redundancy between features. We developed a new method called Feature Merging and Selection algorithm, which combined Linear Discriminant Analysis (LDA) method to learn linear relationship between different features. Classical LDA requires the total scatter matrix to be nonsingular. However, in gene expression data analysis, all scatter matrices in question can be singular since the data points are from a very high-dimensional space and in general the sample size does not exceed this dimension. To deal with the singularity problems, classical LDA method was modified in a way that an unit diagonal matrix with small weights was added to the within-class scatter matrix. The procedure continued until the remaining matrix eventually became nonsingular.

FMS algorithm consists of two parts: feature merging and feature deletion. Feature merging is the main part of the algorithm. The procedure is described below (see Figure

FMS algorithm flowchart

**FMS algorithm flowchart**.

Step 1: Initialization: set weights of all the features to 1 and the counter to 0; label each feature from 1 to n, n is the total number of features.

Step 2: Loop from step 2 to step 7 until the counter equals to n-1.

Step 3: Delete features of zero variance, and add the total number of deleted features to the counter.

Step 4: Compute pairwise relationship of the remaining features using modified LDA, and preserve the combination features with maximal Fisher statistics. The Fisher statistics is defined as _{k }
_{k }

Step 5: Measure the combination ability by combining Fisher statistics method and Pearson correlation coefficient methods, and calculate the merging value = (new value of Fisher statistics)*(Pearson Correlation Coefficient)/(geometric mean values of Fisher statistics of the original features).

Step 6: Select and merge the feature pair with the greatest merging value, save the original labels, and multiply the weight by previously trained weight.

Step 7: Normalize the weight; add 1 to the counter.

Step 8: Re-compute the weight of each combination using LDA until the original feature number is less than two. Preserve the combination with maximal Fisher statistics value and normalize the weights.

After feature merging, the resulting combinations reveals the relationship between the original features. With more features deleted, linear bias is getting greater, but variance is getting lower; and vice versa. To compromise between the bias and variance criteria, we selected the dimension reduction ratios by 5-fold proportional cross validation

To simplify the model, features were deleted based on the resulting combinations after feature merging and cross validation. Values of fisher statistics were multiplied by the weight of each combination. Features were sorted in ascending order by absolute value of their weights and were deleted one by one, and the error rate were got by 5-fold proportional cross validation. For those classification performances with equal error rates, the decision was then made to preserve the resulting combination with lower dimensions or less number of features. Unimportant features were thus deleted to simplify the model. In summary, FMS determine the final dimensionality and thus the optimal number of features which yields the lowest error rate got by cross validation. FMS algorithm is a dimensionality reduction method and should be used with combination of a classifier.

Fisher method has a high classification ability on datasets with low noise, but its performance can be reduced because of the noisy data. To address the weakness of fisher method when dealing with noisy data, mutual information method was used for feature deletion instead of Fisher statistic method. Under Occam's razor

Examples of FMS algorithm

We first tested the FMS algorithm on the 16S rRNAs' expression profiles got from pneumonia samples belonged to three classes, 101 patients with hospital-acquired pneumonia (HAP), 43 patients with community-acquired pneumonia (CAP), and 42 normal persons as control

The feature merging algorithm was then performed on the whole training data based on the obtained degrees of feature merging and feature deletion, the output reflected the relationship between combinations of features. Then the classifier was used to produce a classification on test data, and error rate was obtained. K-nearest neighbor algorithm (kNN) and Support vector machine (SVM) are widely used tools for classification. SVM was selected as classifier along with the algorithm because of its lower error rate for the pneumonia training data. Four widely used feature selection methods, mRMR method ^{2 }statistic

Two types of classification were considered: three-class problem and two-class problem. The former outputs three classes, i.e. HAP, CAP and normal, the later outputs two classes,

For balanced training data, error rates obtained from the whole training data is suited to measure classification ability. However, it is not suitable for imbalanced data. Therefore, the mean error rate _{i }
_{i }

Learning curves of FMS algorithm for feature merging in 3-class problem (a), feature deletion in 3-class problem (b), feature merging in 2-class problem (c) and feature deletion in 2-class problem (d)

**Learning curves of FMS algorithm for feature merging in 3-class problem (a), feature deletion in 3-class problem (b), feature merging in 2-class problem (c) and feature deletion in 2-class problem (d)**.

Combined with either SVM or kNN classifier, FMS algorithm has the lowest mean error rate in both the 3-class and 2-class problems compared with four other widely used feature selection methods, **χ ^{2 }
**statistic

Classification ability on pneumonia data in 3-class problem

**Method**

**Error rate**

**Dimension**

**Feature number**

**Note**

**On training data**

**On test data**

svm/FMS

0.1895

0.2637

29

129

svm/mRMR

0.2267

0.3103

38

38

svm/KruskalWallis

0.1984

0.3816

107

107

svm/InformationGain

0.2425

0.3684

28

28

svm/χ2 statistic

0.2127

0.4308

125

125

svm

0.2841

0.4017

137

137

kNN/FMS

0.2013

0.3406

112

133

k = 1

kNN/mRMR

0.2635

0.3774

130

130

k = 1

kNN/KruskalWallis

0.2492

0.3795

134

134

k = 1

kNN/InformationGain

0.2635

0.3774

130

130

k = 1

kNN/χ2 statistic

0.2537

0.4128

124

124

k = 1

kNN

0.2635

0.3774

137

137

k = 1

Classification ability on pneumonia data in 2-class problem.

**Method**

**Error rate**

**Dimension**

**Feature number**

**Note**

**On training data**

**On test data**

svm/FMS

0.0922

0.1279

42

123

svm/mRMR

0.1313

0.1977

36

36

svm/KruskalWallis

0.1081

0.1628

62

62

svm/InformationGain

0.1456

0.186

54

54

svm/χ2 statistic

0.1561

0.186

127

127

svm

0.1611

0.1977

137

137

kNN/FMS

0.1279

0.2393

20

130

k = 1

kNN/mRMR

0.2532

0.3372

54

54

k = 1

kNN/KruskalWallis

0.1861

0.3343

25

25

k = 4

kNN/InformationGain

0.2248

0.3256

107

107

k = 1

kNN/χ2 statistic

0.336

0.4535

107

107

k = 1

kNN

0.346

0.4419

137

137

k = 1

**Supplementary materials, pdf format**.

Click here for file

Heatmap is a frequently used matrix of pair-wise sample correlations in which anti-correlation or correlation is indicated by a color-scale,

The expression profiles of original pneumonia data for 3-class problem (a), data after treated by FMS for 3-class problem (b); original pneumonia data for 2-class problem (c) and data after treated by FMS for 2-class problem (d)

**The expression profiles of original pneumonia data for 3-class problem (a), data after treated by FMS for 3-class problem (b); original pneumonia data for 2-class problem (c) and data after treated by FMS for 2-class problem (d)**. Rows are microbiotas and columns are disease classes. From left to right are 30 normal, 32 CAP, 71 HAP samples for 3-class problem, and 30 normal 103 pneumonia samples for 2-class problem.

Combinations of features were sorted by their Fisher statistics, which indicated the discrimination ability. The microbiota signatures with best discrimination ability enabled us to identify low- and high-risk patients with distinct pneumonia classes (Additional file

Phylogenetic relationship of microbiota signatures in 3-class problem

**Phylogenetic relationship of microbiota signatures in 3-class problem**. The microbiota signatures with best discrimination ability were labeled with green star.

Phylogenetic relationship of microbiota signatures in 2-class problem

**Phylogenetic relationship of microbiota signatures in 2-class problem**. The microbiota signatures with best discrimination ability were labeled with green star.

FMS algorithm was also tested on 16S rRNAs' profiles form dental decay patients. These samples were collected from saliva and dental plaques separately. For the expression level of 16S rRNAs collected from dental plaques samples, the training data contains 23 dental decay patient samples and 20 normal samples and the test data contains 9 dental decay patient samples and 8 normal samples. For the expression level of 16S rRNAs collected from saliva samples, the training data contains 23 dental decay patient samples and 19 normal samples and the test data contains 10 dental decay patient samples and 8 normal samples. As these dental decay datasets are noisy, mutual information method was used for feature deletion instead of Fisher statistic method. When treating with the noisy data, the data showed that FMS also performed better than mRMR method

Conclusions

In this work, we introduced FMS algorithm to address the high level sparsity and redundancy problem of 16S rRNA genes microbial surveys, thereby identifying combinations of 16S rRNA genes that give the best discrimination of sample groups. FMS method has several distinct advantages and features that make it useful to researchers: 1) FMS reduces feature dimension with higher accuracy and preserves the relationship between different features as well, thus improve the intelligibility of the result. 2) FMS processes features into sets of combinations and performs more efficiently and meaningfully in distinguishing among classifications than the individual features, which is in line with the observation that particular combinations of specific bacteria are associated with individual symptoms and signs

In conclusion, we developed a new feature merging and selection algorithm to deal with 16S rRNAs expression data in order to reduce feature dimensionality and retain enough important features. The improved method reserves some advantages of both LDA and other feature selection methods, and reduces dimensions much more effectively. As the classification examples showed, the FMS algorithm reduced dimensionality of the data effectively without losing important features, which made results more intelligible. FMS performed well and will be useful in human microbiome projects for identifying biomarkers for disease or other physiological conditions.

Data and method

Data

We got the 16S rRNAs' expression profiles of pneumonia patients from Zhou et al.,

Linear Discriminant Analysis

Linear Discriminant Analysis(LDA) is a typical variable transformation method to reduce dimensions _{k }
_{k }

LDA method can find a direction which maximizes the projected class means and while minimizing the classes variance in this direction. To avoid _{W }
_{W }
_{W }

Support vector machine algorithm

Support vector machine (SVM) algorithm is one of the most popular supervised learning method basing on the concept of maximal margin hyperplane

k-nearest neighbor algorithm

k-nearest neighbor algorithm (kNN) is a nonparametric method of supervised classification, basing on distance function _{q}
_{i}
_{i}
_{q}
_{i}
_{q }
_{i }

k means clustering method

k means clustering is an unsupervised classification method for finding clusters and cluster centers. The method works in three steps: (1) Select the first kth samples as the seed mean; (2) Classify samples according to the nearest mean value; (3) End the loop when there is no change in the mean values. We used Euclidean distance as distance function. The program can be downloaded from

Mutual information

Mutual information measures the mutual dependence between two variables based on information theory. The mutual information of two continuous variables × and Y is defined as:

In case of discrete variables, mutual information is defined as:

We sorted the mean values of each feature class, computed average values of each adjacent values, and discretized each features according to the average values, then calculated the mutual information. Datasets with mutual information below 0.03 threshold were considered as noisy data, thus mutual information method was used instead of Fisher statistic method at feature deletion step.

To measure classification ability on noisy data, we discretized features according to median value of classes for each feature, then compute mutual information.

Minimum Redundancy Maximum Relevance

Minimum Redundancy Maximum Relevance (mRMR) method is widely used for feature selection such as gene selection

The Minimum Redundancy is defined as:

The mRMR feature set is obtained by optimizing the Maximum Relevance and Minimum Redundancy simultaneously. Optimization of both conditions requires combining them into a single criterion function. In this paper, the m-th feature was selected according to the value of Maximum Relevance divided by Minimum Redundancy

mRMR method need to discrete training data before running, so considering sparse discrete of the data, we assign 1 for features with expression information and 0 for features without expression. The mRMR program can be downloaded from web site:

Kruskal-Wallis test

Kruskal-Wallis test is a non-parametric method for testing whether samples originate from the same distribution

Information Gain

Information Gain measures the classification ability of each feature with respect to the relevance with the output class, which is defined as Information Gain = H(S)-H(S|x)

χ^{2 }statistic

The Chi-squared (χ^{2}) statistic uses theχ^{2 }statistic to discretize numeric attributes and achieves feature selection via discretization ^{2 }value is defined as _{ij }
_{i }
_{j }
^{2 }statistic values, the lager the value, the more important is the feature.

Abbreviations

FMS: Feature Merging and Selection algorithm; PCA: Principal Component Analysis; LDA: Linear Discriminant Analysis; HAP: hospital-acquired pneumonia; CAP: community-acquired pneumonia; kNN: K-nearest neighbor algorithm; SVM: Support vector machine; mRMR: Minimum Redundancy Maximum Relevance.

Competing interests

The authors declared that they have no competing interests.

Authors' contributions

YW performed algorithm design and wrote the manuscript. YZ, ZL and YZ collected the data. YL and XG designed and sponsored the study. HS contributed and edited the manuscript. All authors read and approved the manuscript.

Acknowledgements

This work was supported by the National '973' Basic Research Program (2010CB529200, 2011CB910204, 2011CB510100, 2010CB529206, 2010CB912702), Research Program of CAS (KSCX2-EW-R-04, KSCX2-YW-R-190, 2011KIP204), National Natural Science Foundation of China (30900272, 31070752), Chinese Ministry for Science and Technology Grant 2008BAI64B01, Chinese High-Tech R&D Program (863) (2009AA02Z304, 2009AA022710), and Shanghai Committee of Science and Technology (09ZR1423000).

This article has been published as part of