Alignment-free supervised classification of metagenomes by recursive SVM
1 Department of Automation, Bioinformatics Division/Center for Synthetic & Systems Biology, TNLIST, MOE Key Laboratory of Bioinformatics, Tsinghua University, Beijing 100084, China
2 School of Life Sciences and School of Medicine, Tsinghua University, Beijing 100084, China
BMC Genomics 2013, 14:641 doi:10.1186/1471-2164-14-641Published: 22 September 2013
Comparison and classification of metagenome samples is one of the major tasks in the study of microbial communities of natural environments or niches on human bodies. Bioinformatics methods play important roles on this task, including 16S rRNA gene analysis and some alignment-based or alignment-free methods on metagenomic data. Alignment-free methods have the advantage of not depending on known genome annotations and therefore have high potential in studying complicated microbiomes. However, the existing alignment-free methods are all based on unsupervised learning strategy (e.g., PCA or hierarchical clustering). These types of methods are powerful in revealing major similarities and grouping relations between microbiome samples, but cannot be applied for discriminating predefined classes of interest which might not be the dominating assortment in the data. Supervised classification is needed in the latter scenario, with the goal of classifying samples into predefined classes and finding the features that can discriminate the classes. The effectiveness of supervised classification with alignment-based features on metagenomic data have been shown in some recent studies. The application of alignment-free supervised classification methods on metagenome data has not been well explored yet.
We developed a method for this task using k-tuple frequencies as features counted directly from metagenome short reads and the R-SVM (Recursive SVM) for feature selection and classification. We tested our method on a simulation dataset, a real dataset composed of several known genomes, and a real metagenome NGS short reads dataset. Experiments on simulated data showed that the method can classify the classes almost perfectly and can recover major sequence signatures that distinguish the two classes. On the real human gut metagenome data, the method can discriminate samples of inflammatory bowel disease (IBD) patients from control samples with high accuracy, which cannot be separated when comparing the samples with unsupervised clustering approaches.
The proposed alignment-free supervised classification method can perform well in discriminating of metagenomic samples of predefined classes and in selecting characteristic sequence features for the discrimination. This study shows as an example on the feasibility of using metagenome sequence features of microbiomes on human bodies to study specific human health conditions using supervised machine learning methods.