Department of Epidemiology, College of Public Health and Health Professions and College of Medicine, Emerging Pathogens Institute, University of Florida, 32610, Gainesville, FL, USA

Department of Electrical and Computer Engineering, University of Florida, 32610, Gainesville, FL, USA

Department of Microbiology and Immunology, Center of Excellence in Bioinformatics & Life Sciences, Department of Computer Science and Engineering, The State University of New York at Buffalo, 14214, Buffalo, NY, USA

Abstract

Background

Binning 16S rRNA sequences into operational taxonomic units (OTUs) is an initial crucial step in analyzing large sequence datasets generated to determine microbial community compositions in various environments including that of the human gut. Various methods have been developed, but most suffer from either inaccuracies or from being unable to handle millions of sequences generated in current studies. Furthermore, existing binning methods usually require

Results

We present a novel modularity-based approach (M-pick) to address the aforementioned problems. The new method utilizes ideas from community detection in graphs, where sequences are viewed as vertices on a weighted graph, each pair of sequences is connected by an imaginary edge, and the similarity of a pair of sequences represents the weight of the edge. M-pick first generates a graph based on pairwise sequence distances and then applies a modularity-based community detection technique on the graph to generate OTUs to capture the community structures in sequence data. To compare the performance of M-pick with that of existing methods, specifically CROP and ESPRIT-Tree, sequence data from different hypervariable regions of 16S rRNA were used and binning results were compared.

Conclusions

A new modularity-based clustering method for OTU picking of 16S rRNA sequences is developed in this study. The algorithm does not require a predetermined cut-off level, and our simulation studies suggest that it is superior to existing methods that require specified distance levels to define OTUs. The source code is available at

Background

Recent advances in high-throughput sequencing technologies have contributed to an explosion in sequence data from studies of microbial composition in various environments that harbor complex microbial communities. As one of the most commonly used approaches for such studies, 16S rRNA sequences are analyzed to estimate species composition and diversity.

An initial requirement for downstream analyses of 16S rRNA sequences is the binning into operational taxonomic units (OTUs) that contain similar sequences. The existing methods can be divided into two classes, taxonomy-dependent methods and taxonomy-independent (TI) methods

In TI methods, pairwise sequence distances are computed either by multiple sequence alignment (MSA) or pairwise sequence alignment (PSA) and several clustering algorithms can then be applied to form OTUs. These clustering algorithms include hierarchical clustering algorithms such as DOTUR

One of the critical problems with existing TI methods is the need to set an appropriate distance threshold to retrieve the optimal OTU binning at a distinct taxonomic level such as species. Applying different thresholds leads to inconsistent binning results. Furthermore, appropriate distance levels appear to vary depending on the chosen hypervariable region

Some efforts have been made recently to address this issue. In

In this study, a modularity-based clustering method was developed for OTU picking. By viewing an OTU as a collection of related sequences with similar densities in a sequence space, we applied a community detection method and treated OTU picking as a community structure detection problem.

Methods

Modularity-based clustering

We herein refer to community structure as the occurrence of groups of vertices in a graph that are more densely connected with each other than with the rest of the graph. Modularity-based methods are popular in community detection; they are derived from the intuition that a graph has community structure, if the number of edges within groups is significantly more than expected by chance

where _{
i
}, _{
j
})=1, and otherwise _{
i
}, _{
j
})=0. The term

Modularity itself is also a quality function that indicates whether a partitioning of a graph can reveal the community structure on the graph if such structure exists. The maximum value of modularity is 1; a large value implies good partitioning. The maximum

Several algorithms have been developed to efficiently optimize modularity. Among them, the algorithm in

In the context of OTU picking, a weighted graph is formed by: i) viewing sequences as vertices, where each pair of sequences is connected by an imaginary edge, and ii) viewing the simlarity of a pair of sequences as the weight on the edge connecting these two sequences. Thus the modularity of a partition of sequences can be computed using Equation (1); the best clustering result is the one that maximizes the modularity. In such a result, each cluster represents an OTU with high homogeneity inside, that is, similarities between sequences within OTUs are greater than those between them. Using this approach, OTUs are defined by homogeneity of edge densities and not by distance between neighborhood clusters, circumventing the need for choosing distance levels.

A toy example comparing the modularity-based method and average linkage based hierarchical clustering is shown in Figure

M-pick outperforms hierarchical clustering when clusters have different sizes

**M-pick outperforms hierarchical clustering when clusters have different sizes.** Clusters are represented in different colors. (**a**) Ground truth generated from three Gaussian distributions. (**b**) Clustering results of M-pick. (**c**) Clustering results of average linkage based hierarchical clustering.

Our modularity-based approach includes three steps. (1) Pairwise sequence distances are computed using the alignment module of ESPRIT

In the first step, we generate a pairwise distance matrix, viewable as a fully connected graph. However, the fully connected graph cannot be directly used to perform clustering because of i) prohibitive computational costs and ii) the resolution limit problem which states that modularity-based methods may fail to acquire clusters smaller than a scale depending on the total size of the graph

Due to the resolution limit problem, which often generates big clusters, it is not desirable to perform the clustering only once. Thus, we recursively evaluate each formed cluster to determine the need for further partitioning. The maximum modularity detected on a graph can indicate the presence of community structure in the graph. While a single cluster partitioning has modularity 0, partitions on a highly homogeneous graph (i.e., a graph with limited community structure) have modularity values close to 0. On the other hand, if multiple communities exist on a graph, some partitions will have large modularity values. Thus, the maximum modularity obtained on a graph can be used as a homogeneity criterion, suggesting the existence of multiple communities. Here we recursively apply clustering to sub-graphs exhibiting large modularity values, with the final sub-graphs or clusters having a maximum value less than a threshold

Flowchart of M-pick. (a) The overall process. (b) The recursive clustering process

**Flowchart of M-pick.** (**a**) The overall process. (**b**) The recursive clustering process.

Clustering results validation

Different clustering results are frequently obtained for the same sequence data set by applying different clustering methods and/or different parameter settings. The lack of a ground truth complicates an objective comparison of clustering methods. Generally, there are two types of clustering validation methods

Normalized mutual information (NMI) is a well-known external criterion previously used for validating OTU picking; it measures the difference of a clustering result from a perceived ground truth

Another popular external criterion is the F-score, which jointly considers precision and recall _{
k
} in ground truth is only judged by the best-matched cluster in the clustering result. Thus, other small clusters that match with _{
k
} can not affect the F-score, overestimating correlation when many small clusters are present

Internal validation indices such as Silhoutette width

Results

16S rRNA sequences of different hypervariable regions were used to compare M-pick with ESPRIT-Tree and CROP.

We first constructed a reference database from the RDP-II database

Procedures to generate ground truth for 16S datasets

**Procedures to generate ground truth for 16S datasets.**

Case study 1 - V2 variable region

We used published sequences previously generated to study the association between obesity and the composition of human gut microbiota

ESPRIT-Tree was applied to each test subset using distance levels between 0.01-0.1 (incremented by 0.01) and the peak NMI score was chosen. Similarly, CROP was applied to each test dataset using different cut-off settings (1%, 2%, 3%, 5%, and 8%) as described in

Performance validation for Case study I

**Performance validation for Case study I.** (**a**) Peak NMI scores of CROP and ESPRIT-Tree compared with NMI scores of M-pick. (**b**) Boxplots of NMI scores of CROP (boxes, at cut-offs of 0.01, 0.02, 0.03, 0.05, and 0.08), ESPRIT-Tree (filled boxes, at cut-offs ranging from 0.01 to 0.1 incremented by 0.01).

While ESPRIT-Tree and CROP can achieve NMI scores greater than 0.9 at their optimum distance level, results are sensitive to the chosen distance level (which is not known

In addition to the NMI scores, we also checked if the three methods could accurately estimate the number of species in the test datasets (Table

**CROP**

**ESPRIT-Tree**

**M-pick**

# OTU (mean, std)

55.5 (19.5)

45.3 (10.8)

56.6 (3.1)

Best distance level

2%–3%

4%–5%

N/A

In order to evaluate the impact of parameter selection (

NMI scores of M-pick using different **ε** and

**NMI scores of M-pick using different ε and **
**
δ
**

Case study 2- V9 variable region

To confirm the observation described above and to be able to generalize our findings, we performed additional studies using different datasets covering various 16S rRNA hypervariable regions. Results from another case study are presented below; The second study was performed on a dataset retrieved from a soil microbial diversity study

Similar to the first case study, we initially performed a blast search of the sequences against the annotated RDP-II database and filtered the sequences using the previously described criteria. We then randomly extracted 10 test subsets each containing 1000 sequences from the 100 most abundant species in the ground truth. The proposed M-pick algorithm was applied by setting

Peak NMI scores of CROP and ESPRIT-Tree compared with NMI scores of M-pick

**Peak NMI scores of CROP and ESPRIT-Tree compared with NMI scores of M-pick.**

NMI scores of M-pick using different **ε** and

**NMI scores of M-pick using different ε and **
**
δ
**

Case study 3- V3 variable region

For the ease of presentation, we only used the top 50 or 100 species in the previous case studies, which may not give a complete picture of how M-pick works on a whole real data.

In this case study, we used a dataset from our sepsis study designed to investigate the association of sepsis and intestinal microbiota in infants with very low birth weight. The dataset contains 110,000 sequences from V3 region. ESPRIT-Tree and M-pick were applied to obtain clustering results for the whole dataset.

**ESPRIT-Tree**

**M-pick**

**0.01**

**0.02**

**0.03**

**0.04**

# OTUs

8823

2338

1356

944

921

NMI

0.846

0.870

0.859

0.831

0.879

Case study 4- simulated dataset

In the above case studies, the ground truth was generated by keeping the sequences that highly matched with the RDP-II database through the stringent criteria. However, the way to genererate ground truth could be quenstionable. To adress this concern, we included another simulated dataset from

Additional case studies

Additional case studies were provided in the Additional file

**Results of case studies not included in the main text.**

Click here for file

Discussion

We herein developed a novel modularity-based clustering method, M-pick, for binning 16S rRNA sequences into OTUs. M-pick is based on graph partitioning, and does not require a predetermined distance level to generate OTUs, which is a challenging requirement for many other OTU picking methods.

M-pick is based on a concept from graph partitioning. It initially creates a similarity based graph composed of all the sequences in a dataset. The algorithm first computes the pairwise sequence distances, and then implicitly creates an

We used multiple sequence datasets from different hypervariable regions of 16S rRNA to compare the performance of M-pick with two other commonly used algorithms, CROP and ESPRIT-Tree. Both are thought to generate accurate clustering results if the optimal distance level is known. However, the optimal distance level, which is not known

Two parameters are required by M-pick.

The computational cost is composed of two parts. (1) ^{2}) is consumed in computing pairwise sequence, where

Conclusions

We developed M-pick, a new modularity-based clustering method, for OTU picking of 16S rRNA sequences. The algorithm does not require a predetermined cut-off value, and our simulation studies suggest that it is superior to the methods that require specified distance levels to define OTUs. M-pick appears to offer a viable alternative for binning similar sequences into OTUs.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

XW, YS and VM designed the study. XW and JY performed the simulations. All authors discussed the results, read and approved the manuscript.

Acknowledgment

We thank the editor and reviewers for their comments and suggestions that significantly improve the quality of this article. This work is supported in part by National Science Foundation under grant No. DBI-1062362.