Abstract
Background
The development of effective environmental shotgun sequence binning methods remains an ongoing challenge in algorithmic analysis of metagenomic data. While previous methods have focused primarily on supervised learning involving extrinsic data, a firstprinciples statistical model combined with a selftraining fitting method has not yet been developed.
Results
We derive an unsupervised, maximumlikelihood formalism for clustering short sequences by their taxonomic origin on the basis of their kmer distributions. The formalism is implemented using a Markov Chain Monte Carlo approach in a kmer feature space. We introduce a space transformation that reduces the dimensionality of the feature space and a genomic fragment divergence measure that strongly correlates with the method's performance. Pairwise analysis of over 1000 completely sequenced genomes reveals that the vast majority of genomes have sufficient genomic fragment divergence to be amenable for binning using the present formalism. Using a highperformance implementation, the binner is able to classify fragments as short as 400 nt with accuracy over 90% in simulations of lowcomplexity communities of 2 to 10 species, given sufficient genomic fragment divergence. The method is available as an open source package called LikelyBin.
Conclusion
An unsupervised binning method based on statistical signatures of short environmental sequences is a viable standalone binning method for low complexity samples. For medium and high complexity samples, we discuss the possibility of combining the current method with other methods as part of an iterative process to enhance the resolving power of sorting reads into taxonomic and/or functional bins.
Background
Metagenomics, the study of the combined genomes of communities of organisms, is a rapidly expanding area of genome research. The field is driven by environmental shotgun sequencing (ESS), a technique of applying highthroughput genome sequencing to nonclonal DNA purified directly from an environmental sample. This removes the requirement to isolate and cultivate clonal cultures of each species, allowing an unprecedented broad view of microbial communities.
Thus far, environments such as acid mine drainage [1], Scottish soil [2], open ocean [3], termite gut [4], human gut [5], and neanderthal [6] have been sequenced, to name a few. Attention has been directed to bacterial and viral fractions of these communities, with eukaryotic metagenomics pioneered by projects such as the marine protist census [7]. Complexity of these communities varies greatly from 5 to several thousand identifiable bacterial species. These projects have uncovered vast amounts of previously unobserved genetic diversity [8,9]. For example, "deep sequencing" using 454 pyrosequencing suggests that possibly tens of thousands of species coexist in a single ml of seawater [10].
Given this wealth of genomic data it is becoming possible to make increasingly precise biological inferences regarding the structure and functioning of microbial communities [1113]. As but one example, the discovery of a novel proteorhodopsin gene was the first step in uncovering a previously unknown, yet apparently dominant, mechanism for phototrophy in the oceans [14]. Characterization of functional diversity is limited by our ability to classify sequences into distinct groups that reflect a desired taxonomic or functional resolution.
Shotgun metagenomic DNA is sequenced in fragments of 50 to 1000 nucleotides, then possibly assembled into longer sequences (contigs). Phylogenetic binning, the task of classifying these sequences into bins by taxonomic origin, then becomes critical to separate metagenomic data into coherent subsets plausibly belonging to separate organisms. This task is challenging due to the short length of available fragments. Bacterial communities of very high complexity, with thousands of species present, further complicate the task.
While methods such as 16S bacterial community censuses [15] and functional or sequencebased screening surveys are the forerunners of modern metagenomics, indiscriminate wholegenome shotgun sequencing may be the defining approach of the discipline today. This approach has recently generated vast amounts of data, facilitated by continual capacity increases and quality improvements at major sequencing centers and the emergence of cost effective very high throughput Next Generation sequencing (NGS) (454 pyrosequencing [16], Illumina [17] and SOLiD [18]). At the highest diversity levels, the reads may not be assembled at all due to the sparseness of even the highest throughput sequencing methods and the danger of chimeric assemblies, arising from sampling so many organisms at once, leaving the binner with raw reads. Binning methods therefore aim to be able to operate on very short read lengths provided by nextgeneration sequencing, although most, including the present approach, are only able to go down to 454 pyrosequencing read length (about 400 nt) and not to microread length (30 to 100 nt).
Classic approaches to phylogenetic determination of species identities from environmental sequences rely on identifying variants of highly conserved genes, like 16S rRNA or recA [19]. This approach is not applicable on a full metagenomic scale for two reasons: first, ribosomal or marker gene sequences comprise a small fraction of the bacterial genome, so most shotgun sequences do not contain them and cannot be classified this way; and second, organisms with identical or closely related 16S genes have been shown to exhibit variations in essential physiological functions [20]. Other approaches are broadly divided into sequence similarity based classifiers such as MEGAN [21], which rely on BLAST or other alignments, and sequence composition based classifiers, which rely on statistical patterns of oligonucleotide distributions. Many solutions integrate the task of phylogenetic assignment (labeling) together with that of binning per se (clustering) of genomic fragments. However, with unsupervised methods, like the one presented here, labeling is not possible as part of the algorithm and has to be performed by other means, like analyzing the correspondence of generated clusters to known phylogenies.
Sequence classification based on oligonucleotide distributions has been the basis for gene finding applications since the early 1990s. In 1995, Karlin and Burge [22] noted that dinucleotide distribution is relatively constant within genomes but varies between genomes. Since then, this property has been extensively studied and generalized to other oligonucleotide lengths [23]. With the advent of ESS, several binning methods have used oligonucleotide distributions of various orders to build supervised and semisupervised classifiers. These include PhyloPythia [24], CompostBin [25], and selforganizing map (SOM) based methods [2628].
Machine learningbased classification algorithms like those used for binning are categorized into supervised, semisupervised, and unsupervised classes. Supervised algorithms accept a training set of labeled data used to build their models, which are then applied to the query data. In case of binning, this training set consists of genomic sequences labeled according to the species they originate from. Semisupervised algorithms use both training set data and query data to build their models. Unsupervised algorithms use no training data and derive their models directly from the query input. While methods described above have achieved considerable success in classifying short anonymous genomic fragments, their supervised nature makes them reliant on previously sequenced data. For example, BLASTbased methods are completely dependent on the presence of sequences related to the query in the database. While semisupervised clustering methods can have significant generalizing power, their accuracy still depends on similarity of input data to their training set.
To our knowledge, two approaches to unsupervised metagenomic binning have been published. TETRA [29,30] explores the applications of kmer frequency statistics to metagenomic data. The authors state that their method is suitable as a "fingerprinting technique" for longer DNA fragments, though not as a generalpurpose binning method for singleread 454 pyrosequenced or Sanger fragments, and an application of methods including TETRA to binning of fosmidsized DNA is used in [31]. Abe et al. [26] used selforganizing maps (SOM) in combination with principal component analysis (PCA) on 1 and 10Kb fragments, and this method was evaluated and enhanced in [27] using growing selforganizing maps (GSOM), an extension of SOM, on 8 and 10kb fragments.
Given the apparent diversity of metagenomic samples and the significant fraction of the full bacterial phylogeny with no sequenced representatives [20,32], as well as possible undiscovered diversity of the tree of life, binning methods must perform well on previously unseen data. Semisupervised methods may be able to extrapolate on this data, but if not, unsupervised clustering will be a necessary part of a combinedmethod binning approach. We present LikelyBin, a new statistical approach to unsupervised classification of metagenomic reads based on an explicit likelihood model of short genomic fragments [33]. The rest of this paper is organized as follows. The Methods section introduces a formal definition of the binning problem, the application of the Markov Chain Monte Carlo (MCMC) formalism, and the feature space and likelihood model used. We discuss numerical methods used in the implementation, including a novel coordinate transformation which achieves dimension reduction for the feature space of kmer frequencies, and the genomic fragment divergence measure D_{n}, a novel statistical measure we developed for performance evaluation of our algorithm. The Results section presents performance evaluations of our method on mixtures of 2 to 10 species compiled from completed genomes available in GenBank, with fragment lengths starting at 400 nt, as well as accuracy trends over different fragment lengths and mixing ratios. We also present results on the FAMeS [34] dataset and compare the current method to a semisupervised binning method based on kmer distributions [25]. The Conclusion section explains the applicability of our method, its speed and availability, as well as important future directions for improvement.
Methods
The binning problem
We state the problem as follows: given a collection of N short sequence reads from M complete genomes, how can we predict which sequences derive from the same genome? In our model, we represent a genome as a string of characters deriving from a stochastic model with parameters Θ, referred to here as a master distribution. We make the simplifying assumption that the oligonucleotide distribution is uniform across the bacterial chromosome. This assumption is not satisfied biologically; genecoding, RNAcoding, and noncoding regions, leading and lagging strands of replication, and genomic islands resulting from horizontal gene transfer can all exhibit distinct oligonucleotide distributions. Accurate classification of these regions in metagenomic fragments is an open problem which requires complex statistical models that we have yet to incorporate into our framework, and which are targets for subsequent model development. Nonetheless we have found that clustering of short reads using the above assumption is sufficiently accurate for use in low complexity metagenome samples.
Given this assumption of statistical homogeneity, we model a collection of sequences from a single genome as realizations of a single stochastic process. Similarly, we model a collection of sequences from multiple genomes as realizations of multiple stochastic processes, one per genome, each with its own master distribution. We are interested in determining which sequences in a metagenomic survey are likely to have been drawn from the same genome and, consequently, the statistical distributions of oligonucleotides within each of the master distributions. If the number of master distributions is unknown, then we must include some prior estimate to close the model. Thus, even in cases where due to insufficient coverage it is impossible to assemble disparate segments of a consensus genome together, a binning algorithm should still be able to group reads together based on their statistical distribution of oligonucleotides.
The simplest model of a genome would be a random collection of letters, A, T, C, and G. The master distribution of a single genome can then be represented as a single probability, p_{A}, denoting the fraction of As in the genome. Base complementarity requires p_{A }= p_{T }and p_{C }= 1/2  p_{A }= p_{G}. A more complex representation would be to assume that genomes are random collections of kmers. When k = 1, each nucleotide is independent of the previous. When k = 2, the genomes are random collections of dimers and so on. However, when k ≥ 2, inherent symmetries are present in this representation since all but the first letters of the current kmer are also contained in the next kmer. In a metagenomic dataset, each short fragment derives from a single master distribution, θ_{i}, which is represented a fraction f_{i }of times. How then can we infer the most likely Θ ≡ (θ_{1}, θ_{2}, ..., θ_{M}) and F ≡ (f_{1}, f_{2}, ..., f_{M}) given a set of N sequences S ≡ (s_{1}, s_{2}, ..., s_{N})? To do so, we must calculate the likelihood ℒ(SΘ, F) of observing the sequences S given the parameters Θ and F. Then, we must estimate the values of Θ and F that maximize the likelihood ℒ. Below, we demonstrate the use of a MCMC algorithm to perform this task.
MCMC framework
FIGURE 1 We are interested in finding the values of Θ and F that maximize the likelihood, ℒ. The MCMC approach has been described in detail elsewhere [35]. Given an initial parameter setting and a metagenomic data set, we implement the following MetropolisHastings algorithm to MCMC maximum likelihood estimation: (i) Determine the likelihood of the dataset ℒ(Θ, FS); (ii) Choose some Φ = Θ + dΘ, and G = F + dF and determine its likelihood, ℒ' (Φ, G), such that both Φ and G exist in the same highdimensional simplex as Θ and F respectively; (iii) Accept the new value given a probability 1 if ℒ' (Φ, G) > ℒ'(Θ, F) and with probability ℒ' (Φ, G)/ℒ(Θ, F) otherwise; (iv) Repeat, and after a burnin period determine the values and which maximize ℒ(SΘ, F). We can then utilize the resulting model of sequence parameters to classify sequences and estimate the most likely oligonucleotide distribution of each of the originating master distributions. The iterative process, together with key stages of the entire binning algorithm, is illustrated in Figure 1. Some technical details necessary for the implementation follow.
Figure 1. Binning diagram. Diagram of binning data pathways and main MCMC iteration loop.
Likelihood model
Consider a nucleotide sequence s = c_{1 }c_{2 }c_{3}, ..., c_{ℓ}. We would like to know the probability of observing such a sequence given some underlying model. We assume that our sequence is selected from broken pieces of doublestranded DNA, and thus that complementary nucleotide sequences have the same probability: i.e., L(s) = L(s'), where , and is the nucleotide complementary to the nucleotide c_{i}. We assume that the probability of our sequence is determined by a set of 2^{k }kmer probabilities .
That is, we write:
Assuming we know probabilities for all of our kmers, we have probabilities for k  1mers as marginals.
Thus we can write:
As an example, the probability of a sequence given a set of known dimer frequencies is:
Note that we assume the marginal probabilities are well defined: i.e., that we get the same marginal probability if we collapse a kmer to a k  1mer by summing over the first, or the last, nucleotide. The likelihood of observing N sequences given M master distributions is
where P_{m}(s_{i}) is the probability of generating the ith sequence given the mth master distribution.
A simple example of likelihood computation according to the described model is given in the Appendix.
The space of kmer frequencies
Given the assumption of uniformity of the kmer (oligonucleotide) distribution across each genome, we can impose three kinds of constraints on the kmer frequency space. This space is a subspace of , subject to three kinds of constraints: all kmer frequencies sum to 1, e.g.
each kmer has the same frequency as its complement; and all marginal probabilities are consistent over all margins, e.g.
We then derive a transformation of the original kmer frequency vector, x = [p_{A}, p_{T}, p_{G}, p_{C}, p_{AA}, p_{AT}, p_{AG}, p_{AC}, p_{TA}, ...], into the independent coordinate space. To generalize and automate the process, we perform it for each case from 1mers (4 dimensions before removing redundancies) to 5mers (1364 dimensions before removing redundancies) by generating all equations governing the constraints above. We use the notation [Ab] to denote the matrices of the constraint equation Ax = b by generating rows for each constraint type. For example, for k = 2, we write the summation, complementarity and marginality constraints as follows:
We find the nullspace of the resulting matrix A and use it to perform the transformation. The resulting number of independent dimensions is shown in Table 1. The MCMC simulation then performs the search in the independent coordinate space. For k > 6, the matrix A becomes too big to compute its nullspace using a nonparallelized algorithm. Even for k = 6, the number of independent dimensions is so large that the MCMC simulation takes an intractable amount of time. Therefore, we only generalize our algorithm up to k = 5.
Table 1. Redundancies in oligonucleotide dimension space
Initial conditions
The choice of initial conditions can dramatically alter the speed of convergence of a MCMC solver. We used the same initial conditions for comparison of model results, specified by the frequencies of kmers in the entire dataset provided as input (i.e., the weighted average of all sources' contributions to the dataset). Other possibilities, implemented but not chosen as the default, include taking uniformly distributed frequencies, randomizing the starting condition, or using principal components analysis with Kmeans clustering to obtain initial cluster centroids. We verified that convergence, when it did occur, did not depend sensitively on initial conditions (Additional files 1 and 2).
Additional file 1. Convergence dynamics. Figure 1: Convergence dynamics for good accuracy, Mycoplasma capricolum subsp. capricolum ATCC 27343 vs. Campylobacter jejuni subsp. jejuni 81176 (D_{3 }= 2.8). A single MCMC simulation was completed for this pair of genomes as described in Methods. kmer order 3 model was used with 30000 steps, and expected nucleotide frequencies in accepted models were plotted over time for all independent mono and dinucleotides in the model. Two starting conditions were compared: uniform initial frequencies (solid line) and frequencies at dataset mean (dashed line). Dotted lines indicate true average frequencies in the constituent species' fragment datasets. Convergence was observed to be substantially the same, demonstrating robustness of the algorithm to initial starting conditions. Final model accuracy was ≈ 95% in both cases.
Format: PDF Size: 156KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 2. Convergence dynamics. Figure 2: Convergence dynamics for poor accuracy, Granulibacter bethesdensis CGDNIH1 vs. Gluconobacter oxydans 621H (D_{3 }= 0.45). Details are identical to Additional file 1, but final model accuracy was ≈ 60% in both cases.
Format: PDF Size: 170KB Download file
This file can be viewed with: Adobe Acrobat Reader
Finding the maximum likelihood model
Once the predefined number of timesteps has elapsed, the model with the largest log likelihood is selected. Note that the MCMC framework is amenable to a Bayesian approach, which we implemented as an alternative. Once the equilibrium state has been reached we calculate the autocorrelation of frequencies and estimate a window over which frequencies show no significant autocorrelations. Given a specified prior distribution p(Θ, F) for the master distribution and frequencies, the MetropolisHastings approach will converge to the true posterior distribution of π (Θ, FS) ∝ ℒ (SΘ, F) p(Θ, F). In our case we used an uninformed prior distribution so long as positivity and all other specified constraints among kmer probabilities were preserved. We then sample from the equilibrium state to find π (Θ, F). Averages of master distributions in the posterior distribution also preserve the constraint conditions because of the linearity of the averaging operator. Accuracy of the model was similar whether using the maximum likelihood model or the average of the posterior distribution (Additional file 3). Full posterior distributions of kmer models could be used to estimate posterior distributions of binning accuracy.
Additional file 3. Accuracydivergence dependencies for Bayesian sampling. Figure 3: Pairs and triples of genomes were sampled randomly from a set of 1055 completed bacterial chromosomes, and experiments were conducted using Bayesian posterior distribution sampling on the stationary distribution of the MCMC simulation. The results were found to not be significantly different from those for maximum likelihood sampling (Figure 4).
Format: PDF Size: 139KB Download file
This file can be viewed with: Adobe Acrobat Reader
Numerical details
Precision
Due to precision limitations of the machine double precision floating point format, the model likelihood calculation is performed in log space. Denote the old model under consideration as M = {M_{1}, M_{2}, ... M_{m}}, and the new (perturbed) model as . The log likelihood of a single model is
and note that the innermost fraction contains higherorder terms when working with Markov chain orders higher than 2. The innermost product term is a product of on the order of 1000 terms of magnitude ≈ 1/4. However, 1/4^{n }exceeds double floating point precision at n ≈ 540. To prevent underflow, we find the P_{m}(s_{i}) of highest magnitude and divide the inner sum by it. This allows log space evaluation of the highest magnitude term and ensures that any terms whose precision is lost are at least ≈ 1e300 times smaller. The model log likelihood ratio is then . If this term exceeds 0, the new model is more likely to be observed than the old.
The MCMC iteration loop was implemented with the MetropolisHastings criterion. From an initial model, a perturbed model M_{N }is generated. The new model's probability is evaluated as above and compared to that of the currently selected model M_{C}. If higher, the new model is selected; otherwise, the new model is selected with probability p = exp (log ℒ(M_{N}S)  log ℒ(M_{C}S)). The step is repeated N times (N is fixed at 40000 for the experiments described). Each selected model is stored in a model record for later sampling.
Computing the perturbation
The statistical model consists of submodels for each source. The perturbation step is performed for every submodel independently. Every submodel consists of a complete kmer frequency vector, {p_{A}, p_{T}, p_{G}, p_{C}, p_{AA}...}. It is perturbed by scaling each vector of the basis matrix A by a random number r_{i }drawn from a Gaussian distribution with mean 0 and constant variance (computed as described below), then adding each scaled vector in succession to the frequency vector. The basis matrix A is precomputed for each kmer model order from 2 to 5 and supplied with the program. The computation is performed by generating a system of equations representing the base complementarity, marginal, and summation constraints and using the standard nullspace algorithm supplied with GNU Octave.
The perturbation step variance must be calibrated independently for each dataset. An excessive variance will result in too many suboptimal perturbations as well as perturbations placing the frequency vector outside the unit hypercube (those perturbations are rejected). A variance that is too small can result in an inability to escape local maxima in the model search space and an inability to reach the stationary phase before the predetermined number of steps is taken. To calibrate the variance, the MCMC iteration is started independently for a reduced number of steps, and different variances ranging from 1e  3 down to 1e  8 are tried. With each trial, the number of new model acceptances is recorded. We consider the fraction . Once the variance yielding f closest to 0.234 is found (a heuristic level of acceptances that has become standard [35], p. 504), we use this variance for the main run. Convergence to the stationary phase occurred after 40,000 iterations in all cases of interest.
Computing the prediction
To derive the final model prediction, the model with the overall maximum log likelihood is selected. The full MCMC simulation is repeated a selected number of times (to increase performance, the classifier was run in parallel on an 8core machine; each core was assigned to run one MCMC simulation for a total of 8 restarts). Final model predictions are compared between different runs, and the best overall prediction is selected according to its model likelihood (described above).
The classifier then assigns a putative source to each sequence fragment it was initially queried with. For every fragment, its likelihood according to each submodel in the final predicted model is computed, and the submodel supplying the highest likelihood is selected. Since the sources are anonymous, they are referred to simply by indices from 1 to n corresponding to each submodel's index in the final predicted model. Figure 2 illustrates the log likelihood comparison process for all fragments in a given dataset, according to the best model selected as a result of this process.
Figure 2. Fragment likelihood separation. Log likelihood values of fragments from pairs of species according to models fitted by the classifier. Points' positions on the two axes represent log likelihoods of each fragment according to the first and second model, respectively. A, Helicobacter acinonychis vs. Vibrio fischeri, good separation (98% accuracy, D = 1.31); B, Streptococcus pneumoniae vs. Streptococcus pyogenes, poor separation (57% accuracy, D = 0.22). Fragment length was 800 in both cases. 500 fragments per species were supplied.
Testing methodology
Simulated metagenomic datasets were created by selecting two or more genomic sequences as source DNA. Sequence fragments were selected at random positions within source sequences; overlaps were allowed to occur. Fragment size was fixed for all fragments for each experiment. The total number of fragments per source was selected either according to overall source length or at specified frequency ratios (e.g., 2:1, 10:1:1). The number of sources in each testing dataset was supplied to the classifier.
Accuracy of the classifier is calculated as follows. Every possible matching of source genomic sequence names to classifier output indices is considered, e.g. {seq1 → 1, seq2 → 2}, {seq1 → 2, seq2 → 1}. The number of correct assignments made by the classifier is then counted for each matching and the matching with the highest number of correct assignments is selected. Accuracy is then given as . To evaluate separability of the randomly generated datasets according to the classifier's model, we also define and compute the genomic fragment divergence between two sources' kmer distributions. First, we compute the mean, μ, and standard deviation, σ, of each kmer frequency for each source across fragments originating from that source. The genomic fragment divergence of kmer order n is then given by
Generalizing to M species, let {S} = {S_{1}, S_{2}, ... S_{m}}. Then we define.
Figure 3 illustrates the distribution of genomic fragment divergences between completed bacterial genomes. A different formula for intergenomic difference, called the average absolute dinucleotide relative abundance difference is [36]: , where . This formula encompasses dinucleotides and pairwise comparisons of entire sequences only, and uses dimer frequency biases instead of absolute frequencies and their deviations in a hierarchical fashion.
Figure 3. Pairwise genome divergence distributions. Cumulative distributions of pairwise divergences (D_{n}) between all completed bacterial genomes retrieved from GenBank. Fragment lengths of 400 to 1000 were used to compute D_{n}. Divergences based on kmer order 2, 3, and 4 are represented in panels A, B, and C, respectively. The vertical cutoff line at D = 1 indicates an empirical boundary above which the binning algorithm works with high accuracy. For fragment length 400, over 80% of all randomly selected pairs are observed to have divergences above this line.
Results and Discussion
The accuracy and applicability of the present method in binning short sequence fragments from low complexity communities (210 species) was systematically analyzed using a variety of species, varying fragment lengths, and varying ratios of fragment representation.
First, a set of 1055 completed bacterial chromosomes was retrieved from GenBank. This set was randomly sampled for sets of 2, 3, 5, 10 genomes at a time, representative of various genomic fragment kmer distribution divergences. Binning results for nearly 1800 simulated communities comprised of 2 or 3 genomes at a time are summarized in the top panels of Figure 4. There is a strong positive correlation between genomic fragment divergence and average performance. Classification accuracy was consistently above 85% for fragment divergences when D_{3 }> 2. Results for Bayesian posterior distribution sampling were not substantially different (Additional file 3).
Figure 4. Algorithm accuracy vs. fragment divergence. Sets of 2, 3, 5, 10 genomes were sampled randomly from a set of 1055 completed bacterial chromosomes, and experiments were conducted as described in Materials and Methods. Trials were conducted with 400 and 800nt long fragments. Classification accuracy for the majority of genome pairs above overall divergence 1 is in the high performance range (accuracy > 0.9), while above divergence 3 accuracy is above 0.9 for over 95% of the trials. Results for Bayesian posterior distribution sampling were not significantly different (Additional file 3).
Accuracy of binning simulated communities of 510 species was consistent with the results from 23 species communities. The accuracy of binning was strongly positively correlated with genomic fragment divergence with accuracies consistently above 85% for D_{3 }> 2. Note that accurate binning was possible when fragment length was either L = 400 nt or L = 800 nt (middle and bottom panels of Figure 4 respectively). For 5 and 10 species, a total of 1815 simulated communities were tested in the L = 400 nt case and a total of 425 simulated communities were tested in the L = 800 nt case.
Next, we evaluated the robustness of our binning method to changes in fragment length and to changes in fragment ratios using five distinct genome pairs from the preceding experiment (see Table 2). The pairs were selected based on their relatively low genomic fragment divergence, D_{3 }≈ 1, given a fragment length of L = 400 nt. Binning results on these 2species tests were evaluated using sequence fragments whose lengths ranged from 40 to 1000 nt. The results are shown in Figure 5. Performance stabilizes close to its optimal value at fragment length 400. Again, results for Bayesian posterior distribution sampling were not substantially different than the maximum likelihood approach (Table 3).
Figure 5. Algorithm accuracy vs. fragment length. Fragment lengthdependent performance on 2species datasets. Same trials as in Figure 4 were performed on a subset of pairs of genomes while varying simulated fragment size from 40 to 1000. The species' characteristics are given in Table 2.
Table 2. Summary of species' characteristics, including all independent monomer and dimer frequencies, in the subset of trials on 5 pairs of genomes performed in Figures 5 and 6.
Table 3. Summary of algorithm performance on JGI FAMeS data.
For the same five pairs as in Figure 5, we performed a test of fragment ratiodependent contributions to accuracy (Figure 6). The binner successfully classifies mixtures with species' fractional content of 20% and above. Although robust to moderate variation in fragment ratios, these results indicate that binning relatively rare species may require modifications to the present likelihood formalism.
Figure 6. Algorithm accuracy vs. source ratio. Fragment ratiodependent performance on 2species datasets. Same trials as in Figure 4 were performed on a subset of pairs of genomes while varying species' contributions to the dataset from 2% to 98%. Fragment sizes were fixed at 400 nt (A) and 1000 nt (B). The species' characteristics are given in Table 2.
We also tested our method using subsets of the JGI FAMeS [34,37] simulated lowcomplexity dataset (simLC). We took 5 genomic sources at a time, using 500 fragments, each of length L = 400 nt. The accuracy results for binning these simulated low complexity communities are summarized in Table 4. The binning method has approximately 80% accuracy for a fivespecies community despite the genomic divergence, D_{3}, being approximately 1.5 (an indicator of a community with similar kmer distributions).
Table 4. Performance comparison of LikelyBin and CompostBin on pairs of genomes analyzed in Figures 5, 6, Table 2.
We also compared our method to CompostBin [25], a semisupervised algorithm that utilizes a PCA method to bin fragments based on their kmer distributions (Table 5). We performed comparisons on pairs of genomes with fragment divergence D_{3 }≈ 1 using the same dataset analyzed in Figures 5, 6 and Table 2. The results indicated that our method performs on par with or better than CompostBin, even though CompostBin required a fraction of input fragments to be labeled to initialize its clustering algorithm. Run time and memory performance was comparable between the two methods.
Table 5. The method of sampling the posterior distribution of the MCMC chain by averaging random accepted models from the steady state was compared to the method of selecting the model with the overall maximum log likelihood.
The algorithm is implemented in portable Perl and C code that can be compiled and run on any platform supporting a Perl interpreter. Both memory use and run time scale linearly with the number of fragments and species, and sublinearly with fragment length. Memory complexity scales quadratically with the number of dimensions in the search space, or exponentially with k (as shown in Table 1). We selected k = 3 as the default kmer length, with userdefined options for 2, 4, or 5 available. We have not yet formalized convergence time performance as a function of k. In practice, a 3species dataset of 1000 fragments per species, with kmer order set to 3, takes approximately 2 minutes to run on an Intel Core 2 Duoclass processor.
Conclusion
We developed an unsupervised, maximum likelihood approach to the binning problem  called LikelyBin. LikelyBin uses a MCMC framework to estimate the set of master distributions and relative frequencies most likely to give rise to an observed collection of short reads. The likelihood approach is based on kmer distributions, for which we developed an index of separability of any pair of genomes, which we termed the genomic fragment divergence measure, D_{n}. We found that the vast majority of genomes have sufficient divergence to be distinguished using the present method (Figure 3).
Using a highperformance implementation, LikelyBin can be used to cluster sequences with high accuracy (in some cases, > 95%) even when the mononucleotide content of the original genomes is essentially identical (Figure 4). The method does as well or better than a comparable semisupervised method (CompostBin [25]) that also uses kmer distributions as the statistical basis for binning (Table 5).
Performance of LikelyBin is consistently good for synthesized lowcomplexity datasets (210 species) with fragments of length as low as 400 nt, which corresponds to the characteristic singleread length of a 454 pyrosequencing FLX machine. Microread sequencing technologies such as Solexa and SOLiD are currently out of reach of any nonalignmentbased binning method when applied to single reads, which range from 30 to 50 base pairs with these technologies.
The unsupervised nature of our approach makes it potentially useful for classifying mixtures of novel sequences for which supervised learningbased methods may have difficulties. A future direction for our work is to combine our statistical formalism with alignment and supervised compositionbased models. For example, we could develop a feature selection framework that would transform the input fragments' features such as kmer statistics, coding frame information, and variablelength motifs into a lowerdimensional space. We could then feed these features to an unsupervised MCMCbased classifier in tandem with an alignmentbased classifier that can partially label fragments based on known taxonomic information, then compare and combine their results.
A number of challenges remain to broaden the scope and applicability of the current method. At present, our method is scalable for kmer length from k = 2 to k = 5. We intend to expand the method's ability to capture longer motif frequencies by using dimension transformation or feature selection in a future work. Intragenomic heterogeneity of oligonucleotide distributions is another topic that is yet to be addressed. A confidence measure that serves as a performance selfcheck is already available as part of our method but we have not incorporated it into the program's output yet.
Further, applying the current method in an environmental context requires an estimation of the number of bins. The problem of identifying the necessary number of distinct models, or groups thereof, to represent all components of a given genome, is related to the problem of identifying the number of distinct genomes in the mixture. A combination of jump diffusion and grouped models is our currently planned solution. In this respect, the use of phylogenetic markers to estimate the number of bins will provide important prior information.
In summary, the unsupervised method we proposed is based on a maximum likelihood formalism and can bin short fragments (L = 400 nt) of low complexity communities (210 species) with high accuracy (in some cases, > 95%) given sufficient genomic divergence. The maximum likelihood formalism and its MCMC implementation make the current approach amenable to extension and incorporation into other packages. The MCMC binner application is provided as an opensource downloadable package, LikelyBin [33], that can be installed on any platform that supports Perl and C and is fully automated to facilitiate use in genome processing pipelines. Version 0.1 of the source code is provided in Additional files 4.
Additional file 4. LikelyBin version 0.1 archive. This archive contains the source and executable files for the binner application.
Format: ZIP Size: 3.9MB Download file
Authors' contributions
AK developed code, performed all statistical analyses, and wrote the manuscript. SB developed code and performed preliminary statistical analyses. JD developed the mathematical method and supervised the statistical analysis. JSW developed the mathematical method, supervised the computational and statistical analysis, and wrote the manuscript. All authors read and approved the final manuscript.
Appendix
Example application of likelihood model
Suppose we have two source genomes, G_{1 }and G_{2}, with two fragments from each: G_{1 }→ {ATGTTA, TGTAAT}, G_{2 }→ {CCTGTC, AGGCCTC}. We wish to evaluate the likelihood of observing these sequences according to a dimer model of 2 sources, M = {S_{1}, S_{2}}, which we have generated. Assume the model's source frequency vector is F = [0.6, 0.4], its monomer frequencies are
{S_{1 }: {p_{A }= 0.3, p_{T }= 0.3, p_{G }= 0.2, p_{C }= 0.2}, S_{2 }: {p_{A }= 0.2, p_{T }= 0.2, p_{G }= 0.3, p_{C }= 0.3}} and its dimer frequencies are
S_{1 }: {p_{AA }= 0.09, p_{AT }= 0.09, p_{AG }= 0.06, p_{AC }= 0.06, p_{TA }= 0.07, p_{TT }= 0.09, p_{TG }= 0.06, p_{TC }= 0.08 p_{GA }= 0.08, p_{GT }= 0.06, p_{GG }= 0.04, p_{GC }= 0.02p_{CA }= 0.06, p_{CT }= 0.06, p_{CG }= 0.04, p_{CC }= 0.04}, S_{2 }: {p_{AA }= 0.02, p_{AT }= 0.04, p_{AG }= 0.08, p_{AC }= 0.06, p_{TA }= 0.04, p_{TT }= 0.02, p_{TG }= 0.06, p_{TC }= 0.08, p_{GA }= 0.08, p_{GT }= 0.06, p_{GG }= 0.07, p_{GC }= 0.09p_{CA }= 0.06, p_{CT }= 0.08, p_{CG }= 0.09, p_{CC }= 0.07}}
Then the likelihoods of observing the first fragment, ATGTTA, given master distributions S_{1 }and S_{2}, respectively, are
where superscripts S_{1 }and S_{2 }denote the master distribution. Similarly,
The overall posterior likelihood of the model is then
Acknowledgements
We are pleased to acknowledge the support of the Defense Advanced Research Projects Agency under grant HR00110510057. Joshua S. Weitz, Ph.D., holds a Career Award at the Scientific Interface from the Burroughs Wellcome Fund. The authors would like to thank Jonathan Eisen for many inspiring discussions. The authors would also like to thank Amol Shetty, Michael RaghibMoreno, Sourav Chatterji, Luca Giuggoli, and Simon Levin for their suggestions on a preliminary version of the present model, and thank three anonymous reviewers for their helpful suggestions on the paper. The authors are grateful to Sourav Chatterji, Jonathan Eisen, and Ichitaro Yamazaki for their help in the utilization of CompostBin.
References

Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF: Community structure and metabolism through reconstruction of microbial genomes from the environment.
Nature 2004, 428(6978):3743. PubMed Abstract  Publisher Full Text

Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC, Bork P, Hugenholtz P, Rubin EM: Comparative Metagenomics of Microbial Communities.
Science 2005, 308(5721):554557. PubMed Abstract  Publisher Full Text

Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S, Wu D, Eisen JA, Hoffman JM, Remington K, Beeson K, Tran B, Smith H, BadenTillson H, Stewart C, Thorpe J, Freeman J, AndrewsPfannkoch C, Venter JE, Li K, Kravitz S, Heidelberg JF, Utterback T, Rogers YH, Falcón LI, Souza V, BonillaRosso G, Eguiarte LE, Karl DM, Sathyendranath S, Platt T, Bermingham E, Gallardo V, TamayoCastillo G, Ferrari MR, Strausberg RL, Nealson K, Friedman R, Frazier M, Venter CJ: The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific.
PLoS Biology 2007, 5(3):e77. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Warnecke F, Luginbühl P, Ivanova N, Ghassemian M, Richardson TH, Stege JT, Cayouette M, Mchardy AC, Djordjevic G, Aboushadi N, Sorek R, Tringe SG, Podar M, Martin HG, Kunin V, Dalevi D, Madejska J, Kirton E, Platt D, Szeto E, Salamov A, Barry K, Mikhailova N, Kyrpides NC, Matson EG, Ottesen EA, Zhang X, Hernández M, Murillo C, Acosta LG, Rigoutsos I, Tamayo G, Green BD, Chang C, Rubin EM, Mathur EJ, Robertson DE, Hugenholtz P, Leadbetter JR: Metagenomic and functional analysis of hindgut microbiota of a woodfeeding higher termite.
Nature 2007, 450(7169):560565. PubMed Abstract  Publisher Full Text

Gill SR, Pop M, Deboy RT, Eckburg PB, Turnbaugh PJ, Samuel BS, Gordon JI, Relman DA, FraserLiggett CM, Nelson KE: Metagenomic Analysis of the Human Distal Gut Microbiome.
Science 2006, 312(5778):13551359. PubMed Abstract  Publisher Full Text

Noonan JP, Coop G, Kudaravalli S, Smith D, Krause J, Alessi J, Chen F, Platt D, Pääbo S, Pritchard JK, Rubin EM: Sequencing and analysis of Neanderthal genomic DNA.
Science 2006, 314(5802):11131118. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Not F, Gausling R, Azam F, Heidelberg JF, Worden AZ: Vertical distribution of picoeukaryotic diversity in the Sargasso Sea.
Environmental Microbiology 2007, 9(5):12331252. PubMed Abstract  Publisher Full Text

Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, Carlson C, Chan AM, Haynes M, Kelley S, Liu H, Mahaffy JM, Mueller JE, Nulton J, Olson R, Parsons R, Rayhawk S, Suttle CA, Rohwer F: The marine viromes of four oceanic regions.
PLoS Biol 2006., 4(11) PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Huse SM, Dethlefsen L, Huber JA, Welch DM, Relman DA, Sogin ML: Exploring Microbial Diversity and Taxonomy Using SSU rRNA Hypervariable Tag Sequencing.
PLoS Genet 2008, 4(11):e1000255. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, Neal PR, Arrieta JM, Herndl GJ: Microbial diversity in the deep sea and the underexplored " rare biosphere".
Proceedings of the National Academy of Sciences 2006, 103(32):1211512120. Publisher Full Text

Handelsman J: Metagenomics: application of genomics to uncultured microorganisms.
Microbiol Mol Biol Rev 2004, 68(4):669685. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, Remington K, Eisen JA, Heidelberg KB, Manning G, Li W, Jaroszewski L, Cieplak P, Miller CS, Li H, Mashiyama STT, Joachimiak MP, van Belle C, Chandonia JM, Soergel DA, Zhai Y, Natarajan K, Lee S, Raphael BJ, Bafna V, Friedman R, Brenner SE, Godzik A, Eisenberg D, Dixon JE, Taylor SS, Strausberg RL, Frazier M, Venter JC: The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families.
PLoS Biol 2007., 5(3) Publisher Full Text

Dinsdale EA, Edwards RA, Hall D, Angly F, Breitbart M, Brulc JM, Furlan M, Desnues C, Haynes M, Li L, McDaniel L, Moran MAA, Nelson KE, Nilsson C, Olson R, Paul J, Brito BRR, Ruan Y, Swan BK, Stevens R, Valentine DL, Thurber RVV, Wegley L, White BA, Rohwer F: Functional metagenomic profiling of nine biomes.
Nature 2008, 452(7187):629632. PubMed Abstract  Publisher Full Text

Béjà O, Spudich EN, Spudich JL, Leclerc M, DeLong EF: Proteorhodopsin phototrophy in the ocean.
Nature 2001, 411(6839):786789. PubMed Abstract  Publisher Full Text

Muyzer G, de Waal EC, Uitterlinden AG: Profiling of complex microbial populations by denaturing gradient gel electrophoresis analysis of polymerase chain reactionamplified genes coding for 16S rRNA.
Appl Environ Microbiol 1993, 59(3):695700. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer MLI, Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, Lohman KL, Lu H, Makhijani VB, Mcdade KE, Mckenna MP, Myers EW, Nickerson E, Nobile JR, Plant R, Puc BP, Ronan MT, Roth GT, Sarkis GJ, Simons JF, Simpson JW, Srinivasan M, Tartaro KR, Tomasz A, Vogt KA, Volkmer GA, Wang SH, Wang Y, Weiner MP, Yu P, Begley RF, Rothberg JM: Genome sequencing in microfabricated highdensity picolitre reactors.
Nature 2005, 437(7057):376380. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bentley DR: Wholegenome resequencing.
Current Opinion in Genetics & Development 2006, 16(6):545552. PubMed Abstract  Publisher Full Text

Shendure J, Porreca GJ, Reppas NB, Lin X, Mccutcheon JP, Rosenbaum AM, Wang MD, Zhang K, Mitra RD, Church GM: Accurate Multiplex Polony Sequencing of an Evolved Bacterial Genome.
Science 2005, 309(5741):17281732. PubMed Abstract  Publisher Full Text

Lane DJ, Pace B, Olsen GJ, Stahl DA, Sogin ML, Pace NR: Rapid determination of 16S ribosomal RNA sequences for phylogenetic analyses.
Proceedings of the National Academy of Sciences 1985, 82(20):69556959. Publisher Full Text

Ward BB: How many species of prokaryotes are there?
Proc Natl Acad Sci USA 2002, 99(16):1023410236. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Huson DH, Auch AF, Qi J, Schuster SC: MEGAN analysis of metagenomic data.
Genome Res 2007, 17(3):377386. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kariin S, Burge C: Dinucleotide relative abundance extremes: a genomic signature.
Trends in Genetics 1995, 11(7):283290. PubMed Abstract  Publisher Full Text

Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B: Genomic signature: characterization and classification of species assessed by chaos game representation of sequences.
Mol Biol Evol 1999, 16(10):13911399. PubMed Abstract  Publisher Full Text

Mchardy AC, Martín HG, Tsirigos A, Hugenholtz P, Rigoutsos I: Accurate phylogenetic classification of variablelength DNA fragments.
Nature Methods 2006, 4:6372. PubMed Abstract  Publisher Full Text

Chatterji S, Yamazaki I, Bai Z, Eisen J: CompostBin: A DNA compositionbased algorithm for binning environmental shotgun reads. In Research in Computational Molecular Biology, 12th Annual International Conference, RECOMB 2008, Singapore, March 30  April 2, 2008. Proceedings, Lecture Notes in Computer Science. Volume 4955. Springer; 2008.

Abe T, Kanaya S, Kinouchi M, Ichiba Y, Kozuki T, Ikemura T: Informatics for unveiling hidden genome signatures.
Genome research 2003, 13(4):693702. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chan CK, Hsu AL, Tang SL, Halgamuge SK: Using growing selforganising maps to improve the binning process in environmental wholegenome shotgun sequencing.
Journal of biomedicine & biotechnology 2008, 2008:513701. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chan CKK, Hsu AL, Halgamuge SK, Tang SL: Binning sequences using very sparse labels within a metagenome.
BMC Bioinformatics 2008, 9:215. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Teeling H, Meyerdierks A, Bauer M, Amann R, Glöckner FO: Application of tetranucleotide frequencies for the assignment of genomic fragments.
Environ Microbiol 2004, 6(9):938947. PubMed Abstract  Publisher Full Text

Teeling H, Waldmann J, Lombardot T, Bauer M, Glöckner FO: TETRA: a webservice and a standalone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences.
BMC Bioinformatics 2004, 5:163. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Woyke T, Teeling H, Ivanova NN, Huntemann M, Richter M, Gloeckner FO, Boffelli D, Anderson IJ, Barry KW, Shapiro HJ, Szeto E, Kyrpides NC, Mussmann M, Amann R, Bergin C, Ruehland C, Rubin EM, Dubilier N: Symbiosis insights through metagenomic analysis of a microbial consortium.
Nature 2006, 443(7114):950955. PubMed Abstract  Publisher Full Text

A Genomic Encyclopedia of Bacteria and Archaea (GEBA) [http://www.jgi.doe.gov/programs/GEBA/index.html] webcite

LikelyBin webpage [http://ecotheory.biology.gatech.edu/likelybin] webcite

Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy AC, Rigoutsos I, Salamov A, Korzeniewski F, Land M, Lapidus A, Grigoriev I, Richardson P, Hugenholtz P, Kyrpides NC: Use of simulated data sets to evaluate the fidelity of metagenomic processing methods.
Nature methods 2007, 4(6):495500. PubMed Abstract  Publisher Full Text

Sorensen D, Gianola D: Likelihood, Bayesian and MCMC Methods in Quantitative Genetics. Springer; 2007.

Campbell A, Mrázek J, Karlin S: Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA.
Proceedings of the National Academy of Sciences of the United States of America 1999, 96(16):91849189. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

FAMeS: Fidelity of Analysis of Metagenomic Samples [http://fames.jgipsf.org/] webcite