INRIA Rennes - Bretagne Atlantique/IRISA, EPI GenScale, Rennes, France

ENS Cachan/IRISA, EPI GenScale, Rennes, France

Abstract

Background

Nowadays, metagenomic sample analyses are mainly achieved by comparing them with

Methods

This work introduces Compareads, a

Results

We show that Compareads enables to retrieve biological information while being able to scale to huge datasets. Its time and memory features make Compareads usable on read sets each composed of more than 100 million Illumina reads in a few hours and consuming 4 GB of memory, and thus usable on today's personal computers.

Conclusion

Using a new data structure, Compareads is a practical solution for comparing

Introduction

The past five years have seen the arrival of High Throughput Sequencing (HTS), also known as Next-Generation Sequencing (NGS). These technologies drastically lowered sequencing costs and increased sequencing throughput. They radically changed molecular biology and computational biology, as data generation is no longer a bottleneck. In fact, nowadays a major challenge is the analysis and interpretation of sequencing data

Metagenomics, also known as "environmental genomics", provides an alternative to traditional single- genome studies for exploring the microbial world. Most microorganisms (up to 99% of Bacteria

HTS technologies provide fragments of sequences (called reads) of length a few hundred base pairs without any information about the locus nor the orientation on the molecule they come from. In the metagenomic context, an additional difficulty comes from the fact that each read may belong to any species.

Nowadays, it is difficult to assemble complex metagenomes (such as soil or water metagenomes) into longer consensus sequences, because reads from different species may be merged into one chimeric sequence. Mende and colleagues

Comparative metagenomics usually deals with many aspects, such as sequence composition,

To the best of our knowledge, there is no software designed to compare two or more metagenomic samples at the read level,

Here, we introduce a time and memory-efficient method for extracting similar reads between two metagenomic datasets. The similarity is based on shared

This manuscript presents two main contributions: (**I**) a new algorithm, called Compareads, which computes the similarity measure between two metagenomics datasets; (**II**) a new simple but extremely efficient data structure based on the Bloom filter for storing the presence/absence of

Methods

**Preliminaries and definitions **A

**Overview of **Compareads Compareads is designed for finding similar sequences between two read sets. This basic operation may appear extremely simple. However, it has to be highly efficient, in term of computation time and memory footprint, in order to scale with huge metagenomics datasets.

In order to perform efficiently this operation, Compareads indexes

**Definition 1 (shared k-mer) **

**Definition 2 (Similar sequences) **
_{1 }
_{2 }

In a few words, given two read sets

Computing

Compareads computes **The indexing **step consists in storing in memory all **query **step processes reads from set **
A
**one by one. For a read

**Limiting the indexing space **To control the approximation error (see Section "

**Time complexity **Let _{A }
_{B }
_{B}

Ad hoc data structure

The index data structure we use is based on a Bloom filter, specially designed for the task of storing efficiently a huge set of

Bloom filter

A Bloom filter is a probabilistic data structure designed to test the membership of elements in a set

This data structure is probabilistic in nature, as false positives are possible. Even if an element is not in the set, its bits in the array may still be all set to one. This is because the bits associated to an element may independently be associated to other elements. Hence, the Bloom filter returns a wrong answer with non-zero probability. This probability is the ^{m/n}
_{2 }
_{2}(1

The Bloom Data Structure index

In this article, we consider a slightly different variation of Bloom filters: instead of using a single array of bits, each hash function corresponds to a distinct array, disjoint from all other functions. In terms of performance, with uniform hash functions, this variation is asymptotically equivalent to the original definition

**Particular hash functions **The hash functions used in this framework are a specific family of functions, which can be efficiently computed on consecutive _{1}, _{2 }and _{3}, are said to be

One important property of these functions is that there is a simple relationship between the hash values of two consecutive

The Compareads pipeline

Computing

**Similarity measure **While comparing read sets

Dealing with false positives

Our approach may generate false positives for two reasons we describe in the two upcoming sections, which also expose solutions for limiting these effects.

False positives due to k-mer shared between a read and a dataset

Using

This issue can be mitigated by performing the following steps to compute both

1. Compute

2. Compute

3. Compute

In a few words, the two output datasets

The example presented in Figure

The Compareads pipeline

**The Compareads pipeline**. Representation of the three steps while comparing symmetrically read sets

Note that in practice, the last set

As outlined in the example Figure

Bloom filter false positives

As exposed in Section "

**FP probablity for each function **Assuming the nucleotide composition of the indexed _{FP}
_{i}, k, n_{i }

**Theoretical details for the false positive rate**. Details about how theoretical false positive results were obtained. Theoretical details for the false positive rate. Details about how theoretical false positive results were obtained.

Click here for file

We have plotted in Figure ^{k }

BDS false positive rate w.r.t. hash functions (a) and

**BDS false positive rate w.r.t. hash functions (a) and k-value (b)**. FP rate as a function of the number of indexed

**FP probablity for a combination of functions **One important property of the balanced hash functions is that there do not exist two distinct

This "independence" property implies also that combining these 3 functions in our BDS is a very efficient strategy to reduce the FP rate, as can be seen in Figure

Concerning the unbalanced functions, such property does not hold, since it is possible to find couples of distinct

**Empirical estimation of false positive rate**. Details about how empirical false positive results were obtained. Empirical estimation of false positive rate. Details about how empirical false positive results were obtained.

Click here for file

**Choice of parameters **The comparison of these FP curves led us to choose the combination of the three balanced functions plus an unbalanced one. This choice is motivated by the fact that unbalanced functions are not essential, as they have a limited effect on the FP rate (Figure ^{k }
^{k }
^{k }
^{k }
^{
k-1 }bytes).

For the chosen combination of functions, we plotted the FP rate as a function of

For

Results

Practical performance of the BDS, comparison with other data structures

We propose here a comparative analysis of the BDS with other data structures. In the following, we show that classical non probabilistic data structures result in a worse time and memory performance, while in Section "

Comparison with non probabilistic data structures: suffix array and hash table

Indexing ^{9 }bytes,

An hash table can be used to store an exact set of

Comparison with other hash functions and with a classical Bloom filter

**Time comparison with other hash functions **The hash functions defined for BDS were designed with speed in mind. In this paragraph, we compare them with a popular and fast hash function (Jenkins hash, specifically

**FP rate comparison with other hash functions **We can see in Figure

Jenkins versus BDS false positive rate

**Jenkins versus BDS false positive rate**. Comparison of FP rates between classical hash functions and the functions we used in the BDS. FP rate is plotted as a function of the number of indexed

**Comparison with a classical Bloom filter **A classical Bloom filter requires a fixed amount of memory to index

Comparison with a classical approach using B

Our approach is an heuristic based on shared

Both B

**Comparison between Compareads and BLAST.**

**Total Time (min)**

**Mean Time for one intersection (s)**

**Reads Found**

B

7200

3600

33 400 091

Compareads 1 ∗ 33

238

119

35 898 023

Compareads 4 ∗ 33

230

115

31 997 243

Compareads 10 ∗ 33

228

114

21 350 268

CPU time per intersection and global CPU time using a single core of an Intel^{® }Xeon^{® }CPU X5550 at 2.67GHz. **Reads Found **corresponds to the total number of similar reads in all the 120 intersections.

For each experiment, samples were hierarchically clustered based on their pairwise similarity scores and then drawn as a dendrogram. As shown in Figure **(a) **is slightly different but the three main branches are the same than with the Compareads approach **(b)**. Interestingly, these branches discriminate three groups of samples corresponding to the three different biological conditions indicated by 1, 10 and 40 in the samples names: 1 corresponds to addition of Carbon in the water, 10 stands for normal condition and 40 for introduction of Nitrogen. Notably, all dendrograms based on Compareads approach **(b, c, d) **show a similar organization. Increasing the number of shared

Clustering based on Compareads and B

**Clustering based on Compareads and B LAST results**. Representation of hierarchical clustering based on pairwise intersections between all samples using B

Applying Compareads to Global Ocean metagenomic samples

We tested Compareads on a larger and famous public dataset from the Global Ocean Sampling (The Sorcerer II expedition) ^{® }Xeon^{® }CPU X5550 at 2.67GHz. Results presented in Figure

Heatmap of intersections in Global Ocean Sampling

**Heatmap of intersections in Global Ocean Sampling**. Similarity matrix resulting from the comparison of 44 samples from The

Those results show that Compareads can also be used on Sanger reads and deliver reliable biological conclusions. Indeed, despite of false positives and the simple definition of similarity, we were able to retrieve the classification of metagenomes according to their geographical origin.

Conclusion

Motivated by ^{® }Xeon^{® }CPU X5550 at 2.67GHz. This would have been unfeasible with any other known existing tools (based on results Section "

Compareads has been conceived for being parallelizable both at fine and coarse grained levels. Future work will consist in implementing a parallel version exploiting multi-core and GPU chips. Compareads is released under the CeCILL license and can be freely downloaded from

Competing interests

The authors declare that they have no competing interests.

Author's contributions

DL and PP initiated the work. RC and CL provided expertise about Bloom filters datastructures and their statistical aspects. NM and PP made the implementations. NM, CL, DL and PP performed the experiments. All authors participated to the redaction and approved the final manuscript.

Acknowledgements

This work was supported by the french ANR-2010-COSI-004 MAPPI Project. Authors warmly thank O. Jaillon from the Genoscope and P. Vandenkoornhuyse from the Ecobio UMR for providing their biological expertise and metagenomic datasets. Additionally, we thank G. Rizk for its help and comments with the data structure and F. Gauthier for the

This article has been published as part of