Table 4

Assembly results for three metagenomic datasets

Library

Assembly run

# Reads

# Contigs (> 500 bp)

Average contig length (> 500 bp)

Contig N501 (bp)

# Concatenated tag sequences allowing 3 mismatches


LIB019

A

42,825

136 (25)

329.91 (703.08)

423

10


B

34,778

73 (25)

390.04 (694.04)

605

5


C

35,4262

50 (26)

510.94 (768.92)

663

0


LIB020

A

17,129

89 (6)

246.40 (557.33)

306

4


B

14,208

55 (13)

292.85 (655.85)

510

3


C

14,3662

52 (12)

312.54 (726.33)

547

0


LIB021

A

49,282

305 (15)

238.54 (682.00)

276

29


B

41,126

186 (18)

264.12 (691.67)

302

16


C

42,4952

165 (20)

282.39 (782.00)

303

0


The GS De Novo Assembler Software version 2.3 (Roche, Branford, CT) was used to assemble three metagenomic libraries (LIB019, LIB020 and LIB021) to illustrate how TagCleaner can improve metagenomic and other high-throughput studies. The assembly parameters were set to 95% identity over at least 35 bp. Assemblies were generated for three different parameter sets for each of the metagenomic libraries: (A) raw data; (B) tag sequences trimmed allowing three mismatches; (C) tag sequences trimmed allowing three mismatches with additional splitting of the fragment-to-fragment concatenations and continuous end tag trimming. For B and C, the minimum sequence length was set to 40 bp, sequence duplicates were removed and all other parameters were kept at their default values.

1 The N50 contig size is a weighted median that is defined as the length of the smallest contig C in the sorted list of all contigs where the cumulative length from the largest contig to contig C is at least 50% of the total length (sum of contig lengths).

2 Increased number of reads due to splitting of the fragment-to-fragment concatenations.

Schmieder et al. BMC Bioinformatics 2010 11:341   doi:10.1186/1471-2105-11-341

Open Data