Table 2

Datasets used for performance evaluation

Data Set

String Type

Mean Length

Database Count

QueryCount

alphabet size

k-mer length

total database k-mers


16Sa

DNA

1350

188,073

2000

4

7

16,384

Pyrob

DNA

150

501,532

500

4

6

4,096

ITSc

DNA

627

212,367

2000

4

6

4,096

Shuffled

DNA

687

1,000,000

1000

4

7

16,384

gpIe

RNA

398

20,085

5000

4

7

16,360

GP120f

Protein

175

68,119

2000

20

4

98,695

Institutesg

Text

121

23,768

1000

47/61

4

67,287


a Greengenes 16S rRNA gene collection (DeSantis, 2006)

b Roche-454 pyrosequences from gastrointestinal contents (Ochman, 2010)

c Internal Transcribed Spacer region from eukaryotic ribosomal genes.

d Derived from random repetitive shuffling of Ralstonia solanacearum strain UW486 endoglucanase precursor, DQ657652 (Castillo and Greenberg, 2007)

e Group I catalytic introns RFAM RF00028 (Griffiths-Jones, et al., 2003)

f HIV Envelope glycoprotein PFAM PF00516 (Finn, 2008)

g Institute names as displayed in GenBank records. For BLAST and SSAHA2, all non-alphanumeric characters were interpreted as a space for a total of alphabet size of 47, for Simrank no substitution for any of the 61 unique characters was performed.

DeSantis et al. BMC Ecology 2011 11:11   doi:10.1186/1472-6785-11-11

Open Data