Table 2

Rules for Tokenization before Lucene Indexing

Rule

Regular Expression

Replacement


1

([A-Z]{2,})([a-z]{2,})

$1$2

2

([a-z]{2,})([A-Z]{2,})

$1$2

3

[\w\_&&[^\.]]

4

([\d\.]+)

$1

5

\s+


is used to represent a character space.

Kuo et al. BMC Bioinformatics 2011 12(Suppl 8):S6   doi:10.1186/1471-2105-12-S8-S6

Open Data