Table 1

Orthographic features.

Orthographic Feature

Reg. Exp.


Init Caps

[A-Z].*

Init Caps Alpha

[A-Z] [a-z]*

All Caps

[A-Z]+

Caps Mix

[A-Za-z]+

Has Digit

.*[0-9].*

Single Digit

[0-9]

Double Digit

[0-9][0-9]

Natural Number

[0-9]+

Real Number

[-0-9]+ [.,]+[0-9].,]+

Alpha-Num

[A-Za-z0-9]+

Roman

[ivxdlcm]+ or [IVXDLCM]+

Has Dash

.*-.*

Init Dash

-.*

End Dash

.*-

Punctuation

[,.;:?!-+'"']


This defines the complete set of orthographic predicate used by the system. The observation list for each token will include a predicate for every regular expression that token matches.

McDonald and Pereira BMC Bioinformatics 2005 6(Suppl 1):S6   doi:10.1186/1471-2105-6-S1-S6

Open Data