Table 4

Algorithms and features used by systems mostly based on machine learning methods.

De-identification system

Machine learning algorithm

Features


Lexical/morphological

Syntactic

Semantic


Aramaki

CRF

Word, surrounding words (5 words window), capitalization, word length, regular expressions (date, phone), sentence position and length.

POS (word + 2 surrounding words)

Dictionary terms (names, locations)


Gardner

CRF

Word lemma, capitalization, numbers, prefixes/suffixes, 2-3 character n-grams

POS (word)

None


Guo

SVM

Word, capitalization, prefixes/suffixes, word length, numbers, regular expressions (date, ID, phone, age)

POS (word)

Entities extracted by ANNIE (doctors, hospitals, locations)


Hara

SVM

Word, lemma, capitalization, regular expressions (phone, date, ID)

POS (word)

Section headings


Szarvas

Decision Tree

Word length, capitalization, numbers, regular expressions (age, date, ID, phone), token frequency

None

Dictionary terms (first names, US locations, countries, cities, diseases, non-PHI terms), section heading.


Taira

Maximum Entropy

Capitalization, punctuation, numbers, regular expressions (prefixes, physician and hospital name, syndrome/disease/procedure)

POS (word)

Semantic lexicon, dictionary terms (proper names, prefixes, drugs, devices), semantic selectional restrictions


Uzuner

SVM

Word, lexical bigrams, capitalization, punctuation, numbers, word length.

POS (word + 2 surrounding words), syntactic bigrams (link grammar)

MeSH ID, dictionary terms (names, US and world locations, hospital names), section headers.


Wellner

CRF

Word unigrams/bigrams, surrounding words (3 words window), prefixes/suffixes, capitalization, numbers, regular expressions (phone, ID, zip, date, locations/hospitals)

None

Dictionary terms (US states, months, general English terms).


CRF = Conditional Random Fields; SVM = Support Vector Machine; POS = Part-of-speech

Meystre et al. BMC Medical Research Methodology 2010 10:70   doi:10.1186/1471-2288-10-70

Open Data