Email updates

Keep up to date with the latest news and content from BMC Medical Informatics and Decision Making and BioMed Central.

Open Access Highly Accessed Software

Development and evaluation of an open source software tool for deidentification of pathology reports

Bruce A Beckwith*, Rajeshwarri Mahaadevan, Ulysses J Balis and Frank Kuo

BMC Medical Informatics and Decision Making 2006, 6:12  doi:10.1186/1472-6947-6-12

PubMed Commons is an experimental system of commenting on PubMed abstracts, introduced in October 2013. Comments are displayed on the abstract page, but during the initial closed pilot, only registered users can read or post comments. Any researcher who is listed as an author of an article indexed by PubMed is entitled to participate in the pilot. If you would like to participate and need an invitation, please email, giving the PubMed ID of an article on which you are an author. For more information, see the PubMed Commons FAQ.

Comparing de-identification methods

Jules Berman   (2006-03-31 12:02)  Association for Pathology Informatics email

Bruce Beckwith and colleagues have made an important contribution to the field of data scrubbing and data de-identification. To their credit, they made all their source code and Java files publicly available and free. The paper is well written and data-driven.

I have discussed the issue of data scrubbing strategies with Dr. Beckwith on several occasions. Basically, there seems to be two published approaches. One approach is to parse text and remove all the identifying words. This is the way that Bruce Beckwith recommends.

The second way is to parse text and to extract EVERY WORD EXCEPT words from an approved list of non-identifying words. That's the strategy that I have previously published.

Berman JJ. Concept-match medical data scrubbing. How pathology text can be used in research. Arch Pathol Lab Med. 2003 Jun;127(6):680-6.

There are advantages and disadvantages to both methods.

Basically, if you write regex rules to extract identifying words (the method of Beckwith et al) , you'll miss some identifiers. If you have a large corpus of text, you'll miss a lot of identifying words. Because it allows some identifiers to slip through, the method by Beckwith et al is best suited for limited data use agreements. In addition, the regex rules will need to change for different sets of records (radiology, surgical pathology, op notes, hospital A formats, hospital B formats). To accommodate changes in style and format, the list of regex rules will need to grow and grow. The software will become increasingly complex and S L O W.

If you use the concept match method, the algorithm is simple and fast, and virtually foolproof.

No identifiers will be present in the output because the only words in the output come from an approved list of terms. But the output of the algorithm will be hard to read. Identifying words in the original text will be replaced by an asterisk, and the text may consist predominantly of asterisks if it contains many words and terms that are not present in the "approved" word list. This was noted, correctly, by Beckwith et al.

I have written a much-improved version of my concept-match software that uses doublets (2-grams). The algorithm is now simpler than ever. There is an external list of "approved" word doublets (about 80,000 of them). The doublet list is chosen to contain no identifying terms. My current list of doublets were derived from two open source medical vocabularies. The algorithm is simple... The text is parsed, and all the doublets in the text that match a term in the approved list are retained. Everything else is replaced by an asterisk. It works fast (1 Mbyte per second on my 1.6 GHz CPU) and doesn't allow any unlisted doublets to slip through. It retains more words from the text than the original concept match algorithm.

The value of the use of doublets (instead of approved words) is that a single seemingly innocuous word (like "No") can be a person's name ("Dr. No is in the hospital"). Because the doublets, "Dr. No" and "No is" are not included in the approved doublet list, the identifying text will be excluded. On the other hand, accepted doublets, like "no way" or "no food" would be saved if they were included in the list of approved doublets.

The method can be scripted in under 20 Perl command lines.


#Copyright (C) 2006 Jules J. Berman

#This program is free software; you can redistribute it and/or modify

#it under the terms of the GNU General Public License as published by

#the Free Software Foundation; either version 2 of the License, or

#(at your option) any later version.


#The GNU license is available at:


#This program is distributed in the hope that it will be useful,

#but WITHOUT ANY WARRANTY; without even the implied warranty of


#GNU General Public License for more details.


#As a courtesy, users of this script should cite the following



#Berman JJ. Concept-match medical data scrubbing. How pathology text

#can be used in research. Arch Pathol Lab Med. 2003 Jun;127(6):680-6.


use Fcntl; #Perl needs the standard Fcntl (file control) module for this script

use SDBM_File; #Perl needs the standard SDBM_File (Simple database module) for this script

tie %doubhash, "SDBM_File", 'scrub', O_RDWR, 0644; #ties the external doublet database

#undef($/); #undefine the line separator so we can slurp a text file in one reading

#open (TEXT, "scrub.txt")||die"Can't open file"; #open an external file to hold scrubbed text

print "What would you like to scrub?\n";

$line = <STDIN>;

print "Scrubbed text.... ";

$line = lc($line); #convert text to lowercase

$line =~ s/[a-z]+\'[s]/possessive word/g;

$line =~ s/[^a-z0-9 \-]/ /g; #replaces non-alphanumerics with a space

@hoparray = split(/ +/,$line); #creates an ordered array from the text words

$lastword = "\*";

for ($i=0;$i<(scalar(@hoparray));$i++) #steps through the array


$doublet = "$hoparray[$i] $hoparray[$i+1]"; #finds successive overlapping word doublets

if (exists $doubhash{$doublet}) #checks to see if doublet is in database of allowed doublets


print " $hoparray[$i]"; #prints the first word of the doublet

$lastword = " $hoparray[$i+1]"; #saves the second word of the doublet




print $lastword; #doublet not in database, so print the second word of the last matching doublet

$lastword = " \*"; #load an asterisk into the variable that contains the last matching word



print "\n";


The external sdbm files (scrub.dir and scrub.pag) containing the approved doublets are available for download from the API resources page (

Some actual output is shown:


What would you like to scrub?

Basal cell carcinoma, margins involved

Scrubbed text.... basal cell carcinoma margins involved


What would you like to scrub?

Rhabdoid tumor of kidney

Scrubbed text.... rhabdoid tumor of kidney


What would you like to scrub?

Mr Brown has a basal cell carcinoma

Scrubbed text.... * * has a basal cell carcinoma


What would you like to scrub?

Mr. Brown was born on Tuesday, March 14, 1985

Scrubbed text.... * * * * * * * * *


What would you like to scrub?

The doctor killed the patient

Scrubbed text.... * * * * *

My opinion is that the method of Beckwith et al may be the best option if you're planning to share pathology records through a limited data use agreement.

My doublet variant of the concept match method may be the best option if you're working with agnostic text (that doesn't fit any particular format) or if you're preparing data for public distribution.

I plan to expand this work and to eventually publish it. If Bruce Beckwith and colleagues would like to work with me to produce a combined study wherein both methods are used on the same corpus of text, with the advantages and disadvantages of either method described in a controlled study, that would be most welcome.

Competing interests

I declare that I have no competing interests


Post a comment