Automatic extraction of candidate nomenclature terms using the doublet method
-
Correspondence: Jules J Berman jjberman@alum.mit.edu
Cancer Diagnosis Program, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
BMC Medical Informatics and Decision Making 2005, 5:35 doi:10.1186/1472-6947-5-35
Published: 18 October 2005Additional files
Additional file 1:
Neoclxml.gz is the compressed version of neocl.xml, the XML-format for the developmental lineage classification and taxonomy of neoplasms. This version of neocl.xml supercedes prior published versions [6,7]. Because neocl.xml exceeds 9 Megabytes when uncompressed, a gzipped version of the file is provided (neoclxml.gz). After downloading from the biomedcentral site, the filename should be provided with a .gz suffix (if absent from the filename as downloaded). After decompressing the file, the file shoud be renamed "neocl.xml". The file can be viewed on current web browsers, but experience has shown that many browsers lack sufficient memory to display the entire file. Otherwise, the file can be viewed on a wordprocessor or an ascii editor.
Format: GZ Size: 670KB Download file
Additional file 2:
Doubuniq.pl is a Perl script that parses the reference nomenclature (neocl.xml) and ouputs a list of terms that contain one or more unique doublets (i.e., terms that contain a doublet that is not found in any other term from the same nomenclature).
Format: PL Size: 3KB Download file
Additional file 3:
Doubuniq.txt is the output file of doubuniq.pl, and consists of 6,305 terms from the reference nomenclature (neocl.xml) that contain one or more unique doublets (i.e., terms that contain a doublet that is not found in any other term from the same nomenclature).
Format: TXT Size: 177KB Download file
Additional file 4:
Getdoub.pl is a Perl script that implements the doublet method for automatic extraction of candidate nomenclature terms. It requires an external plain-text corpus and a reference nomenclature.
Format: PL Size: 8KB Download file
Additional file 5:
Tumoram.out is a plain-text file containing the output of the getdoub.pl, the Perl script that extracts terms consisting of concatenated doublets found in the reference nomenclature.
Format: OUT Size: 49KB Download file
