<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1472-6947-2-9</ui>
   <ji>1472-6947</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>Preparation of name and address data for record linkage using hidden Markov models</p>
         </title>
         <aug>
            <au id="A1" ce="yes" ca="yes">
               <snm>Churches</snm>
               <fnm>Tim</fnm>
               <insr iid="I1"/>
               <email>tchur@doh.health.nsw.gov.au</email>
            </au>
            <au id="A2" ce="yes">
               <snm>Christen</snm>
               <fnm>Peter</fnm>
               <insr iid="I2"/>
               <email>peter.christen@anu.edu.au</email>
            </au>
            <au id="A3">
               <snm>Lim</snm>
               <fnm>Kim</fnm>
               <insr iid="I1"/>
               <email>klim@doh.health.nsw.gov.au</email>
            </au>
            <au id="A4">
               <snm>Zhu</snm>
               <mnm>Xi</mnm>
               <fnm>Justin</fnm>
               <insr iid="I2"/>
               <email>u3167614@student.anu.edu.au</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Centre for Epidemiology and Research, Public Health Division, New South Wales Department of Health, Locked Mail Bag 961, North Sydney 2059, Australia</p>
            </ins>
            <ins id="I2">
               <p>Department of Computer Science, Australian National University, Canberra, Australia</p>
            </ins>
         </insg>
         <source>BMC Medical Informatics and Decision Making</source>
         <issn>1472-6947</issn>
         <pubdate>2002</pubdate>
         <volume>2</volume>
         <issue>1</issue>
         <fpage>9</fpage>
         <url>http://www.biomedcentral.com/1472-6947/2/9</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="doi">10.1186/1472-6947-2-9</pubid>
               <pubid idtype="pmpid">12482326</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>29</day>
               <month>10</month>
               <year>2002</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>13</day>
               <month>12</month>
               <year>2002</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>13</day>
               <month>12</month>
               <year>2002</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2002</year>
         <collab>Churches et al; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.</collab>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Record linkage refers to the process of joining records that relate to the same entity or event in one or more data collections. In the absence of a shared, unique key, record linkage involves the comparison of ensembles of partially-identifying, non-unique data items between pairs of records. Data items with variable formats, such as names and addresses, need to be transformed and normalised in order to validly carry out these comparisons. Traditionally, deterministic rule-based data processing systems have been used to carry out this pre-processing, which is commonly referred to as "standardisation". This paper describes an alternative approach to standardisation, using a combination of lexicon-based tokenisation and probabilistic hidden Markov models (HMMs).</p>
            </sec>
            <sec>
               <st>
                  <p>Methods</p>
               </st>
               <p>HMMs were trained to standardise typical Australian name and address data drawn from a range of health data collections. The accuracy of the results was compared to that produced by rule-based systems.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>Training of HMMs was found to be quick and did not require any specialised skills. For addresses, HMMs produced equal or better standardisation accuracy than a widely-used rule-based system. However, acccuracy was worse when used with simpler name data. Possible reasons for this poorer performance are discussed.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>Lexicon-based tokenisation and HMMs provide a viable and effort-effective alternative to rule-based systems for pre-processing more complex variably formatted data such as addresses. Further work is required to improve the performance of this approach with simpler data such as names. Software which implements the methods described in this paper is freely available under an open source license for other researchers to use and improve.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <sec>
            <st>
               <p>Introduction</p>
            </st>
            <p>Record linkage refers to the process of joining records that relate to the same entity or event in one or more data collections <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. The entity is often a person, in which case record linkage may be used for tasks such as building a longitudinal health record <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>, or relating genotypic information to phenotypic information <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr></abbrgrp>. In other settings, the aim may be to link several sources of information about the same event, such as police, accident investigation, ambulance, emergency department and hospital admitted patient records which all relate to the same motor vehicle accident <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. Record linkage (originally known as "medical record linkage") is now widely used in research &#8211; in October 2002, a search of the biomedical literature via PubMed for "medical record linkage" as a Medical Subject Heading returned over 1,300 references <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>.</p>
            <p>The process of record linkage is trivial where the records that relate to the same entity or event all share a common, unique key or identifier &#8211; an SQL "equijoin" operation, or its equivalent in other data management environments, can be used to link records. However, often there is no unique key which is shared by all the data collections which need to be linked, particularly when these data collections are administered by separate organisations, possibly operated for quite different purposes in disparate subject domains.</p>
            <p>In these settings, more specialised record linkage techniques need to be used. These techniques can be broadly divided into two groups: deterministic, or rule-based techniques, and probabilistic techniques. A full description of these techniques is beyond the scope of this paper. A number of recent reviews of this topic are available <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp>. However, all of these techniques rely on an element-wise comparison between pairs of records each comprising an ensemble of non-unique, partially identifying personal (or event) attributes. These attributes commonly include name, residential address, date of birth (or age at a particular date), sex (or gender), marital status, and country of birth.</p>
            <p>For example, consider the fictitious personally-identified records in Table <tblr tid="T1">1</tblr>.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Some illustrative, fictitious, personally-identified records</p>
               </caption>
               <tblbdy cols="6">
                  <r>
                     <c ca="center">
                        <p>
                           <b>Record Number</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Name</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Sex</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Street address</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Locality</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Age in years</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="left">
                        <p>Gwen Palfree</p>
                     </c>
                     <c ca="center">
                        <p>F</p>
                     </c>
                     <c ca="left">
                        <p>Flat 17 23&#8211;25 Knitting Street</p>
                     </c>
                     <c ca="left">
                        <p>West Wishbone 2987 New South Wales</p>
                     </c>
                     <c ca="center">
                        <p>42</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="left">
                        <p>Angie Tantivitiyapitak</p>
                     </c>
                     <c ca="center">
                        <p>F</p>
                     </c>
                     <c ca="left">
                        <p>Wat Paknam Saint George Ave</p>
                     </c>
                     <c ca="left">
                        <p>Old Putney NSW 2345</p>
                     </c>
                     <c ca="center">
                        <p>27</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="left">
                        <p>Gwendolynne Palfrey</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="left">
                        <p>17/23 Knitting St</p>
                     </c>
                     <c ca="left">
                        <p>Wishbone West NSW 2987</p>
                     </c>
                     <c ca="center">
                        <p>42</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="left">
                        <p>Tontiveetiyapitak, Angela</p>
                     </c>
                     <c ca="center">
                        <p>Female</p>
                     </c>
                     <c ca="left">
                        <p>C/- Paknam Monastery, 245 St George St</p>
                     </c>
                     <c ca="left">
                        <p>Putney 2345</p>
                     </c>
                     <c ca="center">
                        <p>28</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="left">
                        <p>Palfrey, Lyn</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="left">
                        <p>Corner of Knitting and Cro</p>
                     </c>
                     <c ca="left">
                        <p>chet Streets, Wishbone New Sth Wales</p>
                     </c>
                     <c ca="center">
                        <p>43</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Note: The "bleeding" of street address data into the locality column in record 4 is deliberate, and typical of real-life data captured by information systems with fixed-length data fields.</p>
               </tblfn>
            </tbl>
            <p>The evident variability in the formatting and encoding of these records is quite typical of data collections which have been assembled from multiple sources. This variability tends to frustrate naive attempts at automated linkage of these records. To a human, it is obvious that records 0 and 2 represent the same person. It is quite likely, but not certain, that records 1 and 3 also represent the same person. The status of record 4 with respect to records 0 and 2 is far less clear &#8211; could this be Gwendolynne's spouse, Evelyn, or is this Gwendolynne with her sex and age wrongly recorded?</p>
            <p>Regardless of the method used to automate such decisions, it is clear that transformation of the source data into a normalised form is required before valid and reliable comparisons between pairs of records can be made. Such transformation and normalisation is usually called "data standardisation" in the medical record literature, and "data cleaning" or "data scrubbing" in the computer science literature. We will refer to the process as "standardisation" henceforth, which should not be confused with the epidemiological technique of "age-sex standardisation" of incidence or prevalence rates.</p>
            <p>Standardisation of scalar attributes such as height or weight involves transformation of all quantities into a common set of units, such as from British imperial to SI units. Categorical attributes such as sex are usually transformed to a common set of representations through simple look-up tables or mapping of various encodings &#8211; for example, both "Female" and "2" might be mapped to "F" and "male and "1" to "M" in order to provide a consistent encoding of the sex attribute for each record. Such transformations do not present a major challenge. However, standardisation of attributes which are recorded in highly variable formats, such as names or residential addresses, is far less straightforward, and it is with this task that this paper is concerned.</p>
            <p>This standardisation task can itself be decomposed into two steps: segmentation of the data into specific, atomic data elements; and the transformation of these atomic elements into their canonical forms. In some cases, a third step, the imputation of missing or blank data items, and a fourth step, the enhancement of the original data with known alternatives, may also be required.</p>
            <p>Some examples of the first two steps will make this clearer. Table <tblr tid="T2">2</tblr> shows the segmented and transformed forms of the name, address and sex attributes of the illustrative records introduced in Table <tblr tid="T1">1</tblr>.</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Segmented and transformed versions of the records from Table <tblr tid="T1">1</tblr></p>
               </caption>
               <tblbdy cols="6">
                  <r>
                     <c ca="center">
                        <p>
                           <b>Data element</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Record 0</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Record 1</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Record 2</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Record 3</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Record 4</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Given names</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>gwen</p>
                     </c>
                     <c ca="center">
                        <p>angie</p>
                     </c>
                     <c ca="center">
                        <p>gwendolynne</p>
                     </c>
                     <c ca="center">
                        <p>angela</p>
                     </c>
                     <c ca="center">
                        <p>lyn</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Surnames</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>palfree</p>
                     </c>
                     <c ca="center">
                        <p>tantivitiyapitak</p>
                     </c>
                     <c ca="center">
                        <p>palfrey</p>
                     </c>
                     <c ca="center">
                        <p>tontiveetiyapitak</p>
                     </c>
                     <c ca="center">
                        <p>palfrey</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Sex</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>female</p>
                     </c>
                     <c ca="center">
                        <p>female</p>
                     </c>
                     <c ca="center">
                        <p>female</p>
                     </c>
                     <c ca="center">
                        <p>female</p>
                     </c>
                     <c ca="center">
                        <p>male</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Institution names</b>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>paknam</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>paknam</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Institution types</b>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>monastery</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>monastery</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Unit types</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>flat</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Unit identifiers</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>17</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>17</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Wayfare numbers</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>23,25</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>23</p>
                     </c>
                     <c ca="center">
                        <p>245</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Wayfare names</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>knitting</p>
                     </c>
                     <c ca="center">
                        <p>saint george</p>
                     </c>
                     <c ca="center">
                        <p>knitting</p>
                     </c>
                     <c ca="center">
                        <p>saint george</p>
                     </c>
                     <c ca="center">
                        <p>knitting, crochet</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Wayfare types</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>street</p>
                     </c>
                     <c ca="center">
                        <p>avenue</p>
                     </c>
                     <c ca="center">
                        <p>street</p>
                     </c>
                     <c ca="center">
                        <p>street</p>
                     </c>
                     <c ca="center">
                        <p>street</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Wayfare qualifier</b>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>corner</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Locality name</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>wishbone</p>
                     </c>
                     <c ca="center">
                        <p>putney</p>
                     </c>
                     <c ca="center">
                        <p>wishbone</p>
                     </c>
                     <c ca="center">
                        <p>putney</p>
                     </c>
                     <c ca="center">
                        <p>wishbone</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Locality qualifiers</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>west</p>
                     </c>
                     <c ca="center">
                        <p>old</p>
                     </c>
                     <c ca="center">
                        <p>west</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Territories</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>nsw</p>
                     </c>
                     <c ca="center">
                        <p>nsw</p>
                     </c>
                     <c ca="center">
                        <p>nsw</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>nsw</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Postcodes</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>2987</p>
                     </c>
                     <c ca="center">
                        <p>2345</p>
                     </c>
                     <c ca="center">
                        <p>2987</p>
                     </c>
                     <c ca="center">
                        <p>2345</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>Once the original data have been segmented and standardised in this way, further enhancement of the data is possible. For example, missing postal codes and territories can be automatically filled in from reference tables, and alternate, canonical forms of names can be added where informal, anglicised or other known variations are found, such as "Angie" (Angela, Angelique) or "Lyn" (Evelyn, Lyndon).</p>
         </sec>
         <sec>
            <st>
               <p>Related work</p>
            </st>
            <p>The terms data cleaning (or data cleansing), data standardisation, data scrubbing, data pre-processing and ETL (extraction, transformation and loading) are used synonymously to refer to the general tasks of transforming source data into clean and consistent sets of records suitable for loading into a data warehouse, or for linking with other data sets. A number of commercial software products are available which address this task, and a complete review is beyond the scope of this paper &#8211; a summary can be found in <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. Name and address standardisation is also closely related to the more general problem of extracting structured data, such as bibliographic references, from unstructured or variably structured texts, such as scientific papers.</p>
            <p>The most common approach for name and address standardisation is the manual specification of parsing and transformation rules. A well-known example of this approach in biomedical research is <it>AutoStan</it>, which was the companion product to the widely-used <it>AutoMatch </it>probabilistic record linkage software <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>.</p>
            <p><it>AutoStan </it>first parses the input string into individual words, and each word is then mapped to a token of a particular class. The choice of class is determined by the presence of that word in user-supplied, class-specific lexicons (look-up tables), or by the type of characters found in the word (such as all numeric, alphanumeric or alphabetical). An ordered set of regular expression-like patterns is then evaluated against this sequence of class tokens. If a class token sequence matches a pattern, a corresponding set of actions for that pattern is performed. These actions might include dynamically changing the class of one or more tokens, removing particular tokens from the class token sequence, or modifying the value of the word associated with that token. The remaining patterns are then evaluated against the now modified class token sequence &#8211; in other words, the pattern matcher is re-entrant, and the actions associated with more than one pattern may act on any given token sequence. When the evolving token sequence for a particular record has been tested against all the available patterns, the words in the input string are output into specific fields corresponding to the final class of the tokens associated with each word.</p>
            <p>Such approaches necessarily require both an initial and an ongoing investment in rule programming by skilled staff. In order to mitigate this requirement for skilled programming, some investigators have recently described systems which automatically induce rules for information extraction from unstructured text. These include <it>Whisk </it><abbrgrp><abbr bid="B11">11</abbr></abbrgrp>, <it>Nodose </it><abbrgrp><abbr bid="B12">12</abbr></abbrgrp> and <it>Rapier </it><abbrgrp><abbr bid="B13">13</abbr></abbrgrp>.</p>
            <p>Probabilistic methods are an alternative to these deterministic approaches. Statistical models, particularly hidden Markov models, have been used extensively in the computer science fields of speech recognition and natural language processing to help solve problems such as word-sense disambiguation and part-of-speech tagging <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>. More recently, hidden Markov and related models have been applied to the problem of extracting structured information from unstructured text <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr></abbrgrp>.</p>
            <p>This paper describes an implementation of lexicon-based tokenisation with hidden Markov models for name and address standardisation &#8211; an approach strongly influenced by the work of Borkar <it>et al. </it><abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. This implementation is part of a free, open source <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> record linkage package known as <it>Febrl </it>(Freely extensible biomedical record linkage) <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. <it>Febrl </it>is written in the free, open source, object-oriented programming language <it>Python </it><abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. Other aspects of the <it>Febrl </it>project will be described in subsequent papers.</p>
         </sec>
         <sec>
            <st>
               <p>Cleaning and tokenisation</p>
            </st>
            <p>The following steps are used to clean and tokenise the raw name or address input string. Firstly, all letters are converted to lower case. Various sub-strings in the input string, such as " c/- " or " c.of " are then converted to their canonical form, such as "care_of ", based on a user-specified and domain-specific substitution table. Similarly, punctuation marks are regularised &#8211; for example, all forms of quotation marks are converted to single character (a vertical bar). The cleaned string is then split into a vector of words, using white space and punctuation marks as delimiters.</p>
            <p>Using look-up tables and some hard-coded rules, the words in this input vector are assigned one or more tokens, to which we will refer as "observation symbols" henceforth. The hard-coded rules include, for example, the assignment of the AN (alphanumeric) observation symbol to all words which are a mixture of alphabetic and numeric characters. However, the majority of observation symbols are assigned by searching for words, or sub-sequences of words, in various look-up tables. A list of the observation symbols currently supported by the <it>Febrl </it>package is given in Table <tblr tid="T3">3</tblr>. For example, one of the look-up tables may be a list of locality names. If a word (or contiguous group of words) is found in the locality table, then the LN (locality name) observation symbol is assigned to that word (or group). This look-up uses a "greedy" matching algorithm. For example, the wayfare name look-up table might contain a record for "macquarie", the locality qualifier look-up table might contain a record for "fields" and the locality name look-up table might contain a record for "macquarie fields". If the first word in the input vector is "macquarie" and the second word is "fields", these first two words will be coalesced (into "macquarie_fields") and will be assigned an LN (locality name) observation symbol, rather than the first word being assigned a WN (wayfare name) symbol and the second field an LQ (locality qualifier) symbol.</p>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Observation symbols currently supported by the <it>Febrl </it>package</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c ca="center">
                        <p>
                           <b>Symbol</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Description</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Usage</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Based on</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>LQ</p>
                     </c>
                     <c ca="left">
                        <p>Locality qualifier words</p>
                     </c>
                     <c ca="left">
                        <p>Addresses</p>
                     </c>
                     <c ca="left">
                        <p>Look-up table</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>LN</p>
                     </c>
                     <c ca="left">
                        <p>Locality (town, suburb) names</p>
                     </c>
                     <c ca="left">
                        <p>Addresses</p>
                     </c>
                     <c ca="left">
                        <p>Look-up table</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>TR</p>
                     </c>
                     <c ca="left">
                        <p>Territory (state, region) names</p>
                     </c>
                     <c ca="left">
                        <p>Addresses</p>
                     </c>
                     <c ca="left">
                        <p>Lookup table</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>CR</p>
                     </c>
                     <c ca="left">
                        <p>Country names</p>
                     </c>
                     <c ca="left">
                        <p>Addresses</p>
                     </c>
                     <c ca="left">
                        <p>Look-up table</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>IT</p>
                     </c>
                     <c ca="left">
                        <p>Types of institution</p>
                     </c>
                     <c ca="left">
                        <p>Addresses</p>
                     </c>
                     <c ca="left">
                        <p>Look-up table</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>IN</p>
                     </c>
                     <c ca="left">
                        <p>Names of institutions</p>
                     </c>
                     <c ca="left">
                        <p>Addresses</p>
                     </c>
                     <c ca="left">
                        <p>Look-up table</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PA</p>
                     </c>
                     <c ca="left">
                        <p>Type of postal address</p>
                     </c>
                     <c ca="left">
                        <p>Addresses</p>
                     </c>
                     <c ca="left">
                        <p>Look-up table</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PC</p>
                     </c>
                     <c ca="left">
                        <p>Postal (zip) codes</p>
                     </c>
                     <c ca="left">
                        <p>Addresses</p>
                     </c>
                     <c ca="left">
                        <p>Look-up table</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>UT</p>
                     </c>
                     <c ca="left">
                        <p>Types of housing unit (eg flat, apartment)</p>
                     </c>
                     <c ca="left">
                        <p>Addresses</p>
                     </c>
                     <c ca="left">
                        <p>Look-up table</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>WN</p>
                     </c>
                     <c ca="left">
                        <p>Wayfare names</p>
                     </c>
                     <c ca="left">
                        <p>Addresses</p>
                     </c>
                     <c ca="left">
                        <p>Look-up table</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>WT</p>
                     </c>
                     <c ca="left">
                        <p>Wayfare types (eg street, road, avenue)</p>
                     </c>
                     <c ca="left">
                        <p>Addresses</p>
                     </c>
                     <c ca="left">
                        <p>Look-up table</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>TI</p>
                     </c>
                     <c ca="left">
                        <p>Title words (eg Dr, Prof, Ms)</p>
                     </c>
                     <c ca="left">
                        <p>Names</p>
                     </c>
                     <c ca="left">
                        <p>Look-up table</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>SN</p>
                     </c>
                     <c ca="left">
                        <p>Surnames</p>
                     </c>
                     <c ca="left">
                        <p>Names</p>
                     </c>
                     <c ca="left">
                        <p>Look-up table</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GF</p>
                     </c>
                     <c ca="left">
                        <p>Female given names</p>
                     </c>
                     <c ca="left">
                        <p>Names</p>
                     </c>
                     <c ca="left">
                        <p>Look-up table</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GM</p>
                     </c>
                     <c ca="left">
                        <p>Male given names</p>
                     </c>
                     <c ca="left">
                        <p>Names</p>
                     </c>
                     <c ca="left">
                        <p>Look-up table</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PR</p>
                     </c>
                     <c ca="left">
                        <p>Name prefixes</p>
                     </c>
                     <c ca="left">
                        <p>Names</p>
                     </c>
                     <c ca="left">
                        <p>Look-up table</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>SP</p>
                     </c>
                     <c ca="left">
                        <p>Name qualifiers (eg aka, also known as)</p>
                     </c>
                     <c ca="left">
                        <p>Names</p>
                     </c>
                     <c ca="left">
                        <p>Look-up table</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>BO</p>
                     </c>
                     <c ca="left">
                        <p>"baby of" and similar strings</p>
                     </c>
                     <c ca="left">
                        <p>Names</p>
                     </c>
                     <c ca="left">
                        <p>Look-up table</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>NE</p>
                     </c>
                     <c ca="left">
                        <p>"Nee", "born as" or similar</p>
                     </c>
                     <c ca="left">
                        <p>Names</p>
                     </c>
                     <c ca="left">
                        <p>Look-up table</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>II</p>
                     </c>
                     <c ca="left">
                        <p>One letter words (initials)</p>
                     </c>
                     <c ca="left">
                        <p>Names</p>
                     </c>
                     <c ca="left">
                        <p>Coded rule</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>ST</p>
                     </c>
                     <c ca="left">
                        <p>Saint names (eg Saint George, San Angelo)</p>
                     </c>
                     <c ca="left">
                        <p>Both</p>
                     </c>
                     <c ca="left">
                        <p>Look-up table</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>CO</p>
                     </c>
                     <c ca="left">
                        <p>Comma, semi-colon, colon</p>
                     </c>
                     <c ca="left">
                        <p>Both</p>
                     </c>
                     <c ca="left">
                        <p>Coded rule</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>SL</p>
                     </c>
                     <c ca="left">
                        <p>Slash "/" and back-slash "\"</p>
                     </c>
                     <c ca="left">
                        <p>Both</p>
                     </c>
                     <c ca="left">
                        <p>Coded rule</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>N4</p>
                     </c>
                     <c ca="left">
                        <p>Numbers with four digits</p>
                     </c>
                     <c ca="left">
                        <p>Addresses</p>
                     </c>
                     <c ca="left">
                        <p>Coded rule</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>NU</p>
                     </c>
                     <c ca="left">
                        <p>Other numbers</p>
                     </c>
                     <c ca="left">
                        <p>Both</p>
                     </c>
                     <c ca="left">
                        <p>Coded rule</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>AN</p>
                     </c>
                     <c ca="left">
                        <p>Alphanumeric words</p>
                     </c>
                     <c ca="left">
                        <p>Both</p>
                     </c>
                     <c ca="left">
                        <p>Coded rule</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>VB</p>
                     </c>
                     <c ca="left">
                        <p>Brackets, braces, quotes</p>
                     </c>
                     <c ca="left">
                        <p>Both</p>
                     </c>
                     <c ca="left">
                        <p>Coded rule</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>RU</p>
                     </c>
                     <c ca="left">
                        <p>Rubbish</p>
                     </c>
                     <c ca="left">
                        <p>Both</p>
                     </c>
                     <c ca="left">
                        <p>Look-up table</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>UN</p>
                     </c>
                     <c ca="left">
                        <p>Unknown (none of the above)</p>
                     </c>
                     <c ca="left">
                        <p>Both</p>
                     </c>
                     <c ca="left">
                        <p>Coded rule</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>Such lexicon-based tokenisation allows readily-available lists of postal codes, locality names, states and territories, as typically published by postal authorities or government gazetteers, to be leveraged to provide the probabilistic model used in the next stage with the maximum number of "hints" about the semantic content of the input string. Note that these probabilistic models are able to cope with situations in which incorrect observation symbols are assigned to particular words in the input string &#8211; the only requirement is that the symbols are assigned in a consistent fashion. For example, the input string "17 macquarie fields road, northmead nsw 2345" might be tokenised as "NU-LN-WT-LN-TR-PC" (number-locality name-wayfare type-locality name-territory-postal code). The first LN symbol is wrong in this context because "macquarie fields" is a wayfare name, not a locality name. The hidden Markov models described in the next section are readily able to accommodate such incorrect tokenisation.</p>
         </sec>
         <sec>
            <st>
               <p>Hidden Markov models</p>
            </st>
            <p>A hidden Markov model (HMM) is a probabilistic finite state machine comprising a set of observable facts or observation symbols (also known as output symbols), a finite set of discrete, unobserved (hidden) states, a matrix of transition probabilities between those hidden states, and a matrix of the probabilities with which each hidden state emits an observation symbol <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. This "emission matrix" is sometimes also called the "observation matrix".</p>
            <p>In the case of residential addresses, we posit that hidden states exist for each segment of an address, such as the wayfare (street) number, the wayfare name, the wayfare type, the locality and so on. We treat the tokenised input address as an ordered sequence of observation symbols, and we assume that each observation symbol has been emitted by one of the hidden address states. In other words, we first replace individuals words with tokens which represent a guess (based on look-up tables and simple rules) about the part of the name or address which that word represents. These tokens are our observable facts (observation symbols). We then try to determine by statistical induction which of a large number of possible arrangements of hypothetical "emitters" is most likely to have produce the observed sequence. These hypothetical emitters of observation symbols are the hidden states in our model.</p>
            <p>Training data are representative samples of the input records which have been tokenised into sequences of observation symbols as described above, and then tagged with the hidden state which the trainer thought was most likely to have been responsible for emitting each observation symbol. Maximum likelihood estimates (MLEs) are derived for the HMM transition and emission probability matrices by accumulating frequency counts for each type of state transition and observation symbol from the training records. The probability of making the transition from state <it>i </it>to state <it>j </it>is the number of transitions from state <it>i </it>to state <it>j </it>in the training data divided by the total number of transitions from state <it>i </it>to a subsequent state. Similarly, the probability of observing symbol <it>k </it>given an underlying (hidden) state <it>j </it>is the number of times, in the training data, that symbol <it>k </it>was emitted by state <it>j </it>divided by the total number of symbol emissions by state <it>j</it>. Because of the use of frequency-based MLEs, it is important that the records in the training data set are reasonably representative of the data sets to be standardised. However, as reported below, the HMMs appear to be quite robust with respect to the training set used and quite general with respect to the data sources with which they can be used. As a result, it is quite feasible to add training records which are archetypes of unusual name or address patterns, without compromising the performance of the HMMs on more typical source records.</p>
            <p>The trained HMM can then be used to determine which sequence of hidden states was most likely to have emitted the observed sequence of symbols. In an ergodic (fully connected) HMM, in which each state can be reached from every other state, if there are <it>N </it>states and <it>T </it>observations symbols in a given sequence, then there are <it>N</it><sup><it>T </it></sup>different paths through the model. Even with quite simple models and input sequences, it is computationally infeasible to evaluate the probability of every path to find the most likely one. Fortunately, the Viterbi algorithm <abbrgrp><abbr bid="B25">25</abbr></abbrgrp> provides an efficient method for pruning the number of probability calculations needed to find the most likely path through the model.</p>
            <p>Once found, the most likely path through the HMM can then be used to associate each word in the original input string with a hidden state, and this information is then used to segment the input string into atomic data elements like those illustrated in Table <tblr tid="T2">2</tblr>. This approach can also be used with names or other variably-formatted text, using different sets of hidden states, observation symbols, transition and output matrices.</p>
            <p>Figure <figr fid="F1">1</figr> shows a simplified HMM for addresses with eight states. The <it>start </it>and <it>end </it>states are both virtual states as they do not emit any observation symbols. The probabilities of transition from one state to another are shown by the arrows (transitions with zero probabilities are omitted for the sake of clarity). The illustrative transition and emission probability matrices for this model are shown in Tables <tblr tid="T4">4</tblr> and <tblr tid="T5">5</tblr>.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Graph of a simplified, illustrative HMM for addresses with eight states</p>
               </caption>
               <text>
                  <p><b>Graph of a simplified, illustrative HMM for addresses with eight states </b>Rectangular nodes denote hidden states. Numbers indicate the probabilities of transitions between states, represented by the edges (arrowed lines). Transitions with zero probability are not shown in the interests of clarity.</p>
               </text>
               <graphic file="1472-6947-2-9-1"/>
            </fig>
            <tbl id="T4">
               <title>
                  <p>Table 4</p>
               </title>
               <caption>
                  <p>Transition probability matrix for simplified, illustrative model</p>
               </caption>
               <tblbdy cols="9">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="8" ca="center">
                        <p>
                           <b>To state</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="8">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>From state</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Start</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>Wayfare Number</p>
                     </c>
                     <c ca="left">
                        <p>Wayfare Name</p>
                     </c>
                     <c ca="left">
                        <p>Wayfare Type</p>
                     </c>
                     <c ca="left">
                        <p>Locality Name</p>
                     </c>
                     <c ca="left">
                        <p>Territory</p>
                     </c>
                     <c ca="left">
                        <p>Postal Code</p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>End</it>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="9">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>Start</it>
                        </p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0.9</p>
                     </c>
                     <c ca="right">
                        <p>0.08</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0.02</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Wayfare Number</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0.05</p>
                     </c>
                     <c ca="right">
                        <p>0.95</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Wayfare Name</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0.03</p>
                     </c>
                     <c ca="right">
                        <p>0.95</p>
                     </c>
                     <c ca="right">
                        <p>0.02</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Wayfare Type</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0.95</p>
                     </c>
                     <c ca="right">
                        <p>0.03</p>
                     </c>
                     <c ca="right">
                        <p>0.02</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Locality name</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0.02</p>
                     </c>
                     <c ca="right">
                        <p>0.4</p>
                     </c>
                     <c ca="right">
                        <p>0.4</p>
                     </c>
                     <c ca="right">
                        <p>0.18</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Territory</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0.8</p>
                     </c>
                     <c ca="right">
                        <p>0.2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Postal Code</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0.1</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0.9</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>End</it>
                        </p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Table cells contain probabilities of transition from the state listed at the left of each row to the state identified at the top of each column.</p>
               </tblfn>
            </tbl>
            <tbl id="T5">
               <title>
                  <p>Table 5</p>
               </title>
               <caption>
                  <p>Emission probability matrix for a simplified, illustrative model</p>
               </caption>
               <tblbdy cols="9">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="8" ca="center">
                        <p>
                           <b>State</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="8">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Observation Symbol</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Start</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>Wayfare Number</p>
                     </c>
                     <c ca="left">
                        <p>Wayfare Name</p>
                     </c>
                     <c ca="left">
                        <p>Wayfare Type</p>
                     </c>
                     <c ca="left">
                        <p>Locality Name</p>
                     </c>
                     <c ca="left">
                        <p>Territory</p>
                     </c>
                     <c ca="left">
                        <p>Postal Code</p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>End</it>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="9">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>NU</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="right">
                        <p>0.9</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="right">
                        <p>0.1</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>WN</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="right">
                        <p>0.5</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="right">
                        <p>0.1</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>WT</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="right">
                        <p>0.92</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>LN</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="right">
                        <p>0.1</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="right">
                        <p>0.8</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>TR</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="right">
                        <p>0.07</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="right">
                        <p>0.94</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PC</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="right">
                        <p>0.04</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="right">
                        <p>0.85</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>UN</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="right">
                        <p>0.02</p>
                     </c>
                     <c ca="right">
                        <p>0.31</p>
                     </c>
                     <c ca="right">
                        <p>0.03</p>
                     </c>
                     <c ca="right">
                        <p>0.06</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="right">
                        <p>0.01</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Table cells contain probabilities that the state identified at the top of each column will emit the observation symbol listed at the left of each row.</p>
               </tblfn>
            </tbl>
            <p>Notice that the probabilities in each row of the transition matrix and in each column of the emission matrix add up to one. Also notice that none of the probabilities in the emission matrix are zero. In practice, it is common for some combinations of state and observations symbol not to appear in the training data, resulting in a maximum likelihood estimate of zero for that element of the emission matrix. Such zero probabilities can cause problems when the model is presented with new data, so smoothing techniques are used to assign small probabilities (in this case 0.01) to all unencountered observation symbols for all states. Traditionally Laplace smoothing is used <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>, but Borkar <it>et al. </it>have also described the use of absolute discounting as an alternative when there are a large number of distinct observation symbols <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. The <it>Febrl </it>package offers both types of smoothing.</p>
            <p>Now consider an example address: "17 Epping St Smithfield New South Wales 2987". This would first be cleaned and tokenised as follows.</p>
            <p>['17', 'epping', 'street', 'smithfield', 'nsw', '2987' ]</p>
            <p>['NU', 'LN', 'WT', 'LN', 'TR', 'PC' ]</p>
            <p>Note that Epping is a suburb of the city of Sydney in the state of New South Wales, Australia, hence the word "epping" in the input string is assigned an LN (locality name) observation symbol even though to a human observer it is clearly a wayfare name in this context. This does not matter because we are ultimately not interested in the types of the observed symbols but rather in the underlying hidden states which were most likely to have generated them.</p>
            <p>Even in this very simple model there are 8<sup>6 </sup>= 262,144 possible combinations of hidden states which could have generated this observed sequence of symbols &#8211; such as the following sequence of states (with the corresponding observation symbols in brackets):</p>
            <p><it>Start </it>-> Wayfare Name (NU) -> Locality Name (LN) -> Postal Code (WT) -> Territory (LN) -> Postal Code (TR) -> Territory (PC) -><it>End</it></p>
            <p>Common sense tells us that this sequence of hidden states is a very unlikely explanation for the observed symbols. From our HMM, the probability of this sequence is indeed rather small (emission probabilities are underlined):</p>
            <p>0.08 &#215; <ul>0.01</ul> &#215; 0.02 &#215; <ul>0.8</ul> &#215; 0.4 &#215; <ul>0.01</ul> &#215; 0.1 &#215; <ul>0.01</ul> &#215; 0.8 &#215; <ul>0.01</ul> &#215; 0.1 &#215; <ul>0.01</ul> &#215; 0.2 = 8.19 &#215; 10<sup>-17</sup></p>
            <p>The following sequence of hidden states is a more plausible explanation for the observed symbols:</p>
            <p><it>Start </it>-> Wayfare Number (NU) -> Wayfare Name (LN) -> Wayfare Type (WT) -> Locality (LN) -> Territory (TR) -> Postal Code (PC) -><it>End</it></p>
            <p>In fact, according to our simple HMM, this sequence has the greatest probability of all 262,144 possible combinations of hidden states and observation symbols and is therefore the most likely explanation for the input sequence of observation symbols:</p>
            <p>0.9 &#215; <ul>0.9</ul> &#215; 0.95 &#215; <ul>0.1</ul> &#215; 0.95 &#215; <ul>0.92</ul> &#215; 0.95 &#215; <ul>0.8</ul> &#215; 0.4 &#215; <ul>0.94</ul> &#215; 0.8 &#215; <ul>0.85</ul> &#215; 0.9 = 1.18 &#215; 10<sup>-2</sup></p>
            <p>It is then a simple matter to use this information to segment the cleaned version of the input string into address elements and output them, as shown in Table <tblr tid="T6">6</tblr>.</p>
            <tbl id="T6">
               <title>
                  <p>Table 6</p>
               </title>
               <caption>
                  <p>Example address elements output by a simplified, illustrative model</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c ca="center">
                        <p>
                           <b>Original Words</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Observation Symbol</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Hidden State</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Output Value</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>17</p>
                     </c>
                     <c ca="center">
                        <p>NU</p>
                     </c>
                     <c ca="center">
                        <p>Wayfare Number</p>
                     </c>
                     <c ca="center">
                        <p>17</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Epping</p>
                     </c>
                     <c ca="center">
                        <p>LN</p>
                     </c>
                     <c ca="center">
                        <p>Wayfare Name</p>
                     </c>
                     <c ca="center">
                        <p>epping</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>St</p>
                     </c>
                     <c ca="center">
                        <p>WT</p>
                     </c>
                     <c ca="center">
                        <p>Wayfare Type</p>
                     </c>
                     <c ca="center">
                        <p>street</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Smithfield</p>
                     </c>
                     <c ca="center">
                        <p>LN</p>
                     </c>
                     <c ca="center">
                        <p>Locality</p>
                     </c>
                     <c ca="center">
                        <p>smithfield</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>New South Wales</p>
                     </c>
                     <c ca="center">
                        <p>TR</p>
                     </c>
                     <c ca="center">
                        <p>Territory</p>
                     </c>
                     <c ca="center">
                        <p>nsw</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>2987</p>
                     </c>
                     <c ca="center">
                        <p>PC</p>
                     </c>
                     <c ca="center">
                        <p>Postal Code</p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>2987</it>
                        </p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>Further details of the way in which HMMs are implemented in the <it>Febrl </it>package are available in the associated documentation <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. The hidden states used in the name and address HMMs are shown in Tables <tblr tid="T7">7</tblr> and <tblr tid="T8">8</tblr> respectively. These hidden states, and the observation symbols listed Table <tblr tid="T3">3</tblr>, were derived heuristically from <it>AutoStan </it>tokens and rules developed previously by two of the authors (TC and KL) for use with Australian names and residential addresses. Figures <figr fid="F2">2</figr> and <figr fid="F3">3</figr> show directed graphs of these models. Currently, the observation symbols and hidden states are "hard coded" into the <it>Febrl </it>software package, although they can be altered by editing the freely available source code. Future versions of the package will use "soft-coded" observation symbols and hidden states, allowing users in other countries to adapt the HMMs for other types of name and address information, or indeed for quite different information extraction tasks, without the need for Python programming skills.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Graph of the name standardisation HMM evaluated in this study</p>
               </caption>
               <text>
                  <p><b>Graph of the name standardisation HMM evaluated in this study </b>Rectangular nodes denote hidden states. Numbers indicate the probabilities of transitions between states, represented by the edges (arrowed lines). States which were not used and transitions which had a zero probability in the evaluation have been suppressed in the interests of clarity. Prepared with the Graphviz tool <url>http://www.research.att.com/sw/tools/graphviz/</url>.</p>
               </text>
               <graphic file="1472-6947-2-9-2"/>
            </fig>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Graph of the address standardisation HMM evaluated in this study</p>
               </caption>
               <text>
                  <p><b>Graph of the address standardisation HMM evaluated in this study </b>Rectangular nodes denote hidden states. Numbers indicate the probabilities of transitions between states, represented by the edges (arrowed lines). States which were not used and transitions which had a zero probability in the evaluation have been suppressed in the interests of clarity. Prepared with the Graphviz tool <url>http://www.research.att.com/sw/tools/graphviz/</url>.</p>
               </text>
               <graphic file="1472-6947-2-9-3"/>
            </fig>
            <tbl id="T7">
               <title>
                  <p>Table 7</p>
               </title>
               <caption>
                  <p>Hidden states for name standardisation currently supported by the <it>Febrl </it>package</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="left">
                        <p>
                           <b>Hidden State</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Description</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>titl</p>
                     </c>
                     <c ca="left">
                        <p>Title (<it>Mr, Ms, Dr etc</it>) state</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>baby</p>
                     </c>
                     <c ca="left">
                        <p>State for <it>baby of</it>, <it>son of </it>or <it>daughter of</it></p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>knwn</p>
                     </c>
                     <c ca="left">
                        <p>State for <it>known as</it></p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>andor</p>
                     </c>
                     <c ca="left">
                        <p>State for <it>and </it>or <it>or</it></p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>gname1</p>
                     </c>
                     <c ca="left">
                        <p>First given name state</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>gname2</p>
                     </c>
                     <c ca="left">
                        <p>Second given name state</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>ghyph</p>
                     </c>
                     <c ca="left">
                        <p>Given name hyphen state</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>gopbr</p>
                     </c>
                     <c ca="left">
                        <p>Given name opening bracket or quote state</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>gclbr</p>
                     </c>
                     <c ca="left">
                        <p>Given name closing bracket or quote state</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>agname1</p>
                     </c>
                     <c ca="left">
                        <p>First alternative given name state</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>agname2</p>
                     </c>
                     <c ca="left">
                        <p>Second alternative given name state</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>coma</p>
                     </c>
                     <c ca="left">
                        <p>State for commas, semi-colons etc</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>sname1</p>
                     </c>
                     <c ca="left">
                        <p>First surname state</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>sname2</p>
                     </c>
                     <c ca="left">
                        <p>Second surname state</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>shyph</p>
                     </c>
                     <c ca="left">
                        <p>Surname hyphen state</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>sopbr</p>
                     </c>
                     <c ca="left">
                        <p>Surname opening bracket or quote state</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>sclbr</p>
                     </c>
                     <c ca="left">
                        <p>Surname closing bracket or quote state</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>asname1</p>
                     </c>
                     <c ca="left">
                        <p>First alternative surname state</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>asname2</p>
                     </c>
                     <c ca="left">
                        <p>Second alternative surname state</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>pref1</p>
                     </c>
                     <c ca="left">
                        <p>First name prefix state</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>pref2</p>
                     </c>
                     <c ca="left">
                        <p>Second name prefix state</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>rubb</p>
                     </c>
                     <c ca="left">
                        <p>State for residual elements</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <tbl id="T8">
               <title>
                  <p>Table 8</p>
               </title>
               <caption>
                  <p>Hidden states for address standardisation currently supported by the <it>Febrl </it>package</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="center">
                        <p>
                           <b>Hidden state</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Description (examples in bold italics)</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>wfnu</p>
                     </c>
                     <c ca="left">
                        <p>Wayfare number state (<b><it>23 </it></b>Sherlock Holmes Street, Potingu West NSW 2876)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>wfna1</p>
                     </c>
                     <c ca="left">
                        <p>First wayfare name state (23 <b><it>Sherlock </it></b>Holmes Street, Potingu West NSW 2876)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>wfna2</p>
                     </c>
                     <c ca="left">
                        <p>Second wayfare name state (23 Sherlock <b><it>Holmes </it></b>Street, Potingu West NSW 2876)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>wfql</p>
                     </c>
                     <c ca="left">
                        <p>Wayfare qualifier state (23 Sherlock Holmes Street <b><it>South</it></b>, Potingu West NSW 2876)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>wfty</p>
                     </c>
                     <c ca="left">
                        <p>Wayfare type state (23 Sherlock Holmes <b><it>Street</it></b>, Potingu West NSW 2876)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>unnu</p>
                     </c>
                     <c ca="left">
                        <p>Unit number state (Flat <b><it>5 </it></b>23 Sherlock Holmes Street, Potingu West NSW 2876)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>unty</p>
                     </c>
                     <c ca="left">
                        <p>Unit type state (<b><it>Flat </it></b>5 23 Sherlock Holmes Street, Potingu West NSW 2876)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>prna1</p>
                     </c>
                     <c ca="left">
                        <p>First property name state (<b><it>Emoh </it></b>Ruo, Patonga Road, Potingu West NSW 2876)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>prna2</p>
                     </c>
                     <c ca="left">
                        <p>Second property name state (Emoh <b><it>Ruo</it></b>, Patonga Road, Potingu West NSW 2876)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>inna1</p>
                     </c>
                     <c ca="left">
                        <p>First institution name state (<b><it>Lost </it></b>Dogs Home, Patonga Road, Potingu West NSW 2876)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>inna2</p>
                     </c>
                     <c ca="left">
                        <p>Second institution name state (Lost <b><it>Dogs </it></b>Home, Patonga Road, Potingu West NSW 2876)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>inty</p>
                     </c>
                     <c ca="left">
                        <p>Institution type (Lost Dogs <b><it>Home</it></b>, Patonga Road, Potingu West NSW 2876)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>panu</p>
                     </c>
                     <c ca="left">
                        <p>Postal address number state (Roadside Mailbox <b><it>234</it></b>, Patonga Road, Potingu West NSW 2876)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>paty</p>
                     </c>
                     <c ca="left">
                        <p>Type of postal address state (<b><it>Roadside Mailbox </it></b>234, Patonga Road, Potingu West NSW 2876)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>hyph</p>
                     </c>
                     <c ca="left">
                        <p>Hyphen state</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>sla</p>
                     </c>
                     <c ca="left">
                        <p>Slash state (5/23 Sherlock Holmes Street, Potingu West NSW 2876)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>coma</p>
                     </c>
                     <c ca="left">
                        <p>Comma, semi-colon etc state</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>opbr</p>
                     </c>
                     <c ca="left">
                        <p>State for opening bracket or quote</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>clbr</p>
                     </c>
                     <c ca="left">
                        <p>State for closing bracket or quote</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>loc1</p>
                     </c>
                     <c ca="left">
                        <p>First locality name state (5/23 Sherlock Holmes Street, <b><it>Potingu </it></b>West NSW 2876)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>loc2</p>
                     </c>
                     <c ca="left">
                        <p>Second locality name state</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>locql</p>
                     </c>
                     <c ca="left">
                        <p>Locality qualifier state (5/23 Sherlock Holmes Street, Potingu <b><it>West </it></b>NSW 2876)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>pc</p>
                     </c>
                     <c ca="left">
                        <p>Postal code state (5/23 Sherlock Holmes Street, Potingu West NSW <b><it>2876</it></b>)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>ter1</p>
                     </c>
                     <c ca="left">
                        <p>First territory name state (5/23 Sherlock Holmes Street, Potingu West <b><it>NSW </it></b>2876)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>ter2</p>
                     </c>
                     <c ca="left">
                        <p>Second territory name state</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>cntr1</p>
                     </c>
                     <c ca="left">
                        <p>First country name state (5/23 Sherlock Holmes Street, Potingu West NSW 2876, <b><it>Australia</it></b>)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>cntr2</p>
                     </c>
                     <c ca="left">
                        <p>Second country name state</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>rubb</p>
                     </c>
                     <c ca="left">
                        <p>State for residual elements</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <p>We evaluated the performance of the approach described above with typical Australian residential address data using two data sources.</p>
         <p>The first source was a set of approximately 1 million addresses taken from uncorrected electronic copies of death certificates as completed by medical practitioners and coroners in the state of New South Wales (NSW) in the years 1988 to 2002. The majority of these data were entered from hand-written death certificate forms. The information systems into which the data were entered underwent a number of changes during this period.</p>
         <p>The second data set was a random sample of 1,000 records of residential addresses drawn from the NSW Inpatient Statistics Collection for the years 1993 to 2001 <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. This collection contains abstracts for every admission to a public- or private-sector acute care hospital in NSW. Most of the data were extracted from a variety of computerised hospital information systems, with a small proportion entered from paper forms.</p>
         <p>Accuracy measurements for name standardisation were conducted using a subset of the NSW Midwives Data Collection (MDC) <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. This subset contained 962,776 records for women who had given birth in New South Wales, Australia, over a ten year period (1990&#8211;2000). Most of these data was entered from hand-written forms, although some of the data for the latter years were extracted directly from computerised obstetric information systems.</p>
         <p>Access to these data sets for the purpose of this project was approved by the Australian National University Human Research Ethics Committee and by the relevant data custodians within the NSW Department of Health. The data sets used in this project were held on secure computing facilities at the Australian National University and the NSW Department of Health head offices. In order to minimise the invasion of privacy which is necessarily associated with almost all research use of identified data, the medical and health status details were removed from the files used in this project. Thus, for this project the investigators had access to files of names and addresses, but not to any of the medical or other details for the individuals identified in those files, other than the fact that they had died or had given birth.</p>
         <sec>
            <st>
               <p>Address standardisation</p>
            </st>
            <p>Training of HMMs for residential address standardisation was performed by a process of iterative refinement.</p>
            <p>An initial hidden Markov model (HMM) was trained using 100 randomly selected death certificate (DC) records. Annotating these records with state and observation symbol information took less than one person-hour. The resulting model was used to process 1,100 randomly chosen DC records. These records then became a second-stage training set, with each record already annotated with states and observation symbols derived from the initial model. This annotation was manually checked and corrected where necessary, which took about 5 person-hours. An HMM derived from this second training set was then used to standardise 50,000 randomly chosen DC records, and records with unusual patterns of observation symbols (with a frequency of six or less) were examined, corrected and added to the training set if the results produced by the second-stage HMM were incorrect. A new HMM was then derived from this augmented training set and the process repeated a further three times, resulting in the addition of approximately 250 "atypical" training records (bringing the total number of training records to 1,450). The HMM which emerged from this process, designated HMM1, was used to standardise 1,000 randomly chosen DC test records and the accuracy of the standardisation was assessed. Laplace smoothing used in this and all subsequent address standardisation evaluations. Approximately ten hour person-hours of training time was required to reach this point.</p>
            <p>HMM1 was then used to standardise 1,000 randomly chosen Inpatient Statistics Collection (ISC) test records, and the accuracy assessed. In other words, an HMM trained using one data source (DC) was used to standardise addresses from a different data source (ISC) without any retraining of the HMM.</p>
            <p>An additional 1,000 randomly chosen address training records derived from the Midwives Data Collection (MDC) were then added to the 1,450 training records described above, and this larger training set was used to derive HMM2. HMM2 was then used to re-standardise the same sets of randomly chosen test records described in the first and second steps above, and the results were assessed.</p>
            <p>A further 60 training records, based on archetypes of those records which were incorrectly standardised in all of the preceding tests, were then added to the training set to produce HMM3. HMM3 was then used to re-standardise the same DC and ISC test sets. Thus, HMM3 could be considered as an "overfitted" model for the particular records in the two test sets, although in practice researchers are likely to use such overfitting to maximise standardisation accuracy for the particular data sets used in their studies. The total training time for all address standardisation models was not more than 20 person hours.</p>
            <p>Finally, by way of comparison, the same two 1,000 record test data sets were standardised using <it>AutoStan </it>in conjunction with a rule set which had been developed and refined by two of the investigators (TC and KL) over several years for use with ISC (but not DC) address data, representing a cumulative investment of at least several person-weeks of programming time.</p>
         </sec>
         <sec>
            <st>
               <p>Name standardisation</p>
            </st>
            <p>To assess the accuracy of name standardisation, a subset of 10,000 records with non-empty name components was selected from the MDC data set (approximately a one per cent sample). This sample was split into ten test sets each containing 1,000 records. A ten-fold cross validation study was performed, with each of the folds having a training set of 9,000 records and the remaining 1,000 records being the test set. The training records were marked up with state and observation symbol information in about 10 person-hours using the iterative refinement method described above. HMMs were then trained without smoothing, and with Laplace and absolute discount smoothing, resulting in 30 different HMMs. We found that smoothing had a negligible effect on performance, and only the results from the unsmoothed HMMs are reported here.</p>
            <p>The performance of HMMs for name standardisation was compared with a deterministic rule-based standardisation algorithm which is also implemented in the <it>Febrl </it>package &#8211; details of this algorithm can be found in the associated documentation <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Evaluation criteria</p>
            </st>
            <p>For all tests, records were judged to be accurately standardised when all of the elements present in the input address string, with the exception of punctuation, were allocated to the correct output field, and the values in each output field were correctly transformed to their canonical form where required. Thus, a record was judged to have been incorrectly standardised if any element of the input string was not allocated to an output field, or if any element was allocated to the wrong output field. Due to resource constraints, the investigators were not blind to the nature of the standardisation process (HMM versus <it>AutoStan</it>) used. Exact binomial 95 per cent confidence limits for the proportion of correctly standardised records were calculated using the method given in <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>.</p>
            <p>In the records which were standardised incorrectly, not every data element was assigned to the wrong output field. For each of these address records, the proportions (and corresponding 95 per cent confidence limits) of data elements which were assigned to the wrong output field, or which were not assigned to an output field at all, were calculated. These quantities were not calculated for names due to the much simpler form of the name data.</p>
         </sec>
         <sec>
            <st>
               <p>Computational performance</p>
            </st>
            <p>Indicative run times for the training and application of the HMMs described above were recorded on two computing platforms. Name standardisation was run on a lightly-loaded Sun Enterprise 450 computer with four 480 MHz Ultra-SPARC II processors and 4 gigabytes of main memory, running the Sun Solaris (64-bit Unix) operating system. Address standardisation was performed on a single-user 1.5 GHz Pentium 4 personal computer with 512 MB of main memory, running the 32-bit Microsoft Windows 2000 operating system. Python version 2.2 was used in both cases. Times were averaged over ten runs.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>Addresses standardisation</p>
            </st>
            <p>Results are shown in Table <tblr tid="T9">9</tblr>.</p>
            <tbl id="T9">
               <title>
                  <p>Table 9</p>
               </title>
               <caption>
                  <p>Results of the address standardisation evaluation</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="4" ca="center">
                        <p>
                           <b>
                              <it>HMM/Method</it>
                           </b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Test Data Set (1000 records)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>HMM1</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>HMM2</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>HMM3</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>AutoStan</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Death Certificates</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.957 (0.943 &#8211; 0.969)</p>
                     </c>
                     <c ca="left">
                        <p>0.968 (0.955 &#8211; 0.978)</p>
                     </c>
                     <c ca="left">
                        <p>0.976 (0.964 &#8211; 0.985)</p>
                     </c>
                     <c ca="left">
                        <p>0.915 (0.896 &#8211; 0.932)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Inpatient Statistics Collection</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.957 (0.943 &#8211; 0.969)</p>
                     </c>
                     <c ca="left">
                        <p>0.959 (0.945 &#8211; 0.970)</p>
                     </c>
                     <c ca="left">
                        <p>0.974 (0.962 &#8211; 0.983)</p>
                     </c>
                     <c ca="left">
                        <p>0.953 (0.938 &#8211; 0.965)</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Table cells contain the proportion of correctly standardised address records for each of the two data sources listed. Ninety-five per cent confidence limits for the proportions are given in brackets.</p>
               </tblfn>
            </tbl>
            <p>The mean proportions of data items in each address which were assigned to the incorrect output field, or which were not assigned to any output field, are shown in Table <tblr tid="T10">10</tblr>.</p>
            <tbl id="T10">
               <title>
                  <p>Table 10</p>
               </title>
               <caption>
                  <p>Mean proportion of data items in each address which were assigned to the incorrect output field</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="4" ca="center">
                        <p>
                           <b>
                              <it>HMM/Method</it>
                           </b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>
                           <b>HMM1</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>HMM2</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>HMM3</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>AutoStan</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Death Certificates</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.31 (0.25 &#8211; 0.37)</p>
                     </c>
                     <c ca="left">
                        <p>0.31 (0.24 &#8211; 0.38)</p>
                     </c>
                     <c ca="left">
                        <p>0.33 (0.23 &#8211; 0.42)</p>
                     </c>
                     <c ca="left">
                        <p>0.29 (0.26 &#8211; 0.32)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Inpatient Statistics Collection</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.23 (0.18 &#8211; 0.28)</p>
                     </c>
                     <c ca="left">
                        <p>0.23 (0.18 &#8211; 0.28)</p>
                     </c>
                     <c ca="left">
                        <p>0.21 (0.15 &#8211; 0.26)</p>
                     </c>
                     <c ca="left">
                        <p>0.19 (0.17 &#8211; 0.22)</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Table cells contain the mean proportion of data items in each address which were assigned to the incorrect output field, or to no output field. Ninety-five per cent confidence limits for the proportions are given in brackets.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Name standardisation</p>
            </st>
            <p>Results of the ten-fold cross-validation of name standardisation on 1,000 names of mothers are shown in Table <tblr tid="T11">11</tblr>.</p>
            <tbl id="T11">
               <title>
                  <p>Table 11</p>
               </title>
               <caption>
                  <p>Results of name standardisation evaluation</p>
               </caption>
               <tblbdy cols="12">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="11" ca="center">
                        <p>
                           <b>
                              <it>Folds</it>
                           </b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="11">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>8</p>
                     </c>
                     <c ca="center">
                        <p>9</p>
                     </c>
                     <c ca="center">
                        <p>10</p>
                     </c>
                     <c ca="center">
                        <p>Mean</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="12">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>HMM</p>
                     </c>
                     <c ca="center">
                        <p>0.966</p>
                     </c>
                     <c ca="center">
                        <p>0.921</p>
                     </c>
                     <c ca="center">
                        <p>0.852</p>
                     </c>
                     <c ca="center">
                        <p>0.970</p>
                     </c>
                     <c ca="center">
                        <p>0.966</p>
                     </c>
                     <c ca="center">
                        <p>0.938</p>
                     </c>
                     <c ca="center">
                        <p>0.831</p>
                     </c>
                     <c ca="center">
                        <p>0.920</p>
                     </c>
                     <c ca="center">
                        <p>0.954</p>
                     </c>
                     <c ca="center">
                        <p>0.884</p>
                     </c>
                     <c ca="center">
                        <p>0.920</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Rules</p>
                     </c>
                     <c ca="center">
                        <p>0.997</p>
                     </c>
                     <c ca="center">
                        <p>0.983</p>
                     </c>
                     <c ca="center">
                        <p>0.991</p>
                     </c>
                     <c ca="center">
                        <p>0.983</p>
                     </c>
                     <c ca="center">
                        <p>0.975</p>
                     </c>
                     <c ca="center">
                        <p>0.976</p>
                     </c>
                     <c ca="center">
                        <p>0.985</p>
                     </c>
                     <c ca="center">
                        <p>0.981</p>
                     </c>
                     <c ca="center">
                        <p>0.976</p>
                     </c>
                     <c ca="center">
                        <p>0.971</p>
                     </c>
                     <c ca="center">
                        <p>0.982</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Computational performance</p>
            </st>
            <p>In all cases it took under 15 seconds to train the various HMMs, once the training data files had been prepared (as described earlier).</p>
            <p>HMM standardisation of 10<sup>3</sup>, 10<sup>4 </sup>and 10<sup>5 </sup>name records on the Sun platform took an average of 67 seconds, 525 seconds and 5133 seconds (86 minutes) respectively, indicating that performance scales as O(n) &#8211; that is, linearly with the number of records to be processed. HMM standardisation of one million address records on the PC platform took 14,061 seconds (234 minutes), or 5832 seconds (97 minutes) with the Psyco just-in-time Python compiler enabled <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. AutoStan took 1849 seconds (31 minutes) to standardise the same one million address records on the same computer.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <sec>
            <st>
               <p>Address standardisation</p>
            </st>
            <p>The overall address standardisation results indicate that for typical Australian addresses captured by a variety of information systems, the HMM approach described in this paper performs at least as well as a widely-used rule-based system when used with the data source for which that system's rules were developed, and better when used on a different data set. In other words, HMMs trained on a particular data source appear to be more general than a rule-based system using rules developed for the same data.</p>
            <p>In addition, the improvements in performance observed with HMM2 and HMM3 suggest that, although frequency-based maximum likelihood estimates are used to derive the probability matrices, the resulting HMMs are fairly indifferent to the source of their training data, and their performance can even be improved by the addition of a small number of "atypical" training records which do not "fit" the HMM very well.</p>
            <p>It is probable that some of the observed generality of the HMMs stems from the use of lexicon-based tokenisation as implemented in the <it>Febrl </it>package, which enables exhaustive but readily available place name and other lists to be leveraged. In contrast, Borkar <it>et al. </it><abbrgrp><abbr bid="B20">20</abbr></abbrgrp> replaced each word in each input addresses with symbols based on a simple rational expression grouping eg 3-digit number, 5-digit number, single character, multi-character word, mixed alphanumeric word. These symbols contain much less semantic information than the lexicon-based symbols used in Febrl, although they have the advantage of not requiring look-up tables (lexicons). Borkar <it>et al</it>. also used nested HMMs to achieve acceptable accuracy on more complex addresses <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. At least for Australian addresses, which are of similar complexity to North American addresses, but less complex than most European and Asian addresses, we have not found nested models to be necessary. This may be because the lexicon-based tokenisation used in <it>Febrl </it>preserves more information from the source string for use by the HMM, at the expense of a more complex model. However, the computational performance of these models is satisfactory. Future attempts at optimisation, by re-writing parts of the code, such as the Viterbi algorithm, in C are expected to yield significant increases in speed. In addition, the standardisation of each record is completely independent from other records, and hence can readily be performed in parallel on clusters of workstations (COWs).</p>
            <p>Standardisation is not an all-or-nothing transformation, and both the rule-based and HMM approaches appear to degrade gracefully when the model or rules make errors. In the address records which were not accurately standardised by the HMMs, at least two-thirds of all data elements present in the input record were allocated to the correct output fields. Thus, even these incorrectly standardised records would have considerable discriminatory power when used for record linkage purposes. In only two test records (out of 2000) were all of the address elements wrongly assigned, and both of these were foreign addresses in non-English speaking countries. The performance of our <it>AutoStan </it>rule set was similar in this respect.</p>
            <p>A significant proportion of incorrectly standardised addresses were of the form "Penryth Downs St Blackstump NSW 29