Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN, USA

Regenstrief Institute and Indiana University School of Medicine, Indianapolis, IN, USA

Abstract

Background

Methods for linking real-world healthcare data often use a latent class model, where the latent, or unknown, class is the true match status of candidate record-pairs. This commonly used model assumes that agreement patterns among multiple fields within a latent class are independent. When this assumption is violated, various approaches, including the most commonly proposed loglinear models, have been suggested to account for conditional dependence.

Methods

We present a step-by-step guide to identify important dependencies between fields through a correlation residual plot and demonstrate how they can be incorporated into loglinear models for record linkage. This method is applied to healthcare data from the patient registry for a large county health department.

Results

Our method could be readily implemented using standard software (with code supplied) to produce an overall better model fit as measured by BIC and deviance. Finding the most parsimonious model is known to reduce bias in parameter estimates.

Conclusions

This novel approach identifies and accommodates conditional dependence in the context of record linkage. The conditional dependence model is recommended for routine use due to its flexibility for incorporating conditional dependence and easy implementation using existing software.

Background

Health information exchanges (HIE’s), with highly heterogeneous data, are becoming increasingly important sources of integrated clinical data supporting many healthcare tasks and health-related research. HIE data are captured from different independent databases with different patient identifiers, and best practices for implementing and operating HIE’s are needed. Specifically with respect to data integration and patient matching, in its formal recommendations to the Director of the Office of the National Coordinator for Health Information Technology (HIT) in 2011, the HIT Policy Committee recognized the need to develop and disseminate best practices for patient matching

Many methodologies have been proposed to identify records in two or more databases that are related to the same entity. Deterministic approaches are based on ad-hoc rules, which classify a pair of records as matches if the two records satisfy certain conditions. Although straightforward to implement, deterministic approaches are often too conservative with unacceptably high false negative (missed-match) rates, especially when data are noisy

Distance-based methods that can handle numerical or categorical fields, as described in

Another alternative to deterministic linkage methods are probabilistic methods. A common probabilistic record linkage method was proposed by Fellegi and Sunter in 1969

This conditional independence assumption is often violated in real-world record linkage scenarios

Various methods have been proposed to address the lack of conditional independence in latent class models for record linkage. For example, Tromp et al. incorporated conditional dependence between two fields by combining them into one field with four nominal levels of agreement

Latent class models with conditional independence can be equivalently formulated using a loglinear framework

Similarly, loglinear models have been applied to record linkage applications. Using survey data whose record pairs had known match status, Thibaudeau identified fields with conditional dependence using a loglinear model with selected interactions

Many previous record linkage studies focused largely on maximum likelihood (ML) estimation, where the parameter estimates of the loglinear model were obtained using an Expectation-Maximization (EM) algorithm. For situations such as the latent class model where incomplete data (unobserved classes) are involved, the EM algorithm is a powerful tool to estimate model parameters

Even though loglinear models have been proposed by multiple authors for handling conditional dependence in HIE, implementation of such models requires customized programs and the process for choosing pairwise interactions in these models has not been specified. We therefore describe and evaluate a method for identifying conditional dependence among fields, which are subsequently incorporated as interactions in a loglinear model fitted using standard software. To illustrate the methodology, we use an application linking a client list of a county health department to itself for de-duplication. The step-by-step method described is supplemented by sample code which can be readily modified for linking any two data sets using standard statistical software.

Methods

We first describe a loglinear formulation of the extended F-S model with conditional dependence. Let

where y_{i} = 1 if the ^{
th
} field agrees and 0 otherwise. The match prevalence is defined as the proportion of vector patterns belonging to the true match record class and is π =

where the

To more effectively accommodate conditional independence, the traditional F-S model can be reparameterized using a loglinear formulation, where the mean number of record pairs with agreement pattern

With ^{
K
} possible different agreement patterns. Let _{
d
} represent the frequency count for the agreement pattern _{
d
} (

where the marginal probability of observing the agreement pattern _{
d
} is

The match score for a specific agreement pattern _{
d
} is defined as

The loglinear formulation has been shown to be equivalent to the F-S classical probabilistic formulation of the conditional independence latent class model

To incorporate conditional dependence in the loglinear model setting, we add the appropriate interaction terms to the model. For example, if there is dependence between fields

The above loglinear model with interaction terms is easy to fit in standard statistical software such as SAS (example code is provided in Additional file ^{2} and the Bayesian Information Criterion (BIC). We use deviance to compare nested models. A model with lower deviance provides a better fit to the data and hence will be preferred. For models that are not nested within each other, BIC is the most commonly used criterion for latent class modeling as it takes into account the sample size

**MCHD Loglinear Models.sas.** SAS program that uses the loglinear approach described in this paper to fit the MCHD dataset. Requires SAS® software available from SAS Institute Inc., Cary, NC, USA.

Click here for file

In what follows, we describe a series of steps to fit a loglinear model with appropriate interactions. Specifically, we follow a six-step procedure by identifying the pairwise dependencies between fields using the correlation residual plot proposed by Qu, Tan, and Kutner

Step 1

Fit a loglinear model with no interactions using the observed agreement vectors. This is simply the F-S model formulated as a loglinear model, which provides initial parameter estimates for the next model. Obtain deviance and BIC of this conditional independence model. See Additional files

**corr macro.sas.** SAS macro to compute the correlation residual. Requires SAS® software available from SAS Institute Inc., Cary, NC, USA.

Click here for file

**MCHD data.csv.** MCHD data supplied in standard csv format.

Click here for file

Step 2

Compute the observed pairwise correlation between fields _{
j
} and _{
l
} is

where _{
j
} _{
j
} _{
l
} _{
l
} _{
jl
} _{
j
} _{
l
} _{
j
}, _{
l
}, and _{
jl
} are given by:

respectively.

Step 3

Substitute the parameter estimates of _{
d
}
_{
d
}
_{
d
} and match status _{
d.
} Calculate the expected marginal probability _{
d
}) using Equation (2) and the expected cell count

Step 4

Compute the correlation residual, which is equal to the difference between the observed correlation and the expected correlation for each pair of fields. Plot the residuals across the different pairs of fields. A correlation residual which is much different from zero would imply dependence for the corresponding pair of fields.

Step 5

Incorporate the conditional dependence between the pair of fields identified in _{
d
}
_{
d
}) by substituting the parameter estimates of

Step 6

To classify individual pairs as match, non-match or uncertain matches, we use the final model parameter estimates to calculate the match score for each agreement pattern. Record pairs are then declared as matches or non-matches based on these match scores.

Approval to perform this study was obtained from the Indiana University Institutional Review Board: approval number 1010002784 (0909–68). De-identified data for the HIE example described in the next section is provided as Additional file

Results

Description of the HIE dataset

We applied the above steps to de-duplicate the client registry for the Marion County Health Department (MCHD). De-duplication is a class of record linkage where a data set is linked to itself to identify potential duplicate records. MCHD is a member of the Indiana Network for Patient Care, the nation’s largest and longest tenured HIE

The MCHD client registry contains 779,466 patient records gathered from multiple public health service areas. Therefore this data is highly heterogeneous and the method of input may be any combination of standardized electronic entry, paper entry, or manual entry for a given field. Since the total number of all potential record pairs is extremely large (3 × 10^{11} potential pairs), the data were first blocked to minimize the search space for potential matches. The MCHD client registry was blocked on last name and first name, thus only record pairs agreeing on these two fields are contained in the analysis. This reduced the number of potential record pairs to 618,213. The remaining fields in this dataset include day, month, and year of birth, social security number, telephone number, zip code, and gender. Level of missing across the different fields varies from as low as 0% for day, month, and year of birth to as high as 95% for specific identifiers such as SSN. Missing values were coded as disagreements. We then applied the six-step process described above to all pairs from this block.

Application to HIE dataset

As described in the previous section, we first fit the conditional independence model (^{2} = 8852.9. Assuming independence, the observed and expected pairwise correlations were calculated (

**Model 0**

**Model I**

**Model II**

**(Conditional independence)**

**Field**

**Parameter**

**Estimate**

**Std Error**

**Estimate**

**Std Error**

**Estimate**

**Std Error**

**
π
**

**0.037**

**0.0004**

**0.035**

**0.0004**

**0.035**

**0.0004**

Loglinear model results for MCHD data blocked on last name and first name (Number of record pairs = 618,213). All parameters are statistically significant (p < .001) for all three models, except for u_{2} which is not significant for conditional independence model (p = .468) or Loglinear Model I (p = .143).

Year of birth

_{
1
}

0.581

0.0047

0.615

0.0048

0.615

0.0048

SSN

_{
2
}

0.025

0.0011

0.027

0.0011

0.026

0.0011

Day of birth

_{
3
}

0.572

0.0046

0.608

0.0048

0.608

0.0048

Telephone

_{
4
}

0.173

0.0029

0.141

0.0026

0.140

0.0026

Zip code

_{
5
}

0.409

0.0041

0.363

0.0040

0.362

0.0040

Sex

_{
6
}

0.710

0.0037

0.695

0.0038

0.694

0.0038

Month of birth

_{
7
}

0.716

0.0044

0.768

0.0044

0.769

0.0045

Year of birth

_{
1
}

0.026

0.0002

0.026

0.0002

0.026

0.0002

SSN

_{
2
}

6E-06

8E-06

1E-05

9.00E-06

5E-05

1E-05

Day of birth

_{
3
}

0.032

0.0003

0.031

0.0003

0.031

0.0003

Telephone

_{
4
}

5E-04

0.0001

0.002

0.0001

0.002

0.0001

Zip code

_{
5
}

0.037

0.0003

0.039

0.0003

0.039

0.0003

Sex

_{
6
}

0.661

0.0006

0.661

0.0006

0.661

0.0006

Month of birth

_{
7
}

0.082

0.0004

0.081

0.0004

0.081

0.0004

G^{2}

8852.9

2974.26

2881.45

Correlation residual plots for last name/first name block

**Correlation residual plots for last name/first name block.** Last name/first name block: pairwise correlation residuals for Model 0 **(**Panel **A)**, Model I **(**Panel **B)**, and Model II **(**Panel **C)**.

The seven fields in this particular dataset yield 21 pairwise correlations. The difference between the 21 observed and expected pairwise correlations from this model ranged from −0.027 to 0.155. The majority of the correlation residuals from the conditional independence model fluctuate between −0.03 and 0.05. However, the correlation residual between the fields telephone number and zip code is much larger than the others (almost 5-fold difference), indicating a violation of the conditional independence assumption for this pair of variables.

To accommodate the conditional dependence between telephone number and zip code, we followed

The parameter estimates of the match prevalence and

After calculating the expected correlations under Model I (repeat

Although the correlation residual plot did not reveal substantial deviation of the conditional dependence between other pairs of fields, the correlation residual between SSN and telephone number (0.047) was more than twice of the magnitude of the remaining residuals. To examine whether it is appropriate to consider conditional dependence between this pair of fields, we repeated the

Model II provided a better fit to the data compared to Model I with a smaller deviance and BIC, as well as smaller correlation residuals (Figure ^{2} = 2881.45, which was only slightly less than the deviance of Model I. As a result, parameter estimates under Model II did not differ much from the less complex Model I. As this is consistent with our general guideline regarding correlations above .05, we chose Model I as our final model.

Patient records were then classified as match or non-match based on the estimated match prevalence from these three models. The conditional independence model classified 1,152 record pairs as matches. Model I classified 1,082 matches and Model II results were almost identical to Model I with 1,081 matches. Thus not accounting for the conditional dependence yielded the largest number of declared duplicates; likely producing falsely-merged records resulting in lost patient records. Since it was not the purpose of our study to assess the accuracy of the different models, we did not manually ascertain the true match status of the records. We refer the readers to the literature

Discussion

For many record linkage applications, the assumption of conditional independence for field agreement is often violated and ignoring the conditional dependence may lead to a suboptimal record matching accuracy. To optimize matching accuracy, it is important to examine whether conditional dependence exists and to incorporate such dependence in the model in a proper way.

In this paper, we presented a step-by-step procedure to identify and incorporate conditional dependence among fields using a loglinear latent class model. This stepwise method can be implemented using standard statistical software. In contrast to previous studies where loglinear latent class models were estimated using the iterative EM algorithm, we proposed estimating parameters using the readily available SAS procedure NLMIXED. Our step-by-step process was applied to the de-duplication of the MCHD client registry. The results indicated that conditional dependence can be readily identified using a graphical approach and the model with appropriate conditional dependence provided a much better model fit than the conditional independence model.

Although a stepwise variable-selection strategy was previously proposed by Zhu et al.

In addition to loglinear models, latent class models with conditional dependence have been extensively studied and widely used in other domains. For example, in diagnostic testing, latent class models with random effects

A potential limitation of our approach is that it is more labor intensive because it requires understanding how to fit loglinear models. Additionally, the loglinear model framework requires parameterization that is not as readily understood by practitioners. The approach is iterative thus is also more computationally intensive. However, these challenges are mitigated by providing example code as additional files.

Conclusions

We have proposed a novel and practical approach to identify and incorporate conditional dependence in record linkage. Compared to the commonly used F-S model, the conditional dependence model provides substantially better fit to the data when conditional dependence exists. Given that some fields commonly used for linking health records often have correlated agreement patterns, we recommend the routine use of our proposed methods to avoid model misfit. Our approach can be easily followed using the step-by-step instructions and the sample code provided.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

JKD drafted the manuscript, prepared tables and figures as well as supplemental material. HX performed analysis, was involved in drafting manuscript and created SAS macro. SLH reformatted and also wrote portions of the manuscript. REG provided knowledge on record linkage. This work was supported by SJG’s grant and SJG provided additional knowledge on record linkage as well as provided the data. All authors read and approved the final manuscript.

Pre-publication history

The pre-publication history for this paper can be accessed here: