Bayesian Record Linkage and Binary Classification

1 Background

Suppose we have two separate data sets containing demographic information for the same population and we would like to merge these records, but we lack unique identifiers to base the merge on. As digital record keeping and data collection has proliferate, this scenario has become increasingly common. Fellegi and Sunter (1969) introduced the first record linkage method that enabled the matching entities (e.g. people, companies, or products) across databases in the absence of unique identifiers. R. C. Steorts, Hall, and Fienberg (2016) introduced one of the first Bayesian methods for record linkage, which provided an intuitive approach to quantifying uncertainty around the matches.

Record linkage is often a preliminary step in performing some more complex analysis, e.g. we may have covariates in one data set and outcome data in another and we need to merge the records in order to perform some predictive task. R. C. Steorts, Tancredi, and Liseo (2018) extended the Bayesian framework of R. C. Steorts, Hall, and Fienberg (2016) to handle the joint task of record linkage and linear regression. Their approach had the advantage of exact error propagation: any uncertainty from the linkage step is reflected in the regression estimates. The assumption of Gaussian data allowed R. C. Steorts, Tancredi, and Liseo (2018) to marginalize out the latent covariates in their regression, and just use the observed data, but it eliminates the possibility of a limited dependent variable. In this post, I introduce a model based on R. C. Steorts, Tancredi, and Liseo (2018) that does not require an explicit assumptions on the data generating process for the covariates. I will use this model to jointly link one database with a binary outcome to another containing potential covariates, then estimates a downstream logistic regression.

2 Model

Suppose that we have two databases $\mathcal{D}_1$ and $\mathcal{D}_2$ that consist of records $R_{ij}$, where $i\in\{1,2\}$ indexes the database and $j\in\{1,\ldots,N_i\}$ indexes the record. Each record has $K$ overlapping fields that can be used to link records across databases; we denote this subset of $R_{ij}$ as $V_{ij} = \{V_{ijk}\}_{k=1}^K$. Each $V_{ijk}$ is assumed to be categorical with $m\in\{1,\ldots,M_k\}$ representing the possible labels for the $k^{\text{th}}$ field. If $i = 1$, then $R_{ij}$ has an additional field $Y_{j}\in\{0,1\}$ that represents a binary response. Similarly, if $i=2$, then $R_{ij}$ has $P$ additional fields $\{X_{jp}\}_{p=1}^P$ that represent covariates. Our aim is to determine which records in $\mathcal{D}_1$ match to records in $\mathcal{D}_2$, then conditional on that matching, regress $\{Y_{j}\}_{j=1}^{N_1}$ on $\{X_{j}\}_{j=1}^{N_2}$.

We assume that there are $N_{\text{pop}}$ latent entities indexed by $\ell\in\{1,\ldots,N_{\text{pop}}\}$. The latent entities have an associated set of true field values $\{(\widetilde{V}_\ell,\widetilde{Y}_\ell, \widetilde{X}_\ell)\}_{\ell=1}^{N_\text{pop}}$. The records we observe are (potentially) distorted versions of these true underlying values. To keep track of which latent entity generates which observed records, we define a linkage structure, \[\Lambda = \left\{\{\lambda_{ij}\}_{j=1}^{N_i} : i = 1,2\right\}\] where $\lambda_{ij} = \ell$ implies that the $j^{\text{th}}$ record in the $i^{\text{th}}$ database is generated by the $\ell^{\text{th}}$ latent entity. We assume that is possible for $V_{ijk} \neq \widetilde{V}_{\lambda_{ij}k}$. In this case we say that there is distortion, which we model with the latent indicator $Z_{ijk}$, so that $Z_{ijk} = 1$ if the $k^{\text{th}}$ field of the $ij^{\text{th}}$ record is distorted.

2.1 Independent Fields Model

Assume that each of the $K$ latent field values are generated independently across latent entities and across each field. Given a vector of probabilities $\theta_k$ over the $M_k$ values appearing in the $k^{\text{th}}$ field we have \[\begin{align*} \widetilde{V}_{\ell k} \;|\; \theta_k &\overset{\text{i.i.d.}}{\sim} \text{Multinomial}(1, \theta_k) \\ \widetilde{V}_{\ell k} &\;\perp \widetilde{V}_{\ell k'} \text{ if } k\neq k' \\ \theta_k &\overset{\text{ind.}}{\sim} \text{Dirichlet}(\pmb{\eta}_k). \end{align*}\] We also assume that distortion occurs independently both across fields and records, so that given the distortion probability $\alpha_k$ for the $k^{\text{th}}$ field we have \[\begin{align*} Z_{ijk} \;|\; \alpha_k &\overset{\text{i.i.d.}}{\sim} \text{Bernoulli}(\alpha_k) \\ Z_{ijk} &\;\perp Z_{i'j'k} \text{ if } (i,j) \neq (i',j') \\ \alpha_k &\overset{\text{ind.}}{\sim} \text{Beta}(a_k, b_k). \end{align*}\] Conditional on latent field values, the distortion indicators, and the linkage structure, we assume that the observed records are independently generated by the following “hit-and-miss” model,¹ ² \[\begin{align*} V_{ijk} \;|\; \lambda_{ij}, \widetilde{V}_{\lambda_{ij}k}, Z_{ijk}, \theta_k &\sim \begin{cases} \delta_{\widetilde{V}_{\lambda_{ij}k}}, & \text{if } Z_{ijk} = 0, \\ \text{Multinomial}(1, \theta_k), & \text{if } Z_{ijk} = 1. \end{cases} \\ V_{ijk} &\perp V_{i'j'k'} \text{ if } (i,j,k) \neq (i',j',k') \end{align*}\] Finally, we assume that $p(\Lambda) \propto 1$, which captures the assumption that each legitimate configuration of $\Lambda$ is equally likely a priori.

2.2 Logistic Regression Model

We adopt the following logistic regression model from Robert and Casella (2005), \[\begin{align*} \widetilde{Y}_{\ell} \;|\; \widetilde{X}_{\ell}, \beta_0, \beta &\overset{\text{ind.}}{\sim} \text{Bernoulli} \left(\frac{\exp\left(\beta_0 + \widetilde{X}'_{\ell}\beta\right)}{1 + \exp\left(\beta_0 + \widetilde{X}'_{\ell}\beta\right)}\right) \\ \beta_0 &\;\sim \frac{\exp(\beta_0)}{b_0} \exp\left(-\frac{\exp(\beta_0)}{b_0}\right) \\ \beta &\;\sim \mathcal{N}_P(\mu_\beta, \Sigma_\beta) \\ \beta_0 &\;\perp \beta \perp \widetilde{X}_\ell \end{align*}\] We also assume that, conditional on $\widetilde{Y}_{\lambda_{1j}}$ and $\widetilde{X}_{\lambda_{2j}}$ respectively, $Y_{j}$ and $X_{j}$ are generated independently within their respective databases, and finally that $Y_{j} \perp V_{1j}$ and $X_{j} \perp V_{2j}$, so there is no relation between the fields we are using for matching and those we are using for regression.

We make two simplifying assumptions for this model: i) there is no duplication within databases, i.e. if $j \neq j'$, then $\lambda_{ij} \neq \lambda_{ij'}$, and ii) there is no distortion in the regression data, so that conditional on $\lambda_{ij}$, we have \[\begin{align*} Y_{j} &= \widetilde{Y}_{\lambda_{1j}} \\ X_{j} &= \widetilde{X}_{\lambda_{2j}}. \end{align*}\] One subtlety of this approach is that we may end up with a configuration of $\Lambda$ for which there exists a latent entity indexed by $\ell$ such that \[\{i : \lambda_{ij} = \ell\} = 1 \text{ or } \{i : \lambda_{ij} = \ell\} = 2.\] In other words, there may be a latent entity that points to an individual that appears in one data set, but not both. In this case the density $p(Y_{j}, X_{j'} \;|\; \lambda_{1j} = \lambda_{j'} = \ell)$ will collapse to the marginal density of either $Y_{j}$ or $X_{j'}$. Furthermore, because there is no distortion in this regression model, all of the information we have about $(\widetilde{Y}_\ell, \widetilde{X}_\ell)$ is captured by $\{(Y_{j}, X_{j'}) : \lambda_{1j} = \lambda_{j'} = \ell\}$, so we can effectively drop $\widetilde{Y}_\ell$ and $\widetilde{X}_\ell$ from the model.

2.3 Posterior Distributions

The full posterior distribution for this model is given by \[\begin{align*} p\left(\Lambda, \widetilde{\mathbf{V}}, \mathbf{Z}, \pmb{\alpha}, \pmb{\theta}, \beta \;|\; \mathbf{V}, \mathbf{Y}, \mathbf{X}\right) &\propto p(\mathbf{V} \;|\; \Lambda, \widetilde{\mathbf{V}}, \mathbf{Z}, \pmb{\alpha}, \pmb{\theta}) p(\widetilde{\mathbf{V}} \;|\; \Lambda, \pmb{\theta})p(\pmb{\theta)} p(\mathbf{Z}\;|\; \pmb{\alpha})p(\pmb{\alpha}) \tag{Matching Model} \\ &\quad \times p(\mathbf{Y}, \mathbf{X} \;|\; \Lambda,\beta)p(\beta) \tag{Regression Model} \\ &\quad \times p(\Lambda) \end{align*}\] Focusing on the term $p(\mathbf{Y}, \mathbf{X} \;|\; \Lambda,\beta)$, we can define the sets \[\begin{align*} S_y &= \{\ell : \exists j \text{ s.t. } \lambda_{1j} = \ell \text{ and } \nexists j' \text{ s.t. } \lambda_{j'} = \ell\} \\ S_y &= \{\ell : \nexists j \text{ s.t. } \lambda_{1j} = \ell \text{ and } \exists j' \text{ s.t. } \lambda_{j'} = \ell\} \\ S_{xy} &= \{1, \ldots, N_\text{pop}\} \setminus \{S_y \cup S_x\}. \end{align*}\] Then we have \[\begin{align*} p(\mathbf{Y}, \mathbf{X} \;|\; \Lambda,\beta) &= p(\widetilde{\mathbf{Y}}, \widetilde{\mathbf{X}} \;|\; \Lambda, \beta) \\ &= \prod_{\ell \in S_y} p(\widetilde{Y}_\ell \;|\;\beta) \prod_{\ell \in S_x} p(\widetilde{X}_\ell) \prod_{\ell \in S_{xy}} p(\widetilde{Y}_\ell \;|\; \widetilde{X}_\ell, \beta) p(\widetilde{X}_\ell) \tag{$\widetilde{X}_\ell \perp \beta$}\\ &= \prod_{\ell \in S_y} \int_{X_\ell} \left(\frac{\exp\left(\beta_0 + \widetilde{x}'_{\ell}\beta\right)}{1 + \exp\left(\beta_0 + \widetilde{x}'_{\ell}\beta\right)}\right)^{\widetilde{Y}_\ell}\left(\frac{1}{1 + \exp\left(\beta_0 + \widetilde{x}'_{\ell}\beta\right)}\right)^{1 - \widetilde{Y}_\ell} p(\widetilde{x}_\ell) \; d \widetilde{x}_\ell \\ &\quad \times \prod_{\ell \in S_{xy}} \left(\frac{\exp\left(\beta_0 + \widetilde{X}'_{\ell}\beta\right)}{1 + \exp\left(\beta_0 + \widetilde{X}'_{\ell}\beta\right)}\right)^{\widetilde{Y}_\ell}\left(\frac{1}{1 + \exp\left(\beta_0 + \widetilde{X}'_{\ell}\beta\right)}\right)^{1 - \widetilde{Y}_\ell} \\ &\quad \times \prod_{\ell \in S_x \cup S_{xy}} p(\widetilde{X}_\ell). \end{align*}\] Note that $S_x \cup S_{xy}$ is just the set of latent entities pointing to an observed record of $X_{2j}$, i.e. \[\{\widetilde{X}_\ell : \ell \in S_x \cup S_{xy}\} = \mathbf{X},\] so the last line above simplifies to $p(\mathbf{X})$. Rewriting $p(\mathbf{Y}, \mathbf{X} \;|\; \Lambda,\beta)$ in terms of the observed data (and using the assumption that the data generating process for $X_\ell$ is the same across $\ell$), we have \[\begin{align*} p(\mathbf{Y}, \mathbf{X} \;|\; \Lambda,\beta) &\underset{\Lambda, \beta}{\propto} \prod_{j\;:\;\lambda_{1j} \in S_y} \int_{X} \left(\frac{\exp\left(x'\beta\right)}{1 + \exp\left(\beta_0 + x'\beta\right)}\right)^{Y_{\lambda_{1j}}}\left(\frac{1}{1 + \exp\left(\beta_0 + x'\beta\right)}\right)^{1 - Y_{\lambda_{1j}}} p(x) \; dx \\ &\quad \times \prod_{\substack{(j,j') \;:\; \lambda_{1j}=\lambda_{j'}}} \left(\frac{\exp\left(\beta_0 + X'_{j'}\beta\right)}{1 + \exp\left(\beta_0 + X'_{j'}\beta\right)}\right)^{Y_{j}}\left(\frac{1}{1 + \exp\left(\beta_0 + X'_{j'}\beta\right)}\right)^{1 - Y_{j}}. \end{align*}\] Plugging in the rest of out parametric forms, we have \[\begin{align*} p\left(\Lambda, \widetilde{\mathbf{V}}, \mathbf{Z}, \pmb{\alpha}, \pmb{\theta}, \beta \;|\; \mathbf{V}, \mathbf{Y}, \mathbf{X}\right) &\propto \prod_{i,j,k,m} \left[(1-Z_{ijk})\delta_{\widetilde{V}_{\lambda_{ij}k}}(V_{ijk}) + Z_{ijk}\theta_{km}^{\mathbf{1}\{V_{ijk} = m\}}\right] \\ &\quad \times \prod_{k,m} \theta_{km}^{\eta_{km} -1 + \sum_{\ell = 1}^{N_\text{pop}}\mathbf{1}\{\widetilde{V}_{\ell k} = m\}}\\ &\quad \times \prod_{k} \alpha_k^{a_k - 1 +\sum_{i=1}^2\sum_{j=1}^{N_i} Z_{ijk}} (1-\alpha_k)^{b_k - 1 +\sum_{i=1}^2\sum_{j=1}^{N_i} (1-Z_{ijk})} \\ &\quad \times \prod_{j\;:\;\lambda_{1j} \in S_y} \int_{X} \left(\frac{\exp\left(x'\beta\right)}{1 + \exp\left(x'\beta\right)}\right)^{Y_{\lambda_{1j}}}\left(\frac{1}{1 + \exp\left(\beta_0 + x'\beta\right)}\right)^{1 - Y_{\lambda_{1j}}} p(x) \; d x \\ &\quad \times \prod_{\substack{(j,j') \;:\; \lambda_{1j}=\lambda_{j'}}} \left(\frac{\exp\left(\beta_0 + X'_{j'}\beta\right)}{1 + \exp\left(\beta_0 + X'_{j'}\beta\right)}\right)^{Y_{j}}\left(\frac{1}{1 + \exp\left(\beta_0 + X'_{j'}\beta\right)}\right)^{1 - Y_{j}}. \end{align*}\]

We simulate the posterior distribution via Gibbs sampling. We can directly draw from the full conditional distributions for $\pmb{\alpha}$ and $\pmb{\theta}$ \[\begin{align*} \alpha_k \;|\; \Lambda, \widetilde{\mathbf{V}}, \mathbf{Z}, \pmb{\theta}, \beta, \mathbf{V}, \mathbf{Y}, \mathbf{X} &\overset{\text{ind.}}{\sim} \text{Beta}\left(a_k +\sum_{i=1}^2\sum_{j=1}^{N_i} Z_{ijk},\; b_k +\sum_{i=1}^2\sum_{j=1}^{N_i} (1-Z_{ijk})\right) \\ \theta_k \;|\; \Lambda, \widetilde{\mathbf{V}}, \mathbf{Z}, \pmb{\alpha}, \beta, \mathbf{V}, \mathbf{Y}, \mathbf{X} &\overset{\text{ind.}}{\sim} \text{Dirichlet}(\pmb{\eta}_k + \mathbf{c}_k) \end{align*}\] where $\mathbf{c}_k = \{c_m\}_{m=1}^{M_k}$ and \[c_m = \left[\sum_{\ell=1}^{N_\text{pop}} \mathbf{1}\{\widetilde{V}_{\ell k} = m\}\right] + \left[\sum_{i=1}^2\sum_{j=1}^{N_i} \mathbf{1}(V_{ijk}=m)Z_{ijk}\right].\]

Next, note that if $Z_{ijk} = 0$ and $\lambda_{ij} = \ell$, then we know $\widetilde{V}_{\ell k} = V_{ijk}$. On the other hand, if $Z_{ijk} = 1$, then $\widetilde{V}_{\ell k}$ is just a draw from $\text{Multinomial}(1, \theta_k)$, so we have \[\begin{align*} \widetilde{V}_{\ell k} \;|\; \Lambda, \mathbf{Z}, \pmb{\alpha}, \pmb{\theta}, \beta, \mathbf{V}, \mathbf{Y}, \mathbf{X} &\overset{\text{ind.}}{\sim} \begin{cases} \delta_{V_{ijk}}, & \text{if } \exists\; (i,j) \text{ s.t. } \lambda_{ij} = \ell \text{ and } Z_{ijk} = 0, \\ \text{Multinomial}((1, \theta_k), &\text{ otherwise.} \end{cases} \end{align*}\] Similarly, if $V_{ijk} \neq \widetilde{V}_{\lambda_{ij}k}$, then we know $Z_{ijk} = 1$. On the other hand, if $V_{ijk} = \widetilde{V}_{\lambda_{ij}k}$, then we need to calculate \[\begin{align*} P(Z_{ijk} = 1 \;|\; \Lambda, \widetilde{\mathbf{V}}, \pmb{\alpha}, \pmb{\theta}, \beta, \mathbf{V}, \mathbf{Y}, \mathbf{X}) = \frac{P(\Lambda, \widetilde{\mathbf{V}}, \pmb{\alpha}_{-k}, \pmb{\theta}, \beta, \mathbf{V}, \mathbf{Y}, \mathbf{X} \;|\; Z_{ijk} = 1, \alpha_k)P(Z_{ijk} = 1 \;|\; \alpha_k) }{\sum_{z\in\{0,1\}} P(\Lambda, \widetilde{\mathbf{V}}, \pmb{\alpha}_{-k}, \pmb{\theta}, \beta, \mathbf{V}, \mathbf{Y}, \mathbf{X} \;|\; Z_{ijk} = z, \alpha_k)P(Z_{ijk} = z\;|\;\alpha_k)} \end{align*}\] But by the previous conditional independence results, we have \[\begin{align*} p(\Lambda, \widetilde{\mathbf{V}}, \pmb{\alpha}_{-k}, \pmb{\theta}, \beta, \mathbf{V}, \mathbf{Y}, \mathbf{X} \;|\; Z_{ijk}, \alpha_k)p(Z_{ijk} \;|\; \alpha_k) &\underset{Z_{ijk}}{\propto} \prod_{m=1}^{M_k} \left[(1-Z_{ijk})\delta_{\widetilde{V}_{\lambda_{ij}k}}(V_{ijk}) + Z_{ijk}\theta_{km}^{\mathbf{1}\{V_{ijk} = m\}}\right] \\ &\quad \times \alpha_k^{a_k - 1 + Z_{ijk}} (1-\alpha_k)^{b_k - Z_{ijk}} \\ &= \prod_{m=1}^{M_k} \left[(1-Z_{ijk}) + Z_{ijk}\theta_{km}^{\mathbf{1}\{V_{ijk} = m\}}\right] \alpha_k^{a_k - 1 + Z_{ijk}} (1-\alpha_k)^{b_k - Z_{ijk}} \tag{assumed $\widetilde{V}_{\lambda_{ij}k} = V_{ijk}$} \end{align*}\] which gives \[\begin{align*} P(Z_{ijk} = 1 \;|\; \Lambda, \widetilde{\mathbf{V}}, \pmb{\alpha}, \pmb{\theta}, \beta, \mathbf{V}, \mathbf{Y}, \mathbf{X}) = \frac{\alpha_k \prod_{m=1}^{M_k} \theta_{km}^{\mathbf{1}\{V_{ijk} = m\}}}{(1-\alpha_k) + \alpha_k \prod_{m=1}^{M_k} \theta_{km}^{\mathbf{1}\{V_{ijk} = m\}}} \end{align*}\] and \[\begin{align*} Z_{ijk} \;|\; \Lambda, \widetilde{\mathbf{V}}, \pmb{\alpha}, \pmb{\theta}, \beta, \mathbf{V}, \mathbf{Y}, \mathbf{X} &\overset{\text{ind.}}{\sim} \text{Bernoulli}(\pi_{ijk}) \\ \pi_{ijk} &= \begin{cases} 1, & \text{if } V_{ijk} \neq \widetilde{V}_{\lambda_{ij}k}, \\ \frac{\alpha_k \prod_{m=1}^{M_k} \theta_{km}^{\mathbf{1}\{V_{ijk} = m\}}}{(1-\alpha_k) + \alpha_k \prod_{m=1}^{M_k} \theta_{km}^{\mathbf{1}\{V_{ijk} = m\}}}, & \text{if } V_{ijk} = \widetilde{V}_{\lambda_{ij}k}. \end{cases} \end{align*}\] We do not have a closed form solution for the posterior distribution of $\Lambda$, but we can use a Metropolis step to sample new configurations of the linkage structure, with the caveat that the Metropolis step must reject any proposal that would give $Z_{ijk} = 0$ and $V_{ijk} \neq \widetilde{V}_{\lambda_{ij}k}$ or $\lambda_{ij} = \lambda_{ij'}$ for some $j \neq j'$. In practice we use the SMERE algorithm first introduced in R. Steorts, Hall, and Fienberg (2014) to generate posterior samples of $\Lambda$.

We can also use a Metropolis step to draw samples from the full conditional distribution of $\beta$. However, for this step there is the added wrinkle that we need to calculate the integral \[\begin{align*} \int_{X} \left(\frac{\exp\left(\beta_0 + x'\beta\right)}{1 + \exp\left(\beta_0 + x'\beta\right)}\right)^{Y_j}\left(\frac{1}{1 + \exp\left(\beta_0 + x'\beta\right)}\right)^{1 - Y_j} p(x) \; d x. \end{align*}\] We do not have a closed form solution for this integral, so we propose using kernel density estimation to generate an approximation $\hat{p}(x)$ which we can then use to calculate this integral numerically. In particular we used the ks package in R to estimate $\hat{p}(x)$ (Wang (2020)). A full implementation of the Gibbs sampler is given at the end of this post.

3 Simulation Study

As in R. C. Steorts, Tancredi, and Liseo (2018), we can test the proposed method using data generated from the RL500 data set found in the blink package developed by R. C. Steorts (2015). RL500 is a synthetic data set that contains variables on which we can perform the matching operations, including variables for first and last name, and birth day, month, and year. Of the 500 records, 50 are duplicates, meaning that the 500 records refer to a total of 450 unique individuals. Each duplicate contains at least one typographical error. Some errors are meant to represent typing mistakes, while others are alternate spellings of homophones. These errors create the difficulty in matching the duplicated records to each other.

I augment the RL500 data set with the covariates, $X_\ell$, and the response variable, $Y_\ell$, for the logistic regression task. Next, we can fix a $\beta$ vector at $(2, -5)^T$ and sample the covariates from a multivariate normal distribution \[\begin{align*} X_\ell \overset{\text{i.i.d.}}{\sim} N_2\left( \begin{bmatrix} 0 \\ 0\end{bmatrix},\, \begin{bmatrix} 1 & -0.3 \\ -0.3 & 1 \\ \end{bmatrix} \right). \end{align*}\] The response variables $Y_\ell$ are then sampled conditional on $X_\ell$ according to the logistic model.

Note that in this sampling design, each of the RL500 duplicates has an identical value for $X_\ell$ and $Y_\ell$. After augmenting the matching data with the simulated regression data, the data set is split into two equally sized databases. Each database receives 200 unique individuals, with no match in the other database, and 50 of the 100 total duplicates. The first database receives the matching variables along with the response variable, while the second database receives the matching variables and the covariates.

We have two goals. First, correctly match the duplicates between the two databases, while rejecting matches for observations that do not have a true link. Second, perform a logistic regression conditional on these matches to estimate the posterior distribution of the beta coefficients.

Figure 1: Cross-validation performed on F1 score using databases generated from RL10000.

To benchmark the performance of our method, we can also link our constructed data sets using a modified version of the method first introduced by Fellegi and Sunter (1969) as it is implemented in the RecordLinkage package in (Sariyar and Borg (2010)). The Fellegi-Sunter method is a Frequentist probabilistic method that assigns weights to each possible match with more likely matches getting higher weights. Since our method enforces injective correspondences between the two databases, we choose the most likely match proposed by Fellegi-Sunter and designate any other links across databases as non-matches. If more than one observation is matched to one observation in the other database we again picked the more likely of the two matches.

We must choose a cutoff for accepting or rejecting each proposed match. This threshold is chosen via cross-validation. Using the larger RL10000 data set, I randomly construct data sets with 500 observations and 20% duplication as analogues to RL500. I construct 1,000 such data set, estimate the linkage structure with our modified Fellegi-Sunter procedure, and calculate the average F1-score along a grid of cut-offs. The cross-validation curve with 95% quantiles is depicted in Figure 1. We see that the F1-score is no longer increase after zero, so heuristically it makes sense to place the threshold here and avoid the risk of over-fitting that come with a higher threshold. That being said, the matching results in RL500 are improved by choosing a higher cut-off near 12, and at this cut-off the regression on the Fellegi-Sunter matches will outperform our model.

For the Bayesian model introduced above, I chose uniform, uninformative priors over each of the model parameters with the exception of the distortion probabilities $\alpha_k$. The choice of $a_k$ and $b_k$ is the only subjective choice in this model. R. C. Steorts (2014) previously showed that without accounting for string distance or more nuanced forms of distortion, the posterior number of true entities is extremely sensitive to the choice of $a_k$ and $b_k$, but that a choice of $a_k = 1$ and $b_k = 9999$ (which corresponds to a prior distortion probability of 1 in 10,000) results in a posterior number of latent entities that is stable around the truth. I choose this parameter setting in our simulation as well. I ran this MCMC algorithm for 10,000 Gibbs iterations, with 5 metropolis proposal per iteration, and 100 split-merge proposals per metropolis step, or a total of 5,000,000 iterations, and removed the first 2,000 retained Gibbs samples as burn-in, leaving a final chain of 8,000 iterations.

To compare the quality of our matching results with Fellegi-Sunter, I chose to declare two records as linked if their posterior probability of being linked is greater than half, i.e. we link $R_{1j}$ and $R_{2j'}$ if \[\begin{align*} \frac{1}{G} \sum_{g=1}^G \mathbf{1}\{\lambda^{(g)}_{1j} = \lambda^{(g)}_{2j'}\} \geq 0.5. \end{align*}\] For both cut-off values of Fellegi-Sunter, we link records that have weights greater than the cut-off. We can then calculate the precision, sensitivity, and F1-score based on these matches. From the results in Table 1, we can see that our method and the hand-picked Fellegi-Sunter have very similar precision and sensitivities with nearly identical F1-scores, while the cross-validation Fellegi-Sunter had the best sensitivity but very low precision.

Table 1: Comparing Precision, Sensitivity and F1-score for the record matching of our method, of Fellegi-Sunter with a hand-picked cut-off, and Fellegi-Sunter with a cross-validated cut-off.

Next, for each Frequentist method, using the response and covariate variables from the matched observations, we calculate the MLE and standard errors for the logistic regression model. In Table 2, we compare the true values of the coefficients with the posterior means of our MCMC chains, the MLE values for Fellegi-Sunter with the cross-validation and hand-picked cut-offs, and the perfectly matched MLE, using just the observations that are truly linked. In this table, we can see that the perfectly matched MLE is the closest to the truth for the non-intercept coefficients, followed by the hand-picked Fellegi-Sunter and our MCMC posterior mean. The Fellegi-Sunteer with the cross-validation cut-off performs very poorly. This makes sense in light of the low precision of the method, since it is effectively fitting the model to many observations with independent covariate and response values.

$Table 2: Comparing the true coefficients with the posterior mean of our \beta distributions, and the MLEs from the cross-validation cut-off Fellegi-Sunter, the hand-picked cut-off Fellegi-Sunter, and the perfectly matched data.$

Table 2: Comparing the true coefficients with the posterior mean of our $\beta$ distributions, and the MLEs from the cross-validation cut-off Fellegi-Sunter, the hand-picked cut-off Fellegi-Sunter, and the perfectly matched data.

Finally, we want to compare our MCMC estimate of the posterior distribution with the estimated asymptotic normal distributions of the MLE for the Frequentist methods. In Figures 2-4 we can see the comparison of these distributions. Based on this comparison, the perfect match MLE seems to have the best distribution, having a mean value closest to the truth, while also representing the uncertainty in only having 50 observations. By these same criteria, the hand-picked cut-off Fellegi-Sunter seems to be perform the second-best, our method third, and the cross-validation Fellegi-Sunter fourth.

$Figure 2: Posterior distribution of \beta_0 from our MCMC method plotted against asymptotic normal distributions from the other MLEs. True value in red. 0.025 and .975 quantiles marked on the distributions.$ $Figure 3: Posterior distribution of \beta_1 from our MCMC method plotted against asymptotic normal distributions from the other MLEs. True value in red. 0.025 and .975 quantiles marked on the distributions.$

$Figure 4: Posterior distribution of \beta_2 from our MCMC method plotted against asymptotic normal distributions from the other MLEs. True value in red. 0.025 and .975 quantiles marked on the distributions.$

Figure 4: Posterior distribution of $\beta_2$ from our MCMC method plotted against asymptotic normal distributions from the other MLEs. True value in red. 0.025 and .975 quantiles marked on the distributions.

Based on these results, it appears that the Fellegi-Sunter method with a hand-tuned cut-off performs similarly to our method in the matching task and outperform our method in the regression task. However, in most real world record linkage scenarios, we do not have access to unique identifying information, so it is not possible to tune a model like Fellegi-Sunter. We are more likely to have to rely on a heuristic cut-off like the one we found via cross-validation. We see that our unsupervised approach to linking records strongly outperforms Fellegi-Sunter in this case, both in terms of providing better point estimates and in properly capturing the uncertainty around the parameters that Felleig-Sunter completely misses due to treating the estimate linkage structure as certain.

4 Gibbs Sampler

library(RecordLinkage)
library(mvtnorm)
library(dplyr)
library(MCMCpack)
library(ks)

emp_dist <- function(data, key_vars){
  emp_freqs <- list()
  for (key_var in key_vars) {
    emp_freq <- data %>%
      dplyr::select(key_var) %>%
      table %>%
      prop.table %>%
      data.frame
    names(emp_freq) <- c(paste0(key_var), "freq")
    emp_freqs[[key_var]] <- emp_freq
  }
  return(emp_freqs)
}

grab_x <- function(link, data, cov_names) {
  data[data$link == link & data$file == "x", cov_names]
}

grab_y <- function(link, data) {
  data[data$link == link & data$file == "y", "y"]
}

logitRL <- function(y_data, x_data, key_vars, S_g, S_m = 1, S_t = 1,
                    eta = NULL, a = NULL, b = NULL, mu = NULL, Sigma = NULL,
                    weighting = "emperical", a0 = 0) {
  
  # augmenting data
  x_data$intercept <- 1
  cov_names <- names(x_data)[!(names(x_data) %in% key_vars)]
  y_data[cov_names] <- NA
  y_data$file <- "y"
  x_data$y <- NA
  x_data$file <- "x"
  full_data <- rbind(y_data, x_data)
  X <- as.matrix(x_data[,cov_names])

  # model parameters
  K <- length(key_vars)
  P <- length(cov_names)
  N <- nrow(full_data)
  Nx <- nrow(x_data)
  Ny <- nrow(y_data)
  if (weighting == "emperical") {
    wts <- rep(1/Nx, Nx)
  } else if (weighting == "kde") {
    kdens <- kde(x_data[,cov_names[!(cov_names %in% "intercept")]])
    wts <- predict(kdens, x = x_data[,cov_names[!(cov_names %in% "intercept")]])
    wts <- wts / sum(wts)
  } else if (weighting == "normal") {
    # TODO: need to update weights within sampler
  }
  
  # initialize linkage structure
  full_data$link <- as.character(1:nrow(full_data))
  linkage_chain <- matrix(nrow = S_g, ncol = N)
  
  # default priors
  b0 <- exp(a0 - digamma(1))
  if (is.null(a)) a <- rep(1, K)
  if (is.null(b)) b <- rep(1, K)
  if (is.null(mu)) mu <- rep(0, P)
  if (is.null(Sigma)) Sigma <- diag(100, P)
  if (all(diag(diag(Sigma)) == Sigma)) {
    log_beta_prior <- function(x, mu, Sigma) {
      sum(mapply(dnorm, x = x, mean = mu, sd = sqrt(diag(Sigma)), log = TRUE))
    } 
  } else {
    log_beta_prior <- function(x, mu, Sigma) {
      dmvnorm(x, mean = mu, sigma = Sigma, log = TRUE)
    } 
  }
  
  if(is.null(eta)) {
    eta <- list()
    for (var in key_vars) {
      eta[[var]] <- rep(1, length(unique(full_data[var])))
    }
  }
  
  # initialize thetas
  thetas <- emp_dist(full_data, key_vars)
  theta_chain <- list()
  for (var in key_vars) {
    theta_chain[[var]] <- matrix(nrow = S_g, ncol = nrow(thetas[[var]]))
  }
  
  # initialize true field values
  V_true <- list()
  for (v in full_data$link) {
    V_true[[v]] <- full_data[full_data$link == v, key_vars]
  }
  
  V_df <- do.call(rbind, V_true)
  V_counts <- list()
  for (var in key_vars) {
    var_count <- data.frame(count = colSums(outer(V_df[[var]], unique(full_data[[var]]), `==`)))
    var_count[[var]] <- unique(full_data[[var]])
    V_counts[[var]] <- var_count[order(var_count[[var]]),]
  }
  
  # initialize distortion
  alphas <- data.frame(key_vars = key_vars, alpha = rep(0, K))
  alpha_chain <- matrix(nrow = S_g, ncol = K)
  Z <- data.frame(matrix(FALSE, nrow = N, ncol = K))
  names(Z) <- key_vars
  
  # initialize beta
  betas <- rep(0, P)
  beta_chain <- matrix(nrow = S_g, ncol = P)
  colnames(beta_chain) <- cov_names
  
  # progress bar
  pb = txtProgressBar(min = 1, max = S_g, initial = 1, style = 3)
  
  # sampler
  for (s in 1:S_g) {
    setTxtProgressBar(pb,s)
    
    # theta gibbs update
    for (var in key_vars) {
      cnt <- colSums(outer(full_data[[var]][Z[[var]]], V_counts[[var]][[var]], `==`))
      c_k <- V_counts[[var]][["count"]] + cnt
      thetas[[var]]$freq <- c(rdirichlet(1, eta[[var]] + c_k))
      theta_chain[[var]][s,] <- thetas[[var]]$freq
    }
    
    # alpha gibbs update
    a_post <- a + apply(Z, 2, sum)
    b_post <- b + apply((1-Z), 2, sum)
    alphas$alpha <- mapply(function(a,b) rbeta(n = 1, a, b), a_post, b_post)
    alpha_chain[s,] <- alphas$alpha
    
    # regression data
    linked <- full_data$link[duplicated(full_data$link)]
    if (length(linked) > 0) {
      X_comp <- lapply(linked, grab_x, data = full_data, cov_names = cov_names) %>% 
        do.call(rbind, .) %>% 
        as.matrix
      y_comp <- sapply(linked, grab_y, data = full_data)
    }
    y_miss <- full_data$y[full_data$file == "y" & !(full_data$link %in% linked)]
    
    for (m in 1:S_m) {
      
      # beta metropolis update
      for (p in 1:P) {
        betas_prop <- betas
        if(cov_names[p]== "intercept") {
          betas_prop[p] <- log(rexp(1, 1 / b0))
          log.r <- 0
        } else { 
          betas_prop[p] <- rnorm(1, mean = betas[p])
          log.r <- log_beta_prior(betas_prop, mu, Sigma) - log_beta_prior(betas, mu, Sigma)
        }
        
        XB <- X %*% betas
        XB_prop <- X %*% betas_prop
        pr <- exp(XB) / (1 + exp(XB))
        pr_prop <- exp(XB_prop) / (1 + exp(XB_prop))
        
        log_mean_p1 <- log(sum(wts * pr))
        log_mean_p0 <- log(sum(wts * (1 - pr)))
        
        log_mean_p1_prop <-  log(sum(wts * pr_prop))
        log_mean_p0_prop <- log(sum(wts * (1 - pr_prop)))
        
        if (length(linked) > 0) {
          XB_comp <- X_comp %*% betas
          pr_comp <- exp(XB_comp) / (1 + exp(XB_comp))
          XB_comp_prop <- X_comp %*% betas_prop
          pr_comp_prop <- exp(XB_comp_prop) / (1 + exp(XB_comp_prop))
          log.r <- log.r + sum(y_comp * (XB_comp_prop - XB_comp) + log(1 + exp(XB_comp)) - log(1 + exp(XB_comp_prop)))
        }

        log.r <- log.r + sum(y_miss) * (log_mean_p1_prop - log_mean_p1) + (length(y_miss) - sum(y_miss)) * (log_mean_p0_prop - log_mean_p0)
        
        if (log(runif(1)) < log.r) {
          betas <- betas_prop
          log_mean_p0 <- log_mean_p0_prop
          log_mean_p1 <- log_mean_p1_prop
        } 
      }
      
      # V update
      for (var in key_vars) {
        for (i in 1:length(V_true)) {
          v <- names(V_true)[i]
          if (all(Z[[var]][full_data$link == v])) {
            V_true[[v]][[var]] <- sample(thetas[[var]][[var]], 1, prob = thetas[[var]]$freq)
          } else {
            V_true[[v]][[var]] <- full_data[[var]][full_data$link == v & !Z[[var]]][1]
          }
        }
      }
      
      # Z update
      for (var in key_vars) {
        alpha_k <- alphas[alphas$key_vars == var, "alpha"]
        for (i in 1:N) {
          v_ijk <- full_data[[var]][i]
          if (v_ijk != V_true[[full_data$link[i]]][[var]]) {
            Z[[var]][i] <- TRUE
          } else {
            theta_km <- thetas[[var]]$freq[thetas[[var]][[var]] == v_ijk]
            p_ijk <- alpha_k * theta_km / (1-alpha_k + alpha_k * theta_km)
            Z[[var]][i] <- as.logical(rbinom(1, 1, p_ijk))
          }
        }
      }
      # linkage update
      for (t in 1:S_t) {
        log.r <- 0
        
        # sample two records to link or split
        while(TRUE) {
          idy <- sample(1:Ny, 1)
          idx <- sample((Ny+1):N, 1)
          linking <- full_data$link[idy] != full_data$link[idx]
          if (!linking) break # ok if splitting
          # if linking need to check no other entities linked to the sampled records
          if (sum(full_data$link %in% full_data$link[c(idy,idx)]) == 2) break
        }
        
        XB_sub <- as.matrix(full_data[idx, cov_names]) %*% betas
        
        Z_y_prop <- Z[idy,]
        Z_x_prop <- Z[idx,]
        Z_y_prop[1,] <- TRUE
        Z_x_prop[1,] <- TRUE
        
        # if linking, randomly select fields from sampled records to create new entity
        if (linking) {
          dex <- sample(c(TRUE, FALSE), size = K, replace = TRUE)
          v_prop <- full_data[idx, key_vars] 
          v_prop[,dex] <- full_data[idy, key_vars][dex] 
        }
        
        for (var in key_vars) {
          alpha_k <- alphas[alphas$key_vars == var, "alpha"]
          theta_y <- thetas[[var]]$freq[thetas[[var]][[var]] == full_data[[var]][idy]]
          theta_x <- thetas[[var]]$freq[thetas[[var]][[var]] == full_data[[var]][idx]]
          p_y <- alpha_k * theta_y / (1-alpha_k + alpha_k * theta_y)
          p_x <- alpha_k * theta_x / (1-alpha_k + alpha_k * theta_x)
          
          if (!linking) {
            Z_y_prop[[var]] <- as.logical(rbinom(1, 1, p_y))
            Z_x_prop[[var]] <- as.logical(rbinom(1, 1, p_x))
            theta_m <- thetas[[var]]$freq[thetas[[var]][[var]] == V_true[[full_data$link[idy]]][[var]]]
            log.r <- log.r + log(theta_y) + log(theta_x) - log(theta_m)
          } else {
            if (full_data[[var]][idy] == v_prop[[var]]) Z_y_prop[[var]] <- as.logical(rbinom(1, 1, p_y))
            if (full_data[[var]][idx] == v_prop[[var]]) Z_x_prop[[var]] <- as.logical(rbinom(1, 1, p_x))
            theta_m_y <- thetas[[var]]$freq[thetas[[var]][[var]] == V_true[[full_data$link[idy]]][[var]]]
            theta_m_x <- thetas[[var]]$freq[thetas[[var]][[var]] == V_true[[full_data$link[idx]]][[var]]]
            theta_m <- thetas[[var]]$freq[thetas[[var]][[var]] == v_prop[[var]]]
            log.r <- log.r + log(theta_m) - log(theta_m_y) - log(theta_m_x)
          }
          Z_y_diff <- Z_y_prop[[var]] -  Z[[var]][idy]
          Z_x_diff <- Z_x_prop[[var]] - Z[[var]][idx]
          log.r <- log.r + Z_y_diff * log(theta_y) + Z_x_diff * log(theta_x) 
          log.r <- log.r + (Z_y_diff + Z_x_diff) * (log(alpha_k) - log(1-alpha_k)) 
        }
        
        if(full_data$y[idy] == 1){
          log.r <- log.r + (1 - 2*linking) * (XB_sub - log(1 + exp(XB_sub)) - log_mean_p1)
        } else {
          log.r <- log.r + (1 - 2*linking) * (-log(1 + exp(XB_sub)) - log_mean_p0)
        }
        if (log(runif(1)) < log.r) {
          if (linking) {
            V_true[[full_data$link[idy]]] <- v_prop
            V_true[[full_data$link[idx]]] <- NULL
            full_data$link[idx] <- full_data$link[idy]
          } else {
            V_true[[full_data$link[idy]]] <- full_data[idy, key_vars]
            full_data$link[idx] <- as.character(min((1:N)[!(1:N %in% full_data$link)]))
            V_true[[full_data$link[idx]]] <- full_data[idx, key_vars]
          }
          Z[idx,] <- Z_x_prop
          Z[idy,] <- Z_y_prop
        }
      }
    }
    
    beta_chain[s,] <- betas
    linkage_chain[s,] <- full_data$link
    
    V_df <- do.call(rbind, V_true)
    for (var in key_vars) {
      V_counts[[var]][[var]] <- colSums(outer(V_df[[var]], V_counts[[var]][[var]], `==`))
    }
  }
  
  return(list(theta = theta_chain, 
              alpha = alpha_chain, 
              beta = beta_chain, 
              lambda = linkage_chain))
}

Acknowledgment

This post was adapted from a project for STATS 201B at UCLA that I worked on with my colleague Kyle McEvoy. Thanks to Kyle for his help!

References

Fellegi, Ivan P., and Alan B. Sunter. 1969. “A Theory for Record Linkage.” Journal of the American Statistical Association 64 (328): 1183–1210. https://doi.org/10.1080/01621459.1969.10501049.

Robert, Christian P., and George Casella. 2005. Monte Carlo Statistical Methods (Springer Texts in Statistics). Berlin, Heidelberg: Springer-Verlag.

Sariyar, Murat, and Andreas Borg. 2010. “The RecordLinkage Package: Detecting Errors in Data.” The R Journal 2 (2): 61–67. https://doi.org/10.32614/RJ-2010-017.

Steorts, Rebecca C. 2014. “Entity Resolution with Empirically Motivated Priors.” arXiv: Methodology.

———. 2015. “Entity Resolution with Empirically Motivated Priors.” Bayesian Analysis 10 (4): 849–75. https://doi.org/10.1214/15-BA965SI.

Steorts, Rebecca C., Rob Hall, and Stephen E. Fienberg. 2016. “A Bayesian Approach to Graphical Record Linkage and Deduplication.” Journal of the American Statistical Association 111 (516): 1660–72. https://doi.org/10.1080/01621459.2015.1105807.

Steorts, Rebecca C., Andrea Tancredi, and Brunero Liseo. 2018. “Generalized Bayesian Record Linkage and Regression with Exact Error Propagation.” https://arxiv.org/abs/1810.04808.

Steorts, Rebecca, Rob Hall, and Stephen Fienberg. 2014. “SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication.” In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, edited by Samuel Kaski and Jukka Corander, 33:922–30. Proceedings of Machine Learning Research. Reykjavik, Iceland: PMLR. https://proceedings.mlr.press/v33/steorts14.html.

Wang, Qing. 2020. “Multivariate Kernel Smoothing and Its Applications.” Journal of the American Statistical Association 115 (529): 486–86. https://doi.org/10.1080/01621459.2020.1721247.

This independence assumption holds across databases, records, and fields.↩︎
Here $\delta_v$ is a point-mass on the point $v$.↩︎