Skip to content
/ BDD Public

Duplicate detection in R using a Bayesian partitioning approach

License

Notifications You must be signed in to change notification settings

cleanzr/BDD

Repository files navigation

BDD: Duplicate detection using a Bayesian partitioning approach

This R package implements a Bayesian model for duplicate detection as proposed in (Sadinle 2014). It resembles the classic Fellegi-Sunter model (Fellegi and Sunter 1969) in that coreference decisions are based on pairwise attribute-level comparisons. However, unlike the Fellegi-Sunter mdoel, this model targets a partition of the records, rather than pairwise coreference predictions, thereby ensuring transitivity. Inference is conducted using a Gibbs sampler, with parts implemented in Rcpp.

Installation

BDD is not currently available on CRAN. The latest development version can be installed from source using devtools as follows:

library(devtools)
install_github("cleanzr/BDD")

Example

We demonstrate how to perform duplicate detection using the RLdata500 test data set included with the package. RLdata500 contains 500 synthetic personal records, 50 of which are duplicates with randomly-generated errors. In the code block, below we load the package and examine the first few rows of RLdata500.

library(magrittr)    # pipe operator (must be loaded before BDD)
library(comparator)  # normalized Levenshtein distance
library(clevr)       # evaluation functions
library(BDD)

RLdata500[['rec_id']] <- seq.int(nrow(RLdata500))
head(RLdata500)
#>   fname_c1 fname_c2 lname_c1 lname_c2   by bm bd rec_id
#> 1  CARSTEN     <NA>    MEIER     <NA> 1949  7 22      1
#> 2     GERD     <NA>    BAUER     <NA> 1968  7 27      2
#> 3   ROBERT     <NA> HARTMANN     <NA> 1930  4 30      3
#> 4   STEFAN     <NA>    WOLFF     <NA> 1957  9  2      4
#> 5     RALF     <NA>  KRUEGER     <NA> 1966  1 13      5
#> 6  JUERGEN     <NA>   FRANKE     <NA> 1929  7  4      6

The model uses agreement levels between attributes in order to assess whether a pair of records is a match (referring to the same entity) or not. To compute the agreement levels, we’ll first compute real-valued scores between the attributes, then maps the scores to discrete levels of agreement. In the code block below, we specify scoring functions for the five attributes we will use for matching (we don’t use fname_c2 and lname_c2 as more than 90% of the values are missing). Note that the scoring functions must accept a pair of attribute vectors x and y as arguments and return a vector of scores.

scoring_fns <- list(
  fname_c1 = Levenshtein(normalize = TRUE),
  lname_c1 = Levenshtein(normalize = TRUE),
  by = function(x, y) abs(x - y),
  bm = function(x, y) abs(x - y),
  bd = function(x, y) abs(x - y)
)

For each scoring function above, we provide a breaks vector which specifies the discrete levels of agreement (from ‘high’ agreement to ‘low’).

scoring_breaks <- list(
  fname_c1 = c(-Inf, .05, .2, .4, Inf),
  lname_c1 = c(-Inf, .05, .2, .4, Inf),
  by = c(-Inf, 0, 1, 3, Inf),
  bm = c(-Inf, 0, 1, 3, Inf),
  bd = c(-Inf, 0, 2, 7, Inf)
)

Now we are ready to compute the agreement levels for the record pairs. Since this is a small data set, we consider all pairs using the pairs_all function. For larger data sets, blocking/indexing is recommended using pairs_hamming, pairs_fuzzyblock or a custom indexing function.

pairs <- pairs_all(RLdata500$rec_id) %>%
  compute_scores(RLdata500, scoring_fns, id_col = 'rec_id') %>% 
  discretize_scores(scoring_breaks)

To speed up inference, we only consider a subset of the pairs as candidate matches. Specifically, we consider pairs that have a strong agreement on name (accounting for missing names).

pairs[['candidate']] <- (pairs$fname_c1 < 4) & (pairs$lname_c1 < 4) | 
                            is.na(pairs$fname_c1) | is.na(pairs$lname_c1) 

Next we specify the priors on the m* and u* probabilities for each attribute and agreement level. lambda specifies the lower truncation points for the truncated Beta priors on the m* probabilities. By default, a uniform prior is used over the truncated interval; a non-uniform prior can be specified using the alpha1 and beta1 arguments to BDD below. The Beta priors on the u* probabilities are also uniform by default and can be adjusted using the alpha0 and beta0 arguments.

lambda <- list(
  fname_c1 = c(0.8, 0.85, 0.99),
  lname_c1 = c(0.8, 0.85, 0.99),
  by = c(0.8, 0.85, 0.99),
  bm = c(0.8, 0.85, 0.99),
  bd = c(0.8, 0.85, 0.99)
)

Finally, we initialize the model and run inference using Markov chain Monte Carlo.

model <- BDD(pairs, lambda, id_cols = c("rec_id.x", "rec_id.y"), 
             candidate_col = "candidate")
fit <- run_inference(model, 100, thin_interval = 10, burnin_interval = 100)
#> Completed sampling in 1.464495 secs

The posterior samples of the linkage structure can be accessed by calling extract(fit, "links"). However, these samples only cover the records that were considered as candidate pairs. We can obtain samples of the complete linkage structure (for all records) using the following function.

links_samples <- complete_links_samples(fit, RLdata500$rec_id)

Since RLdata500 comes with ground truth, we can evaluate the predicted linkage structure using supervised metrics. We find that the model performs well on this data set, achieving high pairwise precision and recall.

n_records <- nrow(RLdata500)
true_pairs <- membership_to_pairs(identity.RLdata500)
metrics <- apply(links_samples, 1, function(links) {
  pred_pairs <- membership_to_pairs(links)
  pr <- precision_pairs(true_pairs, pred_pairs)
  re <- recall_pairs(true_pairs, pred_pairs)
  c(Precision = pr, Recall = re)
}) %>% t()
summary(metrics)
#>    Precision          Recall      
#>  Min.   :0.8772   Min.   :0.9600  
#>  1st Qu.:0.9259   1st Qu.:1.0000  
#>  Median :0.9434   Median :1.0000  
#>  Mean   :0.9386   Mean   :0.9968  
#>  3rd Qu.:0.9615   3rd Qu.:1.0000  
#>  Max.   :0.9804   Max.   :1.0000

Licence

GPL-3

References

Fellegi, Ivan P., and Alan B. Sunter. 1969. “A Theory for Record Linkage.” Journal of the American Statistical Association 64 (328): 1183–1210. https://doi.org/10.1080/01621459.1969.10501049.

Sadinle, Mauricio. 2014. “Detecting Duplicates in a Homicide Registry Using a Bayesian Partitioning Approach.” Ann. Appl. Stat. 8 (4): 2404–34. https://doi.org/10.1214/14-AOAS779.