BDD: Duplicate detection using a Bayesian partitioning approach

This R package implements a Bayesian model for duplicate detection as proposed in (Sadinle 2014). It resembles the classic Fellegi-Sunter model (Fellegi and Sunter 1969) in that coreference decisions are based on pairwise attribute-level comparisons. However, unlike the Fellegi-Sunter mdoel, this model targets a partition of the records, rather than pairwise coreference predictions, thereby ensuring transitivity. Inference is conducted using a Gibbs sampler, with parts implemented in Rcpp.

Installation

BDD is not currently available on CRAN. The latest development version can be installed from source using devtools as follows:

library(devtools)
install_github("cleanzr/BDD")

Example

We demonstrate how to perform duplicate detection using the RLdata500 test data set included with the package. RLdata500 contains 500 synthetic personal records, 50 of which are duplicates with randomly-generated errors. In the code block, below we load the package and examine the first few rows of RLdata500.

library(magrittr)    # pipe operator (must be loaded before BDD)
library(comparator)  # normalized Levenshtein distance
library(clevr)       # evaluation functions
library(BDD)

RLdata500[['rec_id']] <- seq.int(nrow(RLdata500))
head(RLdata500)
#>   fname_c1 fname_c2 lname_c1 lname_c2   by bm bd rec_id
#> 1  CARSTEN     <NA>    MEIER     <NA> 1949  7 22      1
#> 2     GERD     <NA>    BAUER     <NA> 1968  7 27      2
#> 3   ROBERT     <NA> HARTMANN     <NA> 1930  4 30      3
#> 4   STEFAN     <NA>    WOLFF     <NA> 1957  9  2      4
#> 5     RALF     <NA>  KRUEGER     <NA> 1966  1 13      5
#> 6  JUERGEN     <NA>   FRANKE     <NA> 1929  7  4      6

The model uses agreement levels between attributes in order to assess whether a pair of records is a match (referring to the same entity) or not. To compute the agreement levels, we’ll first compute real-valued scores between the attributes, then maps the scores to discrete levels of agreement. In the code block below, we specify scoring functions for the five attributes we will use for matching (we don’t use fname_c2 and lname_c2 as more than 90% of the values are missing). Note that the scoring functions must accept a pair of attribute vectors x and y as arguments and return a vector of scores.

scoring_fns <- list(
  fname_c1 = Levenshtein(normalize = TRUE),
  lname_c1 = Levenshtein(normalize = TRUE),
  by = function(x, y) abs(x - y),
  bm = function(x, y) abs(x - y),
  bd = function(x, y) abs(x - y)
)

For each scoring function above, we provide a breaks vector which specifies the discrete levels of agreement (from ‘high’ agreement to ‘low’).

scoring_breaks <- list(
  fname_c1 = c(-Inf, .05, .2, .4, Inf),
  lname_c1 = c(-Inf, .05, .2, .4, Inf),
  by = c(-Inf, 0, 1, 3, Inf),
  bm = c(-Inf, 0, 1, 3, Inf),
  bd = c(-Inf, 0, 2, 7, Inf)
)

Now we are ready to compute the agreement levels for the record pairs. Since this is a small data set, we consider all pairs using the pairs_all function. For larger data sets, blocking/indexing is recommended using pairs_hamming, pairs_fuzzyblock or a custom indexing function.

pairs <- pairs_all(RLdata500$rec_id) %>%
  compute_scores(RLdata500, scoring_fns, id_col = 'rec_id') %>% 
  discretize_scores(scoring_breaks)

To speed up inference, we only consider a subset of the pairs as candidate matches. Specifically, we consider pairs that have a strong agreement on name (accounting for missing names).

pairs[['candidate']] <- (pairs$fname_c1 < 4) & (pairs$lname_c1 < 4) | 
                            is.na(pairs$fname_c1) | is.na(pairs$lname_c1)

Next we specify the priors on the m* and u* probabilities for each attribute and agreement level. lambda specifies the lower truncation points for the truncated Beta priors on the m* probabilities. By default, a uniform prior is used over the truncated interval; a non-uniform prior can be specified using the alpha1 and beta1 arguments to BDD below. The Beta priors on the u* probabilities are also uniform by default and can be adjusted using the alpha0 and beta0 arguments.

lambda <- list(
  fname_c1 = c(0.8, 0.85, 0.99),
  lname_c1 = c(0.8, 0.85, 0.99),
  by = c(0.8, 0.85, 0.99),
  bm = c(0.8, 0.85, 0.99),
  bd = c(0.8, 0.85, 0.99)
)

Finally, we initialize the model and run inference using Markov chain Monte Carlo.

model <- BDD(pairs, lambda, id_cols = c("rec_id.x", "rec_id.y"), 
             candidate_col = "candidate")
fit <- run_inference(model, 100, thin_interval = 10, burnin_interval = 100)
#> Completed sampling in 1.464495 secs

The posterior samples of the linkage structure can be accessed by calling extract(fit, "links"). However, these samples only cover the records that were considered as candidate pairs. We can obtain samples of the complete linkage structure (for all records) using the following function.

links_samples <- complete_links_samples(fit, RLdata500$rec_id)

Since RLdata500 comes with ground truth, we can evaluate the predicted linkage structure using supervised metrics. We find that the model performs well on this data set, achieving high pairwise precision and recall.

n_records <- nrow(RLdata500)
true_pairs <- membership_to_pairs(identity.RLdata500)
metrics <- apply(links_samples, 1, function(links) {
  pred_pairs <- membership_to_pairs(links)
  pr <- precision_pairs(true_pairs, pred_pairs)
  re <- recall_pairs(true_pairs, pred_pairs)
  c(Precision = pr, Recall = re)
}) %>% t()
summary(metrics)
#>    Precision          Recall      
#>  Min.   :0.8772   Min.   :0.9600  
#>  1st Qu.:0.9259   1st Qu.:1.0000  
#>  Median :0.9434   Median :1.0000  
#>  Mean   :0.9386   Mean   :0.9968  
#>  3rd Qu.:0.9615   3rd Qu.:1.0000  
#>  Max.   :0.9804   Max.   :1.0000

Licence

GPL-3

References

Fellegi, Ivan P., and Alan B. Sunter. 1969. “A Theory for Record Linkage.” Journal of the American Statistical Association 64 (328): 1183–1210. https://doi.org/10.1080/01621459.1969.10501049.

Sadinle, Mauricio. 2014. “Detecting Duplicates in a Homicide Registry Using a Bayesian Partitioning Approach.” Ann. Appl. Stat. 8 (4): 2404–34. https://doi.org/10.1214/14-AOAS779.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.vscode		.vscode
R		R
data		data
man		man
src		src
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
BDD.Rproj		BDD.Rproj
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
TODO		TODO
cran-comments.md		cran-comments.md
refs.bib		refs.bib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BDD: Duplicate detection using a Bayesian partitioning approach

Installation

Example

Licence

References

About

Releases

Packages

Languages

License

cleanzr/BDD

Folders and files

Latest commit

History

Repository files navigation

BDD: Duplicate detection using a Bayesian partitioning approach

Installation

Example

Licence

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages