High odds in Fellegi-Sunter model after expectation maximization #67

lmores · 2024-08-30T16:19:44Z

lmores
Aug 30, 2024

I am playing with mismo to deduplicate postal addresses in a set of about 10k entries.
After the expectation-maximization step, the odds of half of the record pairs are equal to 10_000_000_000, hence choosing the threshold to distinguish true matches from false ones is quite hard.

As I cannot share the underlying dataset, I will try to completely avoid sharing my (messy) code, and rather describe what I am doing. Hopefully it will be enough to get some hint from you.

Deduplication Steps

Each record has the following fields:
- record_id
- recipients: a string with the name of the recipient
- recipients_metaphone: the result of double_metaphone() on the field recipients
- recipients_tokens: a sequence of tokens obtained by splitting the recipients field using white spaces.
- address_lines: a string
- address_lines_tokens: a sequence of tokens obtained by splitting the address_lines field using white spaces and discarding terms that appears in more than 5% of the dataset (using mismo.arrays.array_filter_isin_other() and mismo.sets.rare_terms()).
- full_address: the whole address (not including the recipient's name)
- libpostal_address: the address parsed by pypostal using mismo.lib.geo.postal_parse_address()
- libpostal_fingerprint: the list fingerprints returned by pypostal using mismo.lib.geo.postal_fingerprint_address()
Blocking using the following rules:

[
    mismo.block.KeyBlocker("recipients", name="Recipients Exact"),
    mismo.block.KeyBlocker("libpostal_fingerprint", name="Fingerprint", ),
    mismo.block.KeyBlocker("recipients_metaphone", name="Recipients phonetic", ),
]

The corresponding upset chart is as follows

Use the following comparators:
- on postal_code field with levels: EXACT, MAJOR (same first two digits), ELSE
- on recipients_tokens with the jaccard function and levels: JACCARD_50 (>=0.5), JACCARD_25 (>=0.25), JACCARD_10 (>=0.1), JACCARD_02 (>=0.02), ELSE
- on recipients_metaphone with the jaccardfunction and levels: EXACT (jaccard >= 0.50), MAYBE (jaccard >= 0.2), ELSE
- on address_lines_tokens with the jaccard function and levels: JACCARD_50 (>=0.5), JACCARD_25 (>=0.25), JACCARD_10 (>=0.1), JACCARD_02 (>=0.02), ELSE
- on libpostal_fingerprint with levels: AT_LEAST_ONE, ELSE

The weights after running mismo.fs.train_using_em(comparers, t, t, max_pairs=1_000_000) are as follows:

I am not an expert of the Fellegi-Sunter model, but I suspect that there should't be levels where both proportions of pairs are high (e.g. EXACT level for Postal Code and ELSE level in 'Recipients Metaphone').

Scoring the pairs leads to

As you can see, the pairs in the left half of the chart all have odds equal to 10_000_000_000.
Also, the smallest value is 1819 which is way above the "expected" (?) range between 0.01 and 100.

Am I doing something obviously wrong?

P.S.: it seems that darker cells in the match levels chart correspond to the highest match levels. Isn't it a bit counterintuitive?

NickCrews · 2024-08-30T20:53:25Z

NickCrews
Aug 30, 2024
Maintainer

I am not an expert of the Fellegi-Sunter model, but I suspect that there should't be levels where both proportions of pairs are high (e.g. EXACT level for Postal Code and ELSE level in 'Recipients Metaphone').

This can happen. Consider the level of _.name_l != "william shakespeare". This is going to be 100% for all record pairs, both for matches and non-matches. It's not a very useful level though, since it doesn't give you any evidence towards or against a match.

Also, the smallest value is 1819 which is way above the "expected" (?) range between 0.01 and 100.

Are you confusing probability with odds? We are talking about odds here. Probability could be between 0 and 100%. Odds are totally unbounded. Have you read the glossary?

Am I doing something obviously wrong?

Consider your level rates for postal code:

It looks like 100% of pairs fall into the EXACT level, which leads to an odds of (100%/100%) = 1. In other words, if you see a pair matching this level, it gives you no useful info as to whether the pair is a match.

It looks like 0 pairs are falling into the MAJOR and ELSE levels. Again, this should lead to an odds of (0%/0%) = 1. But it looks like it is ending up with an odds of ~800. This could be a bug on mismos part. Can you tell me the exact number of pairs that fall into each level?

IDEALLY, what I am looking for in these match charts is for there to be at least 100 pairs in each bin, to avoid small sample size issues. Second, what I want is for there to be an "inverse relationship" as you go between levels: Your address lines jaccard comparison has this property: ELSE is common among non-matches, rare among true-matches. The other levels are common among true-matches, rare among non-matches. This leads to getting odds that actually give you evidence towards a match/non-match.

So, I think you need to adjust your level criteria to better "spread out" the record pairs between the different levels.

P.S.: it seems that darker cells in the match levels chart correspond to the highest match levels. Isn't it a bit counterintuitive?

Yes it is, that probably needs to get tweaked in the altair definitions. Would love a PR if you are willing! I probably will not get to it anytime soon.

0 replies

lmores · 2024-09-05T18:56:47Z

lmores
Sep 5, 2024
Author

This can happen. Consider the level of _.name_l != "william shakespeare". This is going to be 100% for all record pairs, both for matches and non-matches. It's not a very useful level though, since it doesn't give you any evidence towards or against a match.

That makes sense. Just to double check my understanding: a comparer such that for each level the proportion of matches is about the same as the proportion of non-matches is quite useless, right?

Are you confusing probability with odds? We are talking about odds here. Probability could be between 0 and 100%. Odds are totally unbounded. Have you read the glossary?

I am aware of the difference, but, since the chart is centered between 0.01 and 100, I was just wondering whether the resulting odds should mostly be contained in that interval.

It looks like 0 pairs are falling into the MAJOR and ELSE levels. Again, this should lead to an odds of (0%/0%) = 1. But it looks like it is ending up with an odds of ~800. This could be a bug on mismos part. Can you tell me the exact number of pairs that fall into each level?

It seems that both level contain no pairs. Here is my code:

# `compared` contains all address pairs that I am taking into account

# Pairs in MAJOR level
c = compared.filter([_.postal_code_l != _.postal_code_r, _.postal_code_l.substr(0,2) == _.postal_code_r.substr(0,2)])
c.count()    # returns 0

# Pairs in ELSE level
c = compared.filter([_.postal_code_l != _.postal_code_r, _.postal_code_l.substr(0,2) != _.postal_code_r.substr(0,2)])
c.count()  # returns 0

And is some more information available in the chart:

IDEALLY, what I am looking for in these match charts is for there to be at least 100 pairs in each bin, to avoid small sample size issues.

Noted!

Yes it is, that probably needs to get tweaked in the altair definitions. Would love a PR if you are willing! I probably will not get to it anytime soon.

I'll have a look at altair.

Thank you!

0 replies

NickCrews · 2024-09-05T20:57:39Z

NickCrews
Sep 5, 2024
Maintainer

a comparer such that for each level the proportion of matches is about the same as the proportion of non-matches is quite useless, right?

Exactly! However, it SOMETIMES could be useful, in an indirect way. Consider if in ~90% of records the name field is NULL. They are equally likely to be NULL amongst matches as non-matches, so if you see a NULL in a pair, it doesn't give you any evidence of match or non-match. Consider level configuration A:

(_.name_l = _.name_r, "EXACT") -> gives good evidence of a match
(True, "ELSE") -> doesn't give good evidence, the vast majority of pairs fall into this level, but its equally likely for them to be matches as non-matches

vs configuration B:

(_.name_l.isnull() | _.name_r.isnull(), "NULL") -> the vast majority of pairs fall into here. Again, this level doesn't give much evidence either way.
(_.name_l = _.name_r, "EXACT") -> gives good evidence of a match, same as before.
(True, "ELSE") -> Now this gives good evidence of a non-match! All the "low-information" pairs were "filtered out" by the NULL level!

It seems that both level contain no pairs

Yes, it looks like a very tiny number of pairs falls into each category, and so the calculated odds of 599.99... is probably not very accurate. As you understand now, aim for 100+ per bin.

EDIT: this actually looks like a large change, it would require changing the LevelWeights class to actually keep track of number of pairs instead of merely u and m as it does currently.

Your screen shot reveals that we don't currently show the number of pairs. That would be good to do. I will try to fix this, or if you want to submit a PR for that, also would be appreciated!

0 replies

NickCrews · 2024-09-05T21:14:39Z

NickCrews
Sep 5, 2024
Maintainer

This thread is super useful for figuring out pain points, thank you for your effort here.

Ideally I would like this info to be exposed

ideally, inline inside the altair charts, or raised as warnings, or something that really gives you hints as you actually use the code.
if that is not possible, in the documentation as a how-to guide or a troubleshooting, etc.

0 replies

lmores · 2024-09-27T10:30:30Z

lmores
Sep 27, 2024
Author

"Just" realized that presence of empty comparer levels is largely (but not only) due to the fact that some of them look at the same type of information that I also use in the blocking step.

E.g. in the blocking phase I am using the following rule

    mismo.block.KeyBlocker("libpostal_fingerprint", name="Fingerprint", ),

which is responbile for the generation of nearly half of all comparisons (see the histogram in the first post).

Then, in the comparison phase, I use a comparare whose levels are AT_LEST_ONE or ELSE, whether or not two addresses in a given pair share at least one fingerprint among those built by libpostal.
Of course, doing things in this way heavily affects the outcomes in favor of the AT_LEAST_ONE level.

@NickCrews: not sure if this observation was implicit (or explicit) in your suggestions. If not, could you confirm the above?

0 replies

lmores · 2024-09-27T10:33:00Z

lmores
Sep 27, 2024
Author

Also, it is probably better to move this "issue" to the Discussions section as it makes little sense to close it at some point and it will be likely easier to find it for people coming to mismo (do you agree @NickCrews?).

0 replies

NickCrews · 2024-09-27T17:39:15Z

NickCrews
Sep 27, 2024
Maintainer

Yes, it is important to think about how the blocking rules and the comparisons interact. If you are not careful, the blocking rule will filter out a lot of the possible match levels! The solution varies:

sometimes it means you need to make your blocking rules less strict: your comparison is never seeing pairs that you actually want to consider!

Or, you want to adjust your match levels, you can remove some of the redundant levels, and possibly split apart the other levels to be more fine-grained

0 replies

NickCrews · 2024-09-27T17:41:16Z

NickCrews
Sep 27, 2024
Maintainer

Yes, I will migrate to a discussion when I'm at a computer.

But this thread is still great reference, and inspiration for possible warnings/guardrails I/we should put in place. This topic is full of easy mistakes to make! I want to remove the sharp edges.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High odds in Fellegi-Sunter model after expectation maximization #67

{{title}}

Replies: 8 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

High odds in Fellegi-Sunter model after expectation maximization #67

lmores Aug 30, 2024

Replies: 8 comments

NickCrews Aug 30, 2024 Maintainer

lmores Sep 5, 2024 Author

NickCrews Sep 5, 2024 Maintainer

NickCrews Sep 5, 2024 Maintainer

lmores Sep 27, 2024 Author

lmores Sep 27, 2024 Author

NickCrews Sep 27, 2024 Maintainer

NickCrews Sep 27, 2024 Maintainer

lmores
Aug 30, 2024

NickCrews
Aug 30, 2024
Maintainer

lmores
Sep 5, 2024
Author

NickCrews
Sep 5, 2024
Maintainer

NickCrews
Sep 5, 2024
Maintainer

lmores
Sep 27, 2024
Author

lmores
Sep 27, 2024
Author

NickCrews
Sep 27, 2024
Maintainer

NickCrews
Sep 27, 2024
Maintainer