Replies: 8 comments
-
This can happen. Consider the level of
Are you confusing probability with odds? We are talking about odds here. Probability could be between 0 and 100%. Odds are totally unbounded. Have you read the glossary?
Consider your level rates for postal code: It looks like 100% of pairs fall into the EXACT level, which leads to an odds of (100%/100%) = 1. In other words, if you see a pair matching this level, it gives you no useful info as to whether the pair is a match. It looks like 0 pairs are falling into the MAJOR and ELSE levels. Again, this should lead to an odds of (0%/0%) = 1. But it looks like it is ending up with an odds of ~800. This could be a bug on mismos part. Can you tell me the exact number of pairs that fall into each level? IDEALLY, what I am looking for in these match charts is for there to be at least 100 pairs in each bin, to avoid small sample size issues. Second, what I want is for there to be an "inverse relationship" as you go between levels: Your address lines jaccard comparison has this property: ELSE is common among non-matches, rare among true-matches. The other levels are common among true-matches, rare among non-matches. This leads to getting odds that actually give you evidence towards a match/non-match. So, I think you need to adjust your level criteria to better "spread out" the record pairs between the different levels.
Yes it is, that probably needs to get tweaked in the altair definitions. Would love a PR if you are willing! I probably will not get to it anytime soon. |
Beta Was this translation helpful? Give feedback.
-
That makes sense. Just to double check my understanding: a comparer such that for each level the proportion of matches is about the same as the proportion of non-matches is quite useless, right?
I am aware of the difference, but, since the chart is centered between 0.01 and 100, I was just wondering whether the resulting odds should mostly be contained in that interval.
It seems that both level contain no pairs. Here is my code: # `compared` contains all address pairs that I am taking into account
# Pairs in MAJOR level
c = compared.filter([_.postal_code_l != _.postal_code_r, _.postal_code_l.substr(0,2) == _.postal_code_r.substr(0,2)])
c.count() # returns 0
# Pairs in ELSE level
c = compared.filter([_.postal_code_l != _.postal_code_r, _.postal_code_l.substr(0,2) != _.postal_code_r.substr(0,2)])
c.count() # returns 0 And is some more information available in the chart:
Noted!
I'll have a look at altair. Thank you! |
Beta Was this translation helpful? Give feedback.
-
Exactly! However, it SOMETIMES could be useful, in an indirect way. Consider if in ~90% of records the name field is NULL. They are equally likely to be NULL amongst matches as non-matches, so if you see a NULL in a pair, it doesn't give you any evidence of match or non-match. Consider level configuration A:
vs configuration B:
Yes, it looks like a very tiny number of pairs falls into each category, and so the calculated odds of 599.99... is probably not very accurate. As you understand now, aim for 100+ per bin. EDIT: this actually looks like a large change, it would require changing the Your screen shot reveals that we don't currently show the number of pairs. That would be good to do. I will try to fix this, or if you want to submit a PR for that, also would be appreciated! |
Beta Was this translation helpful? Give feedback.
-
This thread is super useful for figuring out pain points, thank you for your effort here. Ideally I would like this info to be exposed
|
Beta Was this translation helpful? Give feedback.
-
"Just" realized that presence of empty comparer levels is largely (but not only) due to the fact that some of them look at the same type of information that I also use in the blocking step. E.g. in the blocking phase I am using the following rule mismo.block.KeyBlocker("libpostal_fingerprint", name="Fingerprint", ), which is responbile for the generation of nearly half of all comparisons (see the histogram in the first post). Then, in the comparison phase, I use a comparare whose levels are AT_LEST_ONE or ELSE, whether or not two addresses in a given pair share at least one fingerprint among those built by libpostal. @NickCrews: not sure if this observation was implicit (or explicit) in your suggestions. If not, could you confirm the above? |
Beta Was this translation helpful? Give feedback.
-
Also, it is probably better to move this "issue" to the Discussions section as it makes little sense to close it at some point and it will be likely easier to find it for people coming to mismo (do you agree @NickCrews?). |
Beta Was this translation helpful? Give feedback.
-
Yes, it is important to think about how the blocking rules and the comparisons interact. If you are not careful, the blocking rule will filter out a lot of the possible match levels! The solution varies: sometimes it means you need to make your blocking rules less strict: your comparison is never seeing pairs that you actually want to consider! Or, you want to adjust your match levels, you can remove some of the redundant levels, and possibly split apart the other levels to be more fine-grained |
Beta Was this translation helpful? Give feedback.
-
Yes, I will migrate to a discussion when I'm at a computer. But this thread is still great reference, and inspiration for possible warnings/guardrails I/we should put in place. This topic is full of easy mistakes to make! I want to remove the sharp edges. |
Beta Was this translation helpful? Give feedback.
-
I am playing with mismo to deduplicate postal addresses in a set of about 10k entries.
After the expectation-maximization step, the odds of half of the record pairs are equal to
10_000_000_000
, hence choosing the threshold to distinguish true matches from false ones is quite hard.As I cannot share the underlying dataset, I will try to completely avoid sharing my (messy) code, and rather describe what I am doing. Hopefully it will be enough to get some hint from you.
Deduplication Steps
Each record has the following fields:
-
record_id
-
recipients
: a string with the name of the recipient-
recipients_metaphone
: the result ofdouble_metaphone()
on the fieldrecipients
-
recipients_tokens
: a sequence of tokens obtained by splitting therecipients
field using white spaces.-
address_lines
: a string-
address_lines_tokens
: a sequence of tokens obtained by splitting theaddress_lines
field using white spaces and discarding terms that appears in more than 5% of the dataset (usingmismo.arrays.array_filter_isin_other()
andmismo.sets.rare_terms()
).-
full_address
: the whole address (not including the recipient's name)-
libpostal_address
: the address parsed by pypostal usingmismo.lib.geo.postal_parse_address()
-
libpostal_fingerprint
: the list fingerprints returned by pypostal usingmismo.lib.geo.postal_fingerprint_address()
Blocking using the following rules:
The corresponding upset chart is as follows
- on
postal_code
field with levels: EXACT, MAJOR (same first two digits), ELSE- on
recipients_tokens
with thejaccard
function and levels: JACCARD_50 (>=0.5), JACCARD_25 (>=0.25), JACCARD_10 (>=0.1), JACCARD_02 (>=0.02), ELSE- on
recipients_metaphone
with thejaccard
function and levels: EXACT (jaccard >= 0.50), MAYBE (jaccard >= 0.2), ELSE- on
address_lines_tokens
with thejaccard
function and levels: JACCARD_50 (>=0.5), JACCARD_25 (>=0.25), JACCARD_10 (>=0.1), JACCARD_02 (>=0.02), ELSE- on
libpostal_fingerprint
with levels: AT_LEAST_ONE, ELSEThe weights after running
mismo.fs.train_using_em(comparers, t, t, max_pairs=1_000_000)
are as follows:I am not an expert of the Fellegi-Sunter model, but I suspect that there should't be levels where both proportions of pairs are high (e.g. EXACT level for Postal Code and ELSE level in 'Recipients Metaphone').
As you can see, the pairs in the left half of the chart all have odds equal to
10_000_000_000
.Also, the smallest value is
1819
which is way above the "expected" (?) range between0.01
and100
.Am I doing something obviously wrong?
P.S.: it seems that darker cells in the match levels chart correspond to the highest match levels. Isn't it a bit counterintuitive?
Beta Was this translation helpful? Give feedback.
All reactions