in_target_language can count words twice and return ratios above 1.0 #112

osma · 2023-08-11T07:32:21Z

I noticed that in current main version, the in_target_language function can return ratios above 1.0 when given more than one language and the words in the input are matching several languages.

Example:

>>> in_target_language('It was a true gift', lang='en')
1.0
>>> in_target_language('It was a true gift', lang='de')
0.6666666666666666
>>> in_target_language('It was a true gift', lang=('en','de'))
1.6666666666666665

Simplemma 0.9.1 doesn't have this problem:

>>> in_target_language('It was a true gift', lang='en')
1.0
>>> in_target_language('It was a true gift', lang='de')
0.6666666666666666
>>> in_target_language('It was a true gift', lang=('en','de'))
1.0

It's not just a question of capping the scores at 1.0. The problem is that a single word can count more than once if it happens to match multiple languages. Below is an example that demonstrates the problem. I added nonsense words that don't match any language, but their presence is compensated by words that are counted twice.

Current main version:

>>> in_target_language('It was a true gift xxx yyy', lang=('en','de'))
1.0

Simplemma 0.9.1:

>>> in_target_language('It was a true gift xxx yyy', lang=('en','de'))
0.6

The text was updated successfully, but these errors were encountered:

adbar · 2023-08-11T11:07:53Z

Good catch! That's indeed not the idea here and it's a regression.

adbar · 2023-08-11T11:44:03Z

This probably has to do with how proportion_in_target_languages() computes the sum:

return sum(
    percentage
    for (
        lang_code,
        percentage,
    ) in self.proportion_in_each_language(text).items()
    if lang_code != "unk"
)

There is also a performance regression, here is the old code, notice the break which is absent from the new code which makes it less efficient:

for l in LANG_DATA:
    candidate = _return_lemma(token, l.dict, greedy=True, lang=l.code)
    if candidate is not None:
        in_target += 1
        break

edit: removed erroneous information

adbar · 2023-09-05T11:08:23Z

@juanjoDiaz Could you please have a look at this issue?

adbar · 2023-09-15T11:10:33Z

The code should now work, do you confirm @osma ?

osma · 2023-09-15T11:37:20Z

I can confirm that it works.

There seems to be an extra line in the test; I commented on that here: https://github.com/adbar/simplemma/pull/114/files#r1327158689

Also, there could be an unit test for this case:

>>> in_target_language('It was a true gift xxx yyy', lang=('en','de'))
0.6

Although the current (i.e. just fixed) implementation doesn't have this problem, a naive fix (capping the score at 1) could have fixed the original case without fixing this one. But maybe that is too speculative.

osma mentioned this issue Aug 11, 2023

Add README section on advanced usage via classes #113

Merged

adbar added the bug Something isn't working label Aug 11, 2023

adbar added this to the v1.0 milestone Aug 11, 2023

juanjoDiaz mentioned this issue Sep 12, 2023

fix: proportion_in_target_languages not considering tokens present in… #114

Merged

adbar closed this as completed in #114 Sep 15, 2023

osma mentioned this issue Sep 16, 2024

in_target_language can return ratios above 1.0 (again) #149

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

in_target_language can count words twice and return ratios above 1.0 #112

in_target_language can count words twice and return ratios above 1.0 #112

osma commented Aug 11, 2023 •

edited

Loading

adbar commented Aug 11, 2023

adbar commented Aug 11, 2023 •

edited

Loading

adbar commented Sep 5, 2023

adbar commented Sep 15, 2023

osma commented Sep 15, 2023

in_target_language can count words twice and return ratios above 1.0 #112

in_target_language can count words twice and return ratios above 1.0 #112

Comments

osma commented Aug 11, 2023 • edited Loading

adbar commented Aug 11, 2023

adbar commented Aug 11, 2023 • edited Loading

adbar commented Sep 5, 2023

adbar commented Sep 15, 2023

osma commented Sep 15, 2023

osma commented Aug 11, 2023 •

edited

Loading

adbar commented Aug 11, 2023 •

edited

Loading