Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in_target_language can count words twice and return ratios above 1.0 #112

Closed
osma opened this issue Aug 11, 2023 · 5 comments · Fixed by #114
Closed

in_target_language can count words twice and return ratios above 1.0 #112

osma opened this issue Aug 11, 2023 · 5 comments · Fixed by #114
Labels
bug Something isn't working
Milestone

Comments

@osma
Copy link
Contributor

osma commented Aug 11, 2023

I noticed that in current main version, the in_target_language function can return ratios above 1.0 when given more than one language and the words in the input are matching several languages.

Example:

>>> in_target_language('It was a true gift', lang='en')
1.0
>>> in_target_language('It was a true gift', lang='de')
0.6666666666666666
>>> in_target_language('It was a true gift', lang=('en','de'))
1.6666666666666665

Simplemma 0.9.1 doesn't have this problem:

>>> in_target_language('It was a true gift', lang='en')
1.0
>>> in_target_language('It was a true gift', lang='de')
0.6666666666666666
>>> in_target_language('It was a true gift', lang=('en','de'))
1.0

It's not just a question of capping the scores at 1.0. The problem is that a single word can count more than once if it happens to match multiple languages. Below is an example that demonstrates the problem. I added nonsense words that don't match any language, but their presence is compensated by words that are counted twice.

Current main version:

>>> in_target_language('It was a true gift xxx yyy', lang=('en','de'))
1.0

Simplemma 0.9.1:

>>> in_target_language('It was a true gift xxx yyy', lang=('en','de'))
0.6
@adbar adbar added the bug Something isn't working label Aug 11, 2023
@adbar
Copy link
Owner

adbar commented Aug 11, 2023

Good catch! That's indeed not the idea here and it's a regression.

@adbar adbar added this to the v1.0 milestone Aug 11, 2023
@adbar
Copy link
Owner

adbar commented Aug 11, 2023

This probably has to do with how proportion_in_target_languages() computes the sum:

return sum(
    percentage
    for (
        lang_code,
        percentage,
    ) in self.proportion_in_each_language(text).items()
    if lang_code != "unk"
)

There is also a performance regression, here is the old code, notice the break which is absent from the new code which makes it less efficient:

for l in LANG_DATA:
    candidate = _return_lemma(token, l.dict, greedy=True, lang=l.code)
    if candidate is not None:
        in_target += 1
        break

edit: removed erroneous information

@adbar
Copy link
Owner

adbar commented Sep 5, 2023

@juanjoDiaz Could you please have a look at this issue?

@adbar
Copy link
Owner

adbar commented Sep 15, 2023

The code should now work, do you confirm @osma ?

@osma
Copy link
Contributor Author

osma commented Sep 15, 2023

I can confirm that it works.

There seems to be an extra line in the test; I commented on that here: https://github.com/adbar/simplemma/pull/114/files#r1327158689

Also, there could be an unit test for this case:

>>> in_target_language('It was a true gift xxx yyy', lang=('en','de'))
0.6

Although the current (i.e. just fixed) implementation doesn't have this problem, a naive fix (capping the score at 1) could have fixed the original case without fixing this one. But maybe that is too speculative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants