-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
in_target_language can count words twice and return ratios above 1.0 #112
Comments
Good catch! That's indeed not the idea here and it's a regression. |
This probably has to do with how
There is also a performance regression, here is the old code, notice the break which is absent from the new code which makes it less efficient:
edit: removed erroneous information |
@juanjoDiaz Could you please have a look at this issue? |
The code should now work, do you confirm @osma ? |
I can confirm that it works. There seems to be an extra line in the test; I commented on that here: https://github.com/adbar/simplemma/pull/114/files#r1327158689 Also, there could be an unit test for this case:
Although the current (i.e. just fixed) implementation doesn't have this problem, a naive fix (capping the score at 1) could have fixed the original case without fixing this one. But maybe that is too speculative. |
I noticed that in current
main
version, thein_target_language
function can return ratios above 1.0 when given more than one language and the words in the input are matching several languages.Example:
Simplemma 0.9.1 doesn't have this problem:
It's not just a question of capping the scores at 1.0. The problem is that a single word can count more than once if it happens to match multiple languages. Below is an example that demonstrates the problem. I added nonsense words that don't match any language, but their presence is compensated by words that are counted twice.
Current
main
version:Simplemma 0.9.1:
The text was updated successfully, but these errors were encountered: