-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
divvunspell fails to find suggestions hfst-ospell does #20
Comments
Output of the previous commands,
|
There is no timeline on when this might be resolved. I have tested the behaviour with both a ZHFST and BHFST file and there is no difference (which is good). |
Sorry, I want to clarify: you were able to reproduce the bug, and behavior is Or you were unable to reproduce the bug at all? If the former, would you be willing to outline where in the code I should look? |
I can reproduce the bug, but I have absolutely no suggestion as to where the issue might be coming from or how you might go about debugging it, sorry. |
as far as I can tell divvunspell uses characters (or grapheme clusters in newest versions) and does no multichar tokenisation, at least for the input. I think this is probably the right way to go, surface levels of finite-state morphologies shouldn't contain arbitrary multicharacter sequences, it's more often causing hard to debug bugs than is useful. |
I agree with @flammie: multichars should be all and only characters/grapheme clusters, and divvunspell should do no multichar tokenisation. This will make error modelling and debugging much easier. What needs to be ensured is that grapheme clusters are always defined as multichars in the fst's, I am not sure that is always the case. For the acceptor, this is already being handled automatically (especially since the tokeniser fst's have the opposite requirement - no multichars at all on the surface level). So the part to investigate is the error model. |
It seems that hfst-ospell does a better job considering all possible
tokenisations of a word; divvunspell fails to offer some suggestions when they
are multiple tokenisations (due to multichar symbols).
After a lot of work, I derived this explanation by finding a minimal failing
example; I hope this effort helps with fixing the bug!
Script to reproduce (run in an empty directory):
The text was updated successfully, but these errors were encountered: