Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

divvunspell fails to find suggestions hfst-ospell does #20

Open
nlhowell opened this issue May 12, 2020 · 6 comments
Open

divvunspell fails to find suggestions hfst-ospell does #20

nlhowell opened this issue May 12, 2020 · 6 comments

Comments

@nlhowell
Copy link

It seems that hfst-ospell does a better job considering all possible
tokenisations of a word; divvunspell fails to offer some suggestions when they
are multiple tokenisations (due to multichar symbols).

After a lot of work, I derived this explanation by finding a minimal failing
example; I hope this effort helps with fixing the bug!

Script to reproduce (run in an empty directory):

echo -e "cat\nca\ncb\ncbt" > multichar
echo -e "cbt:cat\ncb:ca" > orth
echo '?*' | hfst-regexp2fst -o anystar.hfst
hfst-strings2fst -j -m multichar < orth \
| hfst-concatenate anystar.hfst - \
| hfst-concatenate - anystar.hfst \
| hfst-repeat -f 1 -t 3 \
| hfst-disjunct - anystar.hfst \
| hfst-fst2fst -w > errmodel.default.hfst
echo "cat" | hfst-strings2fst -j -m multichar | hfst-fst2fst -w > acceptor.default.hfst
cat > index.xml <<-EOF
<?xml version="1.0" encoding="utf-8"?>
<hfstspeller dtdversion="1.0" hfstversion="3">
    <info>
        <title>cat</title>
	<locale>xxx</locale>
	<producer>xxx</producer>
	<description>cat</description>
    </info>
    <acceptor type="general" id="acceptor.default.hfst">
        <title>cat</title>
	<description>cat</description>
    </acceptor>
    <errmodel id="errmodel.default.hfst">
        <title>error</title>
	<description>cat</description>
        <type type="default"/>
        <model>errmodel.default.hfst</model>
    </errmodel>
</hfstspeller>
EOF

zip test.zhfst index.xml errmodel.default.hfst acceptor.default.hfst
echo "cbt" | hfst-ospell -S test.zhfst
echo "cbt" | divvunspell -S -z test.zhfst
@ftyers
Copy link

ftyers commented May 12, 2020

Output of the previous commands,

$ bash run
updating: index.xml (deflated 61%)
updating: errmodel.default.hfst (deflated 67%)
updating: acceptor.default.hfst (deflated 39%)
"cbt" is NOT in the lexicon:
Corrections for "cbt":
cat    0.000000

Reading from stdin...
Input: cbt		[INCORRECT]

@bbqsrc bbqsrc added the bug label May 12, 2020
@bbqsrc
Copy link
Member

bbqsrc commented May 18, 2020

There is no timeline on when this might be resolved. I have tested the behaviour with both a ZHFST and BHFST file and there is no difference (which is good).

@nlhowell
Copy link
Author

Sorry, I want to clarify: you were able to reproduce the bug, and behavior is
the same between .zhfst and .bhfst files?

Or you were unable to reproduce the bug at all?

If the former, would you be willing to outline where in the code I should look?
I can try to come up with a patch.

@bbqsrc
Copy link
Member

bbqsrc commented May 18, 2020

I can reproduce the bug, but I have absolutely no suggestion as to where the issue might be coming from or how you might go about debugging it, sorry.

@flammie
Copy link
Contributor

flammie commented Nov 14, 2024

as far as I can tell divvunspell uses characters (or grapheme clusters in newest versions) and does no multichar tokenisation, at least for the input. I think this is probably the right way to go, surface levels of finite-state morphologies shouldn't contain arbitrary multicharacter sequences, it's more often causing hard to debug bugs than is useful.

@snomos
Copy link
Member

snomos commented Nov 15, 2024

I agree with @flammie: multichars should be all and only characters/grapheme clusters, and divvunspell should do no multichar tokenisation. This will make error modelling and debugging much easier. What needs to be ensured is that grapheme clusters are always defined as multichars in the fst's, I am not sure that is always the case.

For the acceptor, this is already being handled automatically (especially since the tokeniser fst's have the opposite requirement - no multichars at all on the surface level). So the part to investigate is the error model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants