divvunspell fails to find suggestions hfst-ospell does #20

nlhowell · 2020-05-12T14:17:25Z

It seems that hfst-ospell does a better job considering all possible
tokenisations of a word; divvunspell fails to offer some suggestions when they
are multiple tokenisations (due to multichar symbols).

After a lot of work, I derived this explanation by finding a minimal failing
example; I hope this effort helps with fixing the bug!

Script to reproduce (run in an empty directory):

echo -e "cat\nca\ncb\ncbt" > multichar
echo -e "cbt:cat\ncb:ca" > orth
echo '?*' | hfst-regexp2fst -o anystar.hfst
hfst-strings2fst -j -m multichar < orth \
| hfst-concatenate anystar.hfst - \
| hfst-concatenate - anystar.hfst \
| hfst-repeat -f 1 -t 3 \
| hfst-disjunct - anystar.hfst \
| hfst-fst2fst -w > errmodel.default.hfst
echo "cat" | hfst-strings2fst -j -m multichar | hfst-fst2fst -w > acceptor.default.hfst
cat > index.xml <<-EOF
<?xml version="1.0" encoding="utf-8"?>
<hfstspeller dtdversion="1.0" hfstversion="3">
    <info>
        <title>cat</title>
	<locale>xxx</locale>
	<producer>xxx</producer>
	<description>cat</description>
    </info>
    <acceptor type="general" id="acceptor.default.hfst">
        <title>cat</title>
	<description>cat</description>
    </acceptor>
    <errmodel id="errmodel.default.hfst">
        <title>error</title>
	<description>cat</description>
        <type type="default"/>
        <model>errmodel.default.hfst</model>
    </errmodel>
</hfstspeller>
EOF

zip test.zhfst index.xml errmodel.default.hfst acceptor.default.hfst
echo "cbt" | hfst-ospell -S test.zhfst
echo "cbt" | divvunspell -S -z test.zhfst

The text was updated successfully, but these errors were encountered:

ftyers · 2020-05-12T14:29:56Z

Output of the previous commands,

$ bash run
updating: index.xml (deflated 61%)
updating: errmodel.default.hfst (deflated 67%)
updating: acceptor.default.hfst (deflated 39%)
"cbt" is NOT in the lexicon:
Corrections for "cbt":
cat    0.000000

Reading from stdin...
Input: cbt		[INCORRECT]

bbqsrc · 2020-05-18T07:46:22Z

There is no timeline on when this might be resolved. I have tested the behaviour with both a ZHFST and BHFST file and there is no difference (which is good).

nlhowell · 2020-05-18T08:25:36Z

Sorry, I want to clarify: you were able to reproduce the bug, and behavior is
the same between .zhfst and .bhfst files?

Or you were unable to reproduce the bug at all?

If the former, would you be willing to outline where in the code I should look?
I can try to come up with a patch.

bbqsrc · 2020-05-18T09:48:20Z

I can reproduce the bug, but I have absolutely no suggestion as to where the issue might be coming from or how you might go about debugging it, sorry.

flammie · 2024-11-14T01:04:40Z

as far as I can tell divvunspell uses characters (or grapheme clusters in newest versions) and does no multichar tokenisation, at least for the input. I think this is probably the right way to go, surface levels of finite-state morphologies shouldn't contain arbitrary multicharacter sequences, it's more often causing hard to debug bugs than is useful.

snomos · 2024-11-15T08:24:22Z

I agree with @flammie: multichars should be all and only characters/grapheme clusters, and divvunspell should do no multichar tokenisation. This will make error modelling and debugging much easier. What needs to be ensured is that grapheme clusters are always defined as multichars in the fst's, I am not sure that is always the case.

For the acceptor, this is already being handled automatically (especially since the tokeniser fst's have the opposite requirement - no multichars at all on the surface level). So the part to investigate is the error model.

bbqsrc added the bug label May 12, 2020

bbqsrc added the help wanted label May 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

divvunspell fails to find suggestions hfst-ospell does #20

divvunspell fails to find suggestions hfst-ospell does #20

nlhowell commented May 12, 2020

ftyers commented May 12, 2020

bbqsrc commented May 18, 2020

nlhowell commented May 18, 2020

bbqsrc commented May 18, 2020

flammie commented Nov 14, 2024

snomos commented Nov 15, 2024

divvunspell fails to find suggestions hfst-ospell does #20

divvunspell fails to find suggestions hfst-ospell does #20

Comments

nlhowell commented May 12, 2020

ftyers commented May 12, 2020

bbqsrc commented May 18, 2020

nlhowell commented May 18, 2020

bbqsrc commented May 18, 2020

flammie commented Nov 14, 2024

snomos commented Nov 15, 2024