Multiple acceptors and error models #25

snomos · 2021-11-03T13:52:08Z

Both old ideas and new development suggest a more flexible approach to accceptors and error models. Below is a list of things discussed in the past, + new ideas inspired by the ongoing machine learning work by @gusmakali, on word completion and prediction. Also some of the tasks mentioned in #19 are relevant to this.

Multiple error models

(neural model)
hand-tuned error model
morphology-based / word-part completion model #29
default/fall-back model (the present one)

The idea is that all of the above could be present in one and the same speller archive, and with some configuration specification as to when to apply which model. A very tentative idea could be that a machine learning error model will either get it right with the top hypothesis, or completely fail (as determined by filtering the hypothesis against the lexicon), thus use that one as a first step, then fall back to a hand-tuned error model, and when that fails (it could be written to be on the safe side, ie not suggest anything outside a certain set of errors), fall back to the default error model.

Exactly how this should work and interact is very much an open question, but divvunspell should provide the machinery so that linguists can experiment with it to reach an optimal setup for a given language and device type.

Multiple acceptors

default acceptor (the present one)
suggestion acceptor
morphology-based / word-part completion model #29
rejector

And possibly other variants too.

There are at least two ideas here:

we might want to be more careful with what we suggest, and an easy way to do that is verifying suggestions against a more restricted acceptor, e.g. with no dynamic compounding or derivation (such words would still be accepted, just never suggested). Another way of restricting suggestions is to never suggest anything with a weight higher than a limit X, where X is configurable (this has been discussed several times in the past):
- never suggest if weight higher than configurable weight X
in productive word formation it is easy to overgenerate, e.g. for compounds, but subtracting illegal paths from an fst is hugely inefficient and space consuming. What is way better is to have a rejector fst that contains invalid strings, and anything in that fst should always be rejected, in all cases except when explicitly added to a user dictionary by the user.

As part of this work it is probably necessary to rework the zhfst archive format, probably by making the bhfst format the standard, including the json config file used there.

flammie · 2021-11-05T14:24:00Z

I have few ideas based on experiments of e.g. optimizing the sizes in memory and on disk and also some experiments of other spell models and or word completion.

In principle there is quite direct tradeoff of space and speed and complexity of keeping any components of FSAs apart or performing their compositions / lookups in runtime. So on that end, it might be good to just have a generally quite flexible model of parts that get assembled on the fly.

One FSA that is worth considering is a weighting model. This would allow switching the acceptors to be unweighted and should theoretically save at least x bytes (however many floats use) per state and edge for both in-memory and on-disk, while the weighting model should assign some weight to all strings it will probably be less complex than analyser, or could be other statistical model altogether.

In prediction or word completion, there is also a place for a kind of morph acceptor model, since we want completions for potentially unfinished word-forms (e.g. compound forms that are bound and parts of complex words).

gusmakali · 2021-12-02T14:36:56Z

We have now a working next-word prediction and autocomplete ML model. Do I understand the idea correctly that on top of this ml model, there should be another one that should perform spellchecking task?

For example:

Input --> gets checked by the spellchecking ML model --> if input is OK the model switches to the ML autocomplete/next-word prediction task. if the input is not correct, the spellchecking model suggests corrections .

snomos · 2021-12-02T15:40:50Z

I am not sure I understand all of this, but here is what I think should happen:

Input --> triggers autocomplete --> autocomplete suggestion is checked against speller, if ok, suggest to user
When a suggestion is accepted by user, present the next word suggestions

I am not sure what role the regular spell checker should have beyond verifying suggestions from ML model. It might be useful to run it against the ML suggestions, but it might as well be better to just filter the suggestions (that is necessary in any case). We need to test this and see how it behaves :)

flammie · 2021-12-02T16:40:22Z

Yeah I think the simplest first approach is to get a decently large nbest list of suggestions from the ML model and run it through the spell-checker to only suggest completions that are probably understandable for the user.

If it is autocompletion after user has input some letters of the word, and the ML model is only trained with complete word-forms in context, from a corpus, there might be a need to account that user is already misspelling the initial part of the word, this won't show up in the (gold) corpus? It might also be possible to make a model to predict initial misspellings to correct word-forms, using the corpus of marked up errors. I'm thinking e.g. user types like "...uit norgga ark" the autocomplete should probably be able to complete 'árktalaš' or so, I have a feeling this is how gboard and swype work on bigger languages, not sure if ML model trained on raw text can do that but maybe?

As a comparison the strictly rule-based or FSA model of completion without context (with context should be possible extension) also s probably usually at least composed of:

input word (prefix or such)
error models (for mispelling)
completion models
dictionary or analyser
(weights and probabilities)

As I understand it it is just a question of how much of this can be baked into single ML model, e.g. if the corpus data is correctly spelled and plenty it would model the dictionary of correctly spelled words or morph or character or other such textpiece combinations without needing to query the rulebased dictionary, but yeah in practice we will see when we test stuff :-)

snomos · 2021-12-02T18:18:48Z

When I first made this issue I didn't espect completion and prediction to be available this early, so I believe we now need to change the plans a bit.

Here is what I suggest for the next steps:

forget about multiple error models for now, and the same with multiple acceptors
use your context aware word completion models as is (but needs filtering against the speller dictionary)
if possible, you could try training a single-word completion model (no context), to use when there is no context info available, like first word of a sentence, or the OS does not give us enough info
then use the next word prediction

@flammie 's point about misspelled input is very relevant. A variant of his suggestion is to use the current fst error model on any given input, and feed the N best corrections ot the ML model, and then return the most likely candidate of the lot.

A potential problem with this approach is the raw number of candidates from the error model, but that can be remedied by using the --beam option:

time echo ark | hfst-lookup -q -b 7 tools/spellcheckers/errmodel.default.hfst 
ark	ark	0,000000
ark	aqk	6,000000
ark	aqrk	6,000000
ark	arkq	6,000000
ark	arq	6,000000
ark	arqk	6,000000
ark	qark	6,000000
ark	qrk	6,000000
ark	árk	6,000000


real	0m0.379s
user	0m0.208s
sys	0m0.166s

Most of these are garbage, but the one we want is also there, and will probably produce good completion suggestions from the ML model. At least worth a try 😄

snomos added this to the 1.1 milestone Nov 3, 2021

snomos changed the title ~~Multiple fst's and error models~~ Multiple acceptors and error models Nov 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple acceptors and error models #25

Multiple acceptors and error models #25

snomos commented Nov 3, 2021 •

edited

Loading

flammie commented Nov 5, 2021

gusmakali commented Dec 2, 2021

snomos commented Dec 2, 2021

flammie commented Dec 2, 2021

snomos commented Dec 2, 2021 •

edited

Loading

Multiple acceptors and error models #25

Multiple acceptors and error models #25

Comments

snomos commented Nov 3, 2021 • edited Loading

Multiple error models

Multiple acceptors

flammie commented Nov 5, 2021

gusmakali commented Dec 2, 2021

snomos commented Dec 2, 2021

flammie commented Dec 2, 2021

snomos commented Dec 2, 2021 • edited Loading

snomos commented Nov 3, 2021 •

edited

Loading

snomos commented Dec 2, 2021 •

edited

Loading