Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple acceptors and error models #25

Open
2 of 9 tasks
snomos opened this issue Nov 3, 2021 · 5 comments
Open
2 of 9 tasks

Multiple acceptors and error models #25

snomos opened this issue Nov 3, 2021 · 5 comments
Milestone

Comments

@snomos
Copy link
Member

snomos commented Nov 3, 2021

Both old ideas and new development suggest a more flexible approach to accceptors and error models. Below is a list of things discussed in the past, + new ideas inspired by the ongoing machine learning work by @gusmakali, on word completion and prediction. Also some of the tasks mentioned in #19 are relevant to this.

Multiple error models

The idea is that all of the above could be present in one and the same speller archive, and with some configuration specification as to when to apply which model. A very tentative idea could be that a machine learning error model will either get it right with the top hypothesis, or completely fail (as determined by filtering the hypothesis against the lexicon), thus use that one as a first step, then fall back to a hand-tuned error model, and when that fails (it could be written to be on the safe side, ie not suggest anything outside a certain set of errors), fall back to the default error model.

Exactly how this should work and interact is very much an open question, but divvunspell should provide the machinery so that linguists can experiment with it to reach an optimal setup for a given language and device type.

Multiple acceptors

And possibly other variants too.

There are at least two ideas here:

  • we might want to be more careful with what we suggest, and an easy way to do that is verifying suggestions against a more restricted acceptor, e.g. with no dynamic compounding or derivation (such words would still be accepted, just never suggested). Another way of restricting suggestions is to never suggest anything with a weight higher than a limit X, where X is configurable (this has been discussed several times in the past):
    • never suggest if weight higher than configurable weight X
  • in productive word formation it is easy to overgenerate, e.g. for compounds, but subtracting illegal paths from an fst is hugely inefficient and space consuming. What is way better is to have a rejector fst that contains invalid strings, and anything in that fst should always be rejected, in all cases except when explicitly added to a user dictionary by the user.

As part of this work it is probably necessary to rework the zhfst archive format, probably by making the bhfst format the standard, including the json config file used there.

@snomos snomos added this to the 1.1 milestone Nov 3, 2021
@snomos snomos changed the title Multiple fst's and error models Multiple acceptors and error models Nov 3, 2021
@flammie
Copy link
Contributor

flammie commented Nov 5, 2021

I have few ideas based on experiments of e.g. optimizing the sizes in memory and on disk and also some experiments of other spell models and or word completion.

In principle there is quite direct tradeoff of space and speed and complexity of keeping any components of FSAs apart or performing their compositions / lookups in runtime. So on that end, it might be good to just have a generally quite flexible model of parts that get assembled on the fly.

One FSA that is worth considering is a weighting model. This would allow switching the acceptors to be unweighted and should theoretically save at least x bytes (however many floats use) per state and edge for both in-memory and on-disk, while the weighting model should assign some weight to all strings it will probably be less complex than analyser, or could be other statistical model altogether.

In prediction or word completion, there is also a place for a kind of morph acceptor model, since we want completions for potentially unfinished word-forms (e.g. compound forms that are bound and parts of complex words).

@gusmakali
Copy link
Contributor

We have now a working next-word prediction and autocomplete ML model. Do I understand the idea correctly that on top of this ml model, there should be another one that should perform spellchecking task?

For example:

Input --> gets checked by the spellchecking ML model --> if input is OK the model switches to the ML autocomplete/next-word prediction task. if the input is not correct, the spellchecking model suggests corrections .

@snomos
Copy link
Member Author

snomos commented Dec 2, 2021

I am not sure I understand all of this, but here is what I think should happen:

  1. Input --> triggers autocomplete --> autocomplete suggestion is checked against speller, if ok, suggest to user
  2. When a suggestion is accepted by user, present the next word suggestions

I am not sure what role the regular spell checker should have beyond verifying suggestions from ML model. It might be useful to run it against the ML suggestions, but it might as well be better to just filter the suggestions (that is necessary in any case). We need to test this and see how it behaves :)

@flammie
Copy link
Contributor

flammie commented Dec 2, 2021

Yeah I think the simplest first approach is to get a decently large nbest list of suggestions from the ML model and run it through the spell-checker to only suggest completions that are probably understandable for the user.

If it is autocompletion after user has input some letters of the word, and the ML model is only trained with complete word-forms in context, from a corpus, there might be a need to account that user is already misspelling the initial part of the word, this won't show up in the (gold) corpus? It might also be possible to make a model to predict initial misspellings to correct word-forms, using the corpus of marked up errors. I'm thinking e.g. user types like "...uit norgga ark" the autocomplete should probably be able to complete 'árktalaš' or so, I have a feeling this is how gboard and swype work on bigger languages, not sure if ML model trained on raw text can do that but maybe?

As a comparison the strictly rule-based or FSA model of completion without context (with context should be possible extension) also s probably usually at least composed of:

  1. input word (prefix or such)
  2. error models (for mispelling)
  3. completion models
  4. dictionary or analyser
  5. (weights and probabilities)

As I understand it it is just a question of how much of this can be baked into single ML model, e.g. if the corpus data is correctly spelled and plenty it would model the dictionary of correctly spelled words or morph or character or other such textpiece combinations without needing to query the rulebased dictionary, but yeah in practice we will see when we test stuff :-)

@snomos
Copy link
Member Author

snomos commented Dec 2, 2021

When I first made this issue I didn't espect completion and prediction to be available this early, so I believe we now need to change the plans a bit.

Here is what I suggest for the next steps:

  • forget about multiple error models for now, and the same with multiple acceptors
  • use your context aware word completion models as is (but needs filtering against the speller dictionary)
  • if possible, you could try training a single-word completion model (no context), to use when there is no context info available, like first word of a sentence, or the OS does not give us enough info
  • then use the next word prediction

@flammie 's point about misspelled input is very relevant. A variant of his suggestion is to use the current fst error model on any given input, and feed the N best corrections ot the ML model, and then return the most likely candidate of the lot.

A potential problem with this approach is the raw number of candidates from the error model, but that can be remedied by using the --beam option:

time echo ark | hfst-lookup -q -b 7 tools/spellcheckers/errmodel.default.hfst 
ark	ark	0,000000
ark	aqk	6,000000
ark	aqrk	6,000000
ark	arkq	6,000000
ark	arq	6,000000
ark	arqk	6,000000
ark	qark	6,000000
ark	qrk	6,000000
ark	árk	6,000000


real	0m0.379s
user	0m0.208s
sys	0m0.166s

Most of these are garbage, but the one we want is also there, and will probably produce good completion suggestions from the ML model. At least worth a try 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants