Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Symbols unknown to error model should also be accepted 'as is' #29

Open
snomos opened this issue Dec 8, 2016 · 0 comments
Open

Symbols unknown to error model should also be accepted 'as is' #29

snomos opened this issue Dec 8, 2016 · 0 comments
Assignees

Comments

@snomos
Copy link
Member

snomos commented Dec 8, 2016

Presently, such symbols are always transformed into a symbol from the error model (or just removed). While this is often fine, for a certain group of input strings it makes the speller underperform heavily. Those strings are of two types: words containing digits, and words containing upper-case letters. Examples:

The input string 40:is should be corrected to 40:s, but the present North Sami speller behaves as follows:

echo 40:is | hfst-ospell -S se.zhfst 
"40:is" is NOT in the lexicon:
Unable to correct "40:is"!

The change to the input string is very simple, but since digits are not part of the error model, the speller is unable to correct it. It is possible to include digits in the error model, but that leads to a large number of unwanted suggestions of the form:

400:s
401:s
402:s
403:s
etc.

The same goes for words containing upper case letters, like proper nouns, abbreviations and acronyms. In almost every case one can assume that the upper case letters are correct, and that the error is in the inflection (or in the case of names, in the non-initial part of the name). For some languages we have included uppercase letters in the error model, but that makes the error model several times bigger, and the speller much slower. It is thus highly desirable to not include upper case letters in the error model.

There is an easy solution to both of these issues: leave unknown symbols untouched, and just move to the next symbol. Then the error model can do its job on the part of the word containing symbols known to it. This should give the correct output for both cases mentioned above.

Of course, there are cases where unknown symbols should be replaced, so the present behavior must be kept as well.

I have tried adding an identity transform to the regex generated from the error model alphabet, but that did not help. Thus it looks to me like this feature needs to be hard-coded into the hfst-ospell library code. But given its general nature, that should be ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants