Symbols unknown to error model should also be accepted 'as is' #29

snomos · 2016-12-08T07:10:28Z

Presently, such symbols are always transformed into a symbol from the error model (or just removed). While this is often fine, for a certain group of input strings it makes the speller underperform heavily. Those strings are of two types: words containing digits, and words containing upper-case letters. Examples:

The input string 40:is should be corrected to 40:s, but the present North Sami speller behaves as follows:

echo 40:is | hfst-ospell -S se.zhfst 
"40:is" is NOT in the lexicon:
Unable to correct "40:is"!

The change to the input string is very simple, but since digits are not part of the error model, the speller is unable to correct it. It is possible to include digits in the error model, but that leads to a large number of unwanted suggestions of the form:

400:s
401:s
402:s
403:s
etc.

The same goes for words containing upper case letters, like proper nouns, abbreviations and acronyms. In almost every case one can assume that the upper case letters are correct, and that the error is in the inflection (or in the case of names, in the non-initial part of the name). For some languages we have included uppercase letters in the error model, but that makes the error model several times bigger, and the speller much slower. It is thus highly desirable to not include upper case letters in the error model.

There is an easy solution to both of these issues: leave unknown symbols untouched, and just move to the next symbol. Then the error model can do its job on the part of the word containing symbols known to it. This should give the correct output for both cases mentioned above.

Of course, there are cases where unknown symbols should be replaced, so the present behavior must be kept as well.

I have tried adding an identity transform to the regex generated from the error model alphabet, but that did not help. Thus it looks to me like this feature needs to be hard-coded into the hfst-ospell library code. But given its general nature, that should be ok.

The text was updated successfully, but these errors were encountered:

snomos added bug enhancement labels Dec 8, 2016

snomos assigned Traubert and eaxelson Dec 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Symbols unknown to error model should also be accepted 'as is' #29

Symbols unknown to error model should also be accepted 'as is' #29

snomos commented Dec 8, 2016

Symbols unknown to error model should also be accepted 'as is' #29

Symbols unknown to error model should also be accepted 'as is' #29

Comments

snomos commented Dec 8, 2016