You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Presently, such symbols are always transformed into a symbol from the error model (or just removed). While this is often fine, for a certain group of input strings it makes the speller underperform heavily. Those strings are of two types: words containing digits, and words containing upper-case letters. Examples:
The input string 40:is should be corrected to 40:s, but the present North Sami speller behaves as follows:
echo 40:is | hfst-ospell -S se.zhfst
"40:is" is NOT in the lexicon:
Unable to correct "40:is"!
The change to the input string is very simple, but since digits are not part of the error model, the speller is unable to correct it. It is possible to include digits in the error model, but that leads to a large number of unwanted suggestions of the form:
400:s
401:s
402:s
403:s
etc.
The same goes for words containing upper case letters, like proper nouns, abbreviations and acronyms. In almost every case one can assume that the upper case letters are correct, and that the error is in the inflection (or in the case of names, in the non-initial part of the name). For some languages we have included uppercase letters in the error model, but that makes the error model several times bigger, and the speller much slower. It is thus highly desirable to not include upper case letters in the error model.
There is an easy solution to both of these issues: leave unknown symbols untouched, and just move to the next symbol. Then the error model can do its job on the part of the word containing symbols known to it. This should give the correct output for both cases mentioned above.
Of course, there are cases where unknown symbols should be replaced, so the present behavior must be kept as well.
I have tried adding an identity transform to the regex generated from the error model alphabet, but that did not help. Thus it looks to me like this feature needs to be hard-coded into the hfst-ospell library code. But given its general nature, that should be ok.
The text was updated successfully, but these errors were encountered:
Presently, such symbols are always transformed into a symbol from the error model (or just removed). While this is often fine, for a certain group of input strings it makes the speller underperform heavily. Those strings are of two types: words containing digits, and words containing upper-case letters. Examples:
The input string 40:is should be corrected to 40:s, but the present North Sami speller behaves as follows:
The change to the input string is very simple, but since digits are not part of the error model, the speller is unable to correct it. It is possible to include digits in the error model, but that leads to a large number of unwanted suggestions of the form:
The same goes for words containing upper case letters, like proper nouns, abbreviations and acronyms. In almost every case one can assume that the upper case letters are correct, and that the error is in the inflection (or in the case of names, in the non-initial part of the name). For some languages we have included uppercase letters in the error model, but that makes the error model several times bigger, and the speller much slower. It is thus highly desirable to not include upper case letters in the error model.
There is an easy solution to both of these issues: leave unknown symbols untouched, and just move to the next symbol. Then the error model can do its job on the part of the word containing symbols known to it. This should give the correct output for both cases mentioned above.
Of course, there are cases where unknown symbols should be replaced, so the present behavior must be kept as well.
I have tried adding an identity transform to the regex generated from the error model alphabet, but that did not help. Thus it looks to me like this feature needs to be hard-coded into the hfst-ospell library code. But given its general nature, that should be ok.
The text was updated successfully, but these errors were encountered: