Whitespace Tokenizer incorrectly removes vowel signs for text in Hindi #5998
Labels
area:rasa-oss 🎡
Anything related to the open source Rasa framework
type:bug 🐛
Inconsistencies or issues which will cause an issue or problem for users or implementors.
Rasa version: master
Python version: 3.6.5
Operating system (windows, osx, ...): osx
Issue:
WhitespaceTokenizer.tokenize
tries to remove punctuations/symbols using a regex -This regex cleans up all vowel signs which is wrong.
Example input:
50 क्या आपके पास डेरी मिल्क 10 वाले बॉक्स मिल सकते है
Ideally, the whitespace tokenizer should give the following list of tokens:
instead, the tokens returned are:
This also leads to misaligned entity annotations as described here.
The text was updated successfully, but these errors were encountered: