Bug of Chinese tokenization in 1.10.12 #6754
Labels
area:rasa-oss 🎡
Anything related to the open source Rasa framework
type:bug 🐛
Inconsistencies or issues which will cause an issue or problem for users or implementors.
Rasa version: 1.10.12
Python version: 3.7.4
Operating system (windows, osx, ...): ubuntu 20.04
Command or request that led to error: rasa train
Issue:
When Chinese training example contains an OOV character, it will be converted to
[UNK]
as token string, when rasa tries to align tokens with entities it will count the token's length, it treat[UNK]
as 5 length string which is incorrect and it will create tokens that not match the true training example.I will submit a PR to fix it.
The text was updated successfully, but these errors were encountered: