Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug of Chinese tokenization in 1.10.12 #6754

Closed
howl-anderson opened this issue Sep 23, 2020 · 2 comments
Closed

Bug of Chinese tokenization in 1.10.12 #6754

howl-anderson opened this issue Sep 23, 2020 · 2 comments
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors.

Comments

@howl-anderson
Copy link
Contributor

Rasa version: 1.10.12

Python version: 3.7.4

Operating system (windows, osx, ...): ubuntu 20.04

Command or request that led to error: rasa train

Issue:
When Chinese training example contains an OOV character, it will be converted to [UNK] as token string, when rasa tries to align tokens with entities it will count the token's length, it treat [UNK] as 5 length string which is incorrect and it will create tokens that not match the true training example.

I will submit a PR to fix it.

@howl-anderson howl-anderson added area:rasa-oss 🎡 Anything related to the open source Rasa framework type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors. labels Sep 23, 2020
@sara-tagger
Copy link
Collaborator

Thanks for raising this issue, @koaning will get back to you about it soon✨

Please also check out the docs and the forum in case your issue was raised there too 🤗

@howl-anderson
Copy link
Contributor Author

Close the issue, since the bugfix #6755 merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
None yet
Development

No branches or pull requests

2 participants