Bug of Chinese tokenization in 1.10.12 #6754

howl-anderson · 2020-09-23T11:15:23Z

Rasa version: 1.10.12

Python version: 3.7.4

Operating system (windows, osx, ...): ubuntu 20.04

Command or request that led to error: rasa train

Issue:
When Chinese training example contains an OOV character, it will be converted to [UNK] as token string, when rasa tries to align tokens with entities it will count the token's length, it treat [UNK] as 5 length string which is incorrect and it will create tokens that not match the true training example.

I will submit a PR to fix it.

The text was updated successfully, but these errors were encountered:

sara-tagger · 2020-09-23T12:00:11Z

Thanks for raising this issue, @koaning will get back to you about it soon✨

Please also check out the docs and the forum in case your issue was raised there too 🤗

howl-anderson · 2020-10-07T11:49:06Z

Close the issue, since the bugfix #6755 merged.

howl-anderson added area:rasa-oss 🎡 Anything related to the open source Rasa framework type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors. labels Sep 23, 2020

howl-anderson mentioned this issue Sep 23, 2020

[bugfix][WIP] Fix the bug of Chinese tokenization in 1.10.12 #6755

Merged

4 tasks

howl-anderson closed this as completed Oct 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug of Chinese tokenization in 1.10.12 #6754

Bug of Chinese tokenization in 1.10.12 #6754

howl-anderson commented Sep 23, 2020

sara-tagger commented Sep 23, 2020

howl-anderson commented Oct 7, 2020

Bug of Chinese tokenization in 1.10.12 #6754

Bug of Chinese tokenization in 1.10.12 #6754

Comments

howl-anderson commented Sep 23, 2020

sara-tagger commented Sep 23, 2020

Please also check out the docs and the forum in case your issue was raised there too 🤗

howl-anderson commented Oct 7, 2020