[bugfix][WIP] Fix the bug of Chinese tokenization in 1.10.12 #6755

howl-anderson · 2020-09-23T11:26:34Z

Proposed changes:

when count string length, treat the length of special token [UNK] as 1, to fix the OOV issue

fix #6754

Status (please check what you already did):

added some tests for the functionality
updated the documentation
updated the changelog (please check changelog for instructions)
reformat files using black (please check Readme for instructions)

sara-tagger · 2020-09-23T12:00:11Z

Thanks for submitting a pull request 🚀 @rctatman will take a look at it as soon as possible ✨

tabergma

Thanks for tackling this issue!

I am not quite sure I fully understand the problem. Added some questions in the comments.

Also we would need to have a changelog entry and some tests. Thanks.

rasa/utils/train_utils.py

rasa/nlu/tokenizers/convert_tokenizer.py

…into bugfix/chinese_tokenization_in_rasa_v1

tabergma · 2020-09-29T09:14:02Z

Looks good so far! Can you please add a changelog entry? Thanks.

tabergma

Looks great! 👍 Thanks for tackling all my comments.

Can you please merge the latest version of 1.10.x into this branch as the convert test fixes are merged there? Thanks.

rasa/utils/train_utils.py

…hinese_tokenization_in_rasa_v1

add patch

8ea4c46

sara-tagger requested a review from rctatman September 23, 2020 12:00

tmbo requested review from tabergma and removed request for rctatman September 26, 2020 12:22

howl-anderson changed the title ~~[WIP] Bug of Chinese tokenization in 1.10.12~~ [bugfix]Bug of Chinese tokenization in 1.10.12 Sep 27, 2020

howl-anderson changed the title ~~[bugfix]Bug of Chinese tokenization in 1.10.12~~ [bugfix] Bug of Chinese tokenization in 1.10.12 Sep 27, 2020

howl-anderson changed the title ~~[bugfix] Bug of Chinese tokenization in 1.10.12~~ [bugfix] Fix the bug of Chinese tokenization in 1.10.12 Sep 27, 2020

tabergma requested changes Sep 28, 2020

View reviewed changes

rasa/utils/train_utils.py Outdated Show resolved Hide resolved

refactor align_tokens: add unk_token parameters

a5a4d07

tabergma reviewed Sep 29, 2020

View reviewed changes

rasa/utils/train_utils.py Outdated Show resolved Hide resolved

rasa/utils/train_utils.py Outdated Show resolved Hide resolved

rasa/nlu/tokenizers/convert_tokenizer.py Outdated Show resolved Hide resolved

howl-anderson changed the title ~~[bugfix] Fix the bug of Chinese tokenization in 1.10.12~~ [bugfix][WIP] Fix the bug of Chinese tokenization in 1.10.12 Sep 29, 2020

howl-anderson added 5 commits September 29, 2020 14:57

add Chinese test case for lm_tokenizer

511c8c0

refactor align_tokens

12af029

uodo modification on convert_tokenizer

19f6691

Add comments about why we set the length of OOV token to 1

3dcfd4f

Merge branch 'ignore-convert-tests' of https://github.com/RasaHQ/rasa …

02e81d1

…into bugfix/chinese_tokenization_in_rasa_v1

add changelog

46a96f8

tabergma approved these changes Sep 29, 2020

View reviewed changes

rasa/utils/train_utils.py Outdated Show resolved Hide resolved

rasa/utils/train_utils.py Outdated Show resolved Hide resolved

howl-anderson added 2 commits September 29, 2020 20:27

fix grammar error in comment

a0d5c9d

Merge branch '1.10.x' of https://github.com/RasaHQ/rasa into bugfix/c…

1f503e1

…hinese_tokenization_in_rasa_v1

tabergma merged commit 0dce587 into RasaHQ:1.10.x Sep 30, 2020

howl-anderson mentioned this pull request Oct 7, 2020

Bug of Chinese tokenization in 1.10.12 #6754

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix][WIP] Fix the bug of Chinese tokenization in 1.10.12 #6755

[bugfix][WIP] Fix the bug of Chinese tokenization in 1.10.12 #6755

howl-anderson commented Sep 23, 2020

sara-tagger commented Sep 23, 2020

tabergma left a comment

tabergma commented Sep 29, 2020

tabergma left a comment

[bugfix][WIP] Fix the bug of Chinese tokenization in 1.10.12 #6755

[bugfix][WIP] Fix the bug of Chinese tokenization in 1.10.12 #6755

Conversation

howl-anderson commented Sep 23, 2020

sara-tagger commented Sep 23, 2020

tabergma left a comment

Choose a reason for hiding this comment

tabergma commented Sep 29, 2020

tabergma left a comment

Choose a reason for hiding this comment