Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix][WIP] Fix the bug of Chinese tokenization in 1.10.12 #6755

Merged

Conversation

howl-anderson
Copy link
Contributor

Proposed changes:

  • when count string length, treat the length of special token [UNK] as 1, to fix the OOV issue

fix #6754

Status (please check what you already did):

  • added some tests for the functionality
  • updated the documentation
  • updated the changelog (please check changelog for instructions)
  • reformat files using black (please check Readme for instructions)

@sara-tagger
Copy link
Collaborator

Thanks for submitting a pull request 🚀 @rctatman will take a look at it as soon as possible ✨

@tmbo tmbo requested review from tabergma and removed request for rctatman September 26, 2020 12:22
@howl-anderson howl-anderson changed the title [WIP] Bug of Chinese tokenization in 1.10.12 [bugfix]Bug of Chinese tokenization in 1.10.12 Sep 27, 2020
@howl-anderson howl-anderson changed the title [bugfix]Bug of Chinese tokenization in 1.10.12 [bugfix] Bug of Chinese tokenization in 1.10.12 Sep 27, 2020
@howl-anderson howl-anderson changed the title [bugfix] Bug of Chinese tokenization in 1.10.12 [bugfix] Fix the bug of Chinese tokenization in 1.10.12 Sep 27, 2020
Copy link
Contributor

@tabergma tabergma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this issue!

I am not quite sure I fully understand the problem. Added some questions in the comments.

Also we would need to have a changelog entry and some tests. Thanks.

rasa/utils/train_utils.py Outdated Show resolved Hide resolved
rasa/utils/train_utils.py Outdated Show resolved Hide resolved
rasa/utils/train_utils.py Outdated Show resolved Hide resolved
rasa/nlu/tokenizers/convert_tokenizer.py Outdated Show resolved Hide resolved
@howl-anderson howl-anderson changed the title [bugfix] Fix the bug of Chinese tokenization in 1.10.12 [bugfix][WIP] Fix the bug of Chinese tokenization in 1.10.12 Sep 29, 2020
@tabergma
Copy link
Contributor

Looks good so far! Can you please add a changelog entry? Thanks.

Copy link
Contributor

@tabergma tabergma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! 👍 Thanks for tackling all my comments.

Can you please merge the latest version of 1.10.x into this branch as the convert test fixes are merged there? Thanks.

rasa/utils/train_utils.py Outdated Show resolved Hide resolved
rasa/utils/train_utils.py Outdated Show resolved Hide resolved
@tabergma tabergma merged commit 0dce587 into RasaHQ:1.10.x Sep 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants