Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizers don't split words into sub-words #5756

Merged
merged 16 commits into from
May 5, 2020
Merged

Conversation

tabergma
Copy link
Contributor

@tabergma tabergma commented Apr 30, 2020

Proposed changes:
To avoid the problem of our entity extractors predicting entity labels for just a part of the words, we introduced a cleaning method after the prediction was done. However, we should avoid the incorrect prediction in the first place.
To achieve this we will not tokenize words into sub-words anymore. We take the mean feature vectors of the sub-words as the feature vector of the word.

fixes #5755
closes https://github.com/RasaHQ/research/issues/83

Status (please check what you already did):

  • added some tests for the functionality
  • updated the documentation
  • updated the changelog (please check changelog for instructions)
  • reformat files using black (please check Readme for instructions)

@tabergma tabergma requested a review from dakshvar22 April 30, 2020 08:54
@dakshvar22
Copy link
Contributor

@tabergma Do we have any performance numbers with and without this fix?

@tabergma
Copy link
Contributor Author

Yes, I tested it on carbon bot and the results were the same (77.2% vs 77.8% for entities - branch composite entities). Also, verified locally on the smaller example bots, that the prediction are not on sub-tokens anymore.

@tabergma
Copy link
Contributor Author

Results on Sara (2 fold cross validation):
master - micro f1: 83.8
fix-tokenization - micro f1: 85.5

Copy link
Contributor

@dakshvar22 dakshvar22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on carbon bot with BERT as well. Performance for entities improves by 2 points 🚀
Just one comment for an additional test.

tests/utils/test_train_utils.py Show resolved Hide resolved
@tabergma tabergma changed the base branch from 1.10.x to master May 4, 2020 12:38
@tabergma tabergma requested a review from dakshvar22 May 5, 2020 06:49
Copy link
Contributor

@dakshvar22 dakshvar22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! 🌟

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EntityExtractor in v1.10.0 yield wrong entity value for language without spaces
2 participants