Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove library specific tokenizers #108 #7027

Merged
merged 41 commits into from
Nov 10, 2020
Merged

Remove library specific tokenizers #108 #7027

merged 41 commits into from
Nov 10, 2020

Conversation

koernerfelicia
Copy link
Contributor

@koernerfelicia koernerfelicia commented Oct 15, 2020

Research issue 108
Proposed changes

  • Remove LanguageModelTokenizer:

I set the sub-token information in HFTransformersNLP. This requires that tokenization occurs before the information is set.

We used to handle this by always tokenizing with the WhitespaceTokenizer inside HFTransformersNLP. Now, the Tokenizer must be placed before HFTransformersNLP in the pipeline, but any Tokenizer can be used. This may be confusing to the user.

One alternative would be to move the logic of setting the sub-token information into LanguageModelFeaturizer. This requires moving several functions from HFTransformersNLP as well and reloading the model in LanguageModelFeaturizer.

Alternatively, the featurization and tokenization from HFTransformersNLP can be moved into LanguageModelFeaturizer, doing away with HFTransformersNLP entirely. The drawback maybe less flexibility should we want to apply features from HFTransformersNLP differently in the future. I'm not sure if this is likely, maybe @dakshvar22 knows?

  • Remove ConveRTTokenizer: the sub-token information is set in ConveRTFeaturizer.

Both Tokenizers will give a deprecation warning if used, and will revert to WhitespaceTokenizer.

Status (please check what you already did):

  • added some tests for the functionality
  • updated the documentation
  • updated the changelog (please check changelog for instructions)
  • reformat files using black (please check Readme for instructions)

@CLAassistant
Copy link

CLAassistant commented Oct 15, 2020

CLA assistant check
All committers have signed the CLA.

@koernerfelicia koernerfelicia changed the title draft PR: research issue 108 Remove library specific tokenizers #108 Oct 15, 2020
@dakshvar22
Copy link
Contributor

@koernerfelicia I started looking into the PR, I think the way how you have implemented your intended logic, it looks good.
However, I have questions on the solution approach itself 😅 -

  1. You've mentioned this as well and I agree that having the Tokenizer before HFTransformersNLP is super counter-intuitive. If we do that and decide to do away with LanguageModelTokenizer then it makes a lot of sense to move the complete logic of HFTransformersNLP inside the LanguageModelFeaturizer and remove HFTransformersNLP

  2. However, we lose a bit of flexibility if we follow the suggestion in (1). Having HFTransformersNLP also allowed users to build a custom component based on the sub-tokens that HFTransformersNLP set. If that logic is moved directly inside LanguageModelFeaturizer, we should still make that information available inside the Message object somehow. Also, for some languages like Chinese whitespace tokenization does not work and users rely on either MitieTokenizer or the output of LanguageModelTokenizer. Both of them are not guaranteed to output the same set of tokens for the same input sentence.

I am not sure what's the ideal solution here but I'll think a bit more on the original problem itself -

certain featurizer can just be used if certain tokenizers are used in the pipeline. Due to that it is not possible to mix different word embeddings as just one tokenizer per pipeline is allowed.

What if we get around that problem by relaxing the constraint of certain tokenizers should necessarily be used for certain featurizers?

@tabergma Thoughts on the above?

@tabergma
Copy link
Contributor

I think we definitely need to get rid of the dependency between LanguageModelTokenizer and WhitespaceTokenizer. For example, currently it is not possible to use the LanguageModel components with a Chinese BERT model as the sentence are whitespace tokenized which does not work for Chinese. We cannot just remove the constraint that certain featurizers rely on certain tokenizers. So, I think moving the functionality completely to the LanguageModelFeaturizer is the best approach. I think it is fine if this component sets the sub-token information as well. And we can also try to decouple the functionality as much as possible so that users building a new component with a similar logic can be easily done.

@koernerfelicia
Copy link
Contributor Author

@tabergma @dakshvar22 To summarize: I'm still not sure how to change the logic so that the Tokenizer can come after HFTransformersNLP unless we replace the HFTransformersNLP with a Featurizer. Even if we figure out some solution to pass the sub-token information to the LanguageModelFeaturizer without attaching it to a token, the featurizing inside HFTransformersNLP requires some sort of tokens (e.g. see line 529 _get_model_features_for_batch). I think the only reasonable solutions are:

  1. (currently implemented in this PR) Require the Tokenizer to come before HFTransformersNLP and use the tokens produced by the selected Tokenizer to set sub-token information, and produce features in HFTransformersNLP. The features are assigned to tokens in LanguageModelFeaturizer.
  2. (like Daksh said) Move the complete logic of HFTransformersNLP inside the LanguageModelFeaturizer and remove HFTransformersNLP.

Which solution should I select?

@tabergma
Copy link
Contributor

I would go with option 2.

(Just as a note: But we need to keep the old components and deprecate them. We cannot simply remove them.)

@dakshvar22
Copy link
Contributor

@tabergma @koernerfelicia I agree option 2 seems much better. Just wondering about this -

For example, currently it is not possible to use the LanguageModel components with a Chinese BERT model as the sentence are whitespace tokenized which does not work for Chinese.

So, if we move all the logic inside LanguageModelFeaturizer, then the user can only use MitieTokenizer for tokenization and not the tokenizer from Bert model, right?

@tabergma
Copy link
Contributor

Yes. For Chinese our users would be able to use the MitieTokenizer to tokenize their models. The LanguageModelFeaturizer can then use the Chinese BERT model to featurize the incoming utterance. It will first "tokenizes" the individual tokens to get the information how many sub-tokens one token has. This information is then used to calculate the avg feature vector per token.

@dakshvar22
Copy link
Contributor

Okay, sounds good. 👍

@koernerfelicia
Copy link
Contributor Author

@dakshvar22 @tabergma I've moved the logic from HFTransformersNLP into LanguageModelFeaturizer.

One tricky thing -- since we are only deprecating HFTransformersNLP I assume we want the behaviour of pipelines containing HFTransformersNLP to be preserved. This means if users have specified a model in HFTransformersNLP, this is the model that needs to be used to featurize the dense-featurizable messages (not the default model in LanguageModelFeaturizer). I have a check in the LanguageModelFeaturizer, which outputs a warning to DEBUG (per message), this may be too noisy, however. I wasn't able to get caplog in tests to show me anything in TRACE.

What do you think?

Copy link
Contributor

@tabergma tabergma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just took a brief look and added some first comments.

I think we can move the check so that we just print the warning once per batch. See inline comments.

Please, also add a changelog entry.

docs/docs/components.mdx Show resolved Hide resolved
rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py Outdated Show resolved Hide resolved
rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py Outdated Show resolved Hide resolved
rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py Outdated Show resolved Hide resolved
@dakshvar22
Copy link
Contributor

@koernerfelicia We should merge this after #7089 is merged into 2.0.x branch. We can then deprecate ConveRTTokenizer by shifting the mandatory model_url parameter to ConveRTFeaturizer as part of this PR(I'll take that on myself).

@tabergma
Copy link
Contributor

@koernerfelicia Let me know if this is ready for another review round.

@koernerfelicia
Copy link
Contributor Author

Hi @howl-anderson, you're right, it's a link to a private repo. Do you have any questions about the issue that I can answer?

@howl-anderson
Copy link
Contributor

Hi @koernerfelicia, I am just working on an issue related to this, so far everything is OK, if I have a question, I will let you know. Thank you!

@github-actions
Copy link
Contributor

github-actions bot commented Nov 9, 2020

Commit: 15c5047, The full report is available as an artifact.

Dataset: Carbon Bot

Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 1m20s, train: 3m14s, total: 4m33s
0.7903 (0.00) 0.6260 (0.00) 0.4967 (0.01)

@github-actions
Copy link
Contributor

github-actions bot commented Nov 9, 2020

Hey @koernerfelicia! 👋 To run model regression tests, comment with the /modeltest command and a configuration.

Tips 💡: The model regression test will be run on push events. You can re-run the tests by re-add status:model-regression-tests label or use a Re-run jobs button in Github Actions workflow.

Tips 💡: Every time when you want to change a configuration you should edit the comment with the previous configuration.

You can copy this in your comment and customize:

/modeltest

```yml
##########
## Available datasets
##########
# - "Carbon Bot"
# - "Hermit"
# - "Private 1"
# - "Private 2"
# - "Private 3"
# - "Sara"

##########
## Available configurations
##########
# - "BERT + DIET(bow) + ResponseSelector(bow)"
# - "BERT + DIET(seq) + ResponseSelector(t2t)"
# - "Spacy + DIET(bow) + ResponseSelector(bow)"
# - "Spacy + DIET(seq) + ResponseSelector(t2t)"
# - "Sparse + BERT + DIET(bow) + ResponseSelector(bow)"
# - "Sparse + BERT + DIET(seq) + ResponseSelector(t2t)"
# - "Sparse + DIET(bow) + ResponseSelector(bow)"
# - "Sparse + DIET(seq) + ResponseSelector(t2t)"
# - "Sparse + Spacy + DIET(bow) + ResponseSelector(bow)"
# - "Sparse + Spacy + DIET(seq) + ResponseSelector(t2t)"

## Example configuration
#################### syntax #################
## include:
##   - dataset: ["<dataset_name>"]
##     config: ["<configuration_name>"]
#
## Example:
## include:
##  - dataset: ["Carbon Bot"]
##    config: ["Sparse + DIET(bow) + ResponseSelector(bow)"]
#
## Shortcut:
## You can use the "all" shortcut to include all available configurations or datasets
#
## Example: Use the "Sparse + EmbeddingIntent + ResponseSelector(bow)" configuration
## for all available datasets
## include:
##  - dataset: ["all"]
##    config: ["Sparse + DIET(bow) + ResponseSelector(bow)"]
#
## Example: Use all available configurations for the "Carbon Bot" and "Sara" datasets
## and for the "Hermit" dataset use the "Sparse + DIET + ResponseSelector(T2T)" and
## "BERT + DIET + ResponseSelector(T2T)" configurations:
## include:
##  - dataset: ["Carbon Bot", "Sara"]
##    config: ["all"]
##  - dataset: ["Hermit"]
##    config: ["Sparse + DIET(seq) + ResponseSelector(t2t)", "BERT + DIET(seq) + ResponseSelector(t2t)"]

include:
 - dataset: ["Carbon Bot"]
   config: ["Sparse + DIET(bow) + ResponseSelector(bow)"]

```

@github-actions
Copy link
Contributor

github-actions bot commented Nov 9, 2020

/modeltest

include:
 - dataset: ["all"]
   config: ["BERT + DIET(bow) + ResponseSelector(bow)", "BERT + DIET(seq) + ResponseSelector(t2t)","Sparse + BERT + DIET(bow) + ResponseSelector(bow)","Sparse+BERT + DIET(seq) + ResponseSelector(t2t)"]

@github-actions
Copy link
Contributor

github-actions bot commented Nov 9, 2020

The model regression tests have started. It might take a while, please be patient.
As soon as results are ready you'll see a new comment with the results.

Used configuration can be found in the comment.

@github-actions
Copy link
Contributor

Commit: 15c5047, The full report is available as an artifact.

Dataset: Carbon Bot

Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 1m24s, train: 4m54s, total: 6m18s
0.7903 (0.00) 0.6260 (0.00) 0.4967 (0.01)
BERT + DIET(seq) + ResponseSelector(t2t)
test: 1m23s, train: 3m36s, total: 4m59s
0.7883 (0.00) 0.8199 (0.00) 0.5648 (0.00)
Sparse + BERT + DIET(bow) + ResponseSelector(bow)
test: 1m16s, train: 3m14s, total: 4m30s
0.7961 (0.00) 0.6260 (0.00) 0.5581 (no data)

Dataset: Hermit

Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 3m13s, train: 19m11s, total: 22m24s
0.8931 (0.00) 0.7504 (0.00) no data
BERT + DIET(seq) + ResponseSelector(t2t)
test: 2m37s, train: 12m22s, total: 14m59s
0.8894 (0.00) 0.8011 (0.00) no data
Sparse + BERT + DIET(bow) + ResponseSelector(bow)
test: 3m14s, train: 20m51s, total: 24m4s
0.8699 (0.00) 0.7504 (0.00) no data

Dataset: Private 1

Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 1m42s, train: 3m25s, total: 5m6s
0.9106 (0.00) 0.9612 (0.00) no data
BERT + DIET(seq) + ResponseSelector(t2t)
test: 1m57s, train: 3m17s, total: 5m13s
0.9179 (0.00) 0.9699 (0.00) no data

Dataset: Private 2

Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 1m52s, train: 3m33s, total: 5m25s
0.8757 (0.00) no data no data

Dataset: Private 3

Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 52s, train: 1m0s, total: 1m51s
0.9342 (0.00) no data no data
BERT + DIET(seq) + ResponseSelector(t2t)
test: 55s, train: 48s, total: 1m42s
0.8148 (0.00) no data no data

Dataset: Sara

Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 2m6s, train: 5m17s, total: 7m22s
0.8452 (0.00) 0.8683 (0.00) 0.8913 (-0.00)
BERT + DIET(seq) + ResponseSelector(t2t)
test: 2m18s, train: 4m1s, total: 6m18s
0.8570 (0.00) 0.8824 (0.00) 0.8826 (0.00)
Sparse + BERT + DIET(bow) + ResponseSelector(bow)
test: 2m11s, train: 6m34s, total: 8m45s
0.8658 (0.00) 0.8683 (0.00) 0.8804 (-0.01)

@rasabot rasabot merged commit df7a5b9 into master Nov 10, 2020
@rasabot rasabot deleted the iss-res-108 branch November 10, 2020 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants