Remove library specific tokenizers #108 #7027

koernerfelicia · 2020-10-15T16:14:42Z

Research issue 108
Proposed changes

Remove LanguageModelTokenizer:

I set the sub-token information in HFTransformersNLP. This requires that tokenization occurs before the information is set.

We used to handle this by always tokenizing with the WhitespaceTokenizer inside HFTransformersNLP. Now, the Tokenizer must be placed before HFTransformersNLP in the pipeline, but any Tokenizer can be used. This may be confusing to the user.

One alternative would be to move the logic of setting the sub-token information into LanguageModelFeaturizer. This requires moving several functions from HFTransformersNLP as well and reloading the model in LanguageModelFeaturizer.

Alternatively, the featurization and tokenization from HFTransformersNLP can be moved into LanguageModelFeaturizer, doing away with HFTransformersNLP entirely. The drawback maybe less flexibility should we want to apply features from HFTransformersNLP differently in the future. I'm not sure if this is likely, maybe @dakshvar22 knows?

Remove ConveRTTokenizer: the sub-token information is set in ConveRTFeaturizer.

Both Tokenizers will give a deprecation warning if used, and will revert to WhitespaceTokenizer.

Status (please check what you already did):

added some tests for the functionality
updated the documentation
updated the changelog (please check changelog for instructions)
reformat files using black (please check Readme for instructions)

…MTokenizer tests adjusted and moved into HFTransformersNLP tests

…kenizer

…Also moved tests from test_convert_tokenizer into test_convert_featurizer and adjusted accordingly

CLAassistant · 2020-10-15T16:14:47Z

All committers have signed the CLA.

dakshvar22 · 2020-10-16T16:06:56Z

@koernerfelicia I started looking into the PR, I think the way how you have implemented your intended logic, it looks good.
However, I have questions on the solution approach itself 😅 -

You've mentioned this as well and I agree that having the Tokenizer before HFTransformersNLP is super counter-intuitive. If we do that and decide to do away with LanguageModelTokenizer then it makes a lot of sense to move the complete logic of HFTransformersNLP inside the LanguageModelFeaturizer and remove HFTransformersNLP
However, we lose a bit of flexibility if we follow the suggestion in (1). Having HFTransformersNLP also allowed users to build a custom component based on the sub-tokens that HFTransformersNLP set. If that logic is moved directly inside LanguageModelFeaturizer, we should still make that information available inside the Message object somehow. Also, for some languages like Chinese whitespace tokenization does not work and users rely on either MitieTokenizer or the output of LanguageModelTokenizer. Both of them are not guaranteed to output the same set of tokens for the same input sentence.

I am not sure what's the ideal solution here but I'll think a bit more on the original problem itself -

certain featurizer can just be used if certain tokenizers are used in the pipeline. Due to that it is not possible to mix different word embeddings as just one tokenizer per pipeline is allowed.

What if we get around that problem by relaxing the constraint of certain tokenizers should necessarily be used for certain featurizers?

@tabergma Thoughts on the above?

tabergma · 2020-10-20T08:29:38Z

I think we definitely need to get rid of the dependency between LanguageModelTokenizer and WhitespaceTokenizer. For example, currently it is not possible to use the LanguageModel components with a Chinese BERT model as the sentence are whitespace tokenized which does not work for Chinese. We cannot just remove the constraint that certain featurizers rely on certain tokenizers. So, I think moving the functionality completely to the LanguageModelFeaturizer is the best approach. I think it is fine if this component sets the sub-token information as well. And we can also try to decouple the functionality as much as possible so that users building a new component with a similar logic can be easily done.

koernerfelicia · 2020-10-20T08:44:54Z

@tabergma @dakshvar22 To summarize: I'm still not sure how to change the logic so that the Tokenizer can come after HFTransformersNLP unless we replace the HFTransformersNLP with a Featurizer. Even if we figure out some solution to pass the sub-token information to the LanguageModelFeaturizer without attaching it to a token, the featurizing inside HFTransformersNLP requires some sort of tokens (e.g. see line 529 _get_model_features_for_batch). I think the only reasonable solutions are:

(currently implemented in this PR) Require the Tokenizer to come before HFTransformersNLP and use the tokens produced by the selected Tokenizer to set sub-token information, and produce features in HFTransformersNLP. The features are assigned to tokens in LanguageModelFeaturizer.
(like Daksh said) Move the complete logic of HFTransformersNLP inside the LanguageModelFeaturizer and remove HFTransformersNLP.

Which solution should I select?

tabergma · 2020-10-20T08:47:15Z

I would go with option 2.

(Just as a note: But we need to keep the old components and deprecate them. We cannot simply remove them.)

dakshvar22 · 2020-10-20T11:56:43Z

@tabergma @koernerfelicia I agree option 2 seems much better. Just wondering about this -

For example, currently it is not possible to use the LanguageModel components with a Chinese BERT model as the sentence are whitespace tokenized which does not work for Chinese.

So, if we move all the logic inside LanguageModelFeaturizer, then the user can only use MitieTokenizer for tokenization and not the tokenizer from Bert model, right?

tabergma · 2020-10-20T12:02:28Z

Yes. For Chinese our users would be able to use the MitieTokenizer to tokenize their models. The LanguageModelFeaturizer can then use the Chinese BERT model to featurize the incoming utterance. It will first "tokenizes" the individual tokens to get the information how many sub-tokens one token has. This information is then used to calculate the avg feature vector per token.

dakshvar22 · 2020-10-20T12:06:13Z

Okay, sounds good. 👍

…nguageModelFeaturizer. HFTransformersNLP is deprecated; however will still work as before if included in pipeline (i.e. LanguageModelFeaturizer does not overwrite tokens or features)

koernerfelicia · 2020-10-26T07:28:30Z

@dakshvar22 @tabergma I've moved the logic from HFTransformersNLP into LanguageModelFeaturizer.

One tricky thing -- since we are only deprecating HFTransformersNLP I assume we want the behaviour of pipelines containing HFTransformersNLP to be preserved. This means if users have specified a model in HFTransformersNLP, this is the model that needs to be used to featurize the dense-featurizable messages (not the default model in LanguageModelFeaturizer). I have a check in the LanguageModelFeaturizer, which outputs a warning to DEBUG (per message), this may be too noisy, however. I wasn't able to get caplog in tests to show me anything in TRACE.

What do you think?

tabergma

Just took a brief look and added some first comments.

I think we can move the check so that we just print the warning once per batch. See inline comments.

Please, also add a changelog entry.

docs/docs/components.mdx

rasa/nlu/featurizers/dense_featurizer/convert_featurizer.py

rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py

dakshvar22 · 2020-10-28T10:29:11Z

@koernerfelicia We should merge this after #7089 is merged into 2.0.x branch. We can then deprecate ConveRTTokenizer by shifting the mandatory model_url parameter to ConveRTFeaturizer as part of this PR(I'll take that on myself).

tabergma · 2020-10-29T07:50:53Z

@koernerfelicia Let me know if this is ready for another review round.

koernerfelicia · 2020-11-09T08:14:13Z

Hi @howl-anderson, you're right, it's a link to a private repo. Do you have any questions about the issue that I can answer?

howl-anderson · 2020-11-09T08:29:21Z

Hi @koernerfelicia, I am just working on an issue related to this, so far everything is OK, if I have a question, I will let you know. Thank you!

github-actions · 2020-11-09T17:22:33Z

Commit: 15c5047, The full report is available as an artifact.

Dataset: Carbon Bot

Configuration	Intent Classification Micro F1	Entity Recognition Micro F1	Response Selection Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `1m20s`, train: `3m14s`, total: `4m33s`	0.7903 (0.00)	0.6260 (0.00)	0.4967 (0.01)

github-actions · 2020-11-09T17:27:05Z

Hey @koernerfelicia! 👋 To run model regression tests, comment with the /modeltest command and a configuration.

Tips 💡: The model regression test will be run on push events. You can re-run the tests by re-add status:model-regression-tests label or use a Re-run jobs button in Github Actions workflow.

Tips 💡: Every time when you want to change a configuration you should edit the comment with the previous configuration.

You can copy this in your comment and customize:

/modeltest

```yml
##########
## Available datasets
##########
# - "Carbon Bot"
# - "Hermit"
# - "Private 1"
# - "Private 2"
# - "Private 3"
# - "Sara"

##########
## Available configurations
##########
# - "BERT + DIET(bow) + ResponseSelector(bow)"
# - "BERT + DIET(seq) + ResponseSelector(t2t)"
# - "Spacy + DIET(bow) + ResponseSelector(bow)"
# - "Spacy + DIET(seq) + ResponseSelector(t2t)"
# - "Sparse + BERT + DIET(bow) + ResponseSelector(bow)"
# - "Sparse + BERT + DIET(seq) + ResponseSelector(t2t)"
# - "Sparse + DIET(bow) + ResponseSelector(bow)"
# - "Sparse + DIET(seq) + ResponseSelector(t2t)"
# - "Sparse + Spacy + DIET(bow) + ResponseSelector(bow)"
# - "Sparse + Spacy + DIET(seq) + ResponseSelector(t2t)"

## Example configuration
#################### syntax #################
## include:
##   - dataset: ["<dataset_name>"]
##     config: ["<configuration_name>"]
#
## Example:
## include:
##  - dataset: ["Carbon Bot"]
##    config: ["Sparse + DIET(bow) + ResponseSelector(bow)"]
#
## Shortcut:
## You can use the "all" shortcut to include all available configurations or datasets
#
## Example: Use the "Sparse + EmbeddingIntent + ResponseSelector(bow)" configuration
## for all available datasets
## include:
##  - dataset: ["all"]
##    config: ["Sparse + DIET(bow) + ResponseSelector(bow)"]
#
## Example: Use all available configurations for the "Carbon Bot" and "Sara" datasets
## and for the "Hermit" dataset use the "Sparse + DIET + ResponseSelector(T2T)" and
## "BERT + DIET + ResponseSelector(T2T)" configurations:
## include:
##  - dataset: ["Carbon Bot", "Sara"]
##    config: ["all"]
##  - dataset: ["Hermit"]
##    config: ["Sparse + DIET(seq) + ResponseSelector(t2t)", "BERT + DIET(seq) + ResponseSelector(t2t)"]

include:
 - dataset: ["Carbon Bot"]
   config: ["Sparse + DIET(bow) + ResponseSelector(bow)"]

```

github-actions · 2020-11-09T17:27:07Z

/modeltest

include:
 - dataset: ["all"]
   config: ["BERT + DIET(bow) + ResponseSelector(bow)", "BERT + DIET(seq) + ResponseSelector(t2t)","Sparse + BERT + DIET(bow) + ResponseSelector(bow)","Sparse+BERT + DIET(seq) + ResponseSelector(t2t)"]

github-actions · 2020-11-09T17:27:10Z

The model regression tests have started. It might take a while, please be patient.
As soon as results are ready you'll see a new comment with the results.

Used configuration can be found in the comment.

github-actions · 2020-11-10T02:31:04Z

Commit: 15c5047, The full report is available as an artifact.

Dataset: Carbon Bot

Configuration	Intent Classification Micro F1	Entity Recognition Micro F1	Response Selection Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `1m24s`, train: `4m54s`, total: `6m18s`	0.7903 (0.00)	0.6260 (0.00)	0.4967 (0.01)
`BERT + DIET(seq) + ResponseSelector(t2t)` test: `1m23s`, train: `3m36s`, total: `4m59s`	0.7883 (0.00)	0.8199 (0.00)	0.5648 (0.00)
`Sparse + BERT + DIET(bow) + ResponseSelector(bow)` test: `1m16s`, train: `3m14s`, total: `4m30s`	0.7961 (0.00)	0.6260 (0.00)	0.5581 (`no data`)

Dataset: Hermit

Configuration	Intent Classification Micro F1	Entity Recognition Micro F1	Response Selection Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `3m13s`, train: `19m11s`, total: `22m24s`	0.8931 (0.00)	0.7504 (0.00)	`no data`
`BERT + DIET(seq) + ResponseSelector(t2t)` test: `2m37s`, train: `12m22s`, total: `14m59s`	0.8894 (0.00)	0.8011 (0.00)	`no data`
`Sparse + BERT + DIET(bow) + ResponseSelector(bow)` test: `3m14s`, train: `20m51s`, total: `24m4s`	0.8699 (0.00)	0.7504 (0.00)	`no data`

Dataset: Private 1

Configuration	Intent Classification Micro F1	Entity Recognition Micro F1	Response Selection Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `1m42s`, train: `3m25s`, total: `5m6s`	0.9106 (0.00)	0.9612 (0.00)	`no data`
`BERT + DIET(seq) + ResponseSelector(t2t)` test: `1m57s`, train: `3m17s`, total: `5m13s`	0.9179 (0.00)	0.9699 (0.00)	`no data`

Dataset: Private 2

Configuration	Intent Classification Micro F1	Entity Recognition Micro F1	Response Selection Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `1m52s`, train: `3m33s`, total: `5m25s`	0.8757 (0.00)	`no data`	`no data`

Dataset: Private 3

Configuration	Intent Classification Micro F1	Entity Recognition Micro F1	Response Selection Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `52s`, train: `1m0s`, total: `1m51s`	0.9342 (0.00)	`no data`	`no data`
`BERT + DIET(seq) + ResponseSelector(t2t)` test: `55s`, train: `48s`, total: `1m42s`	0.8148 (0.00)	`no data`	`no data`

Dataset: Sara

Configuration	Intent Classification Micro F1	Entity Recognition Micro F1	Response Selection Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `2m6s`, train: `5m17s`, total: `7m22s`	0.8452 (0.00)	0.8683 (0.00)	0.8913 (-0.00)
`BERT + DIET(seq) + ResponseSelector(t2t)` test: `2m18s`, train: `4m1s`, total: `6m18s`	0.8570 (0.00)	0.8824 (0.00)	0.8826 (0.00)
`Sparse + BERT + DIET(bow) + ResponseSelector(bow)` test: `2m11s`, train: `6m34s`, total: `8m45s`	0.8658 (0.00)	0.8683 (0.00)	0.8804 (-0.01)

…eaturizer

koernerfelicia added 7 commits October 14, 2020 09:55

Removed LanguageModelTokenizer. Logic moved into HFTransformersNLP, L…

d3869f7

…MTokenizer tests adjusted and moved into HFTransformersNLP tests

Updated Components doc to reflect deprecation of LanguageModelTokenizer

e4cc85b

Updated Components doc to reflect decoupling of LMFeaturizer and LMTo…

955312d

…kenizer

Reformatted lm_tokenizer and hf_transformers

2466515

Removed ConveRTTokenizer and moved its logic into ConveRTFeaturizer. …

a7fcffb

…Also moved tests from test_convert_tokenizer into test_convert_featurizer and adjusted accordingly

Changed ConveRT model url back to official (broken) poly ai url

adf3c31

Updated documentation to reflect deprecation of ConveRTTokenizer

df92bec

koernerfelicia requested review from tabergma and dakshvar22 October 15, 2020 16:15

koernerfelicia changed the title ~~draft PR: research issue 108~~ Remove library specific tokenizers #108 Oct 15, 2020

koernerfelicia and others added 6 commits October 25, 2020 20:06

Moved all featurizer and tokenizer logic from HFTransformersNLP to La…

ddea518

…nguageModelFeaturizer. HFTransformersNLP is deprecated; however will still work as before if included in pipeline (i.e. LanguageModelFeaturizer does not overwrite tokens or features)

Adjusted warning, removed incorrect reference to MIGRATION_DOCS

1705813

Updated docs to reflect deprecation of HFTransformersNLP

2ec801a

Merge branch 'master' into iss-res-108

61a3a63

Adjusted the docstrings a little

ef85054

Merge branch 'iss-res-108' of github.com:RasaHQ/rasa into iss-res-108

0eca21a

tabergma reviewed Oct 26, 2020

View reviewed changes

Adjusted to reflect code review comments

5abfe12

Fixed some deepsource errors

19d1c14

Merge branch 'master' into iss-res-108

429bd2d

koernerfelicia removed the status:model-regression-tests label Nov 9, 2020

koernerfelicia added 3 commits November 9, 2020 17:45

Fix pytests

1af7b9e

Merge branch 'iss-res-108' of github.com:RasaHQ/rasa into iss-res-108

4d9e3d9

Use create instead of constructor for tests

37c92d2

koernerfelicia added status:model-regression-tests and removed status:model-regression-tests labels Nov 9, 2020

github-actions bot deleted a comment from koernerfelicia Nov 9, 2020

github-actions bot removed status:model-regression-tests runner:gpu labels Nov 9, 2020

koernerfelicia added runner:gpu status:model-regression-tests labels Nov 9, 2020

github-actions bot removed status:model-regression-tests runner:gpu labels Nov 10, 2020

koernerfelicia and others added 5 commits November 10, 2020 09:00

Fix pytest

a82adfb

Some deepsource checks and hopefully actually fix pytests

5443f89

Reformat tests again

b02ed91

Overloaded load method so that we do not call the constructor for LMF…

b548deb

…eaturizer

Merge branch 'master' into iss-res-108

05847b9

koernerfelicia added the status:ready-to-merge label Nov 10, 2020

rasabot merged commit df7a5b9 into master Nov 10, 2020

rasabot deleted the iss-res-108 branch November 10, 2020 12:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove library specific tokenizers #108 #7027

Remove library specific tokenizers #108 #7027

koernerfelicia commented Oct 15, 2020 •

edited

Loading

CLAassistant commented Oct 15, 2020 •

edited

Loading

dakshvar22 commented Oct 16, 2020

tabergma commented Oct 20, 2020

koernerfelicia commented Oct 20, 2020

tabergma commented Oct 20, 2020

dakshvar22 commented Oct 20, 2020

tabergma commented Oct 20, 2020

dakshvar22 commented Oct 20, 2020

koernerfelicia commented Oct 26, 2020

tabergma left a comment

dakshvar22 commented Oct 28, 2020

tabergma commented Oct 29, 2020

koernerfelicia commented Nov 9, 2020

howl-anderson commented Nov 9, 2020

github-actions bot commented Nov 9, 2020

github-actions bot commented Nov 9, 2020

github-actions bot commented Nov 9, 2020

github-actions bot commented Nov 9, 2020

github-actions bot commented Nov 10, 2020

Remove library specific tokenizers #108 #7027

Remove library specific tokenizers #108 #7027

Conversation

koernerfelicia commented Oct 15, 2020 • edited Loading

CLAassistant commented Oct 15, 2020 • edited Loading

dakshvar22 commented Oct 16, 2020

tabergma commented Oct 20, 2020

koernerfelicia commented Oct 20, 2020

tabergma commented Oct 20, 2020

dakshvar22 commented Oct 20, 2020

tabergma commented Oct 20, 2020

dakshvar22 commented Oct 20, 2020

koernerfelicia commented Oct 26, 2020

tabergma left a comment

Choose a reason for hiding this comment

dakshvar22 commented Oct 28, 2020

tabergma commented Oct 29, 2020

koernerfelicia commented Nov 9, 2020

howl-anderson commented Nov 9, 2020

github-actions bot commented Nov 9, 2020

github-actions bot commented Nov 9, 2020

github-actions bot commented Nov 9, 2020

github-actions bot commented Nov 9, 2020

github-actions bot commented Nov 10, 2020

koernerfelicia commented Oct 15, 2020 •

edited

Loading

CLAassistant commented Oct 15, 2020 •

edited

Loading