Use HF Auto classes in LanguageModelFeaturizer #10624

mleimeister · 2022-01-05T09:48:40Z

In order to allow users to use arbitrary HuggingFace models in LanguageModelFeaturizer, this PR gets rid of the hard-coded mapping between model architecture and model/tokenizer classes. Instead the model classes are inferred from the specified weights using AutoTokenizer and TFAutoModel. The implementation aims to provide the following:

Enable all HF models that are compatible with our current version of transformers
Keep default weights for the currently available models (for backward compatibility)
Reproduce output for existing model/weights combinations in the unit tests

TODO:

One part that so far was tricky to move to the Auto classes is the delimiter token cleaning (e.g. ## in BERT). The current version has a fixed list of known delimiter tokens (from BERT, GPT2, XLNET). Should investigate how the Tokenizer.convert_tokens_to_string function can be used that implements this cleaning step in the child classes. -> The 3 existing delimiter tokens are the ones currently listed in the HF documentation/tokenizer course.
Validate role of different special tokens in HF tokenizers (especially, should only CLS, BOS, EOS be filtered?) -> Testing various tokenizers, removing the UNK token seems the correct way. Nothing would prevent though someone writing a custom tokenizer that breaks this. The latest HF version contains a base class function SpecialTokensMixin.get_special_tokens_mask that however includes the UNK token in the mask and is therefore not useful for our purpose.
Remove nlu.utils.huggingface subdirectory
Add unit tests for "unknown" model architectures and missing default weights
Run model regression tests
Update docs

Proposed changes:

Use the HuggingFace Auto classes to load arbitrary weights from the HF model hub

Status (please check what you already did):

added some tests for the functionality
updated the documentation
updated the changelog (please check changelog for instructions)
reformat files using black (please check Readme for instructions)
Check on HF issue board if exposing the delimiter prefixes could be implemented without too much work
Go through existing tokenizers and check if convert_tokens_to_string behaves as expected with our cleanup function
Unit tests for tokenizers that differ considerably to the currently supported

…mask to max sequence length.

…t tests.

github-actions · 2022-01-07T11:50:06Z

Commit: a20e992, The full report is available as an artifact.

Dataset: Sara, Dataset repository branch: main, commit: 819cb7b3cc077753e67178ad022d577f164e99cf

Configuration	Intent Classification Micro F1	Entity Recognition Micro F1	Response Selection Micro F1
`BERT + DIET(seq) + ResponseSelector(t2t)` test: `8m32s`, train: `6m22s`, total: `14m54s`	0.7136 (0.00)	0.7925 (0.00)	0.7783 (0.00)
`Sparse + BERT + DIET(bow) + ResponseSelector(bow)` test: `7m22s`, train: `12m59s`, total: `20m21s`	0.6957 (0.01)	0.7949 (0.00)	0.7860 (-0.01)

mleimeister · 2022-02-14T16:11:07Z

@TyDunn Since this PR is close to being finished and contains a feature update (specifically, expanding the functionality of the current LanguageModelFeaturizer), I wanted to check with you regarding where this could best go? Would this be appropriate for the upcoming 3.1 minor release? The changes would be as follows:

More models from HuggingFace are supported, by using the model identifier from the HF model hub as the model_weights parameter. This enables most of the models we've seen in the forum that people wanted to use and couldn't so far.
The models that are not supported as of transformers version 4.13.0 are listed in the documentation. There is also an explicit check when the component loads and an error is being raised in case of an incompatible model, pointing to the documentation for further information.
I will discuss with QA about how to best run a regular check to ensure that the models we do support are up to date.

Let me know if you have any questions or we should discuss details in a short meeting.

TyDunn · 2022-02-15T14:20:52Z

@mleimeister Since this contains enhancements, then it should go into the next minor release (3.1)

mleimeister · 2022-02-16T10:41:59Z

Hi @koernerfelicia, the latest commits now contain the changes we discussed. Particularly:

Incompatible models are listed in the docs and are checked in the component
A unit test checks the token cleanup for the currently supported model architectures and that incompatible models raise an error
A script is in place to test the token cleanup for all models that are supported from this current transformers version. I reached out to QA to discuss how to integrate this in e.g. a cronjob, or the pre-release QA process. The concrete implementation would be covered in this follow-up ticket.
Product is informed and confirmed it's viable for the upcoming minor release

Let me know if this looks ok to you 🙂

rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py

.github/scripts/validate_lmfeaturizer_models.py

docs/docs/components.mdx

koernerfelicia

Some small comments. Also it occurred to me that I think we should have someone else review the docs part (dunno if CSE still does this). I'm not sure anymore if the concept of weights and models and how they relate to each other is confusing since we were not always good about consistent terminology here. Maybe we can have fresh eyes on just that docs section?

koernerfelicia · 2022-02-16T13:17:02Z

I think we should have someone else review the docs part

I think we can request this in #product-docs, unless we think of someone specific (they should not have a deep knowledge of how HF transformers models are organised )

Also, have you let DevRel know about this? I think it is worth a forum post to drum up excitement 💃

koaning · 2022-02-16T14:50:02Z

Can we check some non-Latin alphabets here? It’d be a shame if we work for English, but break for Hebrew, Korean, Chinese, or Arabic. This is probably best served as a “slow test” that we run once a week or so. Running that on every PR would be horrendously slow.

mleimeister · 2022-02-16T16:36:36Z

This is probably best served as a “slow test” that we run once a week or so. Running that on every PR would be horrendously slow.

@koaning That would be great. I'm actually looking for a similar functionality to regularly check if the tokenizer cleanup works as expected. Do we already have a process in place for running such "slow tests"? I guess it could be a cron job (?) and created this follow-up ticket to discuss with QA how to best do this: #10893

koaning · 2022-02-16T18:20:07Z

It's not super formal, but we have a cronjob for the base Rasa install that might serve as a source of inspiration. That runs daily though, this might be better served as an optional/weekly job.

mleimeister · 2022-02-17T08:31:56Z

we have a cronjob for the base Rasa install that might serve as a source of inspiration

@koaning Ah yes, I saw that. I expanded the follow-up ticket to also contain the testing of non-latin script models. I would set up a meeting with QA to discuss next week how to best implement this, also as part of maybe a general strategy to run such "slow tests". Would you be happy to have this handled there in order to not expand this PR further?

Do you have any concerns regarding the docs in terms of being understandable for bot developers?

koaning · 2022-02-17T08:33:46Z

@mleimeister yeah for sure the topic of "what are good slow tests to run from cron" is a larger topic that's a bit out of scope for us here.

koaning · 2022-02-17T08:35:42Z

I guess the main thing in my mind when I read the docs is "this reads fine, but maybe it's time that we have a benchmarking guide". My main fear is that folks try out the Huggingface models but forget to check the computational overhead.

That's a whole 'nother effort though. Also related to "can we make the benchmarking dev experience better".

koernerfelicia · 2022-02-17T09:44:32Z

@koaning maybe we could have a forum post announcing this change and link to your video where you go through why it's important to focus on other things like data quality over fancy, heavy embeddings?

koaning · 2022-02-17T10:39:59Z

A blog post wouldn't hurt, but the docs might be a more appropriate place. If you're interested in doing a benchmark, will you sooner look for info on the blog or on the docs?

This algorithm whiteboard video might be appropriate to link.

# Conflicts: # .github/tests/test_download_pretrained.py

github-actions · 2022-02-17T14:08:20Z

🚀 A preview of the docs have been deployed at the following URL: https://10624--rasahq-docs-rasa-v2.netlify.app/docs/rasa

losterloh · 2022-06-01T09:12:22Z

@dakshvar22 Do you think we should still follow up on this PR?

mleimeister added 7 commits January 4, 2022 11:39

First attempt removing the registry by using HF Auto classes

b72e08e

Fix maximum length. Make mask an np.array.

9601c82

Style fixes

c4ebf30

Set default weights for existing model types. Truncate special token …

c7dc821

…mask to max sequence length.

Add missing bert-base-uncased model to list of chached models for uni…

e99a2fc

…t tests.

Fix linting errors

7314b03

Fix typing error

bce7123

mleimeister mentioned this pull request Jan 5, 2022

Improve Docs for LanguageModelFeaturizer #10385

Closed

mleimeister added 2 commits January 5, 2022 12:55

Remove nlu.utils.hugging_face since it's no longer needed.

507b5c7

Fix subtoken filtering bug. Add unit test for unknown default weights.

ad9a628

mleimeister added status:model-regression-tests runner:gpu and removed status:model-regression-tests labels Jan 7, 2022

github-actions bot deleted a comment from mleimeister Jan 7, 2022

github-actions bot removed status:model-regression-tests runner:gpu labels Jan 7, 2022

Use distilbert model for test, since LaBSE weights are not cached on CI.

008c2ba

RasaHQ deleted a comment from github-actions bot Jan 7, 2022

This was referenced Jan 7, 2022

Load Huggingface Transformers model using TFAutoModel #6307

Closed

Bert model with AlbertTokenizer problem #10584

Closed

mleimeister added 3 commits January 10, 2022 12:04

Prevent error on non-existent cls token

3ff5b82

Merge branch 'main' into lmfeaturizer-automodel

d877657

Adapt documentation

dd5bd84

mleimeister added runner:gpu status:model-regression-tests labels Jan 14, 2022

RasaHQ deleted a comment from github-actions bot Jan 14, 2022

Document subtoken prefixes and make them a global constant

f0e5688

github-actions bot deleted a comment from mleimeister Jan 14, 2022

mleimeister added 3 commits February 15, 2022 10:49

Add script to test all tokenizer models.

cff2070

Merge remote-tracking branch 'origin/main' into lmfeaturizer-automodel

69cb218

Fix docs formatting

6d61e28

mleimeister added 2 commits February 16, 2022 09:16

Documentation of validation script

21a11f1

Fix linter error

43f0175

mleimeister requested a review from koernerfelicia February 16, 2022 10:42

koernerfelicia reviewed Feb 16, 2022

View reviewed changes

rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py Outdated Show resolved Hide resolved

koernerfelicia reviewed Feb 16, 2022

View reviewed changes

.github/scripts/validate_lmfeaturizer_models.py Show resolved Hide resolved

koernerfelicia reviewed Feb 16, 2022

View reviewed changes

docs/docs/components.mdx Outdated Show resolved Hide resolved

koernerfelicia suggested changes Feb 16, 2022

View reviewed changes

mleimeister added 2 commits February 16, 2022 15:34

Changes from code review feedback

bd3f02d

Fix typo

bc40da9

Merge remote-tracking branch 'origin/main' into lmfeaturizer-automodel

90d0d53

mleimeister added 2 commits February 17, 2022 14:49

Merge remote-tracking branch 'origin/main' into lmfeaturizer-automodel

0f96459

# Conflicts: # .github/tests/test_download_pretrained.py

Fix import

4a324ad

dakshvar22 closed this Jun 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use HF Auto classes in LanguageModelFeaturizer #10624

Use HF Auto classes in LanguageModelFeaturizer #10624

mleimeister commented Jan 5, 2022 •

edited

Loading

github-actions bot commented Jan 7, 2022

mleimeister commented Feb 14, 2022 •

edited

Loading

TyDunn commented Feb 15, 2022 •

edited

Loading

mleimeister commented Feb 16, 2022

koernerfelicia left a comment

koernerfelicia commented Feb 16, 2022

koaning commented Feb 16, 2022 •

edited

Loading

mleimeister commented Feb 16, 2022

koaning commented Feb 16, 2022

mleimeister commented Feb 17, 2022

koaning commented Feb 17, 2022

koaning commented Feb 17, 2022

koernerfelicia commented Feb 17, 2022

koaning commented Feb 17, 2022

github-actions bot commented Feb 17, 2022

losterloh commented Jun 1, 2022

Use HF Auto classes in LanguageModelFeaturizer #10624

Use HF Auto classes in LanguageModelFeaturizer #10624

Conversation

mleimeister commented Jan 5, 2022 • edited Loading

github-actions bot commented Jan 7, 2022

mleimeister commented Feb 14, 2022 • edited Loading

TyDunn commented Feb 15, 2022 • edited Loading

mleimeister commented Feb 16, 2022

koernerfelicia left a comment

Choose a reason for hiding this comment

koernerfelicia commented Feb 16, 2022

koaning commented Feb 16, 2022 • edited Loading

mleimeister commented Feb 16, 2022

koaning commented Feb 16, 2022

mleimeister commented Feb 17, 2022

koaning commented Feb 17, 2022

koaning commented Feb 17, 2022

koernerfelicia commented Feb 17, 2022

koaning commented Feb 17, 2022

github-actions bot commented Feb 17, 2022

losterloh commented Jun 1, 2022

mleimeister commented Jan 5, 2022 •

edited

Loading

mleimeister commented Feb 14, 2022 •

edited

Loading

TyDunn commented Feb 15, 2022 •

edited

Loading

koaning commented Feb 16, 2022 •

edited

Loading