-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use HF Auto classes in LanguageModelFeaturizer #10624
Conversation
…mask to max sequence length.
Commit: a20e992, The full report is available as an artifact. Dataset:
|
@TyDunn Since this PR is close to being finished and contains a feature update (specifically, expanding the functionality of the current
Let me know if you have any questions or we should discuss details in a short meeting. |
@mleimeister Since this contains enhancements, then it should go into the next minor release (3.1) |
Hi @koernerfelicia, the latest commits now contain the changes we discussed. Particularly:
Let me know if this looks ok to you 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments. Also it occurred to me that I think we should have someone else review the docs part (dunno if CSE still does this). I'm not sure anymore if the concept of weights and models and how they relate to each other is confusing since we were not always good about consistent terminology here. Maybe we can have fresh eyes on just that docs section?
I think we can request this in #product-docs, unless we think of someone specific (they should not have a deep knowledge of how HF transformers models are organised ) Also, have you let DevRel know about this? I think it is worth a forum post to drum up excitement 💃 |
Can we check some non-Latin alphabets here? It’d be a shame if we work for English, but break for Hebrew, Korean, Chinese, or Arabic. This is probably best served as a “slow test” that we run once a week or so. Running that on every PR would be horrendously slow. |
@koaning That would be great. I'm actually looking for a similar functionality to regularly check if the tokenizer cleanup works as expected. Do we already have a process in place for running such "slow tests"? I guess it could be a cron job (?) and created this follow-up ticket to discuss with QA how to best do this: #10893 |
It's not super formal, but we have a cronjob for the base Rasa install that might serve as a source of inspiration. That runs daily though, this might be better served as an optional/weekly job. |
@koaning Ah yes, I saw that. I expanded the follow-up ticket to also contain the testing of non-latin script models. I would set up a meeting with QA to discuss next week how to best implement this, also as part of maybe a general strategy to run such "slow tests". Would you be happy to have this handled there in order to not expand this PR further? Do you have any concerns regarding the docs in terms of being understandable for bot developers? |
@mleimeister yeah for sure the topic of "what are good slow tests to run from cron" is a larger topic that's a bit out of scope for us here. |
I guess the main thing in my mind when I read the docs is "this reads fine, but maybe it's time that we have a benchmarking guide". My main fear is that folks try out the Huggingface models but forget to check the computational overhead. That's a whole 'nother effort though. Also related to "can we make the benchmarking dev experience better". |
@koaning maybe we could have a forum post announcing this change and link to your video where you go through why it's important to focus on other things like data quality over fancy, heavy embeddings? |
A blog post wouldn't hurt, but the docs might be a more appropriate place. If you're interested in doing a benchmark, will you sooner look for info on the blog or on the docs? This algorithm whiteboard video might be appropriate to link. |
# Conflicts: # .github/tests/test_download_pretrained.py
🚀 A preview of the docs have been deployed at the following URL: https://10624--rasahq-docs-rasa-v2.netlify.app/docs/rasa |
@dakshvar22 Do you think we should still follow up on this PR? |
In order to allow users to use arbitrary HuggingFace models in
LanguageModelFeaturizer
, this PR gets rid of the hard-coded mapping between model architecture and model/tokenizer classes. Instead the model classes are inferred from the specified weights usingAutoTokenizer
andTFAutoModel
. The implementation aims to provide the following:transformers
TODO:
##
in BERT). The current version has a fixed list of known delimiter tokens (from BERT, GPT2, XLNET). Should investigate how theTokenizer.convert_tokens_to_string
function can be used that implements this cleaning step in the child classes. -> The 3 existing delimiter tokens are the ones currently listed in the HF documentation/tokenizer course.CLS
,BOS
,EOS
be filtered?) -> Testing various tokenizers, removing theUNK
token seems the correct way. Nothing would prevent though someone writing a custom tokenizer that breaks this. The latest HF version contains a base class function SpecialTokensMixin.get_special_tokens_mask that however includes theUNK
token in the mask and is therefore not useful for our purpose.nlu.utils.huggingface
subdirectoryProposed changes:
Status (please check what you already did):
black
(please check Readme for instructions)convert_tokens_to_string
behaves as expected with our cleanup function