-
Notifications
You must be signed in to change notification settings - Fork 27.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix train_new_from_iterator
in the case of byte-level tokenizers
#17549
Conversation
train_new_from_iterator
in the case of bytelevel tokenizerstrain_new_from_iterator
in the case of byte-level tokenizers
The documentation is not available anymore as the PR was closed or merged. |
if tokenizer_class is not None: | ||
try: | ||
tokenizer = get_tiny_tokenizer_from_checkpoint(checkpoint) | ||
tiny_config.vocab_size = len(tokenizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to modify the basis of the pipeline tests as they use the train_new_from_iterator
feature to create toy tokenizers.
The problem that arose was the difference between the default tiny_config.vocab_size
attribute and the size of the toy tokenizer vocabulary. With my modification, the vocabulary is larger (more than 256
) and exceeded this default value (which was for example 99
for Bart).
So I reordered the lines (to create the tokenizer before the model) and added this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a huge fan of this modification as it makes config
modified by the test in not really obvious way.
It enables silent issues to creep up (the issues raised by your train_from_iterator change where actually not silent which was a good thing IMO)
How many tests were impacted ?
The other fix I can see would be instead to modify ModelTester.get_pipeline_config
to add the necessary amount of vocabulary_size so that at least we have enough vocab to be able to retrain the tokenizer.
wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your review @Narsil 🤗 !
How many tests were impacted
80 tests failed in run_tests_pipelines_tf
CI and 132 in run_tests_pipelines_torch
CI. I've observed that it impacted 11 different model configurations ('Bart', 'Blenderbot', 'Deberta', 'GPT2', 'GPTJ', 'GPTNeo', 'IBert', 'LED', 'Longformer', 'Roberta' and 'Yoso')
The other fix I can see would be instead to modify ModelTester.get_pipeline_config to add the necessary amount of vocabulary_size so that at least we have enough vocab to be able to retrain the tokenizer.
wdyt ?
I really like this suggestion! I made the changes in the last commits. Can you tell me if you're ok with those? I put a vocabulary size of 300 so that it would be rounded up to the nearest hundred above 256.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this !
I really think this is better this way ! Thanks for taking care of it .
train_new_from_iterator
in the case of byte-level tokenizerstrain_new_from_iterator
in the case of byte-level tokenizers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for fixing those!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this.
I would prefer not changing the config
directly within the tests if possible as IMO it will enable silent bugs to appear.
I can take care of the modifications if you want.
@@ -699,6 +700,8 @@ def train_new_from_iterator( | |||
kwargs["end_of_word_suffix"] = tokenizer_json["model"]["end_of_word_suffix"] | |||
if tokenizer_json["model"]["type"] == "Unigram" and unk_token is not None: | |||
kwargs["unk_token"] = unk_token | |||
if tokenizer_json["pre_tokenizer"]["type"] == "ByteLevel": | |||
kwargs["initial_alphabet"] = pre_tokenizers_fast.ByteLevel.alphabet() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point everything in this should be ported directly within tokenizers
.
Information flow from the Tokenizer
to the trainer
is a long standing issue (some options are recoverable, some are not, but it's inconsistent currently)
if tokenizer_class is not None: | ||
try: | ||
tokenizer = get_tiny_tokenizer_from_checkpoint(checkpoint) | ||
tiny_config.vocab_size = len(tokenizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a huge fan of this modification as it makes config
modified by the test in not really obvious way.
It enables silent issues to creep up (the issues raised by your train_from_iterator change where actually not silent which was a good thing IMO)
How many tests were impacted ?
The other fix I can see would be instead to modify ModelTester.get_pipeline_config
to add the necessary amount of vocabulary_size so that at least we have enough vocab to be able to retrain the tokenizer.
wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clean! Thanks for working on it @SaulLu!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What does this PR do?
This PR aims at allowing to use
train_new_from_iterator
when the original tokenizer backend was using a ByteLevel pre-tokenization. Before this fix, the vocabulary learn wasn't correct because the initial bytes were missing.Fixes #17371
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Would love to have the feedback of @LysandreJik and @sgugger on the tokenizer part and @Narsil on the pipeline tests (and also the tokenizer if you have more time!)