-
Notifications
You must be signed in to change notification settings - Fork 27.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix train_new_from_iterator
in the case of byte-level tokenizers
#17549
Changes from 5 commits
e3d5c53
cf712e6
432a204
1717550
5b961b9
875ef43
eba3e65
9e7fcde
b72cf89
1ac949b
f426ae7
f9cd33d
50d7d58
6d3ce0d
9828598
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -143,6 +143,18 @@ def gen_test(ModelClass, checkpoint, tiny_config, tokenizer_class, feature_extra | |
@skipIf(tiny_config is None, "TinyConfig does not exist") | ||
@skipIf(checkpoint is None, "checkpoint does not exist") | ||
def test(self): | ||
if tokenizer_class is not None: | ||
try: | ||
tokenizer = get_tiny_tokenizer_from_checkpoint(checkpoint) | ||
tiny_config.vocab_size = len(tokenizer) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I needed to modify the basis of the pipeline tests as they use the The problem that arose was the difference between the default So I reordered the lines (to create the tokenizer before the model) and added this line. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not a huge fan of this modification as it makes How many tests were impacted ? The other fix I can see would be instead to modify wdyt ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks a lot for your review @Narsil 🤗 !
80 tests failed in
I really like this suggestion! I made the changes in the last commits. Can you tell me if you're ok with those? I put a vocabulary size of 300 so that it would be rounded up to the nearest hundred above 256. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for this ! I really think this is better this way ! Thanks for taking care of it . |
||
# Rust Panic exception are NOT Exception subclass | ||
# Some test tokenizer contain broken vocabs or custom PreTokenizer, so we | ||
# provide some default tokenizer and hope for the best. | ||
except: # noqa: E722 | ||
self.skipTest(f"Ignoring {ModelClass}, cannot create a simple tokenizer") | ||
else: | ||
tokenizer = None | ||
|
||
if ModelClass.__name__.endswith("ForCausalLM"): | ||
tiny_config.is_encoder_decoder = False | ||
if hasattr(tiny_config, "encoder_no_repeat_ngram_size"): | ||
|
@@ -160,24 +172,14 @@ def test(self): | |
) | ||
if hasattr(model, "eval"): | ||
model = model.eval() | ||
if tokenizer_class is not None: | ||
try: | ||
tokenizer = get_tiny_tokenizer_from_checkpoint(checkpoint) | ||
# XLNet actually defines it as -1. | ||
if isinstance(model.config, (RobertaConfig, IBertConfig)): | ||
tokenizer.model_max_length = model.config.max_position_embeddings - 2 | ||
elif ( | ||
hasattr(model.config, "max_position_embeddings") | ||
and model.config.max_position_embeddings > 0 | ||
): | ||
tokenizer.model_max_length = model.config.max_position_embeddings | ||
# Rust Panic exception are NOT Exception subclass | ||
# Some test tokenizer contain broken vocabs or custom PreTokenizer, so we | ||
# provide some default tokenizer and hope for the best. | ||
except: # noqa: E722 | ||
self.skipTest(f"Ignoring {ModelClass}, cannot create a simple tokenizer") | ||
else: | ||
tokenizer = None | ||
|
||
if tokenizer is not None: | ||
# XLNet actually defines it as -1. | ||
if isinstance(model.config, (RobertaConfig, IBertConfig)): | ||
tokenizer.model_max_length = model.config.max_position_embeddings - 2 | ||
elif hasattr(model.config, "max_position_embeddings") and model.config.max_position_embeddings > 0: | ||
tokenizer.model_max_length = model.config.max_position_embeddings | ||
|
||
feature_extractor = get_tiny_feature_extractor_from_checkpoint( | ||
checkpoint, tiny_config, feature_extractor_class | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point everything in this should be ported directly within
tokenizers
.Information flow from the
Tokenizer
to thetrainer
is a long standing issue (some options are recoverable, some are not, but it's inconsistent currently)