Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix train_new_from_iterator in the case of byte-level tokenizers #17549

Merged
merged 15 commits into from
Jun 8, 2022

Conversation

SaulLu
Copy link
Contributor

@SaulLu SaulLu commented Jun 3, 2022

What does this PR do?

This PR aims at allowing to use train_new_from_iterator when the original tokenizer backend was using a ByteLevel pre-tokenization. Before this fix, the vocabulary learn wasn't correct because the initial bytes were missing.

Fixes #17371

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Would love to have the feedback of @LysandreJik and @sgugger on the tokenizer part and @Narsil on the pipeline tests (and also the tokenizer if you have more time!)

@SaulLu SaulLu changed the title [WIP] fix train_new_from_iterator in the case of bytelevel tokenizers [WIP] fix train_new_from_iterator in the case of byte-level tokenizers Jun 3, 2022
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 3, 2022

The documentation is not available anymore as the PR was closed or merged.

if tokenizer_class is not None:
try:
tokenizer = get_tiny_tokenizer_from_checkpoint(checkpoint)
tiny_config.vocab_size = len(tokenizer)
Copy link
Contributor Author

@SaulLu SaulLu Jun 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to modify the basis of the pipeline tests as they use the train_new_from_iterator feature to create toy tokenizers.

The problem that arose was the difference between the default tiny_config.vocab_size attribute and the size of the toy tokenizer vocabulary. With my modification, the vocabulary is larger (more than 256) and exceeded this default value (which was for example 99 for Bart).

So I reordered the lines (to create the tokenizer before the model) and added this line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a huge fan of this modification as it makes config modified by the test in not really obvious way.
It enables silent issues to creep up (the issues raised by your train_from_iterator change where actually not silent which was a good thing IMO)

How many tests were impacted ?

The other fix I can see would be instead to modify ModelTester.get_pipeline_config to add the necessary amount of vocabulary_size so that at least we have enough vocab to be able to retrain the tokenizer.

wdyt ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your review @Narsil 🤗 !

How many tests were impacted

80 tests failed in run_tests_pipelines_tf CI and 132 in run_tests_pipelines_torch CI. I've observed that it impacted 11 different model configurations ('Bart', 'Blenderbot', 'Deberta', 'GPT2', 'GPTJ', 'GPTNeo', 'IBert', 'LED', 'Longformer', 'Roberta' and 'Yoso')

The other fix I can see would be instead to modify ModelTester.get_pipeline_config to add the necessary amount of vocabulary_size so that at least we have enough vocab to be able to retrain the tokenizer.
wdyt ?

I really like this suggestion! I made the changes in the last commits. Can you tell me if you're ok with those? I put a vocabulary size of 300 so that it would be rounded up to the nearest hundred above 256.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this !

I really think this is better this way ! Thanks for taking care of it .

@SaulLu SaulLu changed the title [WIP] fix train_new_from_iterator in the case of byte-level tokenizers fix train_new_from_iterator in the case of byte-level tokenizers Jun 6, 2022
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing those!

Copy link
Contributor

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this.

I would prefer not changing the config directly within the tests if possible as IMO it will enable silent bugs to appear.

I can take care of the modifications if you want.

@@ -699,6 +700,8 @@ def train_new_from_iterator(
kwargs["end_of_word_suffix"] = tokenizer_json["model"]["end_of_word_suffix"]
if tokenizer_json["model"]["type"] == "Unigram" and unk_token is not None:
kwargs["unk_token"] = unk_token
if tokenizer_json["pre_tokenizer"]["type"] == "ByteLevel":
kwargs["initial_alphabet"] = pre_tokenizers_fast.ByteLevel.alphabet()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point everything in this should be ported directly within tokenizers.

Information flow from the Tokenizer to the trainer is a long standing issue (some options are recoverable, some are not, but it's inconsistent currently)

if tokenizer_class is not None:
try:
tokenizer = get_tiny_tokenizer_from_checkpoint(checkpoint)
tiny_config.vocab_size = len(tokenizer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a huge fan of this modification as it makes config modified by the test in not really obvious way.
It enables silent issues to creep up (the issues raised by your train_from_iterator change where actually not silent which was a good thing IMO)

How many tests were impacted ?

The other fix I can see would be instead to modify ModelTester.get_pipeline_config to add the necessary amount of vocabulary_size so that at least we have enough vocab to be able to retrain the tokenizer.

wdyt ?

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean! Thanks for working on it @SaulLu!

Copy link
Contributor

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SaulLu SaulLu merged commit ae7bae8 into huggingface:main Jun 8, 2022
elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022
amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tokenizer object incorrectly modified in PreTrainedTokenizerFast.train_new_from_iterator()
5 participants