fix `train_new_from_iterator` in the case of byte-level tokenizers #17549

SaulLu · 2022-06-03T18:06:53Z

What does this PR do?

This PR aims at allowing to use train_new_from_iterator when the original tokenizer backend was using a ByteLevel pre-tokenization. Before this fix, the vocabulary learn wasn't correct because the initial bytes were missing.

Fixes #17371

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Would love to have the feedback of @LysandreJik and @sgugger on the tokenizer part and @Narsil on the pipeline tests (and also the tokenizer if you have more time!)

HuggingFaceDocBuilderDev · 2022-06-03T18:16:19Z

The documentation is not available anymore as the PR was closed or merged.

This reverts commit 432a204.

SaulLu · 2022-06-06T16:05:21Z

tests/pipelines/test_pipelines_common.py

+                if tokenizer_class is not None:
+                    try:
+                        tokenizer = get_tiny_tokenizer_from_checkpoint(checkpoint)
+                        tiny_config.vocab_size = len(tokenizer)


I needed to modify the basis of the pipeline tests as they use the train_new_from_iterator feature to create toy tokenizers.

The problem that arose was the difference between the default tiny_config.vocab_size attribute and the size of the toy tokenizer vocabulary. With my modification, the vocabulary is larger (more than 256) and exceeded this default value (which was for example 99 for Bart).

So I reordered the lines (to create the tokenizer before the model) and added this line.

I am not a huge fan of this modification as it makes config modified by the test in not really obvious way.
It enables silent issues to creep up (the issues raised by your train_from_iterator change where actually not silent which was a good thing IMO)

How many tests were impacted ?

The other fix I can see would be instead to modify ModelTester.get_pipeline_config to add the necessary amount of vocabulary_size so that at least we have enough vocab to be able to retrain the tokenizer.

wdyt ?

Thanks a lot for your review @Narsil 🤗 !

How many tests were impacted

80 tests failed in run_tests_pipelines_tf CI and 132 in run_tests_pipelines_torch CI. I've observed that it impacted 11 different model configurations ('Bart', 'Blenderbot', 'Deberta', 'GPT2', 'GPTJ', 'GPTNeo', 'IBert', 'LED', 'Longformer', 'Roberta' and 'Yoso')

The other fix I can see would be instead to modify ModelTester.get_pipeline_config to add the necessary amount of vocabulary_size so that at least we have enough vocab to be able to retrain the tokenizer.
wdyt ?

I really like this suggestion! I made the changes in the last commits. Can you tell me if you're ok with those? I put a vocabulary size of 300 so that it would be rounded up to the nearest hundred above 256.

Thanks for this !

I really think this is better this way ! Thanks for taking care of it .

sgugger

LGTM, thanks for fixing those!

Narsil

Thanks for this.

I would prefer not changing the config directly within the tests if possible as IMO it will enable silent bugs to appear.

I can take care of the modifications if you want.

Narsil · 2022-06-07T13:18:33Z

src/transformers/tokenization_utils_fast.py

@@ -699,6 +700,8 @@ def train_new_from_iterator(
            kwargs["end_of_word_suffix"] = tokenizer_json["model"]["end_of_word_suffix"]
        if tokenizer_json["model"]["type"] == "Unigram" and unk_token is not None:
            kwargs["unk_token"] = unk_token
+        if tokenizer_json["pre_tokenizer"]["type"] == "ByteLevel":
+            kwargs["initial_alphabet"] = pre_tokenizers_fast.ByteLevel.alphabet()


At some point everything in this should be ported directly within tokenizers.

Information flow from the Tokenizer to the trainer is a long standing issue (some options are recoverable, some are not, but it's inconsistent currently)

Narsil · 2022-06-07T13:25:16Z

tests/pipelines/test_pipelines_common.py

+                if tokenizer_class is not None:
+                    try:
+                        tokenizer = get_tiny_tokenizer_from_checkpoint(checkpoint)
+                        tiny_config.vocab_size = len(tokenizer)


I am not a huge fan of this modification as it makes config modified by the test in not really obvious way.
It enables silent issues to creep up (the issues raised by your train_from_iterator change where actually not silent which was a good thing IMO)

How many tests were impacted ?

The other fix I can see would be instead to modify ModelTester.get_pipeline_config to add the necessary amount of vocabulary_size so that at least we have enough vocab to be able to retrain the tokenizer.

wdyt ?

This reverts commit 5b961b9.

LysandreJik

Very clean! Thanks for working on it @SaulLu!

Narsil

LGTM

…ytelevel

…uggingface#17549)

fix for ByteLevel tokenizers

e3d5c53

SaulLu changed the title ~~[WIP] fix train_new_from_iterator in the case of bytelevel tokenizers~~ [WIP] fix train_new_from_iterator in the case of byte-level tokenizers Jun 3, 2022

SaulLu added 4 commits June 6, 2022 15:23

Merge branch 'main' into ls/train_new_with_bytelevel

cf712e6

change vocab size in conversational pipeline tests

432a204

Revert "change vocab size in conversational pipeline tests"

1717550

This reverts commit 432a204.

change vocab size in conversational pipeline tests

5b961b9

SaulLu commented Jun 6, 2022

View reviewed changes

SaulLu requested review from sgugger, Narsil and LysandreJik June 6, 2022 16:21

SaulLu changed the title ~~[WIP] fix train_new_from_iterator in the case of byte-level tokenizers~~ fix train_new_from_iterator in the case of byte-level tokenizers Jun 6, 2022

sgugger approved these changes Jun 6, 2022

View reviewed changes

Narsil reviewed Jun 7, 2022

View reviewed changes

SaulLu added 9 commits June 7, 2022 16:06

Revert "change vocab size in conversational pipeline tests"

875ef43

This reverts commit 5b961b9.

fix bart

eba3e65

fix blendertbot

9e7fcde

fix yoso

b72cf89

fix roberta

1ac949b

fix longformer

f426ae7

fix led

f9cd33d

fix gpt + deberta

50d7d58

format

6d3ce0d

LysandreJik approved these changes Jun 8, 2022

View reviewed changes

Narsil approved these changes Jun 8, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/main' into ls/train_new_with_b…

9828598

…ytelevel

SaulLu merged commit ae7bae8 into huggingface:main Jun 8, 2022

elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022

fix train_new_from_iterator in the case of byte-level tokenizers (h…

806bc5e

…uggingface#17549)

amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Jun 16, 2022

fix train_new_from_iterator in the case of byte-level tokenizers (h…

72d955a

…uggingface#17549)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix `train_new_from_iterator` in the case of byte-level tokenizers #17549

fix `train_new_from_iterator` in the case of byte-level tokenizers #17549

SaulLu commented Jun 3, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 3, 2022 •

edited

Loading

SaulLu Jun 6, 2022 •

edited

Loading

Narsil Jun 7, 2022

SaulLu Jun 7, 2022

Narsil Jun 8, 2022

sgugger left a comment

Narsil left a comment

Narsil Jun 7, 2022

Narsil Jun 7, 2022

LysandreJik left a comment

Narsil left a comment

fix train_new_from_iterator in the case of byte-level tokenizers #17549

fix train_new_from_iterator in the case of byte-level tokenizers #17549

Conversation

SaulLu commented Jun 3, 2022 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Jun 3, 2022 • edited Loading

SaulLu Jun 6, 2022 • edited Loading

Choose a reason for hiding this comment

Narsil Jun 7, 2022

Choose a reason for hiding this comment

SaulLu Jun 7, 2022

Choose a reason for hiding this comment

Narsil Jun 8, 2022

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

Narsil left a comment

Choose a reason for hiding this comment

Narsil Jun 7, 2022

Choose a reason for hiding this comment

Narsil Jun 7, 2022

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

Narsil left a comment

Choose a reason for hiding this comment

fix `train_new_from_iterator` in the case of byte-level tokenizers #17549

fix `train_new_from_iterator` in the case of byte-level tokenizers #17549

SaulLu commented Jun 3, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 3, 2022 •

edited

Loading

SaulLu Jun 6, 2022 •

edited

Loading