[`Tokenizer`] Fix slow and fast serialization #26570

ArthurZucker · 2023-10-03T17:00:06Z

What does this PR do?

sets the defaults for AddedToken instances where needed to match what is pushed to the hub
sets the default for AddedToken to not strip left and right to match the fast tokenizers
fixes the added_tokens.json file: a recent push made it save all the added tokens encoder, but it should only save the indexes greater than the vocab size for forward compatibility.
fixes the list of additionnal_special_tokens that were added twice / overwritten
fixes add_tokens : if the added tokens is a string we check if it's not already in the added vocab instead of always defaulting to strip left or right.
fixes saving: the added_tokens_decoder should not add a "__type ":"AddedToken" field to the added tokens otherwise the previous versions of transformers will try to load them.

fixes #26732, fixes #26775, fixes #26773, fixes #26768, fixes #26859

HuggingFaceDocBuilderDev · 2023-10-03T17:17:34Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

…fix-main

LysandreJik

Feel free to merge when ready as seen offline with you

tongyx361 · 2023-10-23T16:53:23Z

I ran into the error below

Traceback (most recent call last):
  File ".../src/train_flash_attn_2.py", line 11, in <module>
    train()
  File ".../src/train.py", line 157, in train
    tokenizer = transformers.AutoTokenizer.from_pretrained(
  File ".../lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 751, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File ".../lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2017, in from_pretrained
    return cls._from_pretrained(
  File ".../lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2243, in _from_pretrained
    init_kwargs[key] = added_tokens_map.get(init_kwargs[key], init_kwargs[key])
TypeError: unhashable type: 'dict'

So I added some prints and get this intermediate values:

cls.SPECIAL_TOKENS_ATTRIBUTES: (list)['bos_token', 'eos_token', 'unk_token', 'sep_token', 'pad_token', 'cls_token', 'mask_token', 'additional_special_tokens']
added_tokens_map: (dict){'<unk>': AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), '<s>': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), '</s>': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True)}
init_kwargs: (dict){'add_bos_token': True, 'add_eos_token': False, 'bos_token': {'__type': 'AddedToken', 'content': '<s>', 'lstrip': False, 'normalized': True, 'rstrip': False, 'single_word': False}, 'clean_up_tokenization_spaces': False, 'eos_token': {'__type': 'AddedToken', 'content': '</s>', 'lstrip': False, 'normalized': True, 'rstrip': False, 'single_word': False}, 'legacy': None, 'model_max_length': 1024, 'pad_token': None, 'sp_model_kwargs': {}, 'unk_token': {'__type': 'AddedToken', 'content': '<unk>', 'lstrip': False, 'normalized': True, 'rstrip': False, 'single_word': False}, 'vocab_file': '.../tokenizer.model', 'tokenizer_file': '.../tokenizer.json', 'name_or_path': '...'}
key: (streos_token
init_kwargs[key]: (dict){'__type': 'AddedToken', 'content': '</s>', 'lstrip': False, 'normalized': True, 'rstrip': False, 'single_word': False}

According to the output, I made a fix, which seemed to work out:

# Passing AddedTokens and not strings to the class to prevent it from casting the string to a different AddedToken
for key in cls.SPECIAL_TOKENS_ATTRIBUTES & init_kwargs.keys():
    if added_tokens_map != {} and init_kwargs[key] is not None:
        if key != "additional_special_tokens":
            # >>> debug
            def print_info(name, obj):
                print(f"{name}: ({type(obj).__name__}){obj}")
            print_info("cls.SPECIAL_TOKENS_ATTRIBUTES", cls.SPECIAL_TOKENS_ATTRIBUTES)
            print_info("added_tokens_map", added_tokens_map)
            print_info("init_kwargs", init_kwargs)
            print_info("key", key)
            print_info("init_kwargs[key]", init_kwargs[key])
            # <<< debug
-            init_kwargs[key] = added_tokens_map.get(init_kwargs[key], init_kwargs[key])
+            init_kwargs[key] = added_tokens_map.get(key, init_kwargs[key]) # fix

ArthurZucker · 2023-10-24T08:36:46Z

Could you share a reproducer? Would help me a lot as well!

tongyx361 · 2023-10-25T05:37:10Z

Could you share a reproducer? Would help me a lot as well!

Sorry that I'm too busy to do so right now 😭

But this only happened when I loaded the tokenizer of Llemma-7B.

I hope this description could help you reproduce the error.

* fix * last attempt * current work * fix forward compatibility * save all special tokens * current state * revert additional changes * updates * remove tokenizer.model * add a test and the fix * nit * revert one more break * fix typefield issue * quality * more tests * fix fields for FC * more nits? * new additional changes * how * some updates * simplify all * more nits * revert some things to original * nice * nits * a small hack * more nits * ahhaha * fixup * update * make test run on ci * use subtesting * update * Update .circleci/create_circleci_config.py * updates * fixup * nits * replace typo * fix the test * nits * update * None max dif pls * a partial fix * had to revert one thing * test the fast * updates * fixup * and more nits * more fixes * update * Oupsy 👁️ * nits * fix marian * on our way to heaven * Update src/transformers/models/t5/tokenization_t5.py Co-authored-by: Lysandre Debut <[email protected]> * fixup * Update src/transformers/tokenization_utils_fast.py Co-authored-by: Leo Tronchon <[email protected]> * Update src/transformers/tokenization_utils_base.py Co-authored-by: Leo Tronchon <[email protected]> * fix phobert * skip some things, test more * nits * fixup * fix deberta * update * update * more updates * skip one test * more updates * fix camembert * can't test this one * more good fixes * kind of a major update - seperate what is only done in fast in fast init and refactor - add_token(AddedToken(..., speicla = True)) ignores it in fast - better loading * fixup * more fixups * fix pegasus and mpnet * remove skipped tests * fix phoneme tokenizer if self.verbose * fix individual models * update common tests * update testing files * all over again * nits * skip test for markup lm * fixups * fix order of addition in fast by sorting the added tokens decoder * proper defaults for deberta * correct default for fnet * nits on add tokens, string initialized to special if special * skip irrelevant herbert tests * main fixes * update test added_tokens_serialization * the fix for bart like models and class instanciating * update bart * nit! * update idefix test * fix whisper! * some fixup * fixups * revert some of the wrong chanegs * fixup * fixup * skip marian * skip the correct tests * skip for tf and flax as well --------- Co-authored-by: Lysandre Debut <[email protected]> Co-authored-by: Leo Tronchon <[email protected]>

fix

303a82c

ArthurZucker force-pushed the fix-main branch from e33ad2e to 303a82c Compare October 3, 2023 17:00

ArthurZucker added 2 commits October 3, 2023 19:01

Merge branch 'main' of github.com:huggingface/transformers into fix-main

cbf179a

last attempt

01e18db

ArthurZucker added 2 commits October 4, 2023 16:47

current work

08a560a

fix forward compatibility

23c9513

ArthurZucker mentioned this pull request Oct 5, 2023

Tokenizer pad token not saved with save_pretrained #26500

Closed

4 tasks

ArthurZucker added 22 commits October 5, 2023 03:04

save all special tokens

0ae13ed

Merge branch 'fix-main' of github.com:ArthurZucker/transformers into …

d887f68

…fix-main

current state

72ff80e

revert additional changes

b7b7d13

updates

36d5303

remove tokenizer.model

ae93856

add a test and the fix

88ea352

nit

ca98fbd

revert one more break

3c22fbb

fix typefield issue

dc93d5e

quality

00997e9

more tests

6143634

fix fields for FC

907591f

Merge branch 'fix-main' of github.com:ArthurZucker/transformers into …

5df5a83

…fix-main

Merge branch 'fix-main' of github.com:ArthurZucker/transformers into …

66ecb9e

…fix-main

more nits?

0e7bd61

Merge branch 'fix-main' of github.com:ArthurZucker/transformers into …

381a0ec

…fix-main

new additional changes

bf75334

how

fafbbed

some updates

c6de7b2

simplify all

9a6e750

more nits

8c4ec2c

ArthurZucker added 9 commits October 16, 2023 06:28

update bart

640885e

nit!

45801c0

update idefix test

14c576f

fix whisper!

2a78cf9

some fixup

6f28584

fixups

c12656b

revert some of the wrong chanegs

8f8c3f1

fixup

de51ef7

fixup

0f0a3fe

ArthurZucker marked this pull request as ready for review October 16, 2023 13:26

ArthurZucker mentioned this pull request Oct 17, 2023

Running official apply_chat_template throws warnings #26859

Closed

4 tasks

LysandreJik approved these changes Oct 18, 2023

View reviewed changes

ArthurZucker added 4 commits October 18, 2023 14:40

Merge branch 'main' of github.com:huggingface/transformers into fix-main

4b693b9

skip marian

4b82043

skip the correct tests

340df3d

skip for tf and flax as well

f9fb43d

ArthurZucker merged commit ef7e936 into huggingface:main Oct 18, 2023
3 checks passed

ArthurZucker added a commit that referenced this pull request Oct 18, 2023

[Tokenizer] Fix slow and fast serialization (#26570)

0c4b637

Rocketknight1 mentioned this pull request Oct 18, 2023

Mark test_add_special_tokens as slow for Whisper #26903

Closed

julian-risch mentioned this pull request Oct 19, 2023

build: Upgrade transformers to the latest version 4.34.1 deepset-ai/haystack#5994

Merged

NanoCode012 mentioned this pull request Oct 19, 2023

chore: bump transformers to v4.34.1 to fix tokenizer issue axolotl-ai-cloud/axolotl#745

Merged

ArthurZucker mentioned this pull request Oct 19, 2023

tokenizer : special token handling ggerganov/llama.cpp#3538

Merged

5 tasks

ssmi153 mentioned this pull request Oct 22, 2023

Non-consecutive added token '<unk>' found for llama 2 fine tune model huggingface/text-generation-inference#1181

Closed

ArthurZucker deleted the fix-main branch October 24, 2023 08:44

ArthurZucker mentioned this pull request Nov 8, 2023

[CodeLlamaTokenizer] Nit, update __init__ to make sure the AddedTokens are not normalized because they are special #27359

Merged

Sai-Suraj-27 mentioned this pull request Aug 19, 2024

fix: Fixed CodeGenTokenizationTest::test_truncation failing test #32850

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`Tokenizer`] Fix slow and fast serialization #26570

[`Tokenizer`] Fix slow and fast serialization #26570

ArthurZucker commented Oct 3, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 3, 2023

LysandreJik left a comment

tongyx361 commented Oct 23, 2023

ArthurZucker commented Oct 24, 2023

tongyx361 commented Oct 25, 2023 •

edited

Loading

[Tokenizer] Fix slow and fast serialization #26570

[Tokenizer] Fix slow and fast serialization #26570

Conversation

ArthurZucker commented Oct 3, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Oct 3, 2023

LysandreJik left a comment

Choose a reason for hiding this comment

tongyx361 commented Oct 23, 2023

ArthurZucker commented Oct 24, 2023

tongyx361 commented Oct 25, 2023 • edited Loading

[`Tokenizer`] Fix slow and fast serialization #26570

[`Tokenizer`] Fix slow and fast serialization #26570

ArthurZucker commented Oct 3, 2023 •

edited

Loading

tongyx361 commented Oct 25, 2023 •

edited

Loading