Nit-added-tokens #26538

ArthurZucker · 2023-10-02T12:48:57Z

What does this PR do?

Fixes #26500, fixes #26536

HuggingFaceDocBuilderDev · 2023-10-02T13:06:00Z

The documentation is not available anymore as the PR was closed or merged.

…to nit-added-tokens

…dded-tokens

ArthurZucker · 2023-10-02T15:06:41Z

src/transformers/tokenization_utils_base.py

@@ -2382,8 +2387,8 @@ def save_pretrained(
        tokenizer_config = copy.deepcopy(self.init_kwargs)

        # TODO: Ensure the modified attributes (those are also in the __init__ kwargs) will give identical tokenizers
-        # target_keys = self.init_kwargs.keys()
-        target_keys = ["model_max_length", "clean_up_tokenization_spaces", "additional_special_tokens"]
+        target_keys = list(self.init_kwargs.keys())


when saving; we should overwrite the init_kwargs with the content of self. Don't know why it was not the case before

ArthurZucker · 2023-10-02T15:07:01Z

src/transformers/tokenization_utils_base.py

@@ -2227,7 +2232,7 @@ def _from_pretrained(
            if added_tokens_file is not None:
                with open(added_tokens_file, encoding="utf-8") as added_tokens_handle:
                    added_tok_encoder = json.load(added_tokens_handle)
-                # legacy: we have to init with (rstrip=True, lstrip=True)
+                # legacy: we have to init with (rstrip=True, lstrip=True) (if the token is new? Failing test)


Might have to update this. The tests are shitty and the default is biting us

ArthurZucker · 2023-10-02T15:07:38Z

src/transformers/tokenization_utils_base.py

+                    if str(token) in additional_special_tokens:
+                        # at this point if the token is in `additional_special_tokens` as an str, should be updated
+                        additional_special_tokens.remove(str(token))


Only use the default legacy values for AddedToken if the token is not already in the added tokens decoder

…ansformers into nit-added-tokens

LysandreJik

LGTM

ArthurZucker · 2023-10-03T10:00:09Z

A small benchmark on the get_added_vocab():

from transformers import AutoTokenizer
import time 
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-moe-54b")
start = time.time();tokenizer.get_added_vocab();print(time.time()-start)
>>> 0.17021536827087402

start = time.time();{k.content: v for v, k in sorted(tokenizer.added_tokens_decoder.items(), key=lambda item: item[0])};print(time.time()-start)
>>> 0.0054759979248046875

start = time.time();tokenizer.added_tokens_decoder;print(time.time()-start)
0.0007669925689697266

will update rust to make tokenizer.added_tokens_encoder available.

HuggingFaceDocBuilderDev · 2023-10-03T10:27:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

ArthurZucker · 2023-10-03T16:51:03Z

src/transformers/tokenization_utils_base.py

+                    if str(token) in additional_special_tokens:
+                        # at this point the token is in `additional_special_tokens` as an str, let's add the AddedToken info
+                        additional_special_tokens.remove(str(token))
+                    if token.special and token not in additional_special_tokens:
+                        additional_special_tokens.append(token)


cc @LysandreJik here

* fix stripping * nits * fix another test * styling * fix? * update * revert bad merge * found the bug * YES SIR * is that change really required? * make fast even faster * re order functions

ArthurZucker added 5 commits September 28, 2023 12:07

fix stripping

32be323

Merge branch 'main' of https://github.com/huggingface/transformers

603d4e9

nits

cb4e48a

fix another test

e9bc0e6

styling

fa93ed3

ArthurZucker added 3 commits October 2, 2023 15:40

fix?

f031f5e

Merge branch 'main' of https://github.com/huggingface/transformers in…

ec9530b

…to nit-added-tokens

update

fb80bf9

ArthurZucker marked this pull request as ready for review October 2, 2023 15:01

Merge branch 'main' of github.com:huggingface/transformers into nit-a…

2260c85

…dded-tokens

ArthurZucker commented Oct 2, 2023

View reviewed changes

ArthurZucker added 2 commits October 2, 2023 17:48

revert bad merge

a9b8845

Merge branch 'nit-added-tokens' of https://github.com/ArthurZucker/tr…

3807fcb

…ansformers into nit-added-tokens

LysandreJik approved these changes Oct 2, 2023

View reviewed changes

ArthurZucker added 3 commits October 2, 2023 19:34

found the bug

339ce67

YES SIR

d093b5c

is that change really required?

c12a2f9

LysandreJik approved these changes Oct 3, 2023

View reviewed changes

ArthurZucker added 2 commits October 3, 2023 12:02

make fast even faster

02922e1

re order functions

93152be

ArthurZucker merged commit 1a2e966 into huggingface:main Oct 3, 2023
3 checks passed

ArthurZucker deleted the nit-added-tokens branch October 3, 2023 10:36

This was referenced Oct 3, 2023

new T5 tokenisation unexpected behaviour immediately after added special tokens #26543

Closed

[Whisper Tokenizer] Test timestamps #26053

Closed

ArthurZucker commented Oct 3, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nit-added-tokens #26538

Nit-added-tokens #26538

ArthurZucker commented Oct 2, 2023

HuggingFaceDocBuilderDev commented Oct 2, 2023 •

edited

Loading

ArthurZucker Oct 2, 2023

ArthurZucker Oct 2, 2023

ArthurZucker Oct 2, 2023

LysandreJik left a comment

ArthurZucker commented Oct 3, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 3, 2023

ArthurZucker Oct 3, 2023

Nit-added-tokens #26538

Nit-added-tokens #26538

Conversation

ArthurZucker commented Oct 2, 2023

What does this PR do?

HuggingFaceDocBuilderDev commented Oct 2, 2023 • edited Loading

ArthurZucker Oct 2, 2023

Choose a reason for hiding this comment

ArthurZucker Oct 2, 2023

Choose a reason for hiding this comment

ArthurZucker Oct 2, 2023

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

ArthurZucker commented Oct 3, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Oct 3, 2023

ArthurZucker Oct 3, 2023

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Oct 2, 2023 •

edited

Loading

ArthurZucker commented Oct 3, 2023 •

edited

Loading