Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix train_new_from_iterator in the case of byte-level tokenizers #17549

Merged
merged 15 commits into from
Jun 8, 2022
3 changes: 3 additions & 0 deletions src/transformers/tokenization_utils_fast.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
from collections import defaultdict
from typing import Any, Dict, List, Optional, Tuple, Union

import tokenizers.pre_tokenizers as pre_tokenizers_fast
from tokenizers import Encoding as EncodingFast
from tokenizers import Tokenizer as TokenizerFast
from tokenizers.decoders import Decoder as DecoderFast
Expand Down Expand Up @@ -699,6 +700,8 @@ def train_new_from_iterator(
kwargs["end_of_word_suffix"] = tokenizer_json["model"]["end_of_word_suffix"]
if tokenizer_json["model"]["type"] == "Unigram" and unk_token is not None:
kwargs["unk_token"] = unk_token
if tokenizer_json["pre_tokenizer"]["type"] == "ByteLevel":
kwargs["initial_alphabet"] = pre_tokenizers_fast.ByteLevel.alphabet()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point everything in this should be ported directly within tokenizers.

Information flow from the Tokenizer to the trainer is a long standing issue (some options are recoverable, some are not, but it's inconsistent currently)


trainer_class = MODEL_TO_TRAINER_MAPPING[tokenizer_json["model"]["type"]]
trainer = trainer_class(vocab_size=vocab_size, special_tokens=special_tokens, **kwargs)
Expand Down
38 changes: 20 additions & 18 deletions tests/pipelines/test_pipelines_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,18 @@ def gen_test(ModelClass, checkpoint, tiny_config, tokenizer_class, feature_extra
@skipIf(tiny_config is None, "TinyConfig does not exist")
@skipIf(checkpoint is None, "checkpoint does not exist")
def test(self):
if tokenizer_class is not None:
try:
tokenizer = get_tiny_tokenizer_from_checkpoint(checkpoint)
tiny_config.vocab_size = len(tokenizer)
Copy link
Contributor Author

@SaulLu SaulLu Jun 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to modify the basis of the pipeline tests as they use the train_new_from_iterator feature to create toy tokenizers.

The problem that arose was the difference between the default tiny_config.vocab_size attribute and the size of the toy tokenizer vocabulary. With my modification, the vocabulary is larger (more than 256) and exceeded this default value (which was for example 99 for Bart).

So I reordered the lines (to create the tokenizer before the model) and added this line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a huge fan of this modification as it makes config modified by the test in not really obvious way.
It enables silent issues to creep up (the issues raised by your train_from_iterator change where actually not silent which was a good thing IMO)

How many tests were impacted ?

The other fix I can see would be instead to modify ModelTester.get_pipeline_config to add the necessary amount of vocabulary_size so that at least we have enough vocab to be able to retrain the tokenizer.

wdyt ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your review @Narsil 🤗 !

How many tests were impacted

80 tests failed in run_tests_pipelines_tf CI and 132 in run_tests_pipelines_torch CI. I've observed that it impacted 11 different model configurations ('Bart', 'Blenderbot', 'Deberta', 'GPT2', 'GPTJ', 'GPTNeo', 'IBert', 'LED', 'Longformer', 'Roberta' and 'Yoso')

The other fix I can see would be instead to modify ModelTester.get_pipeline_config to add the necessary amount of vocabulary_size so that at least we have enough vocab to be able to retrain the tokenizer.
wdyt ?

I really like this suggestion! I made the changes in the last commits. Can you tell me if you're ok with those? I put a vocabulary size of 300 so that it would be rounded up to the nearest hundred above 256.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this !

I really think this is better this way ! Thanks for taking care of it .

# Rust Panic exception are NOT Exception subclass
# Some test tokenizer contain broken vocabs or custom PreTokenizer, so we
# provide some default tokenizer and hope for the best.
except: # noqa: E722
self.skipTest(f"Ignoring {ModelClass}, cannot create a simple tokenizer")
else:
tokenizer = None

if ModelClass.__name__.endswith("ForCausalLM"):
tiny_config.is_encoder_decoder = False
if hasattr(tiny_config, "encoder_no_repeat_ngram_size"):
Expand All @@ -160,24 +172,14 @@ def test(self):
)
if hasattr(model, "eval"):
model = model.eval()
if tokenizer_class is not None:
try:
tokenizer = get_tiny_tokenizer_from_checkpoint(checkpoint)
# XLNet actually defines it as -1.
if isinstance(model.config, (RobertaConfig, IBertConfig)):
tokenizer.model_max_length = model.config.max_position_embeddings - 2
elif (
hasattr(model.config, "max_position_embeddings")
and model.config.max_position_embeddings > 0
):
tokenizer.model_max_length = model.config.max_position_embeddings
# Rust Panic exception are NOT Exception subclass
# Some test tokenizer contain broken vocabs or custom PreTokenizer, so we
# provide some default tokenizer and hope for the best.
except: # noqa: E722
self.skipTest(f"Ignoring {ModelClass}, cannot create a simple tokenizer")
else:
tokenizer = None

if tokenizer is not None:
# XLNet actually defines it as -1.
if isinstance(model.config, (RobertaConfig, IBertConfig)):
tokenizer.model_max_length = model.config.max_position_embeddings - 2
elif hasattr(model.config, "max_position_embeddings") and model.config.max_position_embeddings > 0:
tokenizer.model_max_length = model.config.max_position_embeddings

feature_extractor = get_tiny_feature_extractor_from_checkpoint(
checkpoint, tiny_config, feature_extractor_class
)
Expand Down
10 changes: 10 additions & 0 deletions tests/tokenization/test_tokenization_fast.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ def setUp(self):
self.test_rust_tokenizer = True

model_paths = ["robot-test/dummy-tokenizer-fast", "robot-test/dummy-tokenizer-wordlevel"]
self.bytelevel_bpe_model_name = "SaulLu/dummy-tokenizer-bytelevel-bpe"

# Inclusion of 2 tokenizers to test different types of models (Unigram and WordLevel for the moment)
self.tokenizers_list = [(PreTrainedTokenizerFast, model_path, {}) for model_path in model_paths]
Expand Down Expand Up @@ -99,6 +100,15 @@ def test_training_new_tokenizer_with_special_tokens_change(self):
shutil.rmtree(self.tmpdirname)
self.tmpdirname = tmpdirname_orig

def test_training_new_tokenizer_with_bytelevel(self):
tokenizer = self.rust_tokenizer_class.from_pretrained(self.bytelevel_bpe_model_name)

toy_text_iterator = ("a" for _ in range(1000))
new_tokenizer = tokenizer.train_new_from_iterator(text_iterator=toy_text_iterator, length=1000, vocab_size=50)

encoding_ids = new_tokenizer.encode("a🤗")
self.assertEqual(encoding_ids, [64, 172, 253, 97, 245])


@require_tokenizers
class TokenizerVersioningTest(unittest.TestCase):
Expand Down