bloom add_prefix_space= True #24846

Dongximing · 2023-07-17T00:51:21Z

System Info

Hi dear officer
I use Bloom BloomTokenizerFast as a tokenizer. here is an issue.

Version =4.28.0
when I use BloomTokenizerFast, I find the add_prefix_space= True is not useful.
Here is the code.
tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom",add_prefix_space = True) print(tokenizer.add_prefix_space) print(tokenizer("Hello world")["input_ids"]) print(transformers.__version__) True [59414, 8876] 4.28.0
here is other code.
from transformers import BloomTokenizerFast tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom") print(tokenizer("Hello world")["input_ids"]) [59414, 8876]
I don't know why they will encode the same result.

please have a look!
Thanks

Who can help?

@arth

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

It should encode different results. since add_prefix_space= True.

Expected behavior

It should encode different results. since add_prefix_space= True.

The text was updated successfully, but these errors were encountered:

sgugger · 2023-07-17T11:23:14Z

cc @ArthurZucker and @younesbelkada

Dongximing · 2023-07-17T17:12:38Z

@younesbelkada @ArthurZucker

younesbelkada · 2023-07-20T15:34:21Z

cc @ArthurZucker as he is more familiar than me regarding tokenizers

ArthurZucker · 2023-07-20T16:05:11Z

Hey! Thanks for opening this issue. This is half a tokenizers issue ( even if you save the tokenizer and modify the tokenizer_config.json to set add_prefix_space=True in the pre_tokenizer the outputs are the same) and half a transformers issue (setting add_prefix_space=False and then saving does not change the value saved!)

Will try to fix it 👍🏻

ArthurZucker · 2023-08-16T15:26:56Z

Very nice catch, opening a fix right now! There is an issue with sequence pre_tokenizers!

ArthurZucker self-assigned this Jul 20, 2023

huggingface deleted a comment from github-actions bot Aug 16, 2023

This was referenced Aug 16, 2023

[TokenizerFast] Fix setting prefix space in __init__ ArthurZucker/transformers#3

Closed

[TokenizerFast] Fix setting prefix space in __init__ #25563

Merged

🚨🚨🚨 [SPM] Finish fix spm models 🚨🚨🚨 #25224

Merged

ArthurZucker closed this as completed in #25563 Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bloom add_prefix_space= True #24846

bloom add_prefix_space= True #24846

Dongximing commented Jul 17, 2023 •

edited

Loading

sgugger commented Jul 17, 2023

Dongximing commented Jul 17, 2023

younesbelkada commented Jul 20, 2023

ArthurZucker commented Jul 20, 2023

ArthurZucker commented Aug 16, 2023 •

edited

Loading

bloom add_prefix_space= True #24846

bloom add_prefix_space= True #24846

Comments

Dongximing commented Jul 17, 2023 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

sgugger commented Jul 17, 2023

Dongximing commented Jul 17, 2023

younesbelkada commented Jul 20, 2023

ArthurZucker commented Jul 20, 2023

ArthurZucker commented Aug 16, 2023 • edited Loading

Dongximing commented Jul 17, 2023 •

edited

Loading

ArthurZucker commented Aug 16, 2023 •

edited

Loading