Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bloom add_prefix_space= True #24846

Closed
2 of 4 tasks
Dongximing opened this issue Jul 17, 2023 · 5 comments · Fixed by #25563
Closed
2 of 4 tasks

bloom add_prefix_space= True #24846

Dongximing opened this issue Jul 17, 2023 · 5 comments · Fixed by #25563
Assignees

Comments

@Dongximing
Copy link

Dongximing commented Jul 17, 2023

System Info

Hi dear officer
I use Bloom BloomTokenizerFast as a tokenizer. here is an issue.

Version =4.28.0
when I use BloomTokenizerFast, I find the add_prefix_space= True is not useful.
Here is the code.
tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom",add_prefix_space = True) print(tokenizer.add_prefix_space) print(tokenizer("Hello world")["input_ids"]) print(transformers.__version__) True [59414, 8876] 4.28.0
here is other code.
from transformers import BloomTokenizerFast tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom") print(tokenizer("Hello world")["input_ids"]) [59414, 8876]
I don't know why they will encode the same result.

please have a look!
Thanks

Who can help?

@arth

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

It should encode different results. since add_prefix_space= True.

Expected behavior

It should encode different results. since add_prefix_space= True.

@sgugger
Copy link
Collaborator

sgugger commented Jul 17, 2023

cc @ArthurZucker and @younesbelkada

@Dongximing
Copy link
Author

@younesbelkada @ArthurZucker

@younesbelkada
Copy link
Contributor

cc @ArthurZucker as he is more familiar than me regarding tokenizers

@ArthurZucker
Copy link
Collaborator

Hey! Thanks for opening this issue. This is half a tokenizers issue ( even if you save the tokenizer and modify the tokenizer_config.json to set add_prefix_space=True in the pre_tokenizer the outputs are the same) and half a transformers issue (setting add_prefix_space=False and then saving does not change the value saved!)

Will try to fix it 👍🏻

@ArthurZucker ArthurZucker self-assigned this Jul 20, 2023
@huggingface huggingface deleted a comment from github-actions bot Aug 16, 2023
@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Aug 16, 2023

Very nice catch, opening a fix right now! There is an issue with sequence pre_tokenizers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants