-
Notifications
You must be signed in to change notification settings - Fork 27.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bloom add_prefix_space= True #24846
Comments
cc @ArthurZucker and @younesbelkada |
cc @ArthurZucker as he is more familiar than me regarding tokenizers |
Hey! Thanks for opening this issue. This is half a Will try to fix it 👍🏻 |
Very nice catch, opening a fix right now! There is an issue with sequence pre_tokenizers! |
System Info
Hi dear officer
I use Bloom BloomTokenizerFast as a tokenizer. here is an issue.
Version =4.28.0
when I use BloomTokenizerFast, I find the add_prefix_space= True is not useful.
Here is the code.
tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom",add_prefix_space = True) print(tokenizer.add_prefix_space) print(tokenizer("Hello world")["input_ids"]) print(transformers.__version__) True [59414, 8876] 4.28.0
here is other code.
from transformers import BloomTokenizerFast tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom") print(tokenizer("Hello world")["input_ids"]) [59414, 8876]
I don't know why they will encode the same result.
please have a look!
Thanks
Who can help?
@arth
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
It should encode different results. since add_prefix_space= True.
Expected behavior
It should encode different results. since add_prefix_space= True.
The text was updated successfully, but these errors were encountered: