Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow users to select/write encoding strategies #1655

Open
pietrolesci opened this issue Oct 16, 2024 · 2 comments
Open

Allow users to select/write encoding strategies #1655

pietrolesci opened this issue Oct 16, 2024 · 2 comments

Comments

@pietrolesci
Copy link

Hi there,

Do you plan to add the possibility to control how tokenizers behave at inference time?

For example, adding the possibility for the user to decide whether to use standard BPE (merges) or, e.g., the longest prefix encoding strategy. See Greed is All You Need: An Evaluation of Tokenizer Inference Methods for why this can be useful.

Thanks in advance for your time!

Best,
Pietro


Example. Consider a BPE tokenizer with merges M = {yu, yum, my} and initial alphabet A = {y, u, m}. Given the string s = yummy, the standard BPE merge-based strategy tokenizes s as yu | m | my while BPE with the longest prefix encoding strategy tokenizes s as yum | my.

@ArthurZucker
Copy link
Collaborator

Hey! If it is demanded by the community for sure! 🤗 I think it would be still quite hard to make it super efficient (changing would take some time)

@pietrolesci
Copy link
Author

Hi @ArthurZucker,
Thanks a lot for your swift reply!
I think it will be super useful, especially for research purposes. Perhaps, the simplest thing would be to allow BPE tokenizers to behave like WordPiece at inference time. In the same way users can assign, e.g., pre_tokenizers to a tokenizer class, they could in principle be able to pass a, e.g., predictor too. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants