Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception: Custom PreTokenizer cannot be serialized #613

Closed
carter54 opened this issue Jan 30, 2021 · 3 comments
Closed

Exception: Custom PreTokenizer cannot be serialized #613

carter54 opened this issue Jan 30, 2021 · 3 comments

Comments

@carter54
Copy link

Hello~ I'm trying to train a BPE tokenizer with a customized pre_tokenizer.
The customized pre_tokenizer used a 3rd party package likes what has been shown in

after training the tokenizer, I tried to use

tokenizer.save(tokenizer_path)

to save the tokenizer, but an Exception appeared:

Exception: Custom PreTokenizer cannot be serialized

I can see that a customized pre_tokenizer cannot be saved with the main tokenizer model, so I should save the main model individually. When loading the tokenizer, I should manually add the pre_tokenizer. Am I right?

@n1t0
Copy link
Member

n1t0 commented Jan 30, 2021

Yes you are right, this is something that we'd like to support in the future though.

In the meantime you can either:

  • save the model and then load everything back manually. If you don't have a complicated tokenizer with many special tokens and components it might be well suited
  • use a "placeholder" PreTokenizer before saving your tokenizer, that you replace by your custom one after loading back.

@n1t0
Copy link
Member

n1t0 commented Feb 3, 2021

Closing this as it is a duplicate of #581.

@n1t0 n1t0 closed this as completed Feb 3, 2021
@carter54
Copy link
Author

carter54 commented Feb 4, 2021

I see @n1t0 thx for the answer~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants