Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save tokenizer in conversion script #128

Merged
merged 4 commits into from
Oct 7, 2021
Merged

Save tokenizer in conversion script #128

merged 4 commits into from
Oct 7, 2021

Conversation

jaketae
Copy link
Member

@jaketae jaketae commented Oct 7, 2021

This PR implements the following:

  • Accept tokenizer_type and tokenizer_name_or_path as conversion script arguments
  • Save pretrained tokenizers alongside the converted model

@jaketae jaketae linked an issue Oct 7, 2021 that may be closed by this pull request
@jaketae jaketae requested a review from stas00 October 7, 2021 06:39
@stas00
Copy link
Contributor

stas00 commented Oct 7, 2021

Thank you for working on it, @jaketae

FYI, in the future you can add:

Fixes: #126

to the OP, and it'll automatically close the Issue when PR is merged.


oh, sorry, perhaps my spec wasn't clear enough. The args are already in the checkpoint's args:

ds_args = ds_checkpoint.get_args()

No need to do it 2nd time.


Additionally to the original spec

it looks like we also need to set:

tokenizer_class in the config object, it should correspond to the HF Transformers tokenizer class. It's probably safe to use the fast tokenizers, so it'd be one of GPT2TokenizerFast and T5TokenizerFast:

Please see huggingface/transformers#13906 for context

@jaketae
Copy link
Member Author

jaketae commented Oct 7, 2021

@stas00 Thanks for the feedback!

Would hard-coding the tokenizer class name be preferable over something like type(tokenizer).__name__ (or some other way of getting the class name from a Python object)? I'm wondering how robust the assumption is that we would always be using one of the two tokenizers.


I used to put "Fixes X", until I realized that one can explicitly link an issue to a PR. I think it achieves the same thing, namely closing the issue when a PR is merged. But I'll keep that in mind.

@stas00
Copy link
Contributor

stas00 commented Oct 7, 2021

I used to put "Fixes X", until I realized that one can explicitly link an issue to a PR. I think it achieves the same thing, namely closing the issue when a PR is merged. But I'll keep that in mind.

Oh, you did it manually, I see. I missed that. I guess this is just a convention we use at HF, so the reviewers quickly see which issue(s) it's resolving. I can now see the linked issue in the right bar. So either way works, it's just further away from the OP and not always immediately obvious.

I myself always start a PR with something like:

"This PR is addressing issue #xxx , "

but that's just my personal convention.

Would hard-coding the tokenizer class name be preferable over something like type(tokenizer).__name__ (or some other way of getting the class name from a Python object)? I'm wondering how robust the assumption is that we would always be using one of the two tokenizers.

That's even better - I forgot that we were creating the corresponding tokenizer anyway to get its files, so by all means yes - your proposal is great!

Thank you, @jaketae

@jaketae
Copy link
Member Author

jaketae commented Oct 7, 2021

@stas00 Thanks for the feedback! I've updated the code so that it saves the tokenizer_class field in config.json. The class name is dynamically fetched as proposed in my previous comment.

I'll make sure to link relevant issues more explicitly in future PRs. Thank you!

Copy link
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic, thank you, @jaketae

Just a small suggestion inside.

The test will be part of #121 once that is merged.

tools/convert_checkpoint/deepspeed_to_transformers.py Outdated Show resolved Hide resolved
@stas00
Copy link
Contributor

stas00 commented Oct 7, 2021

The following comment is for a new Issue/PR:

Since you started to use auto-formatter - let's add the config you used to something we all can run automatically and use the same setup - I highly recommend replicating HF transformers setup, since it's already done and has been thought through.

again, we can do that only for the test suite, since the main code needs to remain in the same format, in order for us to be able to easily sync with Megatron-LM and Megatron-Deepspeed original trees.

@jaketae jaketae merged commit 23dded0 into main Oct 7, 2021
@jaketae jaketae deleted the feat-save-tokenizer branch October 7, 2021 19:38
@jaketae
Copy link
Member Author

jaketae commented Oct 7, 2021

@stas00 Sounds great! I think HF uses black and isort. So maybe I'll just add a few line to the Makefile to make sure that test and tool directories are correctly formatted. I'll open an issue and a corresponding PR in the next day or two. Thanks!

@stas00
Copy link
Contributor

stas00 commented Oct 7, 2021

it does, but I meant to copy the specific config both use:

  • black's config is in pyproject.toml
  • isort's is in setup.cfg
  • and most importantly we have to require the same minimum versions of each, (best to sync with transformers) - since different versions lead to very different outcomes even with the same config, so us developers need to be in sync.

@stas00
Copy link
Contributor

stas00 commented Oct 7, 2021

tools is in Megatron-LM, so not sure about that... and Meg-DS original uses it too.. so best to leave it it alone.

@jaketae
Copy link
Member Author

jaketae commented Oct 7, 2021

tools is in Megatron-LM, so not sure about that... and Meg-DS original uses it too.. so best to leave it it alone.

Wasn't aware of that, I'll make sure to remove them from the formatted directories. I was thinking maybe the conversation directory could be included.

I'll also make sure to check the formatter configuration files. Thanks for the heads up!

@stas00
Copy link
Contributor

stas00 commented Oct 7, 2021

I was thinking maybe the conversation directory could be included.

Sure and then we can ask Meg-DS original to sync with ours. Let's just not forget that if we reformat Tunji's files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

auto-add tokenizer files to the converted model
2 participants