-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save tokenizer in conversion script #128
Conversation
Thank you for working on it, @jaketae FYI, in the future you can add: Fixes: #126 to the OP, and it'll automatically close the Issue when PR is merged. oh, sorry, perhaps my spec wasn't clear enough. The args are already in the checkpoint's
No need to do it 2nd time. Additionally to the original spec it looks like we also need to set:
Please see huggingface/transformers#13906 for context |
@stas00 Thanks for the feedback! Would hard-coding the tokenizer class name be preferable over something like I used to put "Fixes X", until I realized that one can explicitly link an issue to a PR. I think it achieves the same thing, namely closing the issue when a PR is merged. But I'll keep that in mind. |
Oh, you did it manually, I see. I missed that. I guess this is just a convention we use at HF, so the reviewers quickly see which issue(s) it's resolving. I can now see the linked issue in the right bar. So either way works, it's just further away from the OP and not always immediately obvious. I myself always start a PR with something like: "This PR is addressing issue #xxx , " but that's just my personal convention.
That's even better - I forgot that we were creating the corresponding tokenizer anyway to get its files, so by all means yes - your proposal is great! Thank you, @jaketae |
@stas00 Thanks for the feedback! I've updated the code so that it saves the I'll make sure to link relevant issues more explicitly in future PRs. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following comment is for a new Issue/PR: Since you started to use auto-formatter - let's add the config you used to something we all can run automatically and use the same setup - I highly recommend replicating HF transformers setup, since it's already done and has been thought through. again, we can do that only for the test suite, since the main code needs to remain in the same format, in order for us to be able to easily sync with Megatron-LM and Megatron-Deepspeed original trees. |
Co-authored-by: Stas Bekman <[email protected]>
@stas00 Sounds great! I think HF uses |
it does, but I meant to copy the specific config both use:
|
|
Wasn't aware of that, I'll make sure to remove them from the formatted directories. I was thinking maybe the conversation directory could be included. I'll also make sure to check the formatter configuration files. Thanks for the heads up! |
Sure and then we can ask Meg-DS original to sync with ours. Let's just not forget that if we reformat Tunji's files. |
This PR implements the following:
tokenizer_type
andtokenizer_name_or_path
as conversion script arguments