-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add kill switch file support to gracefully exit training at runtime #412
Conversation
megatron/arguments.py
Outdated
@@ -678,6 +679,9 @@ def _add_network_size_args(parser): | |||
help='Untie embeddings and output weights.'), | |||
group.add_argument('--embedding-weights-in-fp32', action='store_true', | |||
help='Cast word embedding weights to fp32 before embedding fwd.'), | |||
group.add_argument('--kill-switch-path', type=str, default=None, | |||
help='Path to look for a kill switch. ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@polisettyvarma, is the kill switch meant to be a file or a folder, or both? Can this be clarified in the help message?
@tjruwase can you please review once |
@polisettyvarma, PR looks good to me overall. My question is about the usage as it is currently unclear to me whether a kill switch is a file or a folder. Can you please update the docs to clarify this? |
@tjruwase i have changed the code, please have a look again. what do you mean by docs here ? |
Thanks for making the changes. Looks great. Sorry for the confusion, by docs I meant the help message. |
No description provided.