-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate global step with progress tracking #11805
Conversation
a5c9580
to
70e64f9
Compare
70e64f9
to
6356ef3
Compare
ca4c65d
to
f0dfd23
Compare
0e235fe
to
7afb814
Compare
f0dfd23
to
e218bfc
Compare
e218bfc
to
87bf779
Compare
87bf779
to
dec43b9
Compare
b543100
to
dc28ad0
Compare
4843cda
to
d95f2b7
Compare
5afe552
to
124e9ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work !
Hi @carmocca, it seems that Can you help us understand under what circumstances |
I think |
Hi @carmocca, this change broke many Meta's internal use cases' tests. To summarize how it's affecting us:
I wonder:
|
@yifuwang After reading your quote:
I'm updating the docs for that argument #12332. The description is wrong, the behavior is different in general. With default checkpointing behavior, the statement is still true though. Have you found a case where it is not? |
Thanks for the prompt response @awaelchli! Please see the example provided by @jjenniferdai above for how the behavior has differed after this change:
|
Before this PR, with all the manual decrements and increments to As Adrian mentioned, the existing implementation meant the doc were not strictly correct. One could avoid the issues you are seeing by instantiating a separate Following this structure would also make sense if one wants to checkpoint
This might not have been intended. Can you elaborate on what's the new observed and expected behaviour? |
I think users would still expect to save |
@carmocca for the following snippet, a
The following contract was respected prior to this PR, but not anymore after:
It is a BC-breaking change that changes a behavior some users rely on, regardless of whether it is believed to be a "bug". |
Yes, I'm fine restoring the previous behaviour for |
I think it can cause confusion, especially when using the default filename ModelCheckpoint(filename="{epoch}-{step}"), where epoch numbers from 0 and step numbers from 1. See #16636 Also, it is too counter-intuitive to number something from 1 instead of 0 in python. I can't find any explicit statement about this numbering stragtegy in the documentation ModelCheckpoint. |
What does this PR do?
trainer.global_step
now returns the progress tracking's optimizer step count.ModelCheckpoint
and other checkpoint logic now use this attribute.trainer.fit_loop.epoch_loop.global_step
counter is now renamed totrainer.fit_loop.epoch_loop._batches_that_stepped
, a name that matches its actual behaviour.Fixes #7406
Does your PR introduce any breaking changes? If yes, please list them.
trainer.global_step
during an intra-training validation hook will now correctly return the number of optimizer steps taken already. That would benew_global_step == master_global_step + 1
. In pseudocodeBefore:
Now:
Saved checkpoints that use the global step value as part of the filename are now increased by 1 for the reason in the bullet before.
If users were using TBPTT or multiple optimizes, the
trainer.global_step
value will account for those and be different from the value in current master.The
Trainer
arguments{min,max}_steps
compare with the newglobal_step
value so they suffer from the same breaking changes. In the case of multiple optimizers or TBPTT users will need to adjust them.Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃
cc @tchaton @rohitgr7 @akihironitta @awaelchli @ananthsub @ninginthecloud @Borda @carmocca