-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure sync across val/test step when using DDP #371
Conversation
Codecov Report
@@ Coverage Diff @@
## master #371 +/- ##
=======================================
Coverage 82.00% 82.00%
=======================================
Files 100 100
Lines 5639 5639
=======================================
Hits 4624 4624
Misses 1015 1015
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
To reduce confusion I can also remove the return loss steps for test/validation, as we're explicitly logging within the function. Let me know if this works @ananyahjha93 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, @ananyahjha93 mind test it?
@ananyahjha93 and I spoke earlier, need to clean this up a bit. We have class metrics and normal metrics being used, and are using the val loss to log. This means the sync_dist call doesn't need to be added to the class metric self.log('val_loss', loss, prog_bar=True, sync_dist=True) # requires sync_dist because not class metric
self.log('val_acc', self.val_acc) # class metric handles sync for us I'm not sure what the resolve here. I've tried to rally for setting sync_dist=True as default, but this has unexpected side effects (what if you want to set val loss reduce to sum but you forgot?) |
Note this came up in Lightning-AI/pytorch-lightning#4323 too |
Hey @ananthsub @Borda what do you think about adding a warning in self.log if this case comes up? Is there a way to make it a one time warning? I.e if sync_dist is False for the logged metric and distribution is True, create a warning? |
What does this PR do?
Fixes Lightning-AI/pytorch-lightning#4693
Lightning behaviour assumes that if no
monitor
key is passed to the checkpoint function, to assumeval_loss
orcheckpoint_on
as the key to save top k models. This means we have to ensure that this is synced correctly on all processes so that they are in sync when tracking the best models.PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃