-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for logging in different trainer stages with DeviceStatsMonitor #15794
Comments
Hi @thesofakillers |
Hi, I'd like to work on this. I'm new to this library, but am currently reading through everything related, the Trainer run function, the train/eval/predict Loops and EpochLoops, the logger connector. |
Currently, for 'fit' runs DeviceStatsMonitor only logs every n steps as defined by Trainer's 'log_every_n_steps' variable. How do we decide how often to log for 'test', AKA 'eval', runs? With the same 'log_every_n_steps' variable or something else? |
To have a base to start with, here is a fork where I enabled DeviceStatsMonitor logging for eval runs. f37d373 Test code used:
I logged the same every N runs as fit runs. However, the fit epoch loop keeps track of a _batches_that_stepped, but the eval epoch loop does not. As far as I can tell (not certain), the eval epoch loop's variable batch_progress.total.completed tracks the same thing, Clarifications/comments/instructions welcome! When I know what to do, I will continue, can add support for logging in predict loop, and eventually send a PR. |
not stale |
Bug description
I would like to use DeviceStatsMonitor during a trainer.test() call. I followed the relative documentation which makes no mention of whether this callback is exclusive to trainer.fit().
Despite following the docs, I get no device stats logs in my tensorboard
How to reproduce the bug
run the following script. You will see that no stats will be logged, despite having the DeviceStatsMonitor callback
Environment
More info
I have verified this on both GPU and CPU. The example above uses CPU.
cc @Borda @awaelchli
The text was updated successfully, but these errors were encountered: