Where and when should device availability checks happen? #11831

ananthsub · 2022-02-09T23:01:31Z

This came up in discussions around #11818, #11797, #11798, #11799

Currently, device checks happen inside of the accelerator connector during trainer initialization.
https://github.com/PyTorchLightning/pytorch-lightning/blob/a2d8c4f6a6080234e47ccc5ad593912303d29bf9/pytorch_lightning/trainer/connectors/accelerator_connector.py#L197-L229

Discussion 1:

Should the per-device check move out of the Trainer to each Accelerator?

Pros for Trainer:
???

Pros for Accelerator:

Logic is better encapsulated
More extensible for new hardware without requiring changes to the Trainer

Discussion 2:

Assuming device checks happen inside of the Accelerator class, should runtime checks happen by default? Or should each accelerator determine when to assert availability?

Pros for default checks:

Additional safety independent of strategy logic

Pros for each accelerator determining this on their own:

more flexibility around when this is called?

Discussion 3:

Assuming device checks happen automatically inside of the Accelerator class, should these happen at initialization or during setup_environment as the first thing to happen during the Trainer's runtime?

Pros for setup environment:

Mimics torch device experience
One can create torch.device("cuda") on a host without GPUs. However, moving a tensor to this device would fail because CUDA is unavailable. The corollary here would be the ability to create a GPUAccelerator on a host without GPUs. But calling GPUAccelerator.setup_environment would fail.
Easier for testing other parts of these classes (doesn't require mocking device availability)
Instantiating the Accelerator class doesn't imply that the model is actually on the device. It's simply an intent of what hardware the model should be trained with

Pros for constructor:

Fails faster

Originally asked by @rohitgr7 in #11797 (comment)

cc @tchaton @justusschock @awaelchli @Borda @akihironitta @rohitgr7 @four4fish

The text was updated successfully, but these errors were encountered:

awaelchli · 2022-02-09T23:44:16Z

IMO the check should not go into the accelerator init. It's better to have it in the accelerator connector, because this way we can build better error messages. For example, we can suggest a better option based on availability of hardware, which would not fit into the responsibility of the accelerator itself. For the same reason we have compatibility checks of strategies not in the strategy but also in the AcceleratorConnector.

rohitgr7 · 2022-02-11T12:13:08Z

I am fine with keeping it inside the accelerator connector. I guess this way it will cover most of the pros and will fail faster too.

stale · 2022-03-13T20:24:20Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

ananthsub added accelerator design Includes a design discussion labels Feb 9, 2022

stale bot added the won't fix This will not be worked on label Mar 13, 2022

stale bot closed this as completed Apr 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where and when should device availability checks happen? #11831

Where and when should device availability checks happen? #11831

ananthsub commented Feb 9, 2022 •

edited

Loading

awaelchli commented Feb 9, 2022

rohitgr7 commented Feb 11, 2022

stale bot commented Mar 13, 2022

Where and when should device availability checks happen? #11831

Where and when should device availability checks happen? #11831

Comments

ananthsub commented Feb 9, 2022 • edited Loading

Discussion 1:

Discussion 2:

Discussion 3:

awaelchli commented Feb 9, 2022

rohitgr7 commented Feb 11, 2022

stale bot commented Mar 13, 2022

ananthsub commented Feb 9, 2022 •

edited

Loading