Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ddp_fork strategy with native AMP by attempting NVML-based CUDA availability assessment #14981

Closed
speediedan opened this issue Oct 3, 2022 · 0 comments · Fixed by #14984
Labels
bug Something isn't working precision: amp Automatic Mixed Precision strategy: ddp DistributedDataParallel

Comments

@speediedan
Copy link
Contributor

speediedan commented Oct 3, 2022

🚀 Feature

ddp_fork (and associated alias strategies) cannot currently be used along with native AMP due to the invocation of the CUDA Runtime API within the call to GradScaler in the NativeMixedPrecisionPlugin:

https://github.com/Lightning-AI/lightning/blob/c059db446e7bfea03fba91e598ad503f0d1c6581/src/pytorch_lightning/plugins/precision/native_amp.py#L53

which in turn initializes CUDA and poisons subsequent forks.

It may be possible with a future version of PyTorch to alter the default behavior of torch.cuda.is_available() to use an NVML-based CUDA assessment throughout Lightning. In the meantime, patching torch.cuda.is_available() with Lightning's implementation of the upstream NVML-based assessment can unlock this functionality.

I'll be opening a PR shortly that patches torch.cuda.is_available() within NativeMixedPrecisionPlugin (both Lite and PL versions) and adds a standalone test for the ddp_fork strategy in a CUDA and AMP context (adding a standalone test only for PL given how expensive the standalone multi-gpu tests can be).

Motivation

Many users will use AMP within the context of jupyter notebooks, where if using multiple GPUS, ddp_fork will be important to support.

Pitch

Allow the use of AMP within the context of jupyter notebooks, where if using multiple GPUS, ddp_fork will be important to support.
I will open a small PR shortly that makes this available.

Additional context

There's a related PR in PyTorch currently that may allow the requested modification of torch.cuda.is_available() throughout Lightning without needing to patch the function or add Lightning's own NVML-based assessment (once the relevant version of PyTorch is the minimum)

cc @justusschock @awaelchli @carmocca

@speediedan speediedan added the needs triage Waiting to be triaged by maintainers label Oct 3, 2022
@carmocca carmocca added bug Something isn't working strategy: ddp spawn precision: amp Automatic Mixed Precision and removed needs triage Waiting to be triaged by maintainers labels Oct 4, 2022
@awaelchli awaelchli added strategy: ddp DistributedDataParallel and removed strategy: ddp spawn labels Nov 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working precision: amp Automatic Mixed Precision strategy: ddp DistributedDataParallel
Projects
None yet
3 participants