Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ddp_fork strategy with native AMP by attempting NVML-based CUDA availability assessment #14984

Conversation

speediedan
Copy link
Contributor

@speediedan speediedan commented Oct 3, 2022

What does this PR do?

Fixes #14981

ddp_fork (and associated alias strategies) cannot currently be used along with native AMP due to the invocation of the CUDA Runtime API within the call to GradScaler in the NativeMixedPrecisionPlugin:

https://github.com/Lightning-AI/lightning/blob/c059db446e7bfea03fba91e598ad503f0d1c6581/src/pytorch_lightning/plugins/precision/native_amp.py#L53

which in turn initializes CUDA and poisons subsequent forks.

It may be possible with a future version of PyTorch to alter the default behavior of torch.cuda.is_available() to use an NVML-based CUDA assessment throughout Lightning. In the meantime, patching torch.cuda.is_available() with Lightning's implementation of the upstream NVML-based assessment can unlock this functionality.

This PR patches torch.cuda.is_available() within NativeMixedPrecisionPlugin (both Lite and PL versions) and adds a standalone test for the ddp_fork strategy in a CUDA and native AMP context (adding a standalone test only for PL given how expensive the standalone multi-gpu tests can be).

The use of ddp_fork (and its associated alias strategies) with native AMP in context (e.g. in the context of Jupyter notebooks) should be possible after this PR. Note that this PR does not address or pertain to bf16 based AMP support.

Additional context

There's a related PR in PyTorch currently that may allow the requested modification of torch.cuda.is_available() throughout Lightning without needing to patch the function or add Lightning's own NVML-based assessment (once the relevant version of PyTorch is the minimum)

Does your PR introduce any breaking changes? If yes, please list them.

None

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Oct 3, 2022
@speediedan speediedan marked this pull request as ready for review October 4, 2022 00:33
@speediedan speediedan marked this pull request as draft October 4, 2022 00:40
@speediedan speediedan force-pushed the enable_ddp_fork_strategy_with_native_amp branch 2 times, most recently from 51f7669 to 09a42e5 Compare October 4, 2022 03:14
@speediedan speediedan marked this pull request as ready for review October 4, 2022 03:40
@codecov
Copy link

codecov bot commented Oct 4, 2022

Codecov Report

Merging #14984 (bba3e82) into master (7fed7a1) will increase coverage by 1%.
The diff coverage is 72%.

Additional details and impacted files
@@            Coverage Diff            @@
##           master   #14984     +/-   ##
=========================================
+ Coverage      83%      84%     +1%     
=========================================
  Files         396      276    -120     
  Lines       29089    21245   -7844     
=========================================
- Hits        24074    17842   -6232     
+ Misses       5015     3403   -1612     

tests/tests_pytorch/helpers/runif.py Outdated Show resolved Hide resolved
tests/tests_pytorch/models/test_ddp_fork_amp.py Outdated Show resolved Hide resolved
tests/tests_pytorch/models/test_ddp_fork_amp.py Outdated Show resolved Hide resolved
src/lightning_lite/accelerators/cuda.py Outdated Show resolved Hide resolved
@mergify mergify bot added the has conflicts label Oct 4, 2022
tests/tests_lite/helpers/runif.py Outdated Show resolved Hide resolved
tests/tests_pytorch/models/test_amp.py Outdated Show resolved Hide resolved
@awaelchli awaelchli added the community This PR is from the community label Oct 4, 2022
@awaelchli awaelchli added this to the pl:1.8 milestone Oct 4, 2022
@awaelchli awaelchli self-assigned this Oct 4, 2022
@awaelchli awaelchli added accelerator: cuda Compute Unified Device Architecture GPU precision: amp Automatic Mixed Precision labels Oct 4, 2022
src/lightning_lite/accelerators/cuda.py Outdated Show resolved Hide resolved
src/lightning_lite/accelerators/cuda.py Outdated Show resolved Hide resolved
src/lightning_lite/accelerators/cuda.py Outdated Show resolved Hide resolved
src/lightning_lite/accelerators/cuda.py Outdated Show resolved Hide resolved
@speediedan speediedan force-pushed the enable_ddp_fork_strategy_with_native_amp branch from f818027 to 8441726 Compare October 5, 2022 01:05
@mergify mergify bot removed the has conflicts label Oct 5, 2022
@speediedan speediedan force-pushed the enable_ddp_fork_strategy_with_native_amp branch from 867f7c7 to 6cc8e44 Compare October 5, 2022 01:15
@mergify mergify bot added ready PRs ready to be merged has conflicts and removed ready PRs ready to be merged labels Oct 5, 2022
@mergify mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Oct 5, 2022
@awaelchli awaelchli merged commit 3b75c52 into Lightning-AI:master Oct 5, 2022
@otaj otaj mentioned this pull request Oct 10, 2022
12 tasks
nicolai86 pushed a commit that referenced this pull request Oct 13, 2022
…DA availability assessment (#14984)

Co-authored-by: Carlos Mocholí <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator: cuda Compute Unified Device Architecture GPU community This PR is from the community pl Generic label for PyTorch Lightning package precision: amp Automatic Mixed Precision ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support ddp_fork strategy with native AMP by attempting NVML-based CUDA availability assessment
5 participants