Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable non blocking to device with MPS #14368

Merged
merged 19 commits into from
Aug 26, 2022
Merged

Disable non blocking to device with MPS #14368

merged 19 commits into from
Aug 26, 2022

Conversation

j0rd1smit
Copy link
Contributor

What does this PR do?

This PR ensures that the race condition bug in Pytorch (pytorch/pytorch#83015) does not affect lightning. As discussed with @justusschock in #13285 this PR disables non-blocking moves to MPS devices since it can result in tensors with different values after the moves to the MPS device. To verify it works now, the test case test_data_is_not_changed_after_move_to_mps_device has been added.

Fixes #13285

Does your PR introduce any breaking changes? If yes, please list them.

No.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Yes, this was my first ever PR 😀. It was quite fun to do. However, since it is my first time, I hope I did everything OK. If not, let me know. I'm happy to learn.

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Aug 23, 2022
@j0rd1smit j0rd1smit changed the title Bugfix/13285 disable non blocking to device mps [WIP] Bugfix/13285 disable non blocking to device mps Aug 23, 2022
src/pytorch_lightning/utilities/apply_func.py Outdated Show resolved Hide resolved
tests/tests_pytorch/accelerators/test_mps.py Outdated Show resolved Hide resolved
tests/tests_pytorch/accelerators/test_mps.py Outdated Show resolved Hide resolved
@carmocca carmocca added community This PR is from the community accelerator: mps Apple Silicon GPU labels Aug 23, 2022
@carmocca carmocca added this to the pl:1.7.x milestone Aug 23, 2022
@carmocca carmocca added the bug Something isn't working label Aug 23, 2022
Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somebody will need to manually run the CI on an MPS device

@carmocca carmocca changed the title [WIP] Bugfix/13285 disable non blocking to device mps Disable non blocking to device with MPS Aug 23, 2022
@carmocca carmocca self-assigned this Aug 23, 2022
@codecov
Copy link

codecov bot commented Aug 23, 2022

Codecov Report

Merging #14368 (39af0b9) into master (33a5ed9) will decrease coverage by 3%.
The diff coverage is 100%.

❗ Current head 39af0b9 differs from pull request most recent head 523b3a8. Consider uploading reports for the commit 523b3a8 to get more accurate results

@@            Coverage Diff             @@
##           master   #14368      +/-   ##
==========================================
- Coverage      79%      76%      -3%     
==========================================
  Files         111      332     +221     
  Lines        7258    26894   +19636     
==========================================
+ Hits         5740    20432   +14692     
- Misses       1518     6462    +4944     

src/pytorch_lightning/utilities/apply_func.py Outdated Show resolved Hide resolved
src/pytorch_lightning/utilities/apply_func.py Outdated Show resolved Hide resolved
src/pytorch_lightning/utilities/apply_func.py Outdated Show resolved Hide resolved
@mergify mergify bot added the ready PRs ready to be merged label Aug 24, 2022
@awaelchli
Copy link
Contributor

Somebody will need to manually run the CI on an MPS device

@carmocca Easier said than done :) See #14012
But perhaps we can run some selected tests related to device transfers

@justusschock
Copy link
Member

@j0rd1smit seems like you fixed it. I think it's reasonable to update the type, but I'd do so in another PR as this might require further typing changes for mypy. If you want to tackle that as well, feel free to do so :)

@j0rd1smit
Copy link
Contributor Author

@justusschock agreed that should be a separate PR. So guess this PR is ready.

@mergify mergify bot added has conflicts and removed ready PRs ready to be merged labels Aug 25, 2022
@mergify mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Aug 26, 2022
@justusschock justusschock enabled auto-merge (squash) August 26, 2022 06:00
@mergify mergify bot added has conflicts and removed ready PRs ready to be merged labels Aug 26, 2022
@mergify mergify bot added ready PRs ready to be merged has conflicts and removed has conflicts ready PRs ready to be merged labels Aug 26, 2022
@mergify mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Aug 26, 2022
@justusschock justusschock merged commit cced335 into Lightning-AI:master Aug 26, 2022
rohitgr7 pushed a commit that referenced this pull request Aug 27, 2022
* disable non-blocking for mps due to race condition bug

* fixed typo

* fixed: unknown mps device for non arm systems

* Removed unrobust test case

* moved _MPS_DEVICES such that we used in apply_func

* Resolve circular dependencies

* Comment rewording

* changed torchElasticEnvironment to a global import

* simplified if statement to blocking device type

* Added change to CHANGELOG

* Update src/pytorch_lightning/utilities/apply_func.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixed mypy not detecting casting of device

* Moved check into if statement to mainain original behavior

Co-authored-by: Carlos Mocholí <[email protected]>
Co-authored-by: Justus Schock <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <[email protected]>
lexierule pushed a commit that referenced this pull request Aug 31, 2022
* disable non-blocking for mps due to race condition bug

* fixed typo

* fixed: unknown mps device for non arm systems

* Removed unrobust test case

* moved _MPS_DEVICES such that we used in apply_func

* Resolve circular dependencies

* Comment rewording

* changed torchElasticEnvironment to a global import

* simplified if statement to blocking device type

* Added change to CHANGELOG

* Update src/pytorch_lightning/utilities/apply_func.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixed mypy not detecting casting of device

* Moved check into if statement to mainain original behavior

Co-authored-by: Carlos Mocholí <[email protected]>
Co-authored-by: Justus Schock <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jirka Borovec <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator: mps Apple Silicon GPU bug Something isn't working community This PR is from the community pl Generic label for PyTorch Lightning package ready PRs ready to be merged
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

MPS Inf/Nan Loss
6 participants