Make gradients available for all_gather on TPU #15003

stekiri · 2022-10-05T14:17:23Z

What does this PR do?

Support autograd for all_gather on TPU using torch_xla.core.functions.all_gather

Fixes #6295

Does your PR introduce any breaking changes? If yes, please list them.

None

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

src/lightning_lite/strategies/xla.py

src/pytorch_lightning/strategies/tpu_spawn.py

src/lightning_lite/strategies/xla.py

tests/tests_pytorch/accelerators/test_tpu.py

for more information, see https://pre-commit.ci

tests/tests_lite/strategies/test_xla.py

tests/tests_pytorch/accelerators/test_tpu.py

Co-authored-by: Adrian Wälchli <[email protected]>

for more information, see https://pre-commit.ci

carmocca

#15349 needs to be merged first

stekiri · 2022-11-02T13:01:49Z

@carmocca, I ran my new tests successfully on Colab using a TPU runtime. However, the test-on-tpus run in this PR keeps failing, I seem to hit the 100 minutes timeout and there seem to be a problem with the logs: error: You must be logged in to the server (Unauthorized).

carmocca

You can ignore the "Unauthorized" error. If the job timed out it's because there's no hardware availability at this moment.

tests/tests_pytorch/accelerators/test_tpu.py

for more information, see https://pre-commit.ci

awaelchli · 2022-11-26T22:04:16Z

@stekiri We haven't forgotten this PR. Some more patience is required, sorry about this. We have some issue getting the TPU CI to work on GitHub here #15788.

* Simplify enabling CPU offload in FSDP (#15832) Co-authored-by: Jirka Borovec <[email protected]> * [App] Enable running with spawn context (#15923) * Fix compiler support test (#15927) * Enable back inference mode support with hpu & update links (#15918) * Enable back inference mode support with hpu * Remove unused * Update document link and address comment Signed-off-by: Jerome <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [App] Introduce auto scaler (#15769) * Exlucde __pycache__ in setuptools * Add load balancer example * wip * Update example * rename * remove prints * _LoadBalancer -> LoadBalancer * AutoScaler(work) * change var name * remove locust * Update docs * include autoscaler in api ref * docs typo * docs typo * docs typo * docs typo * remove unused loadtest * remove unused device_type * clean up * clean up * clean up * Add docstring * type * env vars to args * expose an API for users to override to customise autoscaling logic * update example * comment * udpate var name * fix scale mechanism and clean up * Update exampl * ignore mypy * Add test file * . * update impl and update tests * Update changlog * . * revert docs * update test * update state to keep calling 'flow.run()' Co-authored-by: Aniket Maurya <[email protected]> * Add aiohttp to base requirements * Update docs Co-authored-by: Luca Antiga <[email protected]> * Use deserializer utility * fake trigger * wip: protect /system/* with basic auth * read password at runtime * Change env var name * import torch as optional * Don't overcreate works * simplify imports * Update example * aiohttp * Add work_args work_kwargs * More docs * remove FIXME * Apply Jirka's suggestions Co-authored-by: Jirka Borovec <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * clean example device * add comment on init threshold value * bad merge * nit: logging format * {in,out}put_schema -> {in,out}put_type * lowercase * docs on seconds * process_time -> processing_time * Dont modify work state from flow * Update tests * worker_url -> endpoint * fix exampl * Fix default scale logic * Fix default scale logic * Fix num_pending_works * Update num_pending_works * Fix bug creating too many works * Remove up/downscale_threshold args * Update example * Add typing * Fix example in docstring * Fix default scale logic * Update src/lightning_app/components/auto_scaler.py Co-authored-by: Noha Alon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rename method * rename locvar * Add todo * docs ci * docs ci * asdfafsdasdf pls docs * Apply suggestions from code review Co-authored-by: Ethan Harris <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * . * doc * Update src/lightning_app/components/auto_scaler.py Co-authored-by: Noha Alon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks" This reverts commit 24983a0. * Revert "Update src/lightning_app/components/auto_scaler.py" This reverts commit 56ea78b. * Remove redefinition * Remove load balancer run blocker * raise RuntimeError * remove has_sent * lower the default timeout_batching from 10 to 1 * remove debug * update the default timeout_batching * . * tighten condition * fix endpoint * typo in runtimeerror cond * async lock update severs * add a test * {in,out}put_type typing * Update examples/app_server_with_auto_scaler/app.py Co-authored-by: Jirka Borovec <[email protected]> * Update .actions/setup_tools.py Co-authored-by: Aniket Maurya <[email protected]> Co-authored-by: Luca Antiga <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Noha Alon <[email protected]> Co-authored-by: Ethan Harris <[email protected]> Co-authored-by: Akihiro Nitta <[email protected]> Co-authored-by: thomas chaton <[email protected]> * ENG-627: Docs for CloudCompute Mount Argument (#15182) fixed conflicts * Fix LRScheduler import for PyTorch 2.0 (#15940) * Fix LRScheduler import for PyTorch 2.0 * Add comment for posterity * CI: fix pypi flow (#15944) * CI: fixing pypi syntax (#15943) * connect * input * [App] Remove `SingleProcessRuntime` (#15933) * Remove SingleProcessRuntime * Remove unused queues * Docs * [App] Fix bug when using structures with works (#15911) * Fix bug when using structures with works * Add test * Update CHANGELOG.md * [App] Wait for full file to be transferred in Path / Payload (#15934) * Wait for full file to be transferred in Path / Payload * Fixes * [docs] Include all components in the API reference (#15805) * Update docs Co-authored-by: Jirka Borovec <[email protected]> * Bump playwright from 1.27.1 to 1.28.0 in /requirements (#15903) * Bump playwright from 1.27.1 to 1.28.0 in /requirements Bumps [playwright](https://github.com/Microsoft/playwright-python) from 1.27.1 to 1.28.0. - [Release notes](https://github.com/Microsoft/playwright-python/releases) - [Commits](microsoft/playwright-python@v1.27.1...v1.28.0) --- updated-dependencies: - dependency-name: playwright dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> * 1.28 Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jirka <[email protected]> * [App] Add `configure_layout` method for works (#15926) * Add `configure_layout` method for works * Check for api access availability * Updates from review * Update CHANGELOG.md * Apply suggestions from code review Co-authored-by: Sherin Thomas <[email protected]> * Make gradients available for all_gather on TPU (#15003) * Make gradients available for all_gather on TPU * Modify switch and tests * Apply suggestions from code review * Modify tests * Fix test * Drop test Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> * Don't try to aggregate `requirements/__pycache__/base.txt` in setuptools (#15775) Exlucde __pycache__ in setuptools * [App] Multiprocessing-safe work pickling (#15836) * Upgrade to HPU release 1.7.1 (#15956) * Upgrade to HPU release 1.7.1 Update torch version check for hpu Signed-off-by: Jerome <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Multinode on MPS (#15748) * Fix restarting attribute for lr finder * update lite executor * update trainer executor * update spawn executor * add multinode component tests * add testing helpers * add lite tests * add trainer tests * update changelog * update trainer * update workflow * update tests * debug * add reason for skipif * Apply suggestions from code review * switch skipif Co-authored-by: Jirka <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> * [App] Resolve PythonServer on M1 (#15949) Co-authored-by: thomas <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Lite: Fix DataLoader shuffling when using DistributedSampler (#15931) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [App] Temporarily disable ready (#15958) * Fix restarting attribute for lr finder (#15620) * [App] Improve pdb for multiprocessing (#15950) Co-authored-by: thomas <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [App] Improve debug triggering (#15951) * [App] Add automatic conversion to structures (#15961) * Make LightningModule torch.jit.script-able again (#15947) * Make LightningModule torch.jit.script-able again * remove skip Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * refactor: simplify Tensor import (#15959) * Fix ImportErrors on Multinode if package not present (#15963) * Fix typo in definition of world size in docs (#15954) * [App] Enable running an app from the Gallery (#15941) Co-authored-by: thomas <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Ethan Harris <[email protected]> Co-authored-by: Jirka <[email protected]> * Apply dynamo to training_step, validation_step, test_step, predict_step (#15957) * Apply dynamo to training_step, validation_step, test_step, predict_step * Add entry to CHANGELOG.md * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix merge conflict * rename tpu workflow Signed-off-by: Jerome <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: thomas chaton <[email protected]> Co-authored-by: Luca Antiga <[email protected]> Co-authored-by: Jerome Anand <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Akihiro Nitta <[email protected]> Co-authored-by: Aniket Maurya <[email protected]> Co-authored-by: Noha Alon <[email protected]> Co-authored-by: Ethan Harris <[email protected]> Co-authored-by: Akihiro Nitta <[email protected]> Co-authored-by: Rick Izzo <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jirka <[email protected]> Co-authored-by: Sherin Thomas <[email protected]> Co-authored-by: stekiri <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Justus Schock <[email protected]> Co-authored-by: thomas <[email protected]>

github-actions bot added the pl Generic label for PyTorch Lightning package label Oct 5, 2022

stekiri marked this pull request as ready for review October 13, 2022 11:53

stekiri requested review from awaelchli, carmocca, justusschock, rohitgr7, otaj and kaushikb11 as code owners October 13, 2022 11:53

carmocca reviewed Oct 13, 2022

View reviewed changes

src/lightning_lite/strategies/xla.py Outdated Show resolved Hide resolved

awaelchli self-assigned this Oct 13, 2022

awaelchli added accelerator: tpu Tensor Processing Unit fabric lightning.fabric.Fabric labels Oct 13, 2022

carmocca added feature Is an improvement or enhancement community This PR is from the community and removed fabric lightning.fabric.Fabric labels Oct 13, 2022

awaelchli reviewed Oct 19, 2022

View reviewed changes

src/pytorch_lightning/strategies/tpu_spawn.py Outdated Show resolved Hide resolved

src/lightning_lite/strategies/xla.py Outdated Show resolved Hide resolved

tests/tests_pytorch/accelerators/test_tpu.py Outdated Show resolved Hide resolved

stekiri and others added 5 commits October 24, 2022 19:02

Make gradients available for all_gather on TPU

cb0b12e

Modify switch and tests

8fd5eb8

[pre-commit.ci] auto fixes from pre-commit.com hooks

8cf0877

for more information, see https://pre-commit.ci

Merge branch 'master' into tpu-all-gather-with-grads

fb15ca7

Merge branch 'master' into tpu-all-gather-with-grads

1036b8d

awaelchli reviewed Oct 30, 2022

View reviewed changes

tests/tests_lite/strategies/test_xla.py Outdated Show resolved Hide resolved

tests/tests_pytorch/accelerators/test_tpu.py Outdated Show resolved Hide resolved

stekiri and others added 2 commits October 31, 2022 09:17

Apply suggestions from code review

2c339cf

Co-authored-by: Adrian Wälchli <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

b623b0d

for more information, see https://pre-commit.ci

justusschock approved these changes Oct 31, 2022

View reviewed changes

carmocca suggested changes Oct 31, 2022

View reviewed changes

stekiri added 4 commits October 31, 2022 19:33

Merge branch 'master' into tpu-all-gather-with-grads

26c6829

Modify tests

371dd9a

Fix test

410a6be

Merge branch 'master' into tpu-all-gather-with-grads

77f562e

Merge branch 'master' into tpu-all-gather-with-grads

1e1fd78

carmocca approved these changes Nov 3, 2022

View reviewed changes

tests/tests_pytorch/accelerators/test_tpu.py Outdated Show resolved Hide resolved

mergify bot added the ready PRs ready to be merged label Nov 3, 2022

carmocca assigned carmocca and unassigned awaelchli Nov 9, 2022

Drop test

b22a288

stekiri requested a review from williamFalcon as a code owner November 14, 2022 16:40

pre-commit-ci bot and others added 2 commits November 14, 2022 16:41

[pre-commit.ci] auto fixes from pre-commit.com hooks

6262ead

for more information, see https://pre-commit.ci

Merge branch 'master' into tpu-all-gather-with-grads

7c4d4fb

Borda requested review from awaelchli and removed request for otaj, rohitgr7 and kaushikb11 November 22, 2022 08:37

Merge branch 'master' into tpu-all-gather-with-grads

5c4ff7a

carmocca added this to the v1.9 milestone Nov 26, 2022

Borda added 2 commits November 30, 2022 08:23

Merge branch 'master' into tpu-all-gather-with-grads

d59cc6b

Merge branch 'master' into tpu-all-gather-with-grads

aec3d45

Borda enabled auto-merge (squash) December 8, 2022 04:08

Merge branch 'master' into tpu-all-gather-with-grads

c2372b9

Borda merged commit 0d822e4 into Lightning-AI:master Dec 8, 2022

stekiri deleted the tpu-all-gather-with-grads branch December 8, 2022 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make gradients available for all_gather on TPU #15003

Make gradients available for all_gather on TPU #15003

stekiri commented Oct 5, 2022 •

edited

Loading

carmocca left a comment

stekiri commented Nov 2, 2022

carmocca left a comment

awaelchli commented Nov 26, 2022

Make gradients available for all_gather on TPU #15003

Make gradients available for all_gather on TPU #15003

Conversation

stekiri commented Oct 5, 2022 • edited Loading

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun?

carmocca left a comment

Choose a reason for hiding this comment

stekiri commented Nov 2, 2022

carmocca left a comment

Choose a reason for hiding this comment

awaelchli commented Nov 26, 2022

stekiri commented Oct 5, 2022 •

edited

Loading