Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ColossalAI strategy #14224

Merged
merged 66 commits into from
Oct 11, 2022
Merged

Add ColossalAI strategy #14224

merged 66 commits into from
Oct 11, 2022

Conversation

ver217
Copy link
Contributor

@ver217 ver217 commented Aug 16, 2022

What does this PR do?

Fixes hpcaitech/ColossalAI#1330
Fixes #12733

Add ColossalAI strategy which supports ZeRO-DP with chunk-based memory management.

Does your PR introduce any breaking changes? If yes, please list them.

No.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Aug 16, 2022
@carmocca carmocca added feature Is an improvement or enhancement strategy labels Aug 19, 2022
@carmocca carmocca changed the title add colossalai strategy Add ColossalAI strategy Aug 19, 2022
@carmocca carmocca added this to the pl:1.8 milestone Aug 19, 2022
@ver217 ver217 marked this pull request as ready for review September 7, 2022 06:54
Copy link
Contributor

@rohitgr7 rohitgr7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work so far!
High-level review.

Are you planning to add docs in this PR or another?

src/pytorch_lightning/strategies/colossalai.py Outdated Show resolved Hide resolved
src/pytorch_lightning/strategies/colossalai.py Outdated Show resolved Hide resolved
src/pytorch_lightning/strategies/colossalai.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added pl Generic label for PyTorch Lightning package and removed pl Generic label for PyTorch Lightning package labels Sep 9, 2022
@otaj
Copy link
Contributor

otaj commented Oct 7, 2022

Dockers are hopefully fixed. Will check in the morning

@otaj
Copy link
Contributor

otaj commented Oct 8, 2022

@rohitgr7, dockers are building. Docker image build of build-pl is slightly wrong, because it bases itself on the result of build-cuda images available in dockerhub, which is not available there yet, but it's not an issue, since result of build-cuda will be pushed to dockerhub on merge to master, so once that happens, everything will work out as it should.

However, there is one GPU test, that is consistently failing, and I think it was written by @awaelchli, do you think you can take a look there, please?

@otaj
Copy link
Contributor

otaj commented Oct 10, 2022

So, I found the issue with the failing test. In #14984 we started making sure, that the fork process is not poisoned by CUDA calls. However, just importing colossalai caused a call to torch.cuda.num_devices() (through calling torch_amp). The fix is to guard any colossalai imports by our context manager _patch_cuda_is_available. I'm not really sure, if this is the nicest possible solution, but hey, at least it works.

cc @carmocca to give a decision on whether it's okay enough solution

requirements/pytorch/strategies.txt Show resolved Hide resolved
tests/tests_pytorch/strategies/test_colossalai_strategy.py Outdated Show resolved Hide resolved
tests/tests_pytorch/strategies/test_colossalai_strategy.py Outdated Show resolved Hide resolved
tests/tests_pytorch/strategies/test_colossalai_strategy.py Outdated Show resolved Hide resolved
tests/tests_pytorch/strategies/test_colossalai_strategy.py Outdated Show resolved Hide resolved
tests/tests_pytorch/strategies/test_colossalai_strategy.py Outdated Show resolved Hide resolved
tests/tests_pytorch/strategies/test_colossalai_strategy.py Outdated Show resolved Hide resolved
tests/tests_pytorch/strategies/test_colossalai_strategy.py Outdated Show resolved Hide resolved
tests/tests_pytorch/strategies/test_colossalai_strategy.py Outdated Show resolved Hide resolved
src/pytorch_lightning/strategies/colossalai.py Outdated Show resolved Hide resolved
@mergify mergify bot added the ready PRs ready to be merged label Oct 11, 2022
@carmocca carmocca merged commit 2fef6d9 into Lightning-AI:master Oct 11, 2022
@rohitgr7 rohitgr7 deleted the feature/colossalai branch October 11, 2022 12:03
@1SAA 1SAA mentioned this pull request Oct 12, 2022
12 tasks
nicolai86 pushed a commit that referenced this pull request Oct 13, 2022
Co-authored-by: HELSON <[email protected]>
Co-authored-by: rohitgr7 <[email protected]>
Co-authored-by: otaj <[email protected]>
Co-authored-by: Carlos Mocholí <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement pl Generic label for PyTorch Lightning package ready PRs ready to be merged strategy: colossalai (removed) Colossal-AI strategy
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[FEATURE]: PyTorch Lightning Integration Integrate Colossal.AI Engine
7 participants