-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] Add support for OffloadModel to enable training large models on 1 GPU. #432
Conversation
…s 2 --optim_type oss_offload_ddp --batch_size=32 --model vit_large_patch16_224
* initial fwd/bwd commit * checkpoint work * modify shard loop * activation offloading and test to start with * fix lint errors * update comments * fix lint * remove unused var * remove commented out lines * modify name * remove break * remove profiler comments * avoid saving inputs * fix lint errors Co-authored-by: Anjali Sridhar <[email protected]>
* initial fwd/bwd commit * checkpoint work * modify shard loop * activation offloading and test to start with * fix lint errors * update comments * fix lint * remove unused var * remove commented out lines * modify name * remove break * remove profiler comments * add support for fp16 * add unit tests * fix lint errors * fix test failure Co-authored-by: Anjali Sridhar <[email protected]>
) * initial fwd/bwd commit * checkpoint work * modify shard loop * activation offloading and test to start with * fix lint errors * update comments * fix lint * remove unused var * remove commented out lines * modify name * remove break * remove profiler comments * add support for fp16 * add unit tests * fix lint errors * fix test failure * cp work, incorrect output dimensions still need to be fixed * fixed activation outputs * intermediate cp of work * add tests * fix lint errors Co-authored-by: Anjali Sridhar <[email protected]>
quick question on the test file location. Should it be tests/nn/experimental/test_offload.py or tests/experimental/nn/test_offload.py? I think we mirror the dirs. File names can be shorten, like we have test_fsdp*.py but all in the same mirrored dir. That seems like a good convention? |
also, see this comment: Lightning-AI/pytorch-lightning#6152 (comment) |
I agree. I want it to be in experimental/ just like I moved tests for ampnet. |
fairscale/experimental/nn/offload.py
Outdated
|
||
def __init__( | ||
self, | ||
model_cpu: nn.Sequential, # hard pre-requisite for now, easier model slicing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussing elsewhere, but I think that the FSDP way (wrap submodules) could apply here, and why not keeping the two options (either one monolithic nn.Sequential call, or a per-module wrap) open ? I think that it adds a lot of flexibility and could be good enough in practice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
practically speaking this means that https://pytorch.org/docs/stable/generated/torch.nn.Module.html?highlight=forward%20hook#torch.nn.Module.register_forward_pre_hook can be used, but the latency will be pretty terrible if used "naively" (wait for the FW wavefront to touch, pull in the module), so it's not really a silver bullet
Thanks for the great PR @anj-s, it's super comprehensive ! I think that we can try to make it more generic over time, it does not have to be perfect right now and it's a very solid basis I believe. Minor nits if you don't mind and curious to have @min-xu-ai eyes on that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for late to the party. I agree with Ben that this gives us a good start. Lots of interesting things we can do potentially with this.
Before submitting
What does this PR do?
Add experimental support for using the OffloadModel API which enables training large models on a single GPU. OffloadModel chunks the given model into a list of modules and copies a given chunk from CPU->GPU during the FW pass. After FW computation the chunk is copied back to the CPU. The process is repeated for the BW pass. The current implementations supports:
Caveats:
References:
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃