-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reducing the memory-overhead of creating large-models for multi-GPU run #1244
Conversation
self.module.to(torch.cuda.current_device()) Does this work fine as we want it to? Is it the same implementation as this one? |
When I test this PR. It's works for me. before after It is not completely divisible by number of gpus, but it is acceptable thank you |
From looking at the code, this PR also seems to solve #1192. |
This PR addresses the issue of creating large models for inference. We first partition the parallelable tensors on CPU based on the number of GPUs to run the model and then move each partition to the device.
This fixes #1209 and #1161
This PR changes the inference test scripts to reflect these changes.