Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallelize and deparallelize method for GPT-Neo series model #11751

Closed
Ankit-Dhankhar opened this issue May 17, 2021 · 2 comments
Closed

parallelize and deparallelize method for GPT-Neo series model #11751

Ankit-Dhankhar opened this issue May 17, 2021 · 2 comments

Comments

@Ankit-Dhankhar
Copy link

🚀 Feature request

Parallelize and deparallelize methods for distribution of attention modules across multiple GPUs.

Motivation

Finetuning GPT Neo 2.7B model on 12 GB GPU gives out of memory error. Having a parallelize method would allow us to train that model by splitting attention modules across multiple GPUs of smaller VRAM.

Your contribution

Considering this line in GPT2 code and the absence of doc for parallelize method in GPT2 documentation, wanted to know if these methods are still supported. If not, what is the recommended method for fine-tuning large transformer models like GPT-Neo?

If they are still supported, I can take up this task and submit PR for both methods as well as documentation fix.

@aphedges
Copy link
Contributor

This is answered in #11054.

(I'm in a similar situation as you. I'm just going to go with the suggestion and use DeepSpeed instead of model parallelism.)

@Ankit-Dhankhar
Copy link
Author

Thanks, didn't saw that. Parallelism notes are also awesome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants