You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Parallelize and deparallelize methods for distribution of attention modules across multiple GPUs.
Motivation
Finetuning GPT Neo 2.7B model on 12 GB GPU gives out of memory error. Having a parallelize method would allow us to train that model by splitting attention modules across multiple GPUs of smaller VRAM.
Your contribution
Considering this line in GPT2 code and the absence of doc for parallelize method in GPT2 documentation, wanted to know if these methods are still supported. If not, what is the recommended method for fine-tuning large transformer models like GPT-Neo?
If they are still supported, I can take up this task and submit PR for both methods as well as documentation fix.
The text was updated successfully, but these errors were encountered:
🚀 Feature request
Parallelize and deparallelize methods for distribution of attention modules across multiple GPUs.
Motivation
Finetuning GPT Neo 2.7B model on 12 GB GPU gives out of memory error. Having a parallelize method would allow us to train that model by splitting attention modules across multiple GPUs of smaller VRAM.
Your contribution
Considering this line in GPT2 code and the absence of doc for parallelize method in GPT2 documentation, wanted to know if these methods are still supported. If not, what is the recommended method for fine-tuning large transformer models like GPT-Neo?
If they are still supported, I can take up this task and submit PR for both methods as well as documentation fix.
The text was updated successfully, but these errors were encountered: