parallelize and deparallelize method for GPT-Neo series model #11751

Ankit-Dhankhar · 2021-05-17T18:42:12Z

🚀 Feature request

Parallelize and deparallelize methods for distribution of attention modules across multiple GPUs.

Motivation

Finetuning GPT Neo 2.7B model on 12 GB GPU gives out of memory error. Having a parallelize method would allow us to train that model by splitting attention modules across multiple GPUs of smaller VRAM.

Your contribution

Considering this line in GPT2 code and the absence of doc for parallelize method in GPT2 documentation, wanted to know if these methods are still supported. If not, what is the recommended method for fine-tuning large transformer models like GPT-Neo?

If they are still supported, I can take up this task and submit PR for both methods as well as documentation fix.

aphedges · 2021-05-25T19:51:07Z

This is answered in #11054.

(I'm in a similar situation as you. I'm just going to go with the suggestion and use DeepSpeed instead of model parallelism.)

Ankit-Dhankhar · 2021-05-25T20:05:47Z

Thanks, didn't saw that. Parallelism notes are also awesome.

Ankit-Dhankhar closed this as completed May 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallelize and deparallelize method for GPT-Neo series model #11751

parallelize and deparallelize method for GPT-Neo series model #11751

Ankit-Dhankhar commented May 17, 2021

aphedges commented May 25, 2021

Ankit-Dhankhar commented May 25, 2021

parallelize and deparallelize method for GPT-Neo series model #11751

parallelize and deparallelize method for GPT-Neo series model #11751

Comments

Ankit-Dhankhar commented May 17, 2021

🚀 Feature request

Motivation

Your contribution

aphedges commented May 25, 2021

Ankit-Dhankhar commented May 25, 2021