Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing the memory-overhead of creating large-models for multi-GPU run #1244

Merged
merged 3 commits into from
Aug 26, 2021

Conversation

RezaYazdaniAminabadi
Copy link
Contributor

@RezaYazdaniAminabadi RezaYazdaniAminabadi commented Jul 20, 2021

This PR addresses the issue of creating large models for inference. We first partition the parallelable tensors on CPU based on the number of GPUs to run the model and then move each partition to the device.

This fixes #1209 and #1161

This PR changes the inference test scripts to reflect these changes.

@acostin1
Copy link

acostin1 commented Jul 21, 2021

I just tested this branch - misinterpreted it for training, not inference - for training I have this other issue - when I start on the CPU, it splits the NeoGPT125M models equally between the 8 A100 GPUs - but the performance is ~10% slower.

Configuration before with 0.4.3 - load model with .cuda(), deepspeed zero2, batch size 100

  • memory of first GPU 39GB (20 GB taken on model load). Memory occupied on other GPUs - 20 GB (was OOMing when I was trying to increase the batch size on the first GPU)
  • time to finish my training - 73h

image

image

Configuration with 0.4.4 - load model with .cpu(), batch size 240 - now it occupies all memory equally
image

But estimated time to finish 85h
image

Not an expert so I might be doing something very wrong :)

My training arguments
num_train_epochs=10, logging_steps=500, save_steps=1000,
per_device_train_batch_size=240, per_device_eval_batch_size=240,
save_total_limit=3,gradient_accumulation_steps = 2,
learning_rate = 5e-4,ddp_find_unused_parameters = False,
warmup_steps=100, weight_decay=0.01, logging_dir='./logs')

And the Zero2 config
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},

"optimizer": {
    "type": "AdamW",
    "params": {
        "lr": "auto",
        "betas": "auto",
        "eps": "auto",
        "weight_decay": "auto"
    }
},

"scheduler": {
    "type": "WarmupLR",
    "params": {
        "warmup_min_lr": "auto",
        "warmup_max_lr": "auto",
        "warmup_num_steps": "auto"
    }
},

"zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true
},

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",

}

@hyunwoongko
Copy link
Contributor

self.module.to(torch.cuda.current_device())

Does this work fine as we want it to?
I implemented the code so that only parameters in the CPU are uploaded to the gpu.
https://github.com/tunib-ai/parallelformers/blob/main/parallelformers/parallel/engine.py#L86

Is it the same implementation as this one?

@switiz
Copy link

switiz commented Jul 23, 2021

When I test this PR. It's works for me.
Test 13B model with half precision

before
1 - 8 gpus : OOM

after
1 gpus : 30G
2 gpus : 16.5G
4 gpus : 8.5G
8 gpus : 6G

It is not completely divisible by number of gpus, but it is acceptable

thank you
BR

@aphedges
Copy link
Contributor

From looking at the code, this PR also seems to solve #1192.

@RezaYazdaniAminabadi RezaYazdaniAminabadi merged commit 49b6a63 into master Aug 26, 2021
@mrwyattii mrwyattii deleted the reyazda/large-model-inference branch July 7, 2023 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GPT-NEO model parallelism not work issue
6 participants