Reducing the memory-overhead of creating large-models for multi-GPU run #1244

RezaYazdaniAminabadi · 2021-07-20T22:08:59Z

This PR addresses the issue of creating large models for inference. We first partition the parallelable tensors on CPU based on the number of GPUs to run the model and then move each partition to the device.

This fixes #1209 and #1161

This PR changes the inference test scripts to reflect these changes.

acostin1 · 2021-07-21T02:03:45Z

I just tested this branch - misinterpreted it for training, not inference - for training I have this other issue - when I start on the CPU, it splits the NeoGPT125M models equally between the 8 A100 GPUs - but the performance is ~10% slower.

Configuration before with 0.4.3 - load model with .cuda(), deepspeed zero2, batch size 100

memory of first GPU 39GB (20 GB taken on model load). Memory occupied on other GPUs - 20 GB (was OOMing when I was trying to increase the batch size on the first GPU)
time to finish my training - 73h

Configuration with 0.4.4 - load model with .cpu(), batch size 240 - now it occupies all memory equally

But estimated time to finish 85h

Not an expert so I might be doing something very wrong :)

My training arguments
num_train_epochs=10, logging_steps=500, save_steps=1000,
per_device_train_batch_size=240, per_device_eval_batch_size=240,
save_total_limit=3,gradient_accumulation_steps = 2,
learning_rate = 5e-4,ddp_find_unused_parameters = False,
warmup_steps=100, weight_decay=0.01, logging_dir='./logs')

And the Zero2 config
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},

"optimizer": {
    "type": "AdamW",
    "params": {
        "lr": "auto",
        "betas": "auto",
        "eps": "auto",
        "weight_decay": "auto"
    }
},

"scheduler": {
    "type": "WarmupLR",
    "params": {
        "warmup_min_lr": "auto",
        "warmup_max_lr": "auto",
        "warmup_num_steps": "auto"
    }
},

"zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true
},

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",

}

hyunwoongko · 2021-07-22T13:09:19Z

self.module.to(torch.cuda.current_device())

Does this work fine as we want it to?
I implemented the code so that only parameters in the CPU are uploaded to the gpu.
https://github.com/tunib-ai/parallelformers/blob/main/parallelformers/parallel/engine.py#L86

Is it the same implementation as this one?

switiz · 2021-07-23T07:22:39Z

When I test this PR. It's works for me.
Test 13B model with half precision

before
1 - 8 gpus : OOM

after
1 gpus : 30G
2 gpus : 16.5G
4 gpus : 8.5G
8 gpus : 6G

It is not completely divisible by number of gpus, but it is acceptable

thank you
BR

aphedges · 2021-07-24T01:37:03Z

From looking at the code, this PR also seems to solve #1192.

Reducing the memory-overhead of creating model for multi-GPU run

c5653b4

RezaYazdaniAminabadi requested review from awan-10, cli99, conglongli, eltonzheng, jeffra, minjiaz, niumanar, samyam, ShadenSmith and tjruwase as code owners July 20, 2021 22:09

RezaYazdaniAminabadi mentioned this pull request Jul 20, 2021

Questions about implementing model parallelism in the inference engine #1161

Closed

jeffra approved these changes Jul 20, 2021

View reviewed changes

jeffra and others added 2 commits July 26, 2021 08:59

Merge branch 'master' into reyazda/large-model-inference

62bb27e

Merge branch 'master' into reyazda/large-model-inference

538b540

RezaYazdaniAminabadi mentioned this pull request Aug 26, 2021

[BUG] Host Memory Efficiency #1323

Open

RezaYazdaniAminabadi merged commit 49b6a63 into master Aug 26, 2021

mrwyattii deleted the reyazda/large-model-inference branch July 7, 2023 02:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reducing the memory-overhead of creating large-models for multi-GPU run #1244

Reducing the memory-overhead of creating large-models for multi-GPU run #1244

RezaYazdaniAminabadi commented Jul 20, 2021 •

edited

Loading

acostin1 commented Jul 21, 2021 •

edited

Loading

hyunwoongko commented Jul 22, 2021

switiz commented Jul 23, 2021

aphedges commented Jul 24, 2021

Reducing the memory-overhead of creating large-models for multi-GPU run #1244

Reducing the memory-overhead of creating large-models for multi-GPU run #1244

Conversation

RezaYazdaniAminabadi commented Jul 20, 2021 • edited Loading

acostin1 commented Jul 21, 2021 • edited Loading

hyunwoongko commented Jul 22, 2021

switiz commented Jul 23, 2021

aphedges commented Jul 24, 2021

RezaYazdaniAminabadi commented Jul 20, 2021 •

edited

Loading

acostin1 commented Jul 21, 2021 •

edited

Loading