Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enables ZeRO-3 inference #1514

Merged
merged 22 commits into from
Nov 19, 2021
Merged

Enables ZeRO-3 inference #1514

merged 22 commits into from
Nov 19, 2021

Conversation

jeffra
Copy link
Collaborator

@jeffra jeffra commented Nov 2, 2021

This enables a ZeRO-3 inference support, this means no optimizer is specified. This supports ZeRO-3 w. multiple GPUs and also supports ZeRO-3 w. parameter CPU offload.

This enables initial functional support, we have not fully evaluated performance and all memory reduction scenarios. More PRs to come I imagine :)

@jeffra jeffra changed the title Add support for no optimizer init for Z3 [WIP] Add support for no optimizer init for Z3 Nov 2, 2021
@jeffra jeffra marked this pull request as ready for review November 2, 2021 21:27
@jeffra jeffra changed the title [WIP] Add support for no optimizer init for Z3 Allow ZeRO-3 to work without an optimizer Nov 2, 2021
@jeffra jeffra changed the title Allow ZeRO-3 to work without an optimizer Enables ZeRO-3 inference Nov 2, 2021
@stas00
Copy link
Collaborator

stas00 commented Nov 17, 2021

It definitely uses less GPU memory w/ offload configured as compared to not.

2 gpus, w/ offload:

export BS=16; rm -r output_dir; PYTHONPATH=src USE_TF=0 deepspeed --num_gpus=2 examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --output_dir output_dir --adam_eps 1e-06 --evaluation_strategy=steps --do_eval --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 500 --max_source_length 128 --max_target_length 128  --overwrite_output_dir --per_device_eval_batch_size $BS --predict_with_generate --sortish_sampler --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" --source_prefix "translate English to Romanian: " --val_max_target_length 128 --warmup_steps 50 --max_eval_samples 50 --deepspeed tests/deepspeed/ds_config_zero3.json  --fp16 --skip_memory_metrics 0

  before_init_mem_cpu       =     5552MB
  before_init_mem_gpu       =       32MB
  eval_bleu                 =    28.1872
  eval_gen_len              =      34.88
  eval_loss                 =     3.6001
  eval_mem_cpu_alloc_delta  =     2175MB
  eval_mem_cpu_peaked_delta =        0MB
  eval_mem_gpu_alloc_delta  =        0MB
  eval_mem_gpu_peaked_delta =      264MB
  eval_runtime              = 0:00:16.08
  eval_samples              =         50
  eval_samples_per_second   =      3.108
  eval_steps_per_second     =      0.124
  init_mem_cpu_alloc_delta  =        5MB
  init_mem_cpu_peaked_delta =        0MB
  init_mem_gpu_alloc_delta  =        0MB
  init_mem_gpu_peaked_delta =        0MB

w/o offload (same cmd and config but changed "device": "none",)

  before_init_mem_cpu       =     5428MB
  before_init_mem_gpu       =      106MB
  eval_bleu                 =    28.1872
  eval_gen_len              =      34.88
  eval_loss                 =     3.6001
  eval_mem_cpu_alloc_delta  =      668MB
  eval_mem_cpu_peaked_delta =        0MB
  eval_mem_gpu_alloc_delta  =      332MB
  eval_mem_gpu_peaked_delta =      264MB
  eval_runtime              = 0:00:30.00
  eval_samples              =         50
  eval_samples_per_second   =      1.666
  eval_steps_per_second     =      0.067
  init_mem_cpu_alloc_delta  =        3MB
  init_mem_cpu_peaked_delta =        0MB
  init_mem_gpu_alloc_delta  =        0MB
  init_mem_gpu_peaked_delta =        0MB

the change was:

--- a/tests/deepspeed/ds_config_zero3.json
+++ b/tests/deepspeed/ds_config_zero3.json
@@ -30,11 +30,11 @@
     "zero_optimization": {
         "stage": 3,
         "offload_optimizer": {
-            "device": "cpu",
+            "device": "none",
             "pin_memory": true
         },
         "offload_param": {
-            "device": "cpu",
+            "device": "none",
             "pin_memory": true
         },
         "overlap_comm": true,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants