AttributeError: 'DeepSpeedZeRoOffload' object has no attribute 'checkpoint_event_prologue' [BUG] #3601

wj210 · 2023-05-24T15:35:31Z

during inference, the error "AttributeError: 'DeepSpeedZeRoOffload' object has no attribute 'checkpoint_event_prologue'" occur, similar to #2449.

To Reproduce

train_params = dict(
                strategy = DeepSpeedStrategy(
        stage=3,
        offload_optimizer=args.cpu_offload,
        offload_parameters=args.cpu_offload,
    ) if args.strategy == 'deepspeed_stage_3' else args.strategy,
                accumulate_grad_batches=args.gradient_accumulation_steps,
                accelerator='gpu' if args.n_gpu > 0 else None,
                devices = [args.gpu_no] if args.n_gpu <= 1 else args.n_gpu,
                # num_nodes = args.n_gpu,
                max_epochs=args.num_train_epochs,
                precision='bf16' if args.fp_16 else 32,
                gradient_clip_val=args.max_grad_norm, # default args.max_grad_norm
                enable_checkpointing =False,
                callbacks=callbacks,
                # log_every_n_steps = sum([len(t[0]) for t in curr_train_ids])//args.train_batch_size
                )
    trainer = pl.Trainer(**train_params)
    ckpt_path = os.path.join(args.output_dir,'best.ckpt')
    
    
    # Train model
    trainer.fit(model)
    trainer.save_checkpoint(ckpt_path)
    ## Test phase
    test_metrics = trainer.test() 
```error happened here, the trainer loads the ckpt_path saved in the logs and somehow has that error.

I was able to mitigate it by setting ckpt_path to 'last' but it started printing an extremely long list of weird numbers like
`1017, 1024, 1018, 1019, 1023, 1020, 1022, 1021, 1025, 1026, 1028, 1034, 1029, 1030, 1031, 1032, 1033, 1035, 1036, 1042, 1037, 1038, 1041, 1043, 1044, 1051, 1045, 1046, 1050, 
1047, 1049, 1048, 1052, 1053, 1055, 1061, 1056, 1057, 1058, 1059, 1060, 1062, 1063, 1069, 1064, 1065, 1068, 1070, 1071, 1078, 1072, 1073, 1077, 1074, 1076, 1075, 1079, 1080, 
1082, 1088, 1083, 1084, 1085, 1086, 1087, 1089, 1090, 1096, 1091, 1092, 1095, 1097, 1098, 1105, 1099, 1100, 1104, 1101, 1103, 1102, 1106, 1107, 1109, 1115, 1110, 1111, 1112, 
1113, 1114, 1116, 1117, 1123, 1118, 1119, 1122, 1124, 1125, 1132, 1126, 1127, 1131, 1128, 1130, 1129, 1133, 1134, 1135, 1136]`

**ds_report output**
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/torch']
torch version .................... 1.12.1+cu113
deepspeed install path ........... ['/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.8.2, unknown, unknown
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3

The text was updated successfully, but these errors were encountered:

tjruwase · 2023-05-24T21:57:15Z

@wj210, can you please share full repro steps, including full script and command line? Thanks!

wj210 · 2023-05-25T02:47:05Z

The code is kinda long, but it mainly happens due to the fact that during testing, trainer.test ckpt_path was set to the default which is best and involves loading from the saved ckptpoint, as you can see when ckpt_path was set to "last", the error did not occurred, tho something else happened, which was the printing of numerical digits shown above.

train_params = dict(
                strategy = DeepSpeedStrategy(
        stage=3,
        offload_optimizer=args.cpu_offload,
        offload_parameters=args.cpu_offload,
    ) if args.strategy == 'deepspeed_stage_3' else args.strategy,
                accumulate_grad_batches=args.gradient_accumulation_steps,
                accelerator='gpu' if args.n_gpu > 0 else None,
                devices = [args.gpu_no] if args.n_gpu <= 1 else args.n_gpu,
                # num_nodes = args.n_gpu,
                max_epochs=args.num_train_epochs,
                precision='bf16' if args.fp_16 else 32,
                # gradient_clip_val=args.max_grad_norm, # default args.max_grad_norm
                enable_checkpointing =True,
                callbacks=callbacks,
                # log_every_n_steps = sum([len(t[0]) for t in curr_train_ids])//args.train_batch_size
                )
    trainer = pl.Trainer(**train_params)
    ckpt_path = os.path.join(args.output_dir,'best_ckpt')
    
    
    # Train model
    trainer.fit(model)
    # trainer.save_checkpoint(ckpt_path) # if set to last, no error.
    ## Test phase

I believe the same code from #2449 will reproduce the error.

The command line ran was

torchrun \
--standalone \
--nnodes=1 \
--nproc_per_node=4 train_t5.py \
--model_name_or_path 'google/flan-t5-base' \
--tokenizer_name_or_path 'google/flan-t5-base' \
--data_dir '../data/expl_sentiment_data/movies' \
--n_gpu 4 \
--data_explanation True \
--num_train_epochs 20 \
--strategy 'deepspeed_stage_3' \
--train_batch_size 16 \
--eval_batch_size 16 \
--fp_16 True

RoyJames · 2023-07-03T19:39:42Z

Can confirm this happens. This is a huge inconvenience.

iamlockelightning · 2023-07-21T07:17:59Z

👀

iamlockelightning · 2023-07-24T03:30:39Z

Same situation, waiting to get a solution to this problem. The official doc is taking deepspeed_stage_3 as an example. I don't know why deepspeed_stage_3_offload is not working.

My environment:

torch                     2.0.1
torchaudio                2.0.2
torchmetrics              0.11.4
torchvision               0.15.2
lightning                 2.0.2

tjruwase · 2023-08-18T13:54:24Z

@wj210, @iamlockelightning, @RoyJames, FYI

wj210 added bug Something isn't working inference labels May 24, 2023

tjruwase assigned ShijieZZZZ May 24, 2023

tjruwase added the training label May 24, 2023

awaelchli mentioned this issue Jul 19, 2023

trainer.save_checkpoint doesn't work after trainer.test with deepspeed strategy Lightning-AI/pytorch-lightning#15247

Open

tjruwase self-assigned this Jul 24, 2023

tjruwase mentioned this issue Aug 18, 2023

Load z3 checkpoints for inference #4171

Merged

tjruwase closed this as completed in #4171 Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AttributeError: 'DeepSpeedZeRoOffload' object has no attribute 'checkpoint_event_prologue' [BUG] #3601

AttributeError: 'DeepSpeedZeRoOffload' object has no attribute 'checkpoint_event_prologue' [BUG] #3601

wj210 commented May 24, 2023

tjruwase commented May 24, 2023

wj210 commented May 25, 2023 •

edited

Loading

RoyJames commented Jul 3, 2023

iamlockelightning commented Jul 21, 2023

iamlockelightning commented Jul 24, 2023

tjruwase commented Aug 18, 2023

AttributeError: 'DeepSpeedZeRoOffload' object has no attribute 'checkpoint_event_prologue' [BUG] #3601

AttributeError: 'DeepSpeedZeRoOffload' object has no attribute 'checkpoint_event_prologue' [BUG] #3601

Comments

wj210 commented May 24, 2023

tjruwase commented May 24, 2023

wj210 commented May 25, 2023 • edited Loading

RoyJames commented Jul 3, 2023

iamlockelightning commented Jul 21, 2023

iamlockelightning commented Jul 24, 2023

tjruwase commented Aug 18, 2023

wj210 commented May 25, 2023 •

edited

Loading