Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: 'DeepSpeedZeRoOffload' object has no attribute 'checkpoint_event_prologue' [BUG] #3601

Closed
wj210 opened this issue May 24, 2023 · 6 comments · Fixed by #4171
Closed
Assignees
Labels
bug Something isn't working inference training

Comments

@wj210
Copy link

wj210 commented May 24, 2023

during inference, the error "AttributeError: 'DeepSpeedZeRoOffload' object has no attribute 'checkpoint_event_prologue'" occur, similar to #2449.

To Reproduce

train_params = dict(
                strategy = DeepSpeedStrategy(
        stage=3,
        offload_optimizer=args.cpu_offload,
        offload_parameters=args.cpu_offload,
    ) if args.strategy == 'deepspeed_stage_3' else args.strategy,
                accumulate_grad_batches=args.gradient_accumulation_steps,
                accelerator='gpu' if args.n_gpu > 0 else None,
                devices = [args.gpu_no] if args.n_gpu <= 1 else args.n_gpu,
                # num_nodes = args.n_gpu,
                max_epochs=args.num_train_epochs,
                precision='bf16' if args.fp_16 else 32,
                gradient_clip_val=args.max_grad_norm, # default args.max_grad_norm
                enable_checkpointing =False,
                callbacks=callbacks,
                # log_every_n_steps = sum([len(t[0]) for t in curr_train_ids])//args.train_batch_size
                )
    trainer = pl.Trainer(**train_params)
    ckpt_path = os.path.join(args.output_dir,'best.ckpt')
    
    
    # Train model
    trainer.fit(model)
    trainer.save_checkpoint(ckpt_path)
    ## Test phase
    test_metrics = trainer.test() 
```error happened here, the trainer loads the ckpt_path saved in the logs and somehow has that error.

I was able to mitigate it by setting ckpt_path to 'last' but it started printing an extremely long list of weird numbers like
`1017, 1024, 1018, 1019, 1023, 1020, 1022, 1021, 1025, 1026, 1028, 1034, 1029, 1030, 1031, 1032, 1033, 1035, 1036, 1042, 1037, 1038, 1041, 1043, 1044, 1051, 1045, 1046, 1050, 
1047, 1049, 1048, 1052, 1053, 1055, 1061, 1056, 1057, 1058, 1059, 1060, 1062, 1063, 1069, 1064, 1065, 1068, 1070, 1071, 1078, 1072, 1073, 1077, 1074, 1076, 1075, 1079, 1080, 
1082, 1088, 1083, 1084, 1085, 1086, 1087, 1089, 1090, 1096, 1091, 1092, 1095, 1097, 1098, 1105, 1099, 1100, 1104, 1101, 1103, 1102, 1106, 1107, 1109, 1115, 1110, 1111, 1112, 
1113, 1114, 1116, 1117, 1123, 1118, 1119, 1122, 1124, 1125, 1132, 1126, 1127, 1131, 1128, 1130, 1129, 1133, 1134, 1135, 1136]`

**ds_report output**
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/torch']
torch version .................... 1.12.1+cu113
deepspeed install path ........... ['/home/weijie/anaconda3/envs/flan/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.8.2, unknown, unknown
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3

@wj210 wj210 added bug Something isn't working inference labels May 24, 2023
@tjruwase
Copy link
Contributor

@wj210, can you please share full repro steps, including full script and command line? Thanks!

@wj210
Copy link
Author

wj210 commented May 25, 2023

The code is kinda long, but it mainly happens due to the fact that during testing, trainer.test ckpt_path was set to the default which is best and involves loading from the saved ckptpoint, as you can see when ckpt_path was set to "last", the error did not occurred, tho something else happened, which was the printing of numerical digits shown above.

train_params = dict(
                strategy = DeepSpeedStrategy(
        stage=3,
        offload_optimizer=args.cpu_offload,
        offload_parameters=args.cpu_offload,
    ) if args.strategy == 'deepspeed_stage_3' else args.strategy,
                accumulate_grad_batches=args.gradient_accumulation_steps,
                accelerator='gpu' if args.n_gpu > 0 else None,
                devices = [args.gpu_no] if args.n_gpu <= 1 else args.n_gpu,
                # num_nodes = args.n_gpu,
                max_epochs=args.num_train_epochs,
                precision='bf16' if args.fp_16 else 32,
                # gradient_clip_val=args.max_grad_norm, # default args.max_grad_norm
                enable_checkpointing =True,
                callbacks=callbacks,
                # log_every_n_steps = sum([len(t[0]) for t in curr_train_ids])//args.train_batch_size
                )
    trainer = pl.Trainer(**train_params)
    ckpt_path = os.path.join(args.output_dir,'best_ckpt')
    
    
    # Train model
    trainer.fit(model)
    # trainer.save_checkpoint(ckpt_path) # if set to last, no error.
    ## Test phase

I believe the same code from #2449 will reproduce the error.

The command line ran was

torchrun \
--standalone \
--nnodes=1 \
--nproc_per_node=4 train_t5.py \
--model_name_or_path 'google/flan-t5-base' \
--tokenizer_name_or_path 'google/flan-t5-base' \
--data_dir '../data/expl_sentiment_data/movies' \
--n_gpu 4 \
--data_explanation True \
--num_train_epochs 20 \
--strategy 'deepspeed_stage_3' \
--train_batch_size 16 \
--eval_batch_size 16 \
--fp_16 True 

@RoyJames
Copy link

RoyJames commented Jul 3, 2023

Can confirm this happens. This is a huge inconvenience.

@iamlockelightning
Copy link

👀

@iamlockelightning
Copy link

Same situation, waiting to get a solution to this problem. The official doc is taking deepspeed_stage_3 as an example. I don't know why deepspeed_stage_3_offload is not working.

My environment:

torch                     2.0.1
torchaudio                2.0.2
torchmetrics              0.11.4
torchvision               0.15.2
lightning                 2.0.2

@tjruwase
Copy link
Contributor

@wj210, @iamlockelightning, @RoyJames, FYI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working inference training
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants