You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I used enc_dec_nmt.py to build a NLP/MT model based on aayn_base.yml (nemo:1.8.2 based on pytorch:22.04-py3); training was interrupted before reaching the final epoch, now tying to resume training from the last checkpoint by passing +exp_manager.resume_if_exists=true to the enc_dec_nmt.py call fails with the following trace
Train with examples/enc_dec_nmt.py long enough to produce at least one checkpoint, interrupt training and rerun with the exp_manager.resume_if_exists flag set.
Expected behavior
Training resumes from last checkpoint. This used to work in nemo:1.3.0.
Environment overview (please complete the following information)
pytorch:22.04-py3
nemo:1.8.2
The text was updated successfully, but these errors were encountered:
This is something strange that has been happening since we moved to PTL 1.6 where when a model is being restored, PTL calls validation_epoch_end() without running all of the validation steps to compute validation BLEU scores etc so outputs is just an empty dict/list. There are two potential fixes, 1. At the beginning of the eval_epoch_end() function in mt_enc_dec_model.py, you do if not outputs: return or 2. Try and move back to PTL 1.5.10 and see if that helps. I think we've raised an issue with PTL about this, we'll see if something changes in PTL 1.6.1.
Thanks, this works. Interestingly, the call of validation_epoch_end() happens even if trainer.num_sanity_val_steps=0, which as I understand should skip running any validation pre-training.
Describe the bug
I used
enc_dec_nmt.py
to build a NLP/MT model based onaayn_base.yml
(nemo:1.8.2
based onpytorch:22.04-py3
); training was interrupted before reaching the final epoch, now tying to resume training from the last checkpoint by passing+exp_manager.resume_if_exists=true
to theenc_dec_nmt.py
call fails with the following traceSteps/Code to reproduce bug
Train with
examples/enc_dec_nmt.py
long enough to produce at least one checkpoint, interrupt training and rerun with theexp_manager.resume_if_exists
flag set.Expected behavior
Training resumes from last checkpoint. This used to work in
nemo:1.3.0
.Environment overview (please complete the following information)
pytorch:22.04-py3
nemo:1.8.2
The text was updated successfully, but these errors were encountered: