Trying to resume training `enc_dec_nmt` fails #4224

itzsimpl · 2022-05-22T21:24:16Z

Describe the bug

I used enc_dec_nmt.py to build a NLP/MT model based on aayn_base.yml (nemo:1.8.2 based on pytorch:22.04-py3); training was interrupted before reaching the final epoch, now tying to resume training from the last checkpoint by passing
+exp_manager.resume_if_exists=true to the enc_dec_nmt.py call fails with the following trace

Traceback (most recent call last):
  File "examples/nlp/machine_translation/enc_dec_nmt.py", line 147, in <module>
    main()
  File "/workspace/nemo/nemo/core/config/hydra_runner.py", line 104, in wrapper
    _run_hydra(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "examples/nlp/machine_translation/enc_dec_nmt.py", line 140, in main
    trainer.fit(mt_model)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
    self._call_and_handle_interrupt(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in _run
    results = self._run_stage()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1321, in _run_stage
    return self._run_train()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1351, in _run_train
    self.fit_loop.run()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 269, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 205, in run
    self.on_advance_end()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 255, in on_advance_end
    self._run_validation()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 309, in _run_validation
    self.val_loop.run()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 211, in run
    output = self.on_run_end()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 187, in on_run_end
    self._evaluation_epoch_end(self._outputs)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 309, in _evaluation_epoch_end
    self.trainer._call_lightning_module_hook("validation_epoch_end", output_or_outputs)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1593, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/workspace/nemo/nemo/collections/nlp/models/machine_translation/mt_enc_dec_model.py", line 493, in validation_epoch_end
    self.eval_epoch_end(outputs, 'val', self.global_rank)
  File "/workspace/nemo/nemo/collections/nlp/models/machine_translation/mt_enc_dec_model.py", line 407, in eval_epoch_end
    if isinstance(outputs[0], dict):
IndexError: list index out of range

Steps/Code to reproduce bug

Train with examples/enc_dec_nmt.py long enough to produce at least one checkpoint, interrupt training and rerun with the exp_manager.resume_if_exists flag set.

Expected behavior

Training resumes from last checkpoint. This used to work in nemo:1.3.0.

Environment overview (please complete the following information)

pytorch:22.04-py3
nemo:1.8.2

The text was updated successfully, but these errors were encountered:

MaximumEntropy · 2022-05-22T22:41:12Z

This is something strange that has been happening since we moved to PTL 1.6 where when a model is being restored, PTL calls validation_epoch_end() without running all of the validation steps to compute validation BLEU scores etc so outputs is just an empty dict/list. There are two potential fixes, 1. At the beginning of the eval_epoch_end() function in mt_enc_dec_model.py, you do if not outputs: return or 2. Try and move back to PTL 1.5.10 and see if that helps. I think we've raised an issue with PTL about this, we'll see if something changes in PTL 1.6.1.

itzsimpl · 2022-05-24T20:04:26Z

Thanks, this works. Interestingly, the call of validation_epoch_end() happens even if trainer.num_sanity_val_steps=0, which as I understand should skip running any validation pre-training.

MaximumEntropy · 2022-05-25T21:37:57Z

Yeah, this is strange behavior that we've seen since PTL 1.6. Have a PR that "fixes" it here for 1.9.0 - #4265

MaximumEntropy · 2022-05-25T21:38:16Z

Please re-open this if you see something similar.

itzsimpl added the bug Something isn't working label May 22, 2022

MaximumEntropy self-assigned this May 22, 2022

MaximumEntropy closed this as completed May 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to resume training `enc_dec_nmt` fails #4224

Trying to resume training `enc_dec_nmt` fails #4224

itzsimpl commented May 22, 2022

MaximumEntropy commented May 22, 2022 •

edited

Loading

itzsimpl commented May 24, 2022

MaximumEntropy commented May 25, 2022

MaximumEntropy commented May 25, 2022

Trying to resume training enc_dec_nmt fails #4224

Trying to resume training enc_dec_nmt fails #4224

Comments

itzsimpl commented May 22, 2022

MaximumEntropy commented May 22, 2022 • edited Loading

itzsimpl commented May 24, 2022

MaximumEntropy commented May 25, 2022

MaximumEntropy commented May 25, 2022

Trying to resume training `enc_dec_nmt` fails #4224

Trying to resume training `enc_dec_nmt` fails #4224

MaximumEntropy commented May 22, 2022 •

edited

Loading