Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to resume training enc_dec_nmt fails #4224

Closed
itzsimpl opened this issue May 22, 2022 · 4 comments
Closed

Trying to resume training enc_dec_nmt fails #4224

itzsimpl opened this issue May 22, 2022 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@itzsimpl
Copy link
Contributor

Describe the bug

I used enc_dec_nmt.py to build a NLP/MT model based on aayn_base.yml (nemo:1.8.2 based on pytorch:22.04-py3); training was interrupted before reaching the final epoch, now tying to resume training from the last checkpoint by passing
+exp_manager.resume_if_exists=true to the enc_dec_nmt.py call fails with the following trace

Traceback (most recent call last):
  File "examples/nlp/machine_translation/enc_dec_nmt.py", line 147, in <module>
    main()
  File "/workspace/nemo/nemo/core/config/hydra_runner.py", line 104, in wrapper
    _run_hydra(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "examples/nlp/machine_translation/enc_dec_nmt.py", line 140, in main
    trainer.fit(mt_model)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
    self._call_and_handle_interrupt(
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in _run
    results = self._run_stage()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1321, in _run_stage
    return self._run_train()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1351, in _run_train
    self.fit_loop.run()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 269, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 205, in run
    self.on_advance_end()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 255, in on_advance_end
    self._run_validation()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 309, in _run_validation
    self.val_loop.run()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 211, in run
    output = self.on_run_end()
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 187, in on_run_end
    self._evaluation_epoch_end(self._outputs)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 309, in _evaluation_epoch_end
    self.trainer._call_lightning_module_hook("validation_epoch_end", output_or_outputs)
  File "/ceph/hpc/home/ilb/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1593, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/workspace/nemo/nemo/collections/nlp/models/machine_translation/mt_enc_dec_model.py", line 493, in validation_epoch_end
    self.eval_epoch_end(outputs, 'val', self.global_rank)
  File "/workspace/nemo/nemo/collections/nlp/models/machine_translation/mt_enc_dec_model.py", line 407, in eval_epoch_end
    if isinstance(outputs[0], dict):
IndexError: list index out of range

Steps/Code to reproduce bug

Train with examples/enc_dec_nmt.py long enough to produce at least one checkpoint, interrupt training and rerun with the exp_manager.resume_if_exists flag set.

Expected behavior

Training resumes from last checkpoint. This used to work in nemo:1.3.0.

Environment overview (please complete the following information)

pytorch:22.04-py3
nemo:1.8.2

@itzsimpl itzsimpl added the bug Something isn't working label May 22, 2022
@MaximumEntropy
Copy link
Contributor

MaximumEntropy commented May 22, 2022

This is something strange that has been happening since we moved to PTL 1.6 where when a model is being restored, PTL calls validation_epoch_end() without running all of the validation steps to compute validation BLEU scores etc so outputs is just an empty dict/list. There are two potential fixes, 1. At the beginning of the eval_epoch_end() function in mt_enc_dec_model.py, you do if not outputs: return or 2. Try and move back to PTL 1.5.10 and see if that helps. I think we've raised an issue with PTL about this, we'll see if something changes in PTL 1.6.1.

@MaximumEntropy MaximumEntropy self-assigned this May 22, 2022
@itzsimpl
Copy link
Contributor Author

Thanks, this works. Interestingly, the call of validation_epoch_end() happens even if trainer.num_sanity_val_steps=0, which as I understand should skip running any validation pre-training.

@MaximumEntropy
Copy link
Contributor

Yeah, this is strange behavior that we've seen since PTL 1.6. Have a PR that "fixes" it here for 1.9.0 - #4265

@MaximumEntropy
Copy link
Contributor

Please re-open this if you see something similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants