Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-tuning broken #54

Open
domef opened this issue May 20, 2024 · 1 comment
Open

Fine-tuning broken #54

domef opened this issue May 20, 2024 · 1 comment

Comments

@domef
Copy link

domef commented May 20, 2024

I can't load last.ckpt of my fine-tuned model:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[6], [line 5](vscode-notebook-cell:?execution_count=6&line=5)
      [3](vscode-notebook-cell:?execution_count=6&line=3) ckpt="logs/myexp/checkpoints/last.ckpt"
      [4](vscode-notebook-cell:?execution_count=6&line=4) config = OmegaConf.load(f"{config}")
----> [5](vscode-notebook-cell:?execution_count=6&line=5) model = load_model_from_config(config, f"{ckpt}")
      [6](vscode-notebook-cell:?execution_count=6&line=6) sampler = DDIMSampler(model)

Cell In[4], [line 38](vscode-notebook-cell:?execution_count=4&line=38)
     [36](vscode-notebook-cell:?execution_count=4&line=36) if "global_step" in pl_sd:
     [37](vscode-notebook-cell:?execution_count=4&line=37)     print(f"Global Step: {pl_sd['global_step']}")
---> [38](vscode-notebook-cell:?execution_count=4&line=38) sd = pl_sd["state_dict"]
     [39](vscode-notebook-cell:?execution_count=4&line=39) model = instantiate_from_config(config.model)
     [40](vscode-notebook-cell:?execution_count=4&line=40) m, u = model.load_state_dict(sd, strict=False)

KeyError: 'state_dict'

Probably because the model was not saved correctly, after the fine-tuning is finished it crashes:

Epoch 0:  10%| | 61001/616605 [5:46:21<52:34:42,  2.94it/s, loss=0.166, v_num=0, train/l
Saving latest checkpoint...

Traceback (most recent call last):
  File "main.py", line 779, in <module>
    trainer.test(model, data)
  File "/home/federico/Desktop/InST/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 911, in test
    return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule)
  File "/home/federico/Desktop/InST/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/federico/Desktop/InST/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 954, in _test_impl
    results = self._run(model, ckpt_path=self.tested_ckpt_path)
  File "/home/federico/Desktop/InST/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1128, in _run
    verify_loop_configurations(self)
  File "/home/federico/Desktop/InST/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py", line 42, in verify_loop_configurations
    __verify_eval_loop_configuration(trainer, model, "test")
  File "/home/federico/Desktop/InST/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py", line 186, in __verify_eval_loop_configuration
    raise MisconfigurationException(f"No `{loader_name}()` method defined to run `Trainer.{trainer_method}`.")
pytorch_lightning.utilities.exceptions.MisconfigurationException: No `test_dataloader()` method defined to run `Trainer.test`.

Edit:
Even if i comment this lines and no exception is raised, the checkpoint is not saved correctly:

        if not opt.no_test and not trainer.interrupted:
            trainer.test(model, data)
@domef
Copy link
Author

domef commented May 20, 2024

I see from this issue #45 that probably i do not need to load the new last.ckpt but i need to load the pretrained one, is it correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant