trainer.test(datamodule=dm) stores reference to wrong checkpoint #4693

ananyahjha93 · 2020-11-16T12:38:57Z

🐛 Bug

When finetuning from saved weights in bolts, trainer.test() picks up reference to checkpoints which have already been deleted or not yet created.
Checkpoint created using default trainer options, no callbacks added from the user's side.

Please reproduce using [the BoringModel and post here]

Not sure how to reproduce fine-tuning from a checkpoint using the boring model.

To Reproduce

clone bolts using git clone https://github.com/PyTorchLightning/pytorch-lightning-bolts.git
cd to pl_bolts/models/self_supervised/swav/
wget 'https://pl-bolts-weights.s3.us-east-2.amazonaws.com/swav/checkpoints/swav_stl10.pth.tar'
python swav_finetuner.py --ckpt swav_stl10.pth.tar --dataset stl10 --batch_size 256 --gpus 1 --learning_rate 0.1

Latest saved checkpoint is say 'epoch=33.ckpt' but line 712 in trainer.py looks for other saved checkpoints which might be epochs before or after the one present in checkpoints folder.

  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 712, in test
    results = self.__test_using_best_weights(ckpt_path, test_dataloaders)

Error:

FileNotFoundError: [Errno 2] No such file or directory: '/home/jovyan/pytorch_lightning_bolts/pl_bolts/models/self_supervised/swav/lightning_logs/version_3/checkpoints/epoch=7.ckpt'
FileNotFoundError: [Errno 2] No such file or directory: '/home/jovyan/pytorch_lightning_bolts/pl_bolts/models/self_supervised/swav/lightning_logs/version_3/checkpoints/epoch=21.ckpt'
FileNotFoundError: [Errno 2] No such file or directory: '/home/jovyan/pytorch_lightning_bolts/pl_bolts/models/self_supervised/swav/lightning_logs/version_3/checkpoints/epoch=37.ckpt'

Expected behavior

trainer.test(datamodule=dm) should pickup the reference to the correct checkpoint saved in lightning_logs/version_x/checkpoints

Environment

PyTorch Lightning version 1.0.4+ (tested with both 1.0.4 and 1.0.6)
bolts from master

PyTorch Version (e.g., 1.0): 1.6
OS (e.g., Linux): linux
How you installed PyTorch (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.7
CUDA/cuDNN version:
GPU models and configuration: V100s
Any other relevant information:

Additional context

The text was updated successfully, but these errors were encountered:

Borda · 2020-11-16T13:02:15Z

you please say how the checkpoint was created? and what Pl version was used?

ananyahjha93 · 2020-11-16T13:06:33Z

@Borda added

ananyahjha93 added bug Something isn't working help wanted Open to be worked on priority: 0 High priority task labels Nov 16, 2020

SeanNaren self-assigned this Nov 16, 2020

SeanNaren mentioned this issue Nov 16, 2020

Ensure sync across val/test step when using DDP Lightning-Universe/lightning-bolts#371

Merged

1 task

SeanNaren removed the priority: 0 High priority task label Nov 16, 2020

ananyahjha93 closed this as completed in Lightning-Universe/lightning-bolts#371 Nov 17, 2020

edenlightning added this to the 1.0.8 milestone Nov 18, 2020

Borda modified the milestones: 1.0.8, 1.0.x Nov 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trainer.test(datamodule=dm) stores reference to wrong checkpoint #4693

trainer.test(datamodule=dm) stores reference to wrong checkpoint #4693

ananyahjha93 commented Nov 16, 2020 •

edited

Loading

Borda commented Nov 16, 2020

ananyahjha93 commented Nov 16, 2020

trainer.test(datamodule=dm) stores reference to wrong checkpoint #4693

trainer.test(datamodule=dm) stores reference to wrong checkpoint #4693

Comments

ananyahjha93 commented Nov 16, 2020 • edited Loading

🐛 Bug

Please reproduce using [the BoringModel and post here]

To Reproduce

Expected behavior

Environment

Additional context

Borda commented Nov 16, 2020

ananyahjha93 commented Nov 16, 2020

ananyahjha93 commented Nov 16, 2020 •

edited

Loading