--resume bug #141

zidanexu · 2020-06-20T02:30:46Z

@glenn-jocher I training with default command by multi GPU, When run about 255 epoch, the program is broken,So, I use --resume option to begin continue. but I found the difference about MAP is bigger. eg: at broken point(200 epoch), the MAP(0,5-0,95) is 0.302, but after resume (256 epoch) the MAP(0.5-0.95) decrease only 0.16 quickly. I forget some things in use?

and I wonder , when I kill the program and resume again , the MAP(0.5-0.95) change normal 0.3 again.

glenn-jocher · 2020-06-20T18:04:12Z

@zidanexu --resume has unresolved issues, possibly related to LR continuity. I doubt they will be resolved anytime soon, so I recommend you restart from 0 and train fully.

If you come up with a fix though please let us know!

zidanexu · 2020-06-22T11:53:00Z

@glenn-jocher
I try save the state_dict of schedular like this

and load from weight when resume like below

the same I disable your option and change the situation like below

. there are some positive effect but not completly fixed .
do you think EMA function have some contribution about this difference?

glenn-jocher · 2020-06-22T18:01:05Z

@zidanexu yes, that is a good point. ema will play a part. The ema weights are used for all testing and checkpointing, so both the normal model and the ema model must be saved during all checkpoints for resume to function correctly, in theory. Otherwise the resumed ema will be building off of the saved ema rather than the normal model.

Honestly, resuming is much more headache than it's worth. My advice is to never resume.

glenn-jocher · 2020-06-23T19:10:44Z

@zidanexu was referencing this issue in a recent PR. I think I will remove all official support for resuming training, as it has never, ever worked correctly, either here or in ultralytics/yolov3 and it is creating much more confusion and headaches than it's worth.

zcode86 · 2020-07-02T01:32:01Z

so, has it solved now?

glenn-jocher · 2020-07-02T01:53:30Z

@iamastar88 --resume runs without error, but I would strongly advise against using it, as it does not seamlessly do what it says properly. If possible, you should always train fully from start to finish in one go.

github-actions · 2020-08-01T05:22:18Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

glenn-jocher · 2020-09-10T19:06:21Z

UPDATE: --resume if fully functional now, it has graduated into an officially supported feature. You use it by itself with no arguments, or by pointing to a last.pt to resume from:

python train.py --resume  # resume from most recent last.pt
python train.py --resume runs/exp0/weights/last.pt  # resume from specific weights

zidanexu added the bug Something isn't working label Jun 20, 2020

glenn-jocher changed the title ~~reume bug~~ --resume bug Jun 20, 2020

glenn-jocher mentioned this issue Jun 23, 2020

Log command line options, hyperparameters, and weights per run in runs/ #104

Merged

github-actions bot added the Stale Stale and schedule for closing soon label Aug 1, 2020

github-actions bot closed this as completed Aug 8, 2020

hmoravec mentioned this issue Sep 10, 2020

How to pause training and restart training #911

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--resume bug #141

--resume bug #141

zidanexu commented Jun 20, 2020 •

edited

Loading

glenn-jocher commented Jun 20, 2020

zidanexu commented Jun 22, 2020

glenn-jocher commented Jun 22, 2020

glenn-jocher commented Jun 23, 2020

zcode86 commented Jul 2, 2020

glenn-jocher commented Jul 2, 2020

github-actions bot commented Aug 1, 2020

glenn-jocher commented Sep 10, 2020

--resume bug #141

--resume bug #141

Comments

zidanexu commented Jun 20, 2020 • edited Loading

glenn-jocher commented Jun 20, 2020

zidanexu commented Jun 22, 2020

glenn-jocher commented Jun 22, 2020

glenn-jocher commented Jun 23, 2020

zcode86 commented Jul 2, 2020

glenn-jocher commented Jul 2, 2020

github-actions bot commented Aug 1, 2020

glenn-jocher commented Sep 10, 2020

zidanexu commented Jun 20, 2020 •

edited

Loading