Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--resume bug #141

Closed
zidanexu opened this issue Jun 20, 2020 · 8 comments
Closed

--resume bug #141

zidanexu opened this issue Jun 20, 2020 · 8 comments
Labels
bug Something isn't working Stale Stale and schedule for closing soon

Comments

@zidanexu
Copy link

zidanexu commented Jun 20, 2020

@glenn-jocher I training with default command by multi GPU, When run about 255 epoch, the program is broken,So, I use --resume option to begin continue. but I found the difference about MAP is bigger. eg: at broken point(200 epoch), the MAP(0,5-0,95) is 0.302, but after resume (256 epoch) the MAP(0.5-0.95) decrease only 0.16 quickly. I forget some things in use?
image

and I wonder , when I kill the program and resume again , the MAP(0.5-0.95) change normal 0.3 again.
image

@zidanexu zidanexu added the bug Something isn't working label Jun 20, 2020
@glenn-jocher glenn-jocher changed the title reume bug --resume bug Jun 20, 2020
@glenn-jocher
Copy link
Member

@zidanexu --resume has unresolved issues, possibly related to LR continuity. I doubt they will be resolved anytime soon, so I recommend you restart from 0 and train fully.

If you come up with a fix though please let us know!

@zidanexu
Copy link
Author

@glenn-jocher
I try save the state_dict of schedular like this
image
and load from weight when resume like below
image
the same I disable your option and change the situation like below
image
. there are some positive effect but not completly fixed .
do you think EMA function have some contribution about this difference?

@glenn-jocher
Copy link
Member

@zidanexu yes, that is a good point. ema will play a part. The ema weights are used for all testing and checkpointing, so both the normal model and the ema model must be saved during all checkpoints for resume to function correctly, in theory. Otherwise the resumed ema will be building off of the saved ema rather than the normal model.

Honestly, resuming is much more headache than it's worth. My advice is to never resume.

@glenn-jocher
Copy link
Member

@zidanexu was referencing this issue in a recent PR. I think I will remove all official support for resuming training, as it has never, ever worked correctly, either here or in ultralytics/yolov3 and it is creating much more confusion and headaches than it's worth.

@zcode86
Copy link

zcode86 commented Jul 2, 2020

so, has it solved now?

@glenn-jocher
Copy link
Member

@iamastar88 --resume runs without error, but I would strongly advise against using it, as it does not seamlessly do what it says properly. If possible, you should always train fully from start to finish in one go.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the Stale Stale and schedule for closing soon label Aug 1, 2020
@github-actions github-actions bot closed this as completed Aug 8, 2020
@glenn-jocher
Copy link
Member

UPDATE: --resume if fully functional now, it has graduated into an officially supported feature. You use it by itself with no arguments, or by pointing to a last.pt to resume from:

python train.py --resume  # resume from most recent last.pt
python train.py --resume runs/exp0/weights/last.pt  # resume from specific weights

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale Stale and schedule for closing soon
Projects
None yet
Development

No branches or pull requests

3 participants