-
-
Notifications
You must be signed in to change notification settings - Fork 16.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--resume bug #141
Comments
@zidanexu --resume has unresolved issues, possibly related to LR continuity. I doubt they will be resolved anytime soon, so I recommend you restart from 0 and train fully. If you come up with a fix though please let us know! |
@glenn-jocher |
@zidanexu yes, that is a good point. ema will play a part. The ema weights are used for all testing and checkpointing, so both the normal model and the ema model must be saved during all checkpoints for resume to function correctly, in theory. Otherwise the resumed ema will be building off of the saved ema rather than the normal model. Honestly, resuming is much more headache than it's worth. My advice is to never resume. |
@zidanexu was referencing this issue in a recent PR. I think I will remove all official support for resuming training, as it has never, ever worked correctly, either here or in ultralytics/yolov3 and it is creating much more confusion and headaches than it's worth. |
so, has it solved now? |
@iamastar88 --resume runs without error, but I would strongly advise against using it, as it does not seamlessly do what it says properly. If possible, you should always train fully from start to finish in one go. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
UPDATE: --resume if fully functional now, it has graduated into an officially supported feature. You use it by itself with no arguments, or by pointing to a last.pt to resume from: python train.py --resume # resume from most recent last.pt
python train.py --resume runs/exp0/weights/last.pt # resume from specific weights |
@glenn-jocher I training with default command by multi GPU, When run about 255 epoch, the program is broken,So, I use --resume option to begin continue. but I found the difference about MAP is bigger. eg: at broken point(200 epoch), the MAP(0,5-0,95) is 0.302, but after resume (256 epoch) the MAP(0.5-0.95) decrease only 0.16 quickly. I forget some things in use?
and I wonder , when I kill the program and resume again , the MAP(0.5-0.95) change normal 0.3 again.
The text was updated successfully, but these errors were encountered: